CoCalc Public FilesEDA workshop 1 / Workshop 1 - Datasets of one variable.ipynb
Author: Charles Farrant
Views : 130
Compute Environment: Ubuntu 20.04 (Default)

# Workshop 1

## Datasets of one variable

### Task 1.1: Running speeds of spiders

Male spiders in the genus Tidarren are tiny, weighing only about 1% as much as females. They also have disproportionately large pedipalps, copulatory organs that make up about 10% of a male’s mass. (See image; the pedipalps are indicated by arrows.) Males load the pedipalps with sperm and then search for females to inseminate. Astonishingly, male Tidarren spiders voluntarily amputate one of their two organs, right or left, just before sexual maturity.

Why do they do this? Ramos et al. (2004) suggested that perhaps speed is important to males searching for females, and amputation increases running performance. They used video to measure running speed of males on strands of spider silk before and after voluntary amputation. The running speeds (in cm/s) are in the file tidarren.csv.

1. Read in this dataset and print it.
In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd

Print(Spider_running_speed)

--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-4-8166e72c5a8f> in <module> 4 5 Spider_running_speed = pd.read_csv('tidarren.csv') ----> 6 Print(Spider_running_speed) NameError: name 'Print' is not defined 

#### Creating a new variable from existing ones

What we're interested in is the difference in each spider's speed before and after the amputation.

Fortunately, pandas makes it easy to create new variables from existing ones. The following command creates a new numerical variable called difference from the existing variables before and after:

DataFrame['difference'] = DataFrame['after'] - DataFrame['before']


where you should replace DataFrame with whatever you've called your dataset.

What this command does is create a new column in your DataFrame with the header difference and fills that column with the differences between speeds for each spider.

1. Create this new variable and print your DataFrame again to see what has happened to it.
In [ ]:


1. Create a labelled histogram of the difference in running speeds.
In [ ]:


4. Does this data support Ramos's hypothesis? Explain your reasoning in the cell below.

### Task 1.2: Darwin's cross-fertilised and self-fertilised plants

Darwin wanted to find out if cross-fertilised corn plants grew with greater vigour than self-fertilised corn plants. Pairs of seedlings of the same age, one produced by cross-fertilisation and the other by self-fertilisation, were grown together so that members of each pair were reared under nearly identical conditions. The final heights of all the plants were measured to the nearest millimetre.

The file containing the data on Darwin's plant heights is called darwin.csv.

1. Read in the dataset from the file darwin.csv and print it.
In [ ]:



Each row in the table represents a pair of plants. The second column contains the height of cross-fertilised plants and the third column contains the height of self-fertilised plants.

We are not interested in the heights themselves, but rather the differences in heights for each pair. Do you know why?

1. Create a new variable of the difference in heights for each pair of plants and print the new DataFrame.
In [ ]:


1. Plot a labelled histogram of the difference in plant heights.
2. Looking at the histogram, do you consider that cross-fertilised plants grow with greater vigor than self-fertilised plants? Explain your reasoning.
In [ ]:



### Task 1.3: The lengths of human genes

The international Human Genome Project was the largest coordinated research effort in the history of biology. It yielded the DNA sequence of all 23 human chromosomes, each consisting of millions of nucleotides chained end to end. These encode the genes whose products - RNA and proteins - shape the growth and development of each individual. The file human_genes.csv contains the lengths of all 20,290 known and predicted genes of the published genome sequence (Hubbard et al. 2005). The length of a gene refers to the total number of nucleotides comprising the coding regions.

1. Read in the human gene dataset and print it.
In [ ]:


1. Create a labelled histogram of the length of human genes.
In [ ]:



#### Changing the number and width of histogram bins

The histogram extends out to 100,000 nucleotides but doesn't appear to show any genes between 20,000 and 100,000nt. This is because there are a few very long genes which are so rare that they cannot be seen in this histogram. The longest human gene, with nearly 100,000nt, encodes the gigantic protein titin, which is expressed in heart and skeletal muscle. The protein was named for the titans of Greek mythology, giants who ruled the earth until overthrown by the Olympians. Some mutations in the titin gene cause heart muscle disease and muscular dystrophy.

These few very long genes make the bulk of the histogram bunch up to the left hiding details of the shorter genes. Matplotlib automatically calculates the number of bins and their width for you. In this case the bin width is about 10,000nt.

Rather than let matplotlib set the bins for you, you can set the bins manually. This is done with the bins=range(start, end, width) argument in .hist() where start, end and width are integers.

For example, bins=range(0, 20000, 1000) plots a histogram with bins of width 1000 ranging from 0 to 20,000

1. Play around with different values for start, end and width to see if you can reveal the distribution of gene lengths that doesn't include the very long genes such as titin. Do you find any spikes in the distribution?
In [ ]:



### Task 1.4: Non-randomness of haphazard choice

Some stage performers claim to have real telepathic powers - the ability to read minds. However, these powers can be convincingly faked. In one example, the performer asks every member of the audience to think of a two-digit number. After a show of pretending to read their minds, the performer states a number that a surprisingly large fraction of the audience was thinking of. This feat would be surprising if the people thought of all two-digit numbers with equal probability, but not if people everywhere tend to pick the same few numbers. Marks (2000) asked 315 volunteers to independently think of a two-digit number. The results are in the file two_digit_numbers.csv.

1. Read in the dataset from the file two_digit_numbers.csv and print it.
2. Create a labelled histogram of the data choosing an appropriate number of bins.
3. Are all two-digit numbers selected with equal probability? Describe the distribution of chosen two-digit numbers.
In [ ]:



Niderkorn's (1872) measurements on 144 human corpses provided the first quantitative study on the development of rigor mortis. The data, contained in the file rigor.csv, gives the number of bodies achieving rigor mortis in each hour after death, recorded in one-hour intervals.

1. Read the data in from the file rigor.csv.
2. Examine the dataset by printing it out.
In [ ]:



The categorical variable in this dataset is Hour which is ordinal; it has an inherent order (i.e., 1 hour after death, 2 hours after death, etc.).

1. Construct a frequency table.
In [ ]:



You will notice that most bodies (31 in total) achieved rigor mortis during the fourth hour. But presenting the data ordered by descending frequency is not that useful here. It would make more sense to order by hour so that we can see the distribution of hours when rigor mortis is achieved. To sort on the categories rather their frequencies use value_counts(sort=False).

1. Do this and create a new frequency table.
2. Construct a labelled bar graph ordered by hour.
In [ ]:



The file hurricanes.csv contains the named hurricanes in the Atlantic between 2001 and 2010, along with their severity on the Saffir-Simpson Hurricane scale which categorises each hurricane by a label 1 to 5 depending on its power.

1. Read the data in from the file hurricanes.csv.
2. Examine the dataset by printing it out.
In [ ]:



There are three categorical variables: year which is categorical ordinal, name which is categorical nominal and severity which is categorical ordinal.

1. Construct a frequency table showing the frequency of hurricanes in each severity category.
In [ ]:


1. Construct a frequency table showing the frequency of hurricanes in each year.
In [ ]:


1. Construct a labelled bar graph of the frequency of hurricanes in each severity category.
In [ ]:


1. Construct a labelled bar graph of the frequency of hurricanes in each severity category.
In [ ]:



### References

Ramos, M., et al. (2004). Overcoming an evolutionary conflict: removal of a reproductive organ greatly increases locomotor performance. Proc. Nat. Acad. Sci. (USA) 101:4883-4887.

Darwin, C. (1876). Effects of Cross and Self-fertilization in the Vegetable Kingdom. London: John Murray.

Hubbard, T., et al. (2005). Ensembl 2005. Nucl. Acid. Res. 33:D447-D453.

Marks, D. (2000). The Psychology of the Psychic. Amherst, NY: Prometheus Books.

Niderkorn (1872). Archived here