Project: Charles Farrant - 2020-21/EDA workshop 1 Group B (Thursday)

Path: EDA workshop 1 / Self-study Notebooks Reference / 03 - One categorical variable.ipynb

Views: ²¹¹⁶
Image: ubuntu2004

Kernel: Python 3 (system-wide)

One categorical variable

Categorical data of one variable are usually displayed either in a frequency table - in which the number of occurrences of each category is tabulated, or in a bar graph - in which the height of rectangular bars represent frequency or relative frequency of each category.

We will use an example dataset on causes of human deaths by tigers to demonstrate these methods of presentation.

Death by tiger

Conflict between humans and tigers threatens tiger populations, kills people, and reduces public support for conservation.

Gurung et al. (2008) investigated causes of human deaths by tigers near the protected area of Chitwan National Park, Nepal. Eighty-eight people were killed by 36 individual tigers between 1979 and 2006, mainly within 1 km of the park edge. Table 1 lists the main activities of people at the time they were killed. Such information may be helpful to identify activities that increase vulnerability to attack.

Activity
Collecting grass or fodder for livestock
Collecting non-timber forest products
Fishing
Herding livestock
Disturbing tiger at its kill
Collecting fuel wood or timber
Sleeping in a house
Walking in forest
Using an outside toilet

Table 1: Activities of people at the time they were killed.

The file containing the data on tiger deaths is called tiger_deaths.csv.

Run the following code cell which reads in the file tiger_deaths.csv, names it deaths and prints it.

In [0]:

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd

# Read the tiger_deaths dataset into a DataFrame called deaths.
deaths = pd.read_csv('tiger_deaths.csv')

# Print the dataset.
deaths

There are three columns. The first column is an index. The second column is a unique number identifying the person killed, it contains no useful information and can be ignored. The third column contains all values of the variable called activity, the activity each person was doing when they were killed by a tiger.

What type of variable is activity? Is it categorical nominal or categorical ordinal?

Categories of the categorical variable

It is easy to confuse the terms "categorical variable" and "categories". In this dataset we have one categorical variable called activity. There are nine categories, or values, that this variable can take: "Grass/fodder", "Forest products", "Fishing", etc.

To print out all of the categories of a categorical variable we can use the unique() method as demonstrated in the following code.

In [0]:

# Print the unique categories of the categorical variable activity as a Python list.
print(deaths['activity'].unique())

print()

# Or loop through the list of unique categories to print each category on a separate line.
for category in deaths['activity'].unique():
    print(category)

Counting categories: Frequency table

The commonest thing to do with a categorical variable is to count the number of occurrences (or frequency) of each category. We do this in pandas with the method value_counts() like so

freq_table = DataFrame['variable'].value_counts()

Run the following code to see how this is done.

In [0]:

# Create a frequency table of each activity and save to the Python variable freq_table
freq_table = deaths['activity'].value_counts()

freq_table

What we have done here is produce a frequency table of the main activities of people at the time they were killed. For example, 44 people were killed while collecting grass or fodder for livestock.

Relative frequencies

If we want the relative frequencies of categories (rather than the absolute frequencies we calculated just now) then we need to divide the frequencies by the total number of people killed. This transformation is often called normalisation in science. Pandas can do this for you if you include the argument normalize=True (note American spelling) like so

rel_freq_table = deaths['activity'].value_counts(normalize=True)

Try this in the following code cell to see the result.

In [0]:

# Create a relative frequency table of each activity and save to the Python variable rel_freq_table

Rounding numbers in DataFrames

You should have noticed that the relative frequencies are printed to 6 decimal places which is unnecessary and makes the numbers hard to read. We can round the values to 2dp using the round() method like so

rel_freq_table.round(2)

Try this in the above code cell to see the result.

Create easy to read frequency tables

Frequency tables are a clear and concise way of presenting data of a single categorical variable. Notice that when you printed the frequency table using value_counts(), the categories were automatically ranked by pandas in descending frequency. This makes it quick and easy for someone to compare across categories. If they were mixed up as shown in Table 2 below, then you have to do a lot more mental work to rank the categories.

Activity	Number of people
Fishing	8
Herding	7
Fuelwood/timber	5
Grass/fodder	44
Disturbing tiger kill	5
Toilet	2
Sleeping in house	3
Walking	3
Forest products	11

Table 2: A poorly designed table because categories are not ranked in descending order.

Having said that, when the categories are ordinal, i.e., they have an inherent order (e.g., "low", "medium", "high"), then ordering on the categories, rather than the frequencies, is preferred. You will look at an example of this in the exercises.

We can force pandas not to sort by descending frequency by adding sort=False as follows

freq_table = deaths['activity'].value_counts( sort=False )

Try this in the above code cell to see the result.

Remember to create tables and graphs so that your audience immediately sees any interesting patterns in your data. If your audience has to spend time trying to decipher poorly created graphs and tables you will annoy them and they may not understand what you are presenting.

(It's worth remembering that your audience includes the professors that will mark your work during your time at uni and you may lose marks if your tables and graphs are poorly constructed and annotated.)

Plotting frequencies: bar plot

The frequencies of the categories of a single categorical variable can be displayed in a bar plot. Categories are plotted on the $x$ -axis and frequencies on the $y$ -axis. To create one for tiger deaths we use the frequency table freq_table we created using value_counts() (or if we wanted the relative frequencies we would use rel_freq_table).

Run the following code to see how this is done.

In [0]:

# Create a frequency table of each activity.
freq_table = deaths['activity'].value_counts()

# Create a bar plot of the frequency table of each activity.
freq_table.plot.bar();

We should add some axes labels and a title to the graph:

plt.xlabel('Acitivty')
plt.ylabel('Number of people')
plt.title('Frequency of activities when people were killed by lions')

Make these changes to the graph to make it clearer.

A bar graph is useful for immediately visualising the relative frequencies of categories. So you immediately notice that collecting grass and fodder for livestock is by far the most risky activity. The other activities are far less risky and have similar levels of risks. However, the bar graph loses information compared to a frequency table as it is not easy to see the numerical frequencies. Generally an ordered table is more informative (i.e., contains more information) than a bar graph.

A principle of good table and graph design

A key principle of good table and graph design is to represent the most amount of information with the least amount of ink. Following this principle creates clear and concise tables and graphs that are readily interpreted by your audience.

In addition, up to a tenth of your male audience is red-green colour blind. So colour choice needs some thought. Websites, such as this one, can help in choosing colour palettes for all of your audience.

The bar plot above uses a lot of ink just to display a single number representing frequency. Less ink could be used by making the bars white and just drawing their outlines. This can be accomplished by setting facecolor='w' (w stands for white) and edgecolor='k' (k stands for black) like so

freq_table.plot.bar(facecolor='w', edgecolor='k');

Try this for in the above graph.

Never use pie charts!

The pie chart is another type of graph often used to display frequencies of a categorical variable. This method uses coloured wedges around the circumference of a circle to represent frequency or relative frequency.

Run the following code to produce a pie chart of activities.

In [0]:

freq_table.plot.pie(figsize=(6, 6));

The pie chart has received a lot of criticism from experts in information graphics. One reason is that while it is straightforward to visualise the frequency of deaths in the first and most frequent category (Collecting grass/fodder), it is more difficult to compare frequencies in the remaining categories by eye. This problem worsens as the number of categories increases. Another reason is that it is very difficult to compare frequencies between two or more pie charts side by side, especially when there are many categories. To compensate, pie charts are often drawn with the frequencies added as text around the circle perimeter. The result is not better than a table. The shape of a frequency distribution is more readily perceived in a bar graph than a pie chart, and it is easier to compare frequencies between two or more bar graphs than between pie charts. Use the bar graph instead of the pie chart for showing frequencies in categorical data.

References

Gurung, B., et al. (2008). Factors associaed with human-killing tigers in Chitwan National Park, Nepal. Biol. Conserv. 141:3069-3078.

Exercise Notebook

One categorical variable

Next Notebook

Two numerical variables

One categorical variable

Death by tiger

Categories of the categorical variable

Counting categories: Frequency table

Relative frequencies

Rounding numbers in DataFrames

Create easy to read frequency tables

Plotting frequencies: bar plot

A principle of good table and graph design

Never use pie charts!

References

Exercise Notebook

Next Notebook

Product

Resources

Company