One categorical variable
Categorical data of one variable are usually displayed either in a frequency table - in which the number of occurrences of each category is tabulated, or in a bar graph - in which the height of rectangular bars represent frequency or relative frequency of each category.
We will use an example dataset on causes of human deaths by tigers to demonstrate these methods of presentation.
Death by tiger
Conflict between humans and tigers threatens tiger populations, kills people, and reduces public support for conservation.
Gurung et al. (2008) investigated causes of human deaths by tigers near the protected area of Chitwan National Park, Nepal. Eighty-eight people were killed by 36 individual tigers between 1979 and 2006, mainly within 1 km of the park edge. Table 1 lists the main activities of people at the time they were killed. Such information may be helpful to identify activities that increase vulnerability to attack.
Activity |
---|
Collecting grass or fodder for livestock |
Collecting non-timber forest products |
Fishing |
Herding livestock |
Disturbing tiger at its kill |
Collecting fuel wood or timber |
Sleeping in a house |
Walking in forest |
Using an outside toilet |
Table 1: Activities of people at the time they were killed.
The file containing the data on tiger deaths is called tiger_deaths.csv.
Run the following code cell which reads in the file tiger_deaths.csv
, names it deaths
and prints it.
There are three columns. The first column is an index. The second column is a unique number identifying the person killed, it contains no useful information and can be ignored. The third column contains all values of the variable called activity
, the activity each person was doing when they were killed by a tiger.
What type of variable is activity
? Is it categorical nominal or categorical ordinal?
Categories of the categorical variable
It is easy to confuse the terms "categorical variable" and "categories". In this dataset we have one categorical variable called activity
. There are nine categories, or values, that this variable can take: "Grass/fodder", "Forest products", "Fishing", etc.
To print out all of the categories of a categorical variable we can use the unique()
method as demonstrated in the following code.
Counting categories: Frequency table
The commonest thing to do with a categorical variable is to count the number of occurrences (or frequency) of each category. We do this in pandas with the method value_counts()
like so
Run the following code to see how this is done.
What we have done here is produce a frequency table of the main activities of people at the time they were killed. For example, 44 people were killed while collecting grass or fodder for livestock.
Relative frequencies
If we want the relative frequencies of categories (rather than the absolute frequencies we calculated just now) then we need to divide the frequencies by the total number of people killed. This transformation is often called normalisation in science. Pandas can do this for you if you include the argument normalize=True
(note American spelling) like so
Rounding numbers in DataFrames
You should have noticed that the relative frequencies are printed to 6 decimal places which is unnecessary and makes the numbers hard to read. We can round the values to 2dp using the round()
method like so
Create easy to read frequency tables
Frequency tables are a clear and concise way of presenting data of a single categorical variable. Notice that when you printed the frequency table using value_counts()
, the categories were automatically ranked by pandas in descending frequency. This makes it quick and easy for someone to compare across categories. If they were mixed up as shown in Table 2 below, then you have to do a lot more mental work to rank the categories.
Activity | Number of people |
---|---|
Fishing | 8 |
Herding | 7 |
Fuelwood/timber | 5 |
Grass/fodder | 44 |
Disturbing tiger kill | 5 |
Toilet | 2 |
Sleeping in house | 3 |
Walking | 3 |
Forest products | 11 |
Table 2: A poorly designed table because categories are not ranked in descending order.
Having said that, when the categories are ordinal, i.e., they have an inherent order (e.g., "low", "medium", "high"), then ordering on the categories, rather than the frequencies, is preferred. You will look at an example of this in the exercises.
We can force pandas not to sort by descending frequency by adding sort=False
as follows
Remember to create tables and graphs so that your audience immediately sees any interesting patterns in your data. If your audience has to spend time trying to decipher poorly created graphs and tables you will annoy them and they may not understand what you are presenting.
(It's worth remembering that your audience includes the professors that will mark your work during your time at uni and you may lose marks if your tables and graphs are poorly constructed and annotated.)
Plotting frequencies: bar plot
The frequencies of the categories of a single categorical variable can be displayed in a bar plot. Categories are plotted on the -axis and frequencies on the -axis. To create one for tiger deaths we use the frequency table freq_table
we created using value_counts()
(or if we wanted the relative frequencies we would use rel_freq_table
).
We should add some axes labels and a title to the graph:
Make these changes to the graph to make it clearer.
A bar graph is useful for immediately visualising the relative frequencies of categories. So you immediately notice that collecting grass and fodder for livestock is by far the most risky activity. The other activities are far less risky and have similar levels of risks. However, the bar graph loses information compared to a frequency table as it is not easy to see the numerical frequencies. Generally an ordered table is more informative (i.e., contains more information) than a bar graph.
A principle of good table and graph design
A key principle of good table and graph design is to represent the most amount of information with the least amount of ink. Following this principle creates clear and concise tables and graphs that are readily interpreted by your audience.
In addition, up to a tenth of your male audience is red-green colour blind. So colour choice needs some thought. Websites, such as this one, can help in choosing colour palettes for all of your audience.
The bar plot above uses a lot of ink just to display a single number representing frequency. Less ink could be used by making the bars white and just drawing their outlines. This can be accomplished by setting facecolor='w'
(w stands for white) and edgecolor='k'
(k stands for black) like so
Try this for in the above graph.
Never use pie charts!
The pie chart is another type of graph often used to display frequencies of a categorical variable. This method uses coloured wedges around the circumference of a circle to represent frequency or relative frequency.
The pie chart has received a lot of criticism from experts in information graphics. One reason is that while it is straightforward to visualise the frequency of deaths in the first and most frequent category (Collecting grass/fodder), it is more difficult to compare frequencies in the remaining categories by eye. This problem worsens as the number of categories increases. Another reason is that it is very difficult to compare frequencies between two or more pie charts side by side, especially when there are many categories. To compensate, pie charts are often drawn with the frequencies added as text around the circle perimeter. The result is not better than a table. The shape of a frequency distribution is more readily perceived in a bar graph than a pie chart, and it is easier to compare frequencies between two or more bar graphs than between pie charts. Use the bar graph instead of the pie chart for showing frequencies in categorical data.
References
Gurung, B., et al. (2008). Factors associaed with human-killing tigers in Chitwan National Park, Nepal. Biol. Conserv. 141:3069-3078.