Data visualization example
In a recent blog post, I showed figures from a recent paper and invited readers to redesign them to communicate their message more effectively.
This notebook shows one way we might redesign the figures. At the same time, it demonstrates a simple use of a Pandas MultiIndex.
The study reports the distribution of student evaluation scores for instructors under eight conditions. At the top level, they report scores from evaluations with a 10-point of 6-points scale.
At the next level, they distinguish fields of study as "least" or "most" male-dominated.
And they distinguish between male and female instructors.
We can assemble those levels into a MultiIndex like this:
For each of these eight conditions, the original paper reports the entire distribution of student evaluation scores. To make a simpler and clearer visualization of the results, I am going to present a summary of these distributions.
I could take the mean of each distribution, and that would show the effect. But to make it even clearer, I will use the fraction of "top" scores, meaning a 9 or 10 on the 10-point scale and a 6 on the 6-point scale.
Now, to get the data, I used the figures from the paper and estimated numbers by eye. So these numbers are only approximate!
TopScore% | |||
---|---|---|---|
Scale | Area | Instructor | |
10-point | LeastMaleDominated | Male | 60 |
Female | 60 | ||
MostMaleDominated | Male | 54 | |
Female | 38 | ||
6-point | LeastMaleDominated | Male | 43 |
Female | 42 | ||
MostMaleDominated | Male | 41 | |
Female | 41 |
To extract the subset of the data on a 10-point scale, we can use loc
in the usual way.
TopScore% | ||
---|---|---|
Area | Instructor | |
LeastMaleDominated | Male | 60 |
Female | 60 | |
MostMaleDominated | Male | 54 |
Female | 38 |
To extract subsets at other levels, we can use xs
. This example takes a cross-section of the second level.
TopScore% | ||
---|---|---|
Scale | Instructor | |
10-point | Male | 54 |
Female | 38 | |
6-point | Male | 41 |
Female | 41 |
This example takes a cross-section of the third level.
TopScore% | ||
---|---|---|
Scale | Area | |
10-point | LeastMaleDominated | 60 |
MostMaleDominated | 54 | |
6-point | LeastMaleDominated | 43 |
MostMaleDominated | 41 |
Ok, now to think about presenting the data. At the top level, the 10-point scale and the 6-point scale are different enough that I want to put them on different axes. So I'll start by splitting the data at the top level.
TopScore% | ||
---|---|---|
Area | Instructor | |
LeastMaleDominated | Male | 60 |
Female | 60 | |
MostMaleDominated | Male | 54 |
Female | 38 |
Now, the primary thing I want the reader to see is a discrepancy in percentages. For comparison of two or more values, a bar plot is often a good choice.
As a starting place, I'll try the Pandas default for showing a bar plot of this data.
As defaults go, that's not bad. From this figure it is immediately clear that there is a substantial difference in scores between male and female instructors in male-dominated areas, and no difference in other areas.
The following function cleans up some of the details in the presentation.
Here are the results for the 10-point scale.
And here are the results for the six-point scale, which show clearly that the effect disappears when a 6-point scale is used (at least in this experiment).
Presenting two figures might be the best option, but in my challenge I asked for a single figure.
Here's a version that uses Pandas defaults with minimal customization.
With a little tuning, this could be a good choice. It clearly shows that there is only a substantial difference in one of the four conditions.