Loading
Scatter plot
Scatter plots are a good way to visualize the relationship between two variables, but it is surprising hard to make a good one.
Here's a simple plot of height and weight.
The center of this plot is saturated, so it is not as dark as it should be, which means the rest of the plot is relatively darker than it should be. It gives too much visual weight to the outliers and obscures the shape of the relationship.
Exercise: Use keywords alpha
and markersize
to avoid saturation.
With transparency and smaller markers, you will be able to see that height and weight are discretized.
Exercise: Use np.random.normal
to add enough noise to height and weight so the vertical lines in the scatter plot are blurred out. Create variables named height_jitter
and weight_jitter
.
Linear regression
We can use scipy.stats
to find the linear least squares fit to weight as a function of height.
The LinregressResult
object contains the estimated parameters and a few other statistics.
We can use the estimated slope
and intercept
to plot the line of best fit.
Weight and age
Exercise: Make a scatter plot of weight and age. The variable AGE
is discretized in 5-year intervals, so you might want to jitter it.
Adjust transparency and marker size to generate the best view of the relationship.
Exercise: Use linregress
to estimate the slope and intercept of the line of best fit for this data.
Note: as in the previous example, use dropna
to drop rows that contain NaN for either variable, and use the resulting subset to compute the arguments for linregress
.
Exercise: Generate a plot that shows the estimated line and a scatter plot of the data.
Box and violin plots
The Seaborn package, which is usually imported as sns
, provides two functions used to show the distribution of one variable as a function of another variable.
The following box plot shows the distribution of weight in each age category. Read the documentation so you know what it means.
This figure makes the shape of the relationship clearer; average weight increases between ages 20 and 50, and then decreases.
A violin plot is another way to show the same thing. Again, read the documentation so you know what it means.
Exercise: Make a box plot that shows the distribution of weight as a function of income. The variable INCOME2
contains income codes with 8 levels.
Use dropna
to select the rows with valid income and weight information.
Exercise: Make a violin plot with the same variables.
Plotting percentiles
One more way to show the relationship between two variables is to break one variables into groups and plot percentiles of the other variable across groups.
As a starting place, here's the median weight in each age group.
To get the other percentiles, we can use a Cdf
.
Now I'll collect those results in a list of arrays:
To get the age groups, we can extract the "keys" from the groupby object.
Now, we want to loop through the columns of the list of arrays; to do that, we want to transpose it.
Now we can plot the percentiles across the groups.
In my opinion, this plot shows the shape of the relationship most clearly.
Discretizing variables
Box plot, violin plots, and percentile line plots don't work as well if the number of groups on the x-axis is too big. For example, here's a box plot of weight versus height.
This would look better and mean more if there were fewer height groups. We can use pd.cut
to put people into height groups where each group spans 10 cm.
Now here's what the plot looks like.
Exercise: Plot percentiles of weight versus these height groups.
Vegetables
Exercise: The variable _VEGESU1
contains the self-reported number of serving of vegetables each respondent eats per day. Explore relationships between this variable and the others variables in the dataset, and design visualizations that show any relationship you find as clearly as possible.
Correlation
One way to compute correlations is the Pandas method corr
, which returns a correlation matrix.
Exercise: Compute a correlation matrix for age, income, and vegetable servings.
Correlation calibration
To calibrate your sense of correlation, let's look at scatter plots for fake data with different values of rho
.
The following function generates random normally-distributed data with approximately the given coefficient of correlation.
This function makes a scatter plot and shows the actual value of rho
.
The following plots show what scatter plots look like with different values of rho
.
Here are all the plots side-by-side for comparison.
Nonlinear relationships
Here an example that generates fake data with a nonlinear relationship.
This relationship is quite strong, in the sense that we can make a much better guess about y
if we know x
than if we don't.
But if we compute correlations, they don't show the relationship.
Correlation strength
Here are two fake datasets showing hypothetical relationships between weight and age.
Which relationship is stronger?
It depends on what we mean. Clearly, the first one has a higher coefficient of correlation. In that world, knowing someone's age would allow you to make a better guess about their weight.
But look more closely at the y-axis in the two plots. How much weight do people gain per year in each of these hypothetical worlds?
In fact, the slope for the second data set is almost 10 times higher.
The following figures show the same data again, this time with the line of best fit and the estimated slope.
The difference is not obvious from looking at the figure; you have to look carefully at the y-axis labels and the estimated slope.
And you have to interpret the slope in context. In the first case, people gain about 0.019 kg per year, which works out to less than half a pound per decade. In the second case, they gain almost 4 pounds per decade.
But remember that in the first case, the coefficient of correlation is substantially higher.
Exercise: So, in which case is the relationship "stronger"? Write a sentence or two below to summarize your thoughts.