Jupyter notebook ATMS-305-Content/Week-11/Week 11, Exercise 3: Histograms, PDFs, and CDFs.ipynb
ATMS 305: Week 11
Histograms, PDFs, and CDFs
In this exercise, we will practicing plotting the histograms, Probability Distribution Functions, and Cumulative Distribution Functions of a dataset. We'll use the precipitation dataset we used previously.
Let's look at how the distribution of Tropical precipitation rates have changed over time. Let's group the data by decade, and plot a distribution of the point by point precipitation rates within that region.
First, let's subset the data by space.
OK, now that we know how to create a subset, let's do the statistical analysis of this field. First, let's create a histogram of all the values (note we are not doing any spatial or time averaging, just subsetting), and we will set up the bins in the histogram to have 31 values, 30 mm wide between 0 and 900 mm. These are the bin edges. The plt.hist function will count the values between the bin edges, and display them as a plot. Then we will create a PDF by enabling normed=True
, and a CDF by normed=True
and cumulative=True
, respectively. Note that the returned value has the values, and the bin edges.
Note that the distribution is far from "normal" - the left tail of the distribution is "heavier" than the right tail. This physically corresponds with a higher probability of light rain than heavy rain in the observations which makes sense. Examine how the PDF and CDF looks in this case corresponding with this distribution.
Now, let's process all of the decades. I'll set up a for
loop to process through. We will use pandas
handy built in date functionality which is nice for selecting date periods automatically. Decades can be selected using 10AS as the frequency string, and we'll select 11 decades starting in 1900.
Now, it is just a matter of looping over the decades, storing the data into arrays, and then plotting. We will set up a numpy array to store the data from the histograms. Instead of plt.hist
i'll use np.histogram
, which is very similar, because I don't want to plot the data yet.
Interestingly, it looks like there are more lications with stations in the mid-20th century than now in the Tropics. This has been associated with a decline in the observational network in Africa.
That takes care of the histogram, but what about the PDF?
The definition of a PDF is the sum of the histogram divided by the total of the histogram, which yields the fraction or probability of obtaining a data in that range out of the population of datapoints. We can divide the histogram by the sum on axis=0 (which will sum each decadal pdf) to get the PDF.
To get the CDF, we can use the cumsum
object, which will sum the PDF over a specified axis, in this case we want axis 0, which are the precipitation rates.
OK, we see that the statistical distribution of observed Tropical rain rates is quite similar over the decades. We have different numbers of stations, but the distribution of rain rates at those stations hasn't changed much.
For you homework, you need to do the statistics when subsetting over space. It is quite similar to what we did here.