Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download

Jupyter notebook ATMS-305-Content/Week-11/Week 11, Exercise 3: Histograms, PDFs, and CDFs.ipynb

Views: 80
Kernel: Python 3 (Anaconda)

ATMS 305: Week 11

Histograms, PDFs, and CDFs

In this exercise, we will practicing plotting the histograms, Probability Distribution Functions, and Cumulative Distribution Functions of a dataset. We'll use the precipitation dataset we used previously.

%matplotlib inline import numpy as np import matplotlib.pyplot as plt import xarray as xr precip = xr.open_dataset('http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/ghcngridded/precip.mon.total.nc') precip
<xarray.Dataset> Dimensions: (lat: 36, lon: 72, nbnds: 2, time: 1385) Coordinates: * lat (lat) float32 87.5 82.5 77.5 72.5 67.5 62.5 57.5 52.5 47.5 ... * lon (lon) float32 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 ... * time (time) datetime64[ns] 1900-01-01 1900-02-01 1900-03-01 ... Dimensions without coordinates: nbnds Data variables: time_bnds (time, nbnds) float64 ... precip (time, lat, lon) float64 ... Attributes: platform: Surface Land title: GHCN version 2 Land Precipitation Mean history: Originally created at NOAA/ESRL PSD by CAS 10/2011 from files obtained at NCDC Conventions: CF-1.0 References: https://www.esrl.noaa.gov/psd/data/gridded/data.ghcnv2.html Source: Obtained from https://www.ncdc.noaa.gov/temp-and-precip/ghcn-gridded-products.php via ftp ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2/grid/ dataset_title: GHCN Version 3 Land Temperature and Version 2 Land Precipitation Dataset DODS_EXTRA.Unlimited_Dimension: time

Let's look at how the distribution of Tropical precipitation rates have changed over time. Let's group the data by decade, and plot a distribution of the point by point precipitation rates within that region.

First, let's subset the data by space.

precip_space_subset=precip['precip'].sel(lat=slice(20 -20)) #go in ame order as the data when slicing precip_space_subset
<xarray.DataArray 'precip' (time: 1385, lat: 18, lon: 72)> [1794960 values with dtype=float64] Coordinates: * lat (lat) float32 87.5 82.5 77.5 72.5 67.5 62.5 57.5 52.5 47.5 42.5 ... * lon (lon) float32 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 ... * time (time) datetime64[ns] 1900-01-01 1900-02-01 1900-03-01 ... Attributes: dataset: GHCN version 2 Gridded Land Precipitation Dataset var_desc: Precipitation level_desc: Surface statistic: Mean parent_stat: Observation valid_range: [-1000. 2000.] actual_range: [ -518.34997559 1693.90002441] units: mm long_name: GHCN V2 Gridded Precipitation precision: 2 cell_methods: time: average (monthly from values) standard_name: precip
precip_time_subset=precip_space_subset.sel(time=slice('1900-01-01', '1909-12-31')) precip_time_subset #print(precip_time_subset['time'])
<xarray.DataArray 'precip' (time: 120, lat: 18, lon: 72)> [155520 values with dtype=float64] Coordinates: * lat (lat) float32 87.5 82.5 77.5 72.5 67.5 62.5 57.5 52.5 47.5 42.5 ... * lon (lon) float32 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 ... * time (time) datetime64[ns] 1900-01-01 1900-02-01 1900-03-01 ... Attributes: dataset: GHCN version 2 Gridded Land Precipitation Dataset var_desc: Precipitation level_desc: Surface statistic: Mean parent_stat: Observation valid_range: [-1000. 2000.] actual_range: [ -518.34997559 1693.90002441] units: mm long_name: GHCN V2 Gridded Precipitation precision: 2 cell_methods: time: average (monthly from values) standard_name: precip

OK, now that we know how to create a subset, let's do the statistical analysis of this field. First, let's create a histogram of all the values (note we are not doing any spatial or time averaging, just subsetting), and we will set up the bins in the histogram to have 31 values, 30 mm wide between 0 and 900 mm. These are the bin edges. The plt.hist function will count the values between the bin edges, and display them as a plot. Then we will create a PDF by enabling normed=True, and a CDF by normed=True and cumulative=True, respectively. Note that the returned value has the values, and the bin edges.

hist_1910=xr.plot.hist(precip_time_subset,bins=np.arange(0,2000,30))
Image in a Jupyter notebook
hist_1910
(array([ 1.20900000e+04, 8.40200000e+03, 5.26000000e+03, 3.19100000e+03, 1.86800000e+03, 1.14000000e+03, 7.91000000e+02, 5.59000000e+02, 4.25000000e+02, 2.93000000e+02, 2.37000000e+02, 2.04000000e+02, 1.33000000e+02, 1.02000000e+02, 9.30000000e+01, 7.10000000e+01, 5.50000000e+01, 3.70000000e+01, 5.50000000e+01, 1.40000000e+01, 1.80000000e+01, 1.20000000e+01, 2.00000000e+00, 1.00000000e+01, 7.00000000e+00, 1.00000000e+01, 6.00000000e+00, 2.00000000e+00, 3.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.00000000e+00, 0.00000000e+00, 2.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 2.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]), array([ 0, 30, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330, 360, 390, 420, 450, 480, 510, 540, 570, 600, 630, 660, 690, 720, 750, 780, 810, 840, 870, 900, 930, 960, 990, 1020, 1050, 1080, 1110, 1140, 1170, 1200, 1230, 1260, 1290, 1320, 1350, 1380, 1410, 1440, 1470, 1500, 1530, 1560, 1590, 1620, 1650, 1680, 1710, 1740, 1770, 1800, 1830, 1860, 1890, 1920, 1950, 1980]), <a list of 66 Patch objects>)
pdf_1910=xr.plot.hist(precip_time_subset,bins=30*np.arange(31),normed=True)
Image in a Jupyter notebook
pdf_1910
(array([ 1.14847535e-02, 7.98138121e-03, 4.99667522e-03, 3.03125297e-03, 1.77448466e-03, 1.08292961e-03, 7.51401159e-04, 5.31015484e-04, 4.03723758e-04, 2.78331908e-04, 2.25135366e-04, 1.93787404e-04, 1.26341788e-04, 9.68937019e-05, 8.83442576e-05, 6.74456160e-05, 5.22466040e-05, 3.51477154e-05, 5.22466040e-05, 1.32991356e-05, 1.70988886e-05, 1.13992590e-05, 1.89987651e-06, 9.49938254e-06, 6.64956778e-06, 9.49938254e-06, 5.69962952e-06, 1.89987651e-06, 2.84981476e-06, 0.00000000e+00]), array([ 0, 30, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330, 360, 390, 420, 450, 480, 510, 540, 570, 600, 630, 660, 690, 720, 750, 780, 810, 840, 870, 900]), <a list of 30 Patch objects>)
cdf_1910=xr.plot.hist(precip_time_subset, bins=30*np.arange(31), normed=True, cumulative=True)
Image in a Jupyter notebook
cdf_1910
(array([ 0.3445426 , 0.58398404, 0.7338843 , 0.82482189, 0.87805643, 0.91054431, 0.93308635, 0.94901681, 0.96112853, 0.96947848, 0.97623254, 0.98204617, 0.98583642, 0.98874323, 0.99139356, 0.99341693, 0.99498433, 0.99603876, 0.99760616, 0.99800513, 0.9985181 , 0.99886007, 0.99891707, 0.99920205, 0.99940154, 0.99968652, 0.99985751, 0.99991451, 1. , 1. ]), array([ 0, 30, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330, 360, 390, 420, 450, 480, 510, 540, 570, 600, 630, 660, 690, 720, 750, 780, 810, 840, 870, 900]), <a list of 30 Patch objects>)

Note that the distribution is far from "normal" - the left tail of the distribution is "heavier" than the right tail. This physically corresponds with a higher probability of light rain than heavy rain in the observations which makes sense. Examine how the PDF and CDF looks in this case corresponding with this distribution.


Now, let's process all of the decades. I'll set up a for loop to process through. We will use pandas handy built in date functionality which is nice for selecting date periods automatically. Decades can be selected using 10AS as the frequency string, and we'll select 11 decades starting in 1900.

import pandas as pd time_start = pd.date_range('1900-01-01', freq='10AS', periods=11) time_end = pd.date_range('1909-12-31', freq='10A', periods=11) print(time_start) print(time_end)
DatetimeIndex(['1900-01-01', '1910-01-01', '1920-01-01', '1930-01-01', '1940-01-01', '1950-01-01', '1960-01-01', '1970-01-01', '1980-01-01', '1990-01-01', '2000-01-01'], dtype='datetime64[ns]', freq='10AS-JAN') DatetimeIndex(['1909-12-31', '1919-12-31', '1929-12-31', '1939-12-31', '1949-12-31', '1959-12-31', '1969-12-31', '1979-12-31', '1989-12-31', '1999-12-31', '2009-12-31'], dtype='datetime64[ns]', freq='10A-DEC')

Now, it is just a matter of looping over the decades, storing the data into arrays, and then plotting. We will set up a numpy array to store the data from the histograms. Instead of plt.hist i'll use np.histogram, which is very similar, because I don't want to plot the data yet.

ndecades=len(time_start) #bins for histograms bins=30*np.arange(31) #What is going on here? Test! bin_centers=bins[0:-1]+np.diff(bins)/2. nbins=len(bins) #need to subtract 1 from nbins because bins are the edges #and are 1 more value than the histogram histogram_decades=np.zeros((nbins-1,ndecades)) for i in np.arange(ndecades): data_this_decade=precip_space_subset.sel(time=slice(time_start[i], time_end[i])) histogram_decades[:,i]=xr.plot.hist(data_this_decade,bins=bins)[0]
Image in a Jupyter notebook
histogram_decades
array([[ 1.20900000e+04, 1.35050000e+04, 1.54030000e+04, 1.92280000e+04, 2.31060000e+04, 2.68530000e+04, 2.90300000e+04, 2.95360000e+04, 2.89660000e+04, 2.10120000e+04, 1.56520000e+04], [ 8.40200000e+03, 9.12600000e+03, 9.58500000e+03, 1.09830000e+04, 1.22010000e+04, 1.33950000e+04, 1.43650000e+04, 1.41130000e+04, 1.42200000e+04, 1.17280000e+04, 9.17800000e+03], [ 5.26000000e+03, 5.81900000e+03, 6.12700000e+03, 6.89900000e+03, 7.51300000e+03, 8.15700000e+03, 8.48800000e+03, 8.41900000e+03, 8.27600000e+03, 7.36400000e+03, 5.91700000e+03], [ 3.19100000e+03, 3.47800000e+03, 3.91900000e+03, 4.21300000e+03, 4.48200000e+03, 5.01200000e+03, 5.25100000e+03, 5.11100000e+03, 5.09400000e+03, 4.47500000e+03, 3.79700000e+03], [ 1.86800000e+03, 2.08800000e+03, 2.37000000e+03, 2.69800000e+03, 2.75800000e+03, 3.08700000e+03, 3.16300000e+03, 3.25500000e+03, 3.25300000e+03, 2.74300000e+03, 2.35600000e+03], [ 1.14000000e+03, 1.32800000e+03, 1.55400000e+03, 1.70100000e+03, 1.76400000e+03, 2.04300000e+03, 2.10800000e+03, 2.09100000e+03, 2.15300000e+03, 1.76000000e+03, 1.46100000e+03], [ 7.91000000e+02, 9.64000000e+02, 1.04900000e+03, 1.21000000e+03, 1.25100000e+03, 1.48700000e+03, 1.49700000e+03, 1.52300000e+03, 1.44800000e+03, 1.22200000e+03, 1.06700000e+03], [ 5.59000000e+02, 6.66000000e+02, 7.86000000e+02, 9.18000000e+02, 8.44000000e+02, 1.09600000e+03, 1.09500000e+03, 1.10900000e+03, 1.11200000e+03, 8.60000000e+02, 7.60000000e+02], [ 4.25000000e+02, 5.04000000e+02, 5.72000000e+02, 7.09000000e+02, 6.80000000e+02, 8.88000000e+02, 8.40000000e+02, 8.26000000e+02, 7.84000000e+02, 6.78000000e+02, 5.45000000e+02], [ 2.93000000e+02, 3.68000000e+02, 4.05000000e+02, 5.32000000e+02, 4.67000000e+02, 6.22000000e+02, 6.77000000e+02, 6.17000000e+02, 6.18000000e+02, 4.99000000e+02, 4.46000000e+02], [ 2.37000000e+02, 3.06000000e+02, 2.95000000e+02, 3.73000000e+02, 3.52000000e+02, 4.96000000e+02, 5.11000000e+02, 4.92000000e+02, 4.46000000e+02, 3.63000000e+02, 3.27000000e+02], [ 2.04000000e+02, 2.31000000e+02, 2.32000000e+02, 2.92000000e+02, 2.77000000e+02, 3.84000000e+02, 3.78000000e+02, 3.84000000e+02, 3.59000000e+02, 2.94000000e+02, 2.66000000e+02], [ 1.33000000e+02, 1.88000000e+02, 1.69000000e+02, 2.52000000e+02, 2.16000000e+02, 2.86000000e+02, 2.50000000e+02, 2.77000000e+02, 2.77000000e+02, 2.20000000e+02, 1.85000000e+02], [ 1.02000000e+02, 1.05000000e+02, 1.40000000e+02, 1.44000000e+02, 1.56000000e+02, 2.20000000e+02, 2.48000000e+02, 2.14000000e+02, 2.02000000e+02, 1.56000000e+02, 1.36000000e+02], [ 9.30000000e+01, 9.40000000e+01, 1.21000000e+02, 1.22000000e+02, 9.60000000e+01, 1.56000000e+02, 1.55000000e+02, 1.85000000e+02, 1.74000000e+02, 1.30000000e+02, 1.19000000e+02], [ 7.10000000e+01, 6.00000000e+01, 8.00000000e+01, 1.01000000e+02, 7.80000000e+01, 1.35000000e+02, 1.12000000e+02, 1.27000000e+02, 1.08000000e+02, 8.10000000e+01, 8.20000000e+01], [ 5.50000000e+01, 4.20000000e+01, 7.70000000e+01, 7.40000000e+01, 6.40000000e+01, 1.00000000e+02, 8.70000000e+01, 1.07000000e+02, 9.50000000e+01, 8.80000000e+01, 5.30000000e+01], [ 3.70000000e+01, 4.20000000e+01, 4.40000000e+01, 6.00000000e+01, 6.00000000e+01, 6.80000000e+01, 7.30000000e+01, 7.70000000e+01, 4.80000000e+01, 4.90000000e+01, 5.80000000e+01], [ 5.50000000e+01, 2.90000000e+01, 3.90000000e+01, 4.90000000e+01, 2.80000000e+01, 5.20000000e+01, 6.30000000e+01, 4.10000000e+01, 3.70000000e+01, 3.90000000e+01, 4.80000000e+01], [ 1.40000000e+01, 2.60000000e+01, 2.40000000e+01, 3.70000000e+01, 2.30000000e+01, 4.30000000e+01, 2.90000000e+01, 4.50000000e+01, 3.60000000e+01, 3.00000000e+01, 3.00000000e+01], [ 1.80000000e+01, 1.60000000e+01, 2.10000000e+01, 3.00000000e+01, 2.20000000e+01, 3.10000000e+01, 4.10000000e+01, 4.20000000e+01, 3.20000000e+01, 2.10000000e+01, 2.50000000e+01], [ 1.20000000e+01, 1.60000000e+01, 3.10000000e+01, 2.10000000e+01, 2.00000000e+01, 2.20000000e+01, 2.80000000e+01, 1.60000000e+01, 1.90000000e+01, 2.30000000e+01, 2.30000000e+01], [ 2.00000000e+00, 1.40000000e+01, 1.30000000e+01, 1.80000000e+01, 9.00000000e+00, 1.90000000e+01, 1.90000000e+01, 2.20000000e+01, 1.50000000e+01, 1.20000000e+01, 9.00000000e+00], [ 1.00000000e+01, 1.40000000e+01, 1.60000000e+01, 1.40000000e+01, 9.00000000e+00, 2.20000000e+01, 2.10000000e+01, 1.50000000e+01, 1.80000000e+01, 1.30000000e+01, 1.40000000e+01], [ 7.00000000e+00, 6.00000000e+00, 8.00000000e+00, 8.00000000e+00, 4.00000000e+00, 1.10000000e+01, 1.40000000e+01, 1.20000000e+01, 9.00000000e+00, 4.00000000e+00, 8.00000000e+00], [ 1.00000000e+01, 6.00000000e+00, 7.00000000e+00, 1.10000000e+01, 7.00000000e+00, 8.00000000e+00, 7.00000000e+00, 1.00000000e+01, 3.00000000e+00, 5.00000000e+00, 6.00000000e+00], [ 6.00000000e+00, 5.00000000e+00, 7.00000000e+00, 4.00000000e+00, 6.00000000e+00, 8.00000000e+00, 7.00000000e+00, 3.00000000e+00, 4.00000000e+00, 8.00000000e+00, 8.00000000e+00], [ 2.00000000e+00, 3.00000000e+00, 5.00000000e+00, 2.00000000e+00, 4.00000000e+00, 6.00000000e+00, 1.40000000e+01, 6.00000000e+00, 2.00000000e+00, 3.00000000e+00, 1.00000000e+00], [ 3.00000000e+00, 1.00000000e+00, 7.00000000e+00, 8.00000000e+00, 3.00000000e+00, 4.00000000e+00, 2.00000000e+00, 4.00000000e+00, 2.00000000e+00, 3.00000000e+00, 4.00000000e+00], [ 0.00000000e+00, 4.00000000e+00, 3.00000000e+00, 5.00000000e+00, 1.00000000e+00, 8.00000000e+00, 3.00000000e+00, 0.00000000e+00, 3.00000000e+00, 2.00000000e+00, 4.00000000e+00]])
plt.figure() plt.plot(bin_centers,histogram_decades) plt.xlabel('Precipitation (mm/mon)') plt.ylabel('Histogram counts') plt.legend(time_start,fontsize=8) plt.show()
Image in a Jupyter notebook

Interestingly, it looks like there are more lications with stations in the mid-20th century than now in the Tropics. This has been associated with a decline in the observational network in Africa.

That takes care of the histogram, but what about the PDF?

The definition of a PDF is the sum of the histogram divided by the total of the histogram, which yields the fraction or probability of obtaining a data in that range out of the population of datapoints. We can divide the histogram by the sum on axis=0 (which will sum each decadal pdf) to get the PDF.

pdf_decades=histogram_decades/np.sum(histogram_decades,axis=0)
plt.figure() plt.plot(bin_centers,pdf_decades) plt.xlabel('Precipitation (mm/mon)') plt.ylabel('PDF') plt.legend(time_start,fontsize=8) plt.show()
Image in a Jupyter notebook

To get the CDF, we can use the cumsum object, which will sum the PDF over a specified axis, in this case we want axis 0, which are the precipitation rates.

cdf_decades=pdf_decades.cumsum(axis=0)
plt.figure() plt.plot(bin_centers,cdf_decades) plt.xlabel('Precipitation (mm/mon)') plt.xlabel('CDF') plt.legend(time_start,fontsize=8,loc='lower right') plt.show()
Image in a Jupyter notebook

OK, we see that the statistical distribution of observed Tropical rain rates is quite similar over the decades. We have different numbers of stations, but the distribution of rain rates at those stations hasn't changed much.

For you homework, you need to do the statistics when subsetting over space. It is quite similar to what we did here.