ATMS 305: Week 11

Histograms, PDFs, and CDFs

In this exercise, we will practicing plotting the histograms, Probability Distribution Functions, and Cumulative Distribution Functions of a dataset. We'll use the precipitation dataset we used previously.

In [1]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

import xarray as xr

precip = xr.open_dataset('http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/ghcngridded/precip.mon.total.nc')
precip
Out[1]:
<xarray.Dataset>
Dimensions:    (lat: 36, lon: 72, nbnds: 2, time: 1385)
Coordinates:
  * lat        (lat) float32 87.5 82.5 77.5 72.5 67.5 62.5 57.5 52.5 47.5 ...
  * lon        (lon) float32 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 ...
  * time       (time) datetime64[ns] 1900-01-01 1900-02-01 1900-03-01 ...
Dimensions without coordinates: nbnds
Data variables:
    time_bnds  (time, nbnds) float64 ...
    precip     (time, lat, lon) float64 ...
Attributes:
    platform: Surface Land
    title: GHCN version 2 Land Precipitation Mean
    history: Originally created at NOAA/ESRL PSD by CAS 10/2011 from files obtained at NCDC
    Conventions: CF-1.0
    References: https://www.esrl.noaa.gov/psd/data/gridded/data.ghcnv2.html
    Source: Obtained from https://www.ncdc.noaa.gov/temp-and-precip/ghcn-gridded-products.php via ftp ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2/grid/
    dataset_title: GHCN Version 3 Land Temperature and Version 2 Land Precipitation Dataset
    DODS_EXTRA.Unlimited_Dimension: time

Let's look at how the distribution of Tropical precipitation rates have changed over time. Let's group the data by decade, and plot a distribution of the point by point precipitation rates within that region.

First, let's subset the data by space.

In [2]:
precip_space_subset=precip['precip'].sel(lat=slice(20 -20)) #go in ame order as the data when slicing
precip_space_subset
Out[2]:
<xarray.DataArray 'precip' (time: 1385, lat: 18, lon: 72)>
[1794960 values with dtype=float64]
Coordinates:
  * lat      (lat) float32 87.5 82.5 77.5 72.5 67.5 62.5 57.5 52.5 47.5 42.5 ...
  * lon      (lon) float32 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 ...
  * time     (time) datetime64[ns] 1900-01-01 1900-02-01 1900-03-01 ...
Attributes:
    dataset: GHCN version 2 Gridded Land Precipitation Dataset
    var_desc: Precipitation
    level_desc: Surface
    statistic: Mean
    parent_stat: Observation
    valid_range: [-1000.  2000.]
    actual_range: [ -518.34997559  1693.90002441]
    units: mm
    long_name: GHCN V2 Gridded Precipitation
    precision: 2
    cell_methods: time: average (monthly from values)
    standard_name: precip
In [3]:
precip_time_subset=precip_space_subset.sel(time=slice('1900-01-01', '1909-12-31'))
precip_time_subset
#print(precip_time_subset['time'])
Out[3]:
<xarray.DataArray 'precip' (time: 120, lat: 18, lon: 72)>
[155520 values with dtype=float64]
Coordinates:
  * lat      (lat) float32 87.5 82.5 77.5 72.5 67.5 62.5 57.5 52.5 47.5 42.5 ...
  * lon      (lon) float32 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 ...
  * time     (time) datetime64[ns] 1900-01-01 1900-02-01 1900-03-01 ...
Attributes:
    dataset: GHCN version 2 Gridded Land Precipitation Dataset
    var_desc: Precipitation
    level_desc: Surface
    statistic: Mean
    parent_stat: Observation
    valid_range: [-1000.  2000.]
    actual_range: [ -518.34997559  1693.90002441]
    units: mm
    long_name: GHCN V2 Gridded Precipitation
    precision: 2
    cell_methods: time: average (monthly from values)
    standard_name: precip

OK, now that we know how to create a subset, let's do the statistical analysis of this field. First, let's create a histogram of all the values (note we are not doing any spatial or time averaging, just subsetting), and we will set up the bins in the histogram to have 31 values, 30 mm wide between 0 and 900 mm. These are the bin edges. The plt.hist function will count the values between the bin edges, and display them as a plot. Then we will create a PDF by enabling normed=True, and a CDF by normed=True and cumulative=True, respectively. Note that the returned value has the values, and the bin edges.

In [6]:
hist_1910=xr.plot.hist(precip_time_subset,bins=np.arange(0,2000,30))
In [7]:
hist_1910
Out[7]:
(array([  1.20900000e+04,   8.40200000e+03,   5.26000000e+03,
          3.19100000e+03,   1.86800000e+03,   1.14000000e+03,
          7.91000000e+02,   5.59000000e+02,   4.25000000e+02,
          2.93000000e+02,   2.37000000e+02,   2.04000000e+02,
          1.33000000e+02,   1.02000000e+02,   9.30000000e+01,
          7.10000000e+01,   5.50000000e+01,   3.70000000e+01,
          5.50000000e+01,   1.40000000e+01,   1.80000000e+01,
          1.20000000e+01,   2.00000000e+00,   1.00000000e+01,
          7.00000000e+00,   1.00000000e+01,   6.00000000e+00,
          2.00000000e+00,   3.00000000e+00,   0.00000000e+00,
          1.00000000e+00,   2.00000000e+00,   0.00000000e+00,
          2.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          2.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          1.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00]),
 array([   0,   30,   60,   90,  120,  150,  180,  210,  240,  270,  300,
         330,  360,  390,  420,  450,  480,  510,  540,  570,  600,  630,
         660,  690,  720,  750,  780,  810,  840,  870,  900,  930,  960,
         990, 1020, 1050, 1080, 1110, 1140, 1170, 1200, 1230, 1260, 1290,
        1320, 1350, 1380, 1410, 1440, 1470, 1500, 1530, 1560, 1590, 1620,
        1650, 1680, 1710, 1740, 1770, 1800, 1830, 1860, 1890, 1920, 1950,
        1980]),
 <a list of 66 Patch objects>)
In [8]:
pdf_1910=xr.plot.hist(precip_time_subset,bins=30*np.arange(31),normed=True)
In [9]:
pdf_1910
Out[9]:
(array([  1.14847535e-02,   7.98138121e-03,   4.99667522e-03,
          3.03125297e-03,   1.77448466e-03,   1.08292961e-03,
          7.51401159e-04,   5.31015484e-04,   4.03723758e-04,
          2.78331908e-04,   2.25135366e-04,   1.93787404e-04,
          1.26341788e-04,   9.68937019e-05,   8.83442576e-05,
          6.74456160e-05,   5.22466040e-05,   3.51477154e-05,
          5.22466040e-05,   1.32991356e-05,   1.70988886e-05,
          1.13992590e-05,   1.89987651e-06,   9.49938254e-06,
          6.64956778e-06,   9.49938254e-06,   5.69962952e-06,
          1.89987651e-06,   2.84981476e-06,   0.00000000e+00]),
 array([  0,  30,  60,  90, 120, 150, 180, 210, 240, 270, 300, 330, 360,
        390, 420, 450, 480, 510, 540, 570, 600, 630, 660, 690, 720, 750,
        780, 810, 840, 870, 900]),
 <a list of 30 Patch objects>)
In [10]:
cdf_1910=xr.plot.hist(precip_time_subset, bins=30*np.arange(31), normed=True, cumulative=True)
In [11]:
cdf_1910
Out[11]:
(array([ 0.3445426 ,  0.58398404,  0.7338843 ,  0.82482189,  0.87805643,
         0.91054431,  0.93308635,  0.94901681,  0.96112853,  0.96947848,
         0.97623254,  0.98204617,  0.98583642,  0.98874323,  0.99139356,
         0.99341693,  0.99498433,  0.99603876,  0.99760616,  0.99800513,
         0.9985181 ,  0.99886007,  0.99891707,  0.99920205,  0.99940154,
         0.99968652,  0.99985751,  0.99991451,  1.        ,  1.        ]),
 array([  0,  30,  60,  90, 120, 150, 180, 210, 240, 270, 300, 330, 360,
        390, 420, 450, 480, 510, 540, 570, 600, 630, 660, 690, 720, 750,
        780, 810, 840, 870, 900]),
 <a list of 30 Patch objects>)

Note that the distribution is far from "normal" - the left tail of the distribution is "heavier" than the right tail. This physically corresponds with a higher probability of light rain than heavy rain in the observations which makes sense. Examine how the PDF and CDF looks in this case corresponding with this distribution.


Now, let's process all of the decades. I'll set up a for loop to process through. We will use pandas handy built in date functionality which is nice for selecting date periods automatically. Decades can be selected using 10AS as the frequency string, and we'll select 11 decades starting in 1900.

In [12]:
import pandas as pd

time_start = pd.date_range('1900-01-01', freq='10AS', periods=11)
time_end = pd.date_range('1909-12-31', freq='10A', periods=11)
print(time_start)
print(time_end)
DatetimeIndex(['1900-01-01', '1910-01-01', '1920-01-01', '1930-01-01',
               '1940-01-01', '1950-01-01', '1960-01-01', '1970-01-01',
               '1980-01-01', '1990-01-01', '2000-01-01'],
              dtype='datetime64[ns]', freq='10AS-JAN')
DatetimeIndex(['1909-12-31', '1919-12-31', '1929-12-31', '1939-12-31',
               '1949-12-31', '1959-12-31', '1969-12-31', '1979-12-31',
               '1989-12-31', '1999-12-31', '2009-12-31'],
              dtype='datetime64[ns]', freq='10A-DEC')

Now, it is just a matter of looping over the decades, storing the data into arrays, and then plotting. We will set up a numpy array to store the data from the histograms. Instead of plt.hist i'll use np.histogram, which is very similar, because I don't want to plot the data yet.

In [13]:
ndecades=len(time_start)

#bins for histograms
bins=30*np.arange(31)
#What is going on here?  Test!
bin_centers=bins[0:-1]+np.diff(bins)/2.
nbins=len(bins)

#need to subtract 1 from nbins because bins are the edges
#and are 1 more value than the histogram
histogram_decades=np.zeros((nbins-1,ndecades))

for i in np.arange(ndecades):
    data_this_decade=precip_space_subset.sel(time=slice(time_start[i], time_end[i]))
    histogram_decades[:,i]=xr.plot.hist(data_this_decade,bins=bins)[0]
In [14]:
histogram_decades
Out[14]:
array([[  1.20900000e+04,   1.35050000e+04,   1.54030000e+04,
          1.92280000e+04,   2.31060000e+04,   2.68530000e+04,
          2.90300000e+04,   2.95360000e+04,   2.89660000e+04,
          2.10120000e+04,   1.56520000e+04],
       [  8.40200000e+03,   9.12600000e+03,   9.58500000e+03,
          1.09830000e+04,   1.22010000e+04,   1.33950000e+04,
          1.43650000e+04,   1.41130000e+04,   1.42200000e+04,
          1.17280000e+04,   9.17800000e+03],
       [  5.26000000e+03,   5.81900000e+03,   6.12700000e+03,
          6.89900000e+03,   7.51300000e+03,   8.15700000e+03,
          8.48800000e+03,   8.41900000e+03,   8.27600000e+03,
          7.36400000e+03,   5.91700000e+03],
       [  3.19100000e+03,   3.47800000e+03,   3.91900000e+03,
          4.21300000e+03,   4.48200000e+03,   5.01200000e+03,
          5.25100000e+03,   5.11100000e+03,   5.09400000e+03,
          4.47500000e+03,   3.79700000e+03],
       [  1.86800000e+03,   2.08800000e+03,   2.37000000e+03,
          2.69800000e+03,   2.75800000e+03,   3.08700000e+03,
          3.16300000e+03,   3.25500000e+03,   3.25300000e+03,
          2.74300000e+03,   2.35600000e+03],
       [  1.14000000e+03,   1.32800000e+03,   1.55400000e+03,
          1.70100000e+03,   1.76400000e+03,   2.04300000e+03,
          2.10800000e+03,   2.09100000e+03,   2.15300000e+03,
          1.76000000e+03,   1.46100000e+03],
       [  7.91000000e+02,   9.64000000e+02,   1.04900000e+03,
          1.21000000e+03,   1.25100000e+03,   1.48700000e+03,
          1.49700000e+03,   1.52300000e+03,   1.44800000e+03,
          1.22200000e+03,   1.06700000e+03],
       [  5.59000000e+02,   6.66000000e+02,   7.86000000e+02,
          9.18000000e+02,   8.44000000e+02,   1.09600000e+03,
          1.09500000e+03,   1.10900000e+03,   1.11200000e+03,
          8.60000000e+02,   7.60000000e+02],
       [  4.25000000e+02,   5.04000000e+02,   5.72000000e+02,
          7.09000000e+02,   6.80000000e+02,   8.88000000e+02,
          8.40000000e+02,   8.26000000e+02,   7.84000000e+02,
          6.78000000e+02,   5.45000000e+02],
       [  2.93000000e+02,   3.68000000e+02,   4.05000000e+02,
          5.32000000e+02,   4.67000000e+02,   6.22000000e+02,
          6.77000000e+02,   6.17000000e+02,   6.18000000e+02,
          4.99000000e+02,   4.46000000e+02],
       [  2.37000000e+02,   3.06000000e+02,   2.95000000e+02,
          3.73000000e+02,   3.52000000e+02,   4.96000000e+02,
          5.11000000e+02,   4.92000000e+02,   4.46000000e+02,
          3.63000000e+02,   3.27000000e+02],
       [  2.04000000e+02,   2.31000000e+02,   2.32000000e+02,
          2.92000000e+02,   2.77000000e+02,   3.84000000e+02,
          3.78000000e+02,   3.84000000e+02,   3.59000000e+02,
          2.94000000e+02,   2.66000000e+02],
       [  1.33000000e+02,   1.88000000e+02,   1.69000000e+02,
          2.52000000e+02,   2.16000000e+02,   2.86000000e+02,
          2.50000000e+02,   2.77000000e+02,   2.77000000e+02,
          2.20000000e+02,   1.85000000e+02],
       [  1.02000000e+02,   1.05000000e+02,   1.40000000e+02,
          1.44000000e+02,   1.56000000e+02,   2.20000000e+02,
          2.48000000e+02,   2.14000000e+02,   2.02000000e+02,
          1.56000000e+02,   1.36000000e+02],
       [  9.30000000e+01,   9.40000000e+01,   1.21000000e+02,
          1.22000000e+02,   9.60000000e+01,   1.56000000e+02,
          1.55000000e+02,   1.85000000e+02,   1.74000000e+02,
          1.30000000e+02,   1.19000000e+02],
       [  7.10000000e+01,   6.00000000e+01,   8.00000000e+01,
          1.01000000e+02,   7.80000000e+01,   1.35000000e+02,
          1.12000000e+02,   1.27000000e+02,   1.08000000e+02,
          8.10000000e+01,   8.20000000e+01],
       [  5.50000000e+01,   4.20000000e+01,   7.70000000e+01,
          7.40000000e+01,   6.40000000e+01,   1.00000000e+02,
          8.70000000e+01,   1.07000000e+02,   9.50000000e+01,
          8.80000000e+01,   5.30000000e+01],
       [  3.70000000e+01,   4.20000000e+01,   4.40000000e+01,
          6.00000000e+01,   6.00000000e+01,   6.80000000e+01,
          7.30000000e+01,   7.70000000e+01,   4.80000000e+01,
          4.90000000e+01,   5.80000000e+01],
       [  5.50000000e+01,   2.90000000e+01,   3.90000000e+01,
          4.90000000e+01,   2.80000000e+01,   5.20000000e+01,
          6.30000000e+01,   4.10000000e+01,   3.70000000e+01,
          3.90000000e+01,   4.80000000e+01],
       [  1.40000000e+01,   2.60000000e+01,   2.40000000e+01,
          3.70000000e+01,   2.30000000e+01,   4.30000000e+01,
          2.90000000e+01,   4.50000000e+01,   3.60000000e+01,
          3.00000000e+01,   3.00000000e+01],
       [  1.80000000e+01,   1.60000000e+01,   2.10000000e+01,
          3.00000000e+01,   2.20000000e+01,   3.10000000e+01,
          4.10000000e+01,   4.20000000e+01,   3.20000000e+01,
          2.10000000e+01,   2.50000000e+01],
       [  1.20000000e+01,   1.60000000e+01,   3.10000000e+01,
          2.10000000e+01,   2.00000000e+01,   2.20000000e+01,
          2.80000000e+01,   1.60000000e+01,   1.90000000e+01,
          2.30000000e+01,   2.30000000e+01],
       [  2.00000000e+00,   1.40000000e+01,   1.30000000e+01,
          1.80000000e+01,   9.00000000e+00,   1.90000000e+01,
          1.90000000e+01,   2.20000000e+01,   1.50000000e+01,
          1.20000000e+01,   9.00000000e+00],
       [  1.00000000e+01,   1.40000000e+01,   1.60000000e+01,
          1.40000000e+01,   9.00000000e+00,   2.20000000e+01,
          2.10000000e+01,   1.50000000e+01,   1.80000000e+01,
          1.30000000e+01,   1.40000000e+01],
       [  7.00000000e+00,   6.00000000e+00,   8.00000000e+00,
          8.00000000e+00,   4.00000000e+00,   1.10000000e+01,
          1.40000000e+01,   1.20000000e+01,   9.00000000e+00,
          4.00000000e+00,   8.00000000e+00],
       [  1.00000000e+01,   6.00000000e+00,   7.00000000e+00,
          1.10000000e+01,   7.00000000e+00,   8.00000000e+00,
          7.00000000e+00,   1.00000000e+01,   3.00000000e+00,
          5.00000000e+00,   6.00000000e+00],
       [  6.00000000e+00,   5.00000000e+00,   7.00000000e+00,
          4.00000000e+00,   6.00000000e+00,   8.00000000e+00,
          7.00000000e+00,   3.00000000e+00,   4.00000000e+00,
          8.00000000e+00,   8.00000000e+00],
       [  2.00000000e+00,   3.00000000e+00,   5.00000000e+00,
          2.00000000e+00,   4.00000000e+00,   6.00000000e+00,
          1.40000000e+01,   6.00000000e+00,   2.00000000e+00,
          3.00000000e+00,   1.00000000e+00],
       [  3.00000000e+00,   1.00000000e+00,   7.00000000e+00,
          8.00000000e+00,   3.00000000e+00,   4.00000000e+00,
          2.00000000e+00,   4.00000000e+00,   2.00000000e+00,
          3.00000000e+00,   4.00000000e+00],
       [  0.00000000e+00,   4.00000000e+00,   3.00000000e+00,
          5.00000000e+00,   1.00000000e+00,   8.00000000e+00,
          3.00000000e+00,   0.00000000e+00,   3.00000000e+00,
          2.00000000e+00,   4.00000000e+00]])
In [15]:
plt.figure()
plt.plot(bin_centers,histogram_decades)
plt.xlabel('Precipitation (mm/mon)')
plt.ylabel('Histogram counts')
plt.legend(time_start,fontsize=8)
plt.show()

Interestingly, it looks like there are more lications with stations in the mid-20th century than now in the Tropics. This has been associated with a decline in the observational network in Africa.

That takes care of the histogram, but what about the PDF?

The definition of a PDF is the sum of the histogram divided by the total of the histogram, which yields the fraction or probability of obtaining a data in that range out of the population of datapoints. We can divide the histogram by the sum on axis=0 (which will sum each decadal pdf) to get the PDF.

In [16]:
pdf_decades=histogram_decades/np.sum(histogram_decades,axis=0)
In [17]:
plt.figure()
plt.plot(bin_centers,pdf_decades)
plt.xlabel('Precipitation (mm/mon)')
plt.ylabel('PDF')
plt.legend(time_start,fontsize=8)
plt.show()

To get the CDF, we can use the cumsum object, which will sum the PDF over a specified axis, in this case we want axis 0, which are the precipitation rates.

In [18]:
cdf_decades=pdf_decades.cumsum(axis=0)
In [19]:
plt.figure()
plt.plot(bin_centers,cdf_decades)
plt.xlabel('Precipitation (mm/mon)')
plt.xlabel('CDF')
plt.legend(time_start,fontsize=8,loc='lower right')
plt.show()

OK, we see that the statistical distribution of observed Tropical rain rates is quite similar over the decades. We have different numbers of stations, but the distribution of rain rates at those stations hasn't changed much.

For you homework, you need to do the statistics when subsetting over space. It is quite similar to what we did here.

In [ ]: