CoCalc Public Files2017-2018 / Analysis2_Python_Assignment / PythonAnalysis.ipynb
Authors: Katherine Inskip, Mark Quinn
Views : 591
Compute Environment: Ubuntu 18.04 (Deprecated)

# Analysis #2: Python

Learning objectives:

1. Perform a least squares correlation analysis

2. Linearise non-linear data

3. Perform a non-linear correlation analysis

4. Use Python programming to:

• analyse data

• visualise data and analysis results

A key skill for any scientific problem solver is the analysis and visualisation of data. During semester 1 of the Pro Skills module you will already have had experience with running an experiment such as measuring gravitational acceleration. The data from such investigations typically requires some level of analysis to determine a key measurement and the respective errors. So far, you would have performed this analysis using a spreadsheet program such as Excel.

An alternative tool in performing your analysis is via a programming language such as Python. While classes in Python programming currently begin in year 2 (PHY235) here we will introduce a basic practical usage in the context of data analysis. This option does not require any prior knowledge of programming. Depending on your preference, the Python option may be faster, more efficient and produce nicer looking figures than Excel.

## What's a computer programme?

If you haven't programmed before, you will be relived to know it's easy yet very powerful.

Physicists and other scientists typically perform a lot of data analysis, visualisation and simulations. Programming is the only way we can do this in an efficient and reproducible manner. The ability of directly controlling a computer's brain (CPU) gives you the power to perform over 100 billion calculations per second!

The easy part is down to realising that a programme is a simple list of instructions and the CPU will typically carry out the commands from top to bottom.

A programming language acts as a facilitator of your commands, translating English into the zeros and ones understandable by a CPU.

Our choice of language is Python, it's easy to write and is very powerful.

The 13 lines of code below is all you need to perform a full least squares correlation analysis and to visualise the results ... all in a fraction of a second!

The key learning objective of this session is essentially understanding and using these 13 lines of code.

After today, you should be able to simply repeat this programme for your weekly lab analysis in 1st year physics.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress
filename = "Data/LengthTempData.txt"
slope, intercept, r_value, p_value, err = linregress(temp, length)
plt.errorbar(temp, length, fmt='o',label='Data',yerr=error)
plt.xlabel('Temp [$^{\circ}$C]')
plt.ylabel('Length [m] ')
xfit = np.linspace(0, 300, 100)
yfit = slope*xfit + intercept
plt.plot(xfit,yfit, 'r-', label='Linear Fit')
plt.legend(loc='best')


# Python, Jupyter Notebooks & CoCalc

Python is the programming language of choice for many scientists, it offers a great deal of power to analyse and model scientific data with relatively little overhead in terms of learning, installation or development time. It is a language you can pick up in a weekend, and use for the rest of your life. Plus it's free! If you have never programmed before, don't worry, Python is very intuitive.

Jupyter Notebooks enable you to programme in Python (and other languages) from within a nice interactive document like this one. Traditionally, programmes were written in simple text files called scripts. Notebooks are much more dynamic and enable you to write a narrative around your programme or analysis explaining it's context and procedure. A Jupyter notebook file ends in .ipynb, this format stands for Interactive-Python-NoteBook.

CoCalc enables you to do all this online without having to install anything! Using a good browser like Chrome, you should be able to programme online on your laptop, tablet or even phone. If you need extra power or control, you can install Python locally on a computer. For this and more info on programming see the links at the end of this notebook.

## Cells

A Jupyter notebook is comprised of cells, each being either text or programming code.

You can run a cell by:

• selecting it with a mouse click
• then click the play symbol above on the toolbar
• or by pressing Ctrl + Enter keys

Typically a normal Python script would be just all the code lines written in a simple text file with a name ending in .py. If you perfer using scripts as opposed to notebooks, then that's fine too :)

## Help!

Learning to programme is all about knowing where to get help when debugging your code. Here are some tips:

• Press the 'H' key to see the handy shortcuts for Jupyter Notebooks

• To get help on a Python function simply type and run "?functionname" or "help(functionname)" (without the quotes)

• To zoom in/out see the View menu above

• When all else fails ask the internet, someone has probably already asked the same question :)

• Otherwise feel free ask the academics and their assistants during this session

In [2]:
?print


A comment is a message of plain text in your programm to remind you/others what's happening!

• In Python a comment is defined using the hash-symbol #

Don't underestimate the value of adding comments to your code; it might seem superfluous at the moment but when you look at your code in six months time you will be pleased that you added some comments. It is considered good programming practice to add comments to the code you write, so try to ensure that you do this as you work your way through these exercises.

Comments also enable you to switch off code for debugging. This is the process of identifying a problem in your code which is causing an error.

• Simply add a # to the beginning of the suspect line

### Data & Variables

As the name suggests, computers compute! And to do so they need data and variables that point to the data. Variables enable us to reference and manipulate the data stored within the computer's memory. A variable name is assigned to our data by the use of the assignment operater =

• x = 23

• the data is 23 and is an integer type

• x is the variable name

In Python we have the option of creating:

• simple data types:

• integers and decimal (floating) point numbers

• lists and arrays of numbers

A list is defined using square brackets, i.e. my_list = [2, 3, 4, 5]. while arrays are defined using Numerical Python or numpy functions. Arrays are great for scientists not just to store data but to perform calculations and analysis.

Here is a brief example on the key data types that we use:

In [1]:
# Make an integer
i = 3

# Make a float (decimal point)
k = 1.5

# Make a list
d = [10.2, 12.5, 16.74 ]

# Make an array (need Numpy module)
import numpy as np

a = np.array([2, 4.5, 6.25]) # 1D array

b = np.array([ [2, 4.5, 6.25], [ 7, 9.6, 11.63] ]) # 2D array

# Print the variables to screen
print(i)
print(k)
print(d)
print(a)
print(b)


You should notice that a list and a 1D array are very similar. The key difference is that arrays enable math operations while lists do not. So, when we want $x^2$ rather than just $x$ then our x-data should be an array. Here is a simple example:

In [2]:
x = np.array([2,4,6,8])
x2 = x**2 # x-squared
print(x2)


## Memory

Remember each time you create data and variables, they are stored in the short term memory (RAM) of the computer.

To see what you have in memory, run the %whos command:

In [3]:
# See what variables are in memory
%whos


Note how many bytes of memory are being used to store the data. A byte is 8-bits, and a bit is a 0 or 1. Computers store data (be it numbers, images, audio, video etc) as binary numbers. Obviously, knowing that your computer has 8 gigabytes ($8\times 10^9$ bytes) of memory should tell you that memory is not an limitation .... for now. In future work, you may very well be working with gigabytes of data, indeed experiments at CERN produce gigabytes of data per second!

## Check-point 1

> 1. Get the attention of a staff member > 2. Discuss if you have programmed before > 3. Discuss your grasp of the basic concepts described above > 4. For example, can you describe the difference between and integer and a floating point variable? > 5. Finally what version of Python are we using?

## Exercise 0

> Create a a new jupyter notebook file called **"LinearAnalysis.ipynb"** by clicking on the 'New' tab above.

Make sure the notebook is using Python 3 (Anaconda)

As you work through the exercises here you will be asked to add code to this assignment notebook.

You can use this new notebook to analyse and fit your experimental data each week in year 1 lab and beyond.

As with any programming language you will want to load specific modules from libraries. Modules/libraries are composed of pre-written code that provides the user with many useful functions. This allows you to perform complex tasks without having to do much programming! Python has a huge library of modules included with the distribution. To keep things simple, most of these variables and functions are not normally loaded at startup. Instead, you have to import them when you need them. Three key modules that we will need for today are:

• numpy for numerical Python which comes with many functions for numerical calculations

• matplotlib.pyplot which lets us create nice graphs of our data

• scipy for scientific Python which has many powerful functions for data analysis

In the cell below we will import numpy and shorten its name to np and similarly use plt for mathplotlib.pyplot. This means that we can then call their functions using np.xxxx and plt.xxxx respectively.

In [2]:
# Load numerical Python
import numpy as np

import matplotlib.pyplot as plt

from scipy.stats import linregress
from scipy.optimize import curve_fit


### $\uparrow$ Run that Cell $\uparrow$

This cell above is a code cell of python commands, rather than these text cells which are written in markdown script. You can see from the dropdown menu on the top toolbar above how each cell in this notebook is defined. Remember, by clicking into each cell you can edit its contents and run it.

Now run this first code cell above by selecting it and using the Ctrl+Enter shortcut. The "play" button in the toolbar does the same thing. Each time you run a code cell you update what information is stored in memory.

## Exercise 1

> 1. Copy and paste the load modules code above to the first cell of your assignment notebook > > 2. A you work through the exercises add a comment line at the beginning of each cell starting with a hash symbol '#'. In this case the comment could be: *# Exercise 1: Load useful Python modules*

# Visualising data in Python

Before we work with real data, we can start by plotting and visualising some basic maths functions.

First we need a list of x-values. The numpy function linspace is useful here as it makes an array of data from a starting value to an ending value. If you provide a third argument, it takes that as the total number of points to create. If you don't provide the argument, it gives 50 values. The linspace function will come in handy later in creating lists of x-values for our fit lines.

Once you have setup a list of x-values, the corresponding $y$ values can be defined as a function of $x$. The maths functions in Python are again provide by numpy, for example np.sqrt, np.log, np.exp, np.cos, np.sin etc.

Try this example here for a Gaussian function $y=\mathrm{e}^{-x^2}$:

In [7]:
x = np.linspace(-3, 3, 100) # generate 100 x values in the range [-3, 3]
y = np.exp(-x**2) # compute f(x) for all generated x values

plt.plot(x, y) # plot f(x)
plt.xlabel('[x]')
plt.ylabel('[y]')
plt.show()


Here you can see the syntax for power, $x^2$, is written as $\mathrm{x**2}$.

## Check-point 2

> 1. Get the attention of a staff member > 2. Show off by visualising a function of your choice > 3. The numpy module will have all the basic maths functions such as log, sqrt, trig functions and also constants such as pi

# Analysing Experimemtal Data with Python

Throughout this session, the basic analysis procedure will involve:

2. Fit data with Sci-Py

• Show fit coefficients
3. Visualise data and fit

• Plot data with error-bars

• Define and plot fit

• Label axis

• Save graph as .png file

In the following sections, we will provide examples of this procedure in the case of linear and nonlinear data analysis. Your exercises will then ask you to apply these methods to other data sets.

## Step 1: Load/Define data points

Our fist example data set is thermal expansion data for aluminium. (Does this look familiar?)

For a quick look at a small data sample, it's usually convenient just to assign the values directly to our variable(s). Here, because we want a list of values to be assigned to each variable (as opposed to a single value) we place the data between square brackets [1,2,3] like so:

In [1]:
# Enter Data here, between [ ] brackets separated by comma ','

# Temperature [deg C]
temp = [ 0, 25, 50,   75,  100,  125,  150,  175,  200,  225,  250]
# Length [m]
length = [ 1.1155,  1.1164,  1.1170,   1.1172,  1.1180,   1.1190,   1.1199,  1.1210,  1.1213,  1.1223,  1.1223]
# Error on y [m]
error = 0.0005


### Import data via a text file

For most other purposes, raw data is typically recorded in text files and then loaded into the analysis programme. This is especiallly the case if the data has been recorded by a computer or there is a lot of it!

The module numpy has a handy function called loadtxt which loads the data and assigns variables to different columns using the 'unpack' flag.

In [4]:
# Load Data
filename = "Data/LengthTempData.txt"


Check how the data is stored in the text file:

• by default the loadtxt function assumes the data is separated by a space

• this is the case for "LengthTempData.txt"
• otherwise you must state the delimiter

• see the function help file using ?loadtxt

• for example if it is a comma, simply use: loadtxt(filename, unpack=True, delimiter=',' )

• open the text file to check!

Once you load the data you can easily check the values using print variable_name_1, variable_name_2 :

In [10]:
#prints all the numbers stored in each variable:
print (temp)
print (length)
print (error)

#prints just the first number in the temp, length and error list (the number in the square brackets is an index referenced from 0)
print(temp[0], length[0])


## Exercise 2

> You have carried out an investigation to calculate the gravitational acceleration, $g$, using a pendulum, ruler and stopwatch. > > You varied the length (L) of the pendulum and measured the period (T) and saved the results > in "PendulumData.txt". The estimated reading errors are 2 mm and 5 ms for length and time respectively. > > You are intending to write Python code to analysis the data. > > 1. Create a new cell in your own notebook (Click on the '+' symbol above) > > 2. Open the data file and have a look at its formating > > 3. Add code to the cell to import the data to your assignment notebook.

## Step 2. Analyse data for linear correlation $\rightarrow$ Fit data using Sci-Py

So far in your lectures and indeed in the Excel session on curve fitting, you have learnt to use a least squares method for fitting your experimental data. In Excel, this is achieved using the LINEST function. The key limitation with the least squares method is that it does not care about error-bars on your data points. It assumes they are all the same, or indeed not present.

In this Python session, we will look at using 2 methods:

• the ** linregress ** function for least squares fitting

• the **curve_fit ** function for linear and nonlinear fitting using a weighted approach (chi-squared $\chi^2$) method

### Least Squares method: linregress function

First we will apply the least squares method to the data. As with using any function the general format is:

• output = function(input)

which in this case will look something this:

• slope, intercept, r_value, p_value, err = linregress(x,y)

• remember you can call the variable names anything you wish

• again to know more details the function simple run ?linregress in a code cell

Once the function returns the fit parameters you can simple print them to the screen as shown below:

In [5]:
# Define x and y values
x = temp
y = length

# Fit x/y Data using a linear regression
slope, intercept, r_value, p_value, err = linregress(x,y)

# Print the fit coeffs
print('Slope, Error, Intercept, Correlation Coefficient:')
print(slope, err, intercept, r_value**2)

print('Linear thermal expansion coefficient = ', slope,'+/',err,'m/deg C')


So, from the calculation of the slope we can state that the linear thermal expansion coefficient for aluminium is $(29\pm 1)\times 10^{-6}$ m/$^{\circ}$C. Notice here that we are limited to 1 significant figure on the error due to the limited precision on our initial data.

Next it is your turn perform a similar fit to the pendulum data.

## Exercise 3

> The simple pendulum model has $T = 2\pi\sqrt{\frac{L}{g}}$ In order to calculate a value of $g$ you need to linearise the data by analysing $T(\sqrt{L})$ or alternatively $T^2(L)$ > 1. Define $x$ and $y$ such that the pendulum data can be fit by a linear function $y=mx+c$ with slope $m$ and intercept $c$ > 2. Fit the measured data using a linear regression for slope, intercept, error on the slope and the correlation coefficient ($r^2$) > 3. Calculate and print a value of $g =$value$\pm$error with units

Note: Error analysis lectures begin this week. If the progapation of the errors on $g$ is a problem, ask a member of staff.

## Check-point 3

> 1. After completing Exercise 3, get the attention of a member of staff > 2. Check that the linearisation process is valid > 2. Discuss the linearisation process, why do this? what's the alternative?

## Step 3: Visualise the results of our analysis & save the graphs

You will find that Python is very powerful at creating graphs to visualise data. Here we will use 2 basic commands:

• errorbar: to show the raw data with errorbars
• plot: to show the fit line

In order to show the fit line, we will create some new data called xfit and yfit. Once we have a good looking plot, we will save it as an image with good resolution.

In [6]:
# Plot Data
plt.errorbar(x, y, fmt='o',label='Data',yerr=err)
# Label Axes
plt.xlabel('Temp [$^{\circ}$C]')
plt.ylabel('Length [m] ')

# Make a list of 100 x-values to plot the fit
xfit = np.linspace(0, 300, 100)
yfit = slope*xfit + intercept

# Plot Fit results
plt.plot(xfit,yfit, 'r-', label='Linear Fit')

plt.legend(loc='best')
# Save figure as png file with 300dpi
plt.savefig('TempLength_plot.png', dpi=300, format='png', bbox_inches='tight')
# Show the Figure
plt.show()



As with any example, notice how we avoid covering the data with the legend.

## Exercise 4

> 1. Visualise (plot) the linearised pendulum data with errorbars > 2. Create a list of $x,y$ fit values using the slope and intercept > 3. Plot the least squares analysis results as a straight line > 4. Add a suitable legend and axis labels > 5. Save the plot as an png image with resolution of 300 dpi

# Non Linear Curve Fitting

Most of the analysis you will perform in year 1 physics lab will be of a linear nature. However, in later studies, you will see many nonlinear datasets and functions which will require analysis.

Our science took box called SciPy has a general curve_fit function which can analyse both linear and nonlinear data.

The key difference is that you define a fit function and in most nonlinear cases you need to provide an initial rough guess of the fit parameters.

In this example we will use the laser profile data which you may already be familar with. First, let us load and visualise the data:

In [3]:
filename = "Data/LaserProfileData.txt"
[x,y,dy] = np.loadtxt(filename, delimiter = ',',  unpack=True)
plt.errorbar(x, y, fmt='o',label='Data', yerr=dy)
plt.xlabel('[mm]')
plt.ylabel('Intensity [W/mm$^2$]]')


Our first guess is that the data exhibits a Gaussian curve, so we will define a suitable fit function using: $y= A\,\mathrm{e}^{-\frac{(x-x_0)^2}{2\sigma^2}}$

Our function in Python will take 4 inputs and will then return a value of $y$. These inputs are $x$, $x_0$, $\sigma$ and the amplitude $A$. As in the the Excel work, we can also define a function to calculate the full width at half maximum (FWHM) of the fitted Gaussian$FWHM = 2\sigma\sqrt{2\ln(2)}$ This simple function will take 1 input ($\sigma$) and return the diameter of the laser spot.

## Functions in Python

Functions are defined using def followed by a name and a list of inputs. When your function is then called in the programme, it calculates and then returns a result.

Here are the functions we need:

In [4]:
# Basic Gaussian fit function
# x' is the list of x-values, 'a' is the peak value located at 'b' and 'c' is the value of standard deviation

def gaussfit(x, a, b, c):
return  a* np.exp(-(x - b)**2 / (2*c**2))

# FWHM for a specific value of standard deviation (sd)
def FWHM(sd):
return 2*np.sqrt(2*np.log(2))*sd


Now that these functions are defined in memory, I can create any Gaussian I want:

In [6]:
x1 = np.linspace(0,12, 100)

y1 = gaussfit(x1, 1, 6, 2)

plt.plot(x1, y1)

print("FWHM is:", FWHM(2))


### Initial guess parameters

Once we have a function which we wish to fit to our data, we then provide this to the curve_fit function. In most cases it is also useful to provide rough guesses of the fit parameters.

A key difference compared to linear methods, is that nonlinear fits tend to need the users input in defining an initial guess of the fit parameters. In this case we will define initial values of (a, b, c) for the fit function gaussfit(x,a,b,c). For the curve_fit function these guess values will be stored in a list called $p0$. Without an initial guess the function will set the default values=1.

Looking at our laser profile data, we can already see that the data peaks near 1 W/mm$^2$, it is centred near x=3 mm and it's half-width at 70% of peak $\sigma\approx 0.4$mm. Therefore the guess variable $p0=[1,3,0.4]$

### Including the error-bars in the fit

A nice feature of curve_fit in Python is that it includes the effects of the error-bars in calculating the best fit. This is only important if the error-bars are all different sizes, i.e. non-uniform. This process is typically referred to as a Chi-squared analysis, and is a topic you will see in your error-analysis lectures.

Once we have defined a:

• fit function: gaussfit

• initial guess: p0

• x and y data: x, y

• y error-bars: dy

we can called the curve_fit function: **popt, pcov = curve_fit(gaussfit, x, y, p0, sigma=dy) **.

This then returns the three optimised fit parameters as $popt$ and the errors in the form of the covariance matrix $pcov$:

In [18]:
p0 = [1,3,0.2] # Initial guess for a,b,c

# Fit the x-y data and error dy with curve_fit function
popt, pcov = curve_fit(gaussfit, x, y, p0, sigma=dy)

# To plot the fit we need to create the fit data for x/y
xfit = np.linspace(2.4,3.5, 100)
yfit = gaussfit(xfit, popt[0], popt[1], popt[2])

# Plot the data with errorbars and the fit as a line
plt.errorbar(x, y, fmt='o',label='Data', yerr=dy)
# Now plot the fit
plt.plot(xfit, yfit, label='GaussFit')

# Labels
plt.legend(loc='best')
plt.xlabel('[mm]')
plt.ylabel('Intensity [W/mm$^2$]]')

# Print the optimised fit parameters
print("Peak:",popt[0],"W/mm2")
print("Centred at:",popt[1],"mm")
print( "Standard Deviation:",popt[2],"mm")


Finally, we can calculate the width of the laser spot. Here, we take the fitted value of $\sigma$ and convert it the full-width at half-maximum (FWHM):

In [14]:
fwhm= FWHM(popt[2])
error = np.sqrt(pcov[2, 2]) / popt[2] * fwhm
print("FWHM  =  %.2f +/- %.2f mm" % (fwhm, error)) # print values to 2 decimal places floating point


## Exercise 5

> Instead of a Gaussian try to fit a Lorentzian (Cauchy) distribution to the data > 1. Make a new Jupyter Notebook called "NonlinearAnalysis.ipynb" > 2. Copy the example code above > 3. Define a Lorentzian function (see Analysis 1) > 3. Perform an analysis using a Lorentzian and Gaussian functions > 4. Comment on which fit has the least error in terms of the calculated FWHM > 5. Visualise the results > 6. Save the figure for later

## Check-point 5

> 1. After completing Exercise 5, get the attention of a member of staff > 2. Check that the analysis process is valid > 2. Discuss and compare the Gaussian/Lorentzian results. Which is best here and why?

## Bonus Exercise

> Programming online is easy and offers great flexibility. As a backup, in times of no internet, it is useful to know how to programme locally on the Univesity desktop or your own laptop. > 1. Download [Anaconda Python](https://www.anaconda.com/download/) for Windows, Mac or Linux > 1. For a University desktop, check if it already installed. If not see the software centre. > 1. Besides Jupyter notebooks, try programming a simple Python script. Anaconda provides a useful integrated development enviroment (IDE) called Spyder.