Learning objectives:
Perform a least squares correlation analysis
Linearise non-linear data
Perform a non-linear correlation analysis
Use Python programming to:
load data
analyse data
visualise data and analysis results
A key skill for any scientific problem solver is the analysis and visualisation of data. During semester 1 of the Pro Skills module you will already have had experience with running an experiment such as measuring gravitational acceleration. The data from such investigations typically requires some level of analysis to determine a key measurement and the respective errors. So far, you would have performed this analysis using a spreadsheet program such as Excel.
An alternative tool in performing your analysis is via a programming language such as Python. While classes in Python programming currently begin in year 2 (PHY235) here we will introduce a basic practical usage in the context of data analysis. This option does not require any prior knowledge of programming. Depending on your preference, the Python option may be faster, more efficient and produce nicer looking figures than Excel.
If you haven't programmed before, you will be relived to know it's easy yet very powerful.
Physicists and other scientists typically perform a lot of data analysis, visualisation and simulations. Programming is the only way we can do this in an efficient and reproducible manner. The ability of directly controlling a computer's brain (CPU) gives you the power to perform over 100 billion calculations per second!
The easy part is down to realising that a programme is a simple list of instructions and the CPU will typically carry out the commands from top to bottom.
A programming language acts as a facilitator of your commands, translating English into the zeros and ones understandable by a CPU.
Our choice of language is Python, it's easy to write and is very powerful.
The 13 lines of code below is all you need to perform a full least squares correlation analysis and to visualise the results ... all in a fraction of a second!
The key learning objective of this session is essentially understanding and using these 13 lines of code.
After today, you should be able to simply repeat this programme for your weekly lab analysis in 1st year physics.
import numpy as np import matplotlib.pyplot as plt from scipy.stats import linregress filename = "Data/LengthTempData.txt" [temp,length,error] = np.loadtxt(filename, unpack=True) slope, intercept, r_value, p_value, err = linregress(temp, length) plt.errorbar(temp, length, fmt='o',label='Data',yerr=error) plt.xlabel('Temp [$^{\circ}$C]') plt.ylabel('Length [m] ') xfit = np.linspace(0, 300, 100) yfit = slope*xfit + intercept plt.plot(xfit,yfit, 'r-', label='Linear Fit') plt.legend(loc='best')
Python is the programming language of choice for many scientists, it offers a great deal of power to analyse and model scientific data with relatively little overhead in terms of learning, installation or development time. It is a language you can pick up in a weekend, and use for the rest of your life. Plus it's free! If you have never programmed before, don't worry, Python is very intuitive.
Jupyter Notebooks enable you to programme in Python (and other languages) from within a nice interactive document like this one. Traditionally, programmes were written in simple text files called scripts. Notebooks are much more dynamic and enable you to write a narrative around your programme or analysis explaining it's context and procedure. A Jupyter notebook file ends in .ipynb, this format stands for Interactive-Python-NoteBook.
CoCalc enables you to do all this online without having to install anything! Using a good browser like Chrome, you should be able to programme online on your laptop, tablet or even phone. If you need extra power or control, you can install Python locally on a computer. For this and more info on programming see the links at the end of this notebook.
A Jupyter notebook is comprised of cells, each being either text or programming code.
You can run a cell by:
Typically a normal Python script would be just all the code lines written in a simple text file with a name ending in .py. If you perfer using scripts as opposed to notebooks, then that's fine too :)
Learning to programme is all about knowing where to get help when debugging your code. Here are some tips:
Press the 'H' key to see the handy shortcuts for Jupyter Notebooks
To get help on a Python function simply type and run "?functionname" or "help(functionname)" (without the quotes)
To zoom in/out see the View menu above
When all else fails ask the internet, someone has probably already asked the same question :)
Otherwise feel free ask the academics and their assistants during this session
A comment is a message of plain text in your programm to remind you/others what's happening!
Don't underestimate the value of adding comments to your code; it might seem superfluous at the moment but when you look at your code in six months time you will be pleased that you added some comments. It is considered good programming practice to add comments to the code you write, so try to ensure that you do this as you work your way through these exercises.
Comments also enable you to switch off code for debugging. This is the process of identifying a problem in your code which is causing an error.
As the name suggests, computers compute! And to do so they need data and variables that point to the data. Variables enable us to reference and manipulate the data stored within the computer's memory. A variable name is assigned to our data by the use of the assignment operater =
x = 23
the data is 23 and is an integer type
x is the variable name
In Python we have the option of creating:
simple data types:
advanced data types:
A list is defined using square brackets, i.e. my_list = [2, 3, 4, 5]. while arrays are defined using Numerical Python or numpy functions. Arrays are great for scientists not just to store data but to perform calculations and analysis.
Here is a brief example on the key data types that we use:
# Make an integer i = 3 # Make a float (decimal point) k = 1.5 # Make a list d = [10.2, 12.5, 16.74 ] # Make an array (need Numpy module) import numpy as np a = np.array([2, 4.5, 6.25]) # 1D array b = np.array([ [2, 4.5, 6.25], [ 7, 9.6, 11.63] ]) # 2D array # Print the variables to screen print(i) print(k) print(d) print(a) print(b)
You should notice that a list and a 1D array are very similar. The key difference is that arrays enable math operations while lists do not. So, when we want $x^2$ rather than just $x$ then our x-data should be an array. Here is a simple example:
x = np.array([2,4,6,8]) x2 = x**2 # x-squared print(x2)
Remember each time you create data and variables, they are stored in the short term memory (RAM) of the computer.
To see what you have in memory, run the %whos command:
# See what variables are in memory %whos
Note how many bytes of memory are being used to store the data. A byte is 8-bits, and a bit is a 0 or 1. Computers store data (be it numbers, images, audio, video etc) as binary numbers. Obviously, knowing that your computer has 8 gigabytes ($8\times 10^9$ bytes) of memory should tell you that memory is not an limitation .... for now. In future work, you may very well be working with gigabytes of data, indeed experiments at CERN produce gigabytes of data per second!
Make sure the notebook is using Python 3 (Anaconda)
As you work through the exercises here you will be asked to add code to this assignment notebook.
You can use this new notebook to analyse and fit your experimental data each week in year 1 lab and beyond.
As with any programming language you will want to load specific modules from libraries. Modules/libraries are composed of pre-written code that provides the user with many useful functions. This allows you to perform complex tasks without having to do much programming! Python has a huge library of modules included with the distribution. To keep things simple, most of these variables and functions are not normally loaded at startup. Instead, you have to import them when you need them. Three key modules that we will need for today are:
numpy for numerical Python which comes with many functions for numerical calculations
matplotlib.pyplot which lets us create nice graphs of our data
scipy for scientific Python which has many powerful functions for data analysis
In the cell below we will import numpy and shorten its name to np and similarly use plt for mathplotlib.pyplot. This means that we can then call their functions using np.xxxx and plt.xxxx respectively.
# Load numerical Python import numpy as np # Load plotting functions import matplotlib.pyplot as plt # Load fitting functions from scipy.stats import linregress from scipy.optimize import curve_fit
This cell above is a code cell of python commands, rather than these text cells which are written in markdown script. You can see from the dropdown menu on the top toolbar above how each cell in this notebook is defined. Remember, by clicking into each cell you can edit its contents and run it.
Now run this first code cell above by selecting it and using the Ctrl+Enter shortcut. The "play" button in the toolbar does the same thing. Each time you run a code cell you update what information is stored in memory.
Before we work with real data, we can start by plotting and visualising some basic maths functions.
First we need a list of x-values. The numpy function linspace is useful here as it makes an array of data from a starting value to an ending value. If you provide a third argument, it takes that as the total number of points to create. If you don't provide the argument, it gives 50 values. The linspace function will come in handy later in creating lists of x-values for our fit lines.
Once you have setup a list of x-values, the corresponding $y$ values can be defined as a function of $x$. The maths functions in Python are again provide by numpy, for example np.sqrt, np.log, np.exp, np.cos, np.sin etc.
Try this example here for a Gaussian function $y=\mathrm{e}^{-x^2}$:
x = np.linspace(-3, 3, 100) # generate 100 x values in the range [-3, 3] y = np.exp(-x**2) # compute f(x) for all generated x values plt.plot(x, y) # plot f(x) plt.xlabel('[x]') plt.ylabel('[y]') plt.show()
Here you can see the syntax for power, $x^2$, is written as $\mathrm{x**2}$.
Throughout this session, the basic analysis procedure will involve:
Load/define data points
Fit data with Sci-Py
Visualise data and fit
Plot data with error-bars
Define and plot fit
Label axis
Add legend
Save graph as .png file
In the following sections, we will provide examples of this procedure in the case of linear and nonlinear data analysis. Your exercises will then ask you to apply these methods to other data sets.
Our fist example data set is thermal expansion data for aluminium. (Does this look familiar?)
For a quick look at a small data sample, it's usually convenient just to assign the values directly to our variable(s). Here, because we want a list of values to be assigned to each variable (as opposed to a single value) we place the data between square brackets [1,2,3] like so:
# Enter Data here, between [ ] brackets separated by comma ',' # Temperature [deg C] temp = [ 0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250] # Length [m] length = [ 1.1155, 1.1164, 1.1170, 1.1172, 1.1180, 1.1190, 1.1199, 1.1210, 1.1213, 1.1223, 1.1223] # Error on y [m] error = 0.0005
For most other purposes, raw data is typically recorded in text files and then loaded into the analysis programme. This is especiallly the case if the data has been recorded by a computer or there is a lot of it!
The module numpy has a handy function called loadtxt which loads the data and assigns variables to different columns using the 'unpack' flag.
# Load Data filename = "Data/LengthTempData.txt" [temp,length,error] = np.loadtxt(filename, unpack=True)
Check how the data is stored in the text file:
by default the loadtxt function assumes the data is separated by a space
otherwise you must state the delimiter
see the function help file using ?loadtxt
for example if it is a comma, simply use: loadtxt(filename, unpack=True, delimiter=',' )
open the text file to check!
Once you load the data you can easily check the values using print variable_name_1, variable_name_2 :
#prints all the numbers stored in each variable: print (temp) print (length) print (error) #prints just the first number in the temp, length and error list (the number in the square brackets is an index referenced from 0) print(temp[0], length[0])
So far in your lectures and indeed in the Excel session on curve fitting, you have learnt to use a least squares method for fitting your experimental data. In Excel, this is achieved using the LINEST function. The key limitation with the least squares method is that it does not care about error-bars on your data points. It assumes they are all the same, or indeed not present.
In this Python session, we will look at using 2 methods:
the ** linregress ** function for least squares fitting
the **curve_fit ** function for linear and nonlinear fitting using a weighted approach (chi-squared $\chi^2$) method
First we will apply the least squares method to the data. As with using any function the general format is:
which in this case will look something this:
slope, intercept, r_value, p_value, err = linregress(x,y)
remember you can call the variable names anything you wish
again to know more details the function simple run ?linregress in a code cell
Once the function returns the fit parameters you can simple print them to the screen as shown below:
# Define x and y values x = temp y = length # Fit x/y Data using a linear regression slope, intercept, r_value, p_value, err = linregress(x,y) # Print the fit coeffs print('Slope, Error, Intercept, Correlation Coefficient:') print(slope, err, intercept, r_value**2) print('Linear thermal expansion coefficient = ', slope,'+/',err,'m/deg C')
So, from the calculation of the slope we can state that the linear thermal expansion coefficient for aluminium is $(29\pm 1)\times 10^{-6}$ m/$^{\circ}$C. Notice here that we are limited to 1 significant figure on the error due to the limited precision on our initial data.
Next it is your turn perform a similar fit to the pendulum data.
Note: Error analysis lectures begin this week. If the progapation of the errors on $g$ is a problem, ask a member of staff.
You will find that Python is very powerful at creating graphs to visualise data. Here we will use 2 basic commands:
In order to show the fit line, we will create some new data called xfit and yfit. Once we have a good looking plot, we will save it as an image with good resolution.
# Plot Data plt.errorbar(x, y, fmt='o',label='Data',yerr=err) # Label Axes plt.xlabel('Temp [$^{\circ}$C]') plt.ylabel('Length [m] ') # Make a list of 100 x-values to plot the fit xfit = np.linspace(0, 300, 100) yfit = slope*xfit + intercept # Plot Fit results plt.plot(xfit,yfit, 'r-', label='Linear Fit') # Add Legend plt.legend(loc='best') # Save figure as png file with 300dpi plt.savefig('TempLength_plot.png', dpi=300, format='png', bbox_inches='tight') # Show the Figure plt.show()
As with any example, notice how we avoid covering the data with the legend.
Most of the analysis you will perform in year 1 physics lab will be of a linear nature. However, in later studies, you will see many nonlinear datasets and functions which will require analysis.
Our science took box called SciPy has a general curve_fit function which can analyse both linear and nonlinear data.
The key difference is that you define a fit function and in most nonlinear cases you need to provide an initial rough guess of the fit parameters.
In this example we will use the laser profile data which you may already be familar with. First, let us load and visualise the data:
filename = "Data/LaserProfileData.txt" [x,y,dy] = np.loadtxt(filename, delimiter = ',', unpack=True) plt.errorbar(x, y, fmt='o',label='Data', yerr=dy) plt.xlabel('[mm]') plt.ylabel('Intensity [W/mm$^2$]]')
Our first guess is that the data exhibits a Gaussian curve, so we will define a suitable fit function using: $y= A\,\mathrm{e}^{-\frac{(x-x_0)^2}{2\sigma^2}}$
Our function in Python will take 4 inputs and will then return a value of $y$. These inputs are $x$, $x_0$, $\sigma$ and the amplitude $A$. As in the the Excel work, we can also define a function to calculate the full width at half maximum (FWHM) of the fitted Gaussian$FWHM = 2\sigma\sqrt{2\ln(2)}$ This simple function will take 1 input ($\sigma$) and return the diameter of the laser spot.
Functions are defined using def followed by a name and a list of inputs. When your function is then called in the programme, it calculates and then returns a result.
Here are the functions we need:
# Basic Gaussian fit function # x' is the list of x-values, 'a' is the peak value located at 'b' and 'c' is the value of standard deviation def gaussfit(x, a, b, c): return a* np.exp(-(x - b)**2 / (2*c**2)) # FWHM for a specific value of standard deviation (sd) def FWHM(sd): return 2*np.sqrt(2*np.log(2))*sd
Now that these functions are defined in memory, I can create any Gaussian I want:
x1 = np.linspace(0,12, 100) y1 = gaussfit(x1, 1, 6, 2) plt.plot(x1, y1) print("FWHM is:", FWHM(2))
Once we have a function which we wish to fit to our data, we then provide this to the curve_fit function. In most cases it is also useful to provide rough guesses of the fit parameters.
A key difference compared to linear methods, is that nonlinear fits tend to need the users input in defining an initial guess of the fit parameters. In this case we will define initial values of (a, b, c) for the fit function gaussfit(x,a,b,c). For the curve_fit function these guess values will be stored in a list called $p0$. Without an initial guess the function will set the default values=1.
Looking at our laser profile data, we can already see that the data peaks near 1 W/mm$^2$, it is centred near x=3 mm and it's half-width at 70% of peak $\sigma\approx 0.4$mm. Therefore the guess variable $p0=[1,3,0.4]$
A nice feature of curve_fit in Python is that it includes the effects of the error-bars in calculating the best fit. This is only important if the error-bars are all different sizes, i.e. non-uniform. This process is typically referred to as a Chi-squared analysis, and is a topic you will see in your error-analysis lectures.
Once we have defined a:
fit function: gaussfit
initial guess: p0
x and y data: x, y
y error-bars: dy
we can called the curve_fit function: **popt, pcov = curve_fit(gaussfit, x, y, p0, sigma=dy) **.
This then returns the three optimised fit parameters as $popt$ and the errors in the form of the covariance matrix $pcov$:
p0 = [1,3,0.2] # Initial guess for a,b,c # Fit the x-y data and error dy with curve_fit function popt, pcov = curve_fit(gaussfit, x, y, p0, sigma=dy) # To plot the fit we need to create the fit data for x/y xfit = np.linspace(2.4,3.5, 100) yfit = gaussfit(xfit, popt[0], popt[1], popt[2]) # Plot the data with errorbars and the fit as a line plt.errorbar(x, y, fmt='o',label='Data', yerr=dy) # Now plot the fit plt.plot(xfit, yfit, label='GaussFit') # Labels plt.legend(loc='best') plt.xlabel('[mm]') plt.ylabel('Intensity [W/mm$^2$]]') # Print the optimised fit parameters print("Peak:",popt[0],"W/mm2") print("Centred at:",popt[1],"mm") print( "Standard Deviation:",popt[2],"mm")
Finally, we can calculate the width of the laser spot. Here, we take the fitted value of $\sigma$ and convert it the full-width at half-maximum (FWHM):
fwhm= FWHM(popt[2]) error = np.sqrt(pcov[2, 2]) / popt[2] * fwhm print("FWHM = %.2f +/- %.2f mm" % (fwhm, error)) # print values to 2 decimal places floating point