Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Views: 90

Curve fitting, including linear regression

In this module we look at how to find the curve of best-fit for a given set of (x,y)(x,y) data values. From a modeling point of view this is a critically important need, with a wide variety of applications. Linear regression is a particularly important "special case" of curve fitting, in which we seek the best straight line to fit our data.


In all the examples shown here, we will use a specific dataset from an input file. In this dataset the xx-variable is the calendar year, and the yy-variable is the global mean temperature on the earth's surface. The following code segment reads the data and stores it in the variables: xdata, ydata.
After that it plots the data to see what it looks like.

f = open('./globtemp.csv') # This file has comma separated data, with a plain text header line. xdata = [] # This will assemble my x-data in the form of a list ydata = [] # This will assemble my y-data in the same form headerline = f.readline() # Read & discard the header line in the input file line = f.readline() # This is the first line of actual data while (line !=''): xy = line.split(',') # The comma splits each line into two items of data xdata.append(float(xy[0])) # Append to xdata ydata.append(float(xy[1])) # Append to ydata line = f.readline() # Read the next line # Next, plot the data and see what it looks like: point(zip(xdata, ydata), axes_labels=['year', 'temp $^\circ$C'], pointsize=30)
Curve fitting methods


There are at least 3 different options that Sage offers for doing linear regression. Keep in mind that there is only one unique line of best-fit for any given dataset. This is because the line of best-fit minimizes the sum of the squares of the errors. So, we expect all the implementations to give the same answer!


Method 1:   Using stats.linregress

As the name suggests, this method only does linear regression -- as opposed to fitting other (nonlinear) models to the data.
The code segment below shows how to use it to fit the global temperature dataset. In addition, it shows how to plot the results.

# The library "stats" must first be imported from scipy: from scipy import stats # Give xdata, ydata as input to stats.linregress. # Store the results in some variable (e.g., fit_output). fit_output = stats.linregress(xdata, ydata) #slope, intercept, r_value, p_value, slope_std_error = fit_output print fit_output # The rest of this code just plots and shows the results from scipy import polyval ymod = polyval([fit_output[0],fit_output[1]],xdata) # Compute y-values for the linear model praw = point(zip(xdata, ydata), pointsize=30) # Plot the original dataset # Plot the fitted model pfit = points(zip(xdata,ymod), axes_labels=['year', 'temp $^\circ$C'], pointsize=8, color='darkgreen') show(praw+pfit) print "The mathematical form of the model is: y =", fit_output[0], "* x + ", fit_output[1]
LinregressResult(slope=0.0050534351145038172, intercept=5.1333435114503807, rvalue=0.81149837783481804, pvalue=6.8562147164742632e-32, stderr=0.00032039129165884194)
The mathematical form of the model is: y = 0.0050534351145 * x + 5.13334351145
Method 2:   Using polyfit from scipy


The polyfit function can fit a polynomial of any specified order. Thus it can be used for linear regression, as well as for higher order regression. It returns the best least-squares fit to the dataset.

The code segment below illustrates how to use it to fit a linear, and a 3rd order polynomial to the global temperature dataset. In addition, it shows plots of the results.

# The function "polyfit" must first be imported from scipy: from scipy import polyfit # Let's first do a linear fit: fit_out_linear = polyfit(xdata, ydata, 1) # Next, we'll try a cubic fit for the same data: fit_out_cubic = polyfit(xdata, ydata, 3) # The rest of this code just plots and shows the results. # NOTE: polyfit seems to give output in format for direct input to polyval from scipy import polyval ymod_linear = polyval(fit_out_linear, xdata) # Compute y-values for the linear model ymod_cubic = polyval(fit_out_cubic, xdata) # Compute y-values for cubic model praw = point(zip(xdata, ydata), pointsize=30) # Plot the original dataset # Plot the fitted models pfit_linear = points(zip(xdata,ymod_linear), pointsize=8, color='darkgreen') pfit_cubic = points(zip(xdata,ymod_cubic), axes_labels=['year', 'temp $^\circ$C'], pointsize=8, color='darkred') show(praw+pfit_linear+pfit_cubic) print "The linear model is: y =", fit_out_linear[0], "* x + ", fit_out_linear[1] print "The cubic model is: y =", fit_out_cubic[0], "* x^3 + ", fit_out_cubic[1], "* x^2 + ", fit_out_cubic[2], "* x + ", fit_out_cubic[3]
The linear model is: y = 0.0050534351145 * x + 5.13334351145 The cubic model is: y = 4.05660084499e-07 * x^3 + -0.00235611960143 * x^2 + 4.56551474091 * x + -2936.55073889
Method 3:   Using curve_fit from scipy.optimize


The curve_fit function let's the user define any type of model function, including linear, polynomial, trigonometric, exponential, etc. It returns the best least-squares fit of the model function to the given dataset. This method requires a bit more coding, since it allows for more general models.

The code segment below illustrates how to use it to fit a linear, and a trigonometric function to the global temperature dataset. Note that the user must define a prototype function of each kind before curve_fit can be used. The inputs to curve_fit are the name of the function prototype, and the xx, yy data sets.

# User-defined model of linear function: def f_linear(x, m, b): return m*x + b # User-defined model of trigonometric function: def f_exp(x, a, b, c, d): return a + b*x + c*sin(d*x) # Import the "curve_fit" function from scipy.optimize: from scipy.optimize import curve_fit # Find the best-fit linear model; save results in temporary variables: coefs, pcov = curve_fit(f_linear, xdata, ydata) # Compute y-values for the linear model: ymodel = [] for i in range(len(xdata)): ymodel.append(f_linear(xdata[i], coefs[0], coefs[1])) # Plot original data and linear fit: praw = point(zip(xdata, ydata), pointsize=30) pfit_linear = point(zip(xdata, ymodel), pointsize=8, color='darkgreen') # Find the best-fit trigonometric model; save results in temporary variables: coefs, pcov = curve_fit(f_exp, xdata, ydata) # Compute y-values for the model: ymodel = [] for i in range(len(xdata)): ymodel.append(f_exp(xdata[i], coefs[0], coefs[1], coefs[2], coefs[3])) # Plot trigonometric fit: pfit_exp = point(zip(xdata, ymodel), axes_labels=['year', 'temp $^\circ$C'], pointsize=8, color='darkred') show(praw+pfit_linear+pfit_exp)
Exercise


The file Thailand_tourism_2009_2016.csv contains data from the Department of Tourism on total monthly foreign tourist arrivals in Thailand between the years 2009 and 2012. The data is in the form of a comma-separated spreadsheet with 3 coulmns: number of tourists, year, month. The units used for the number of tourists is millions, and the months are numbered consecutively starting from 1 to 96. The first line in the file is a header containing coulmn headings. The actual data start from the 2nd line.

The goal of this exercise is to read the necessary data from the input file and fit two different models to it: (1) linear, and (2) any one nonlinear model.

Plot your results in the form of graphs, and also give each model's mathematical form.