Curve fitting, including linear regression
In this module we look at how to find the curve of best-fit for a given set of data values. From a modeling point of view this is a critically important need, with a wide variety of applications. Linear regression is a particularly important "special case" of curve fitting, in which we seek the best straight line to fit our data.
In all the examples shown here, we will use a specific dataset
from an input
file. In this dataset the -variable is the calendar year,
and the -variable is the global mean temperature on the
earth's surface. The following code segment reads the data
and stores it in the variables: xdata, ydata.
After that it plots the data to see what it looks like.
There are at least 3 different options that Sage offers
for doing linear regression. Keep in mind that there
is only one unique line of best-fit for any given dataset.
This is because the line of best-fit minimizes the sum of
the squares of the errors.
So, we expect all the implementations to give the same answer!
Method 1: Using stats.linregress
As the name suggests, this method only does linear regression
-- as opposed to fitting other (nonlinear) models to the data.
The code segment below shows how to use it to fit the global
temperature dataset. In addition, it shows how to plot the results.
The polyfit function can fit a polynomial of any
specified order. Thus it can be used for linear regression,
as well as for higher order regression. It returns the best
least-squares fit to the dataset.
The code segment below illustrates how to use it to fit a linear, and a 3rd order polynomial to the global temperature dataset. In addition, it shows plots of the results.
The curve_fit function let's the user define any type
of model function, including linear, polynomial, trigonometric,
exponential, etc. It returns the best least-squares fit of the
model function to the given dataset. This method requires a bit
more coding, since it allows for more general models.
The code segment below illustrates how to use it to fit a linear, and a trigonometric function to the global temperature dataset. Note that the user must define a prototype function of each kind before curve_fit can be used. The inputs to curve_fit are the name of the function prototype, and the , data sets.
The file Thailand_tourism_2009_2016.csv
contains data from the Department of Tourism
on total monthly foreign tourist arrivals in Thailand between the
years 2009 and 2012. The data is in the form of a comma-separated
spreadsheet with 3 coulmns: number of tourists, year, month. The
units used for the number of tourists is millions, and the months
are numbered consecutively starting from 1 to 96.
The first line in the file
is a header containing coulmn headings. The actual data start from
the 2nd line.
The goal of this exercise is to read the necessary data from the input file and fit two different models to it: (1) linear, and (2) any one nonlinear model.
Plot your results in the form of graphs, and also give each model's mathematical form.