Project: Samuel Cabrera Valencia - f19_modsim

Path: projects / guidelines / programming_tips.ipynb

Views: ⁵⁰⁰

Kernel: Python 3

Programming Tips

This notebook will cover various tips for programming that you might find useful as you work through projects.

Pandas

In ModSim, you will primarily work with two types of Pandas data: series and data frames. (The TimeSeries and SweepSeries types in ModSim work very similarly to series in Pandas.) You can find the documentation for series and for data frames, which provide some example code and illustrate all the things you can do with these data types. It's important to note that you do not need to know what all of these do, but if you want to see if there is an easy way to do X with a data frame or series, the documentation is a good place to start. Below, I will highlight a couple things that you may consider doing in your project.

Here is some sample data to serve as examples:

In [1]:

import pandas as pd

series = pd.Series([4, 36, 45, 50, 75])
series

   4
  36
  45
  50
  75
dtype: int64

In [2]:

frame = pd.DataFrame({'a': [1, 4, 7], 'b': [2, 5, 8], 'c': [3, 6, 9]})
frame

	a	b	c
0	1	2	3
1	4	5	6
2	7	8	9

How to get a specific row or column out of a data frame

Getting a specific value out of a series is similar to how you set the value of a series. Remember that the index starts at 0, so using an index like [1] will actually give you the second item in the series.

In [3]:

series[1]

36

This works with getting columns out of a data frame:

In [4]:

frame['a']

  1
  4
  7
Name: a, dtype: int64

But this doesn't work for getting a row out of a data frame:

In [5]:

# This will give result in KeyError: 1
frame[1]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 1

During handling of the above exception, another exception occurred:
KeyError                                  Traceback (most recent call last)
<ipython-input-5-c53f53f30931> in <module>
      1 # This will give result in KeyError: 1
----> 2 frame[1]

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 1

With data frames, you have to use loc, like so:

In [6]:

frame.loc[1]

a    4
b    5
c    6
Name: 1, dtype: int64

If you want to get a specific cell of a data frame, you can use at:

In [7]:

frame.at[1, 'b']

5

Information about series or data frames

You can find the rows and columns of a data frame by just printing it:

In [8]:

frame

	a	b	c
0	1	2	3
1	4	5	6
2	7	8	9

But if the data frame is large, this will take a up a lot of screen space. You can instead use index (for the rows), which works for both series and data frames, and columns (for the...columns), which only works for data frames:

In [9]:

series.index

RangeIndex(start=0, stop=5, step=1)

In [10]:

frame.index

RangeIndex(start=0, stop=3, step=1)

In [11]:

frame.columns

Index(['a', 'b', 'c'], dtype='object')

RangeIndex might look familiar: it provides you a start, stop, and a number of steps and gives you an even spacing of that number of steps between the two values.

You can also get the total number of values in a data frame or series:

In [12]:

series.size

5

In [13]:

frame.size

9

The "shape" will tell you how many rows or columns are in a series or data frame:

In [14]:

series.shape

(5,)

In [15]:

frame.shape

(3, 3)

To get just the number of rows or just the number of columns, you can use [0] after for the number of rows or [1] for the number of columns:

In [16]:

series.shape[0]

5

In [17]:

frame.shape[0]

3

In [18]:

frame.shape[1]

3

It's not always clear if you are working with a series or data frame. You can use type to see what a piece of data is.

In [19]:

type(series)

pandas.core.series.Series

In [20]:

type(frame)

pandas.core.frame.DataFrame

For example, getting a row or column out of a data frame will give you a series:

In [21]:

type(frame['b'])

pandas.core.series.Series

In [22]:

type(frame.loc[1])

pandas.core.series.Series

How to find the mean, median, etc. of a data frame or series

Typically you will get the mean or median of a specific row of a data frame, which as we saw above is a series. We do that like this:

In [23]:

series.mean()

42.0

In [24]:

frame['a'].median()

4.0

The .0 at the end means that the value is being treated as a float, which is Python's way of representing decimal numbers. If you need to use an integer for whatever reason, you can use int:

In [25]:

int(series.median())

45

You can get the maximum and minimum values with max and min.

In [26]:

series.max()

75

For a data frame, this will compute the maximum and minimum for each column by default:

In [27]:

frame.min()

a    1
b    2
c    3
dtype: int64

Another useful thing you can do is to sum the values in a series.

In [28]:

series.sum()

210

Making changes to series or data frames

If you assign a series to another variable, it does not copy the series. Any change you make to that variable will affect the original series.

In [29]:

series

   4
  36
  45
  50
  75
dtype: int64

In [30]:

copy = series
copy[1] = 42
series

   4
  42
  45
  50
  75
dtype: int64

Let's set that back to its original value:

In [31]:

series[1] = 36
series

   4
  36
  45
  50
  75
dtype: int64

If you need to keep the original values, use copy:

In [32]:

copy = series.copy()
copy[1] = 42
copy

   4
  42
  45
  50
  75
dtype: int64

Now copy and series are different:

In [33]:

series

   4
  36
  45
  50
  75
dtype: int64

How to only get cells that match a certain condition

You can actually compare series or data frames to values to get a frame of True and False values that reflect that comparison:

In [34]:

frame > 5

	a	b	c
0	False	False	False
1	False	False	True
2	True	True	True

You can use these to get only the cells that match that condition:

In [35]:

series[series > 40]

  45
  50
  75
dtype: int64

How to check equality

If you want to check that two series are equal, using == won't work as you expect:

In [36]:

other_series = pd.Series([4, 42, 45, 50, 75])
series == other_series

   True
  False
   True
   True
   True
dtype: bool

To just get a yes or no answer as to whether two series match completely, use equals:

In [37]:

series.equals(other_series)

False

How to read CSV files

A CSV (comma-separated values) file contains data that typically looks like this:

a,b,c
1,2,3
4,5,6
7,8,9

You can read these values using read_csv:

In [38]:

frame_from_file = pd.read_csv('data.csv')
frame_from_file

	a	b	c
0	3	4	5
1	5	12	13
2	8	15	17
3	7	24	25

The function automatically detects column headers.

Debugging

If something does not go as you expect, you should start by thinking about (and maybe writing down) what you expect to have happened versus what did happen.

Let's go back to an earlier example error:

In [39]:

# This will result in an error
frame[1]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 1

During handling of the above exception, another exception occurred:
KeyError                                  Traceback (most recent call last)
<ipython-input-39-a4ecee57c551> in <module>
      1 # This will result in an error
----> 2 frame[1]

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 1

The very last line tells us what the specific error was: KeyError: 1. If you are new to Python, this is probably not very helpful. But if we figure out where the error happened, we might be able to get a better sense of what the error actually was. Error messages in Python usually include a traceback (mentioned near the top), which roughly describe all the functions that are executing when this error occurs.

Above the last line, there are lines that look like some/folder/file.ext in some.python.function(). We can see that the files look like pandas/_libs/... which means that the traceback is looking through code that is part of the Pandas library. The error is probably not here, since we can be reasonably sure that the Pandas authors did a good job. Very occasionally there is a mistake in the library itself, but you should assume that this is almost certainly not the case.

If we continue up from the bottom, we see the traceback go through /srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/indexes/base.py with lines of code beneath. The filename indicates that we are in the Anaconda environment, which is also probably not the cause of the error.

Finally, we see this:

<ipython-input-83-74a66e47e334> in <module>
      1 # This will result in an error
----> 2 frame[1]

The <ipython-input-...> means that this is happening in a Jupyter cell. In fact, the lines that follow are what I typed into the cell. The error message is pointing at the statement frame[1], which indicates that this line is responsible for the error. Generally, you should take the line that is closest to the bottom and not part of library code to be the source of the error.

In case you are curious, when you type something like frame[1], the 1 is called the key. So KeyError: 1 means that in frame, the use of 1 as a key failed. Intuitively, this means that frame does not have a column named 1 and therefore Python doesn't know how to satisfy your request for frame[1].

Implementation

Always know what your code should be doing before you implement a function. If you cannot articulate what your function is supposed to do, then you should stop writing that function until you can sketch out what the function does.

It is often helpful to start by documenting your function before you write any code in the body of the function. You can use the pass keyword to avoid errors about not having written any code in the function.

In [40]:

def relative_to_mean(series):
    """
    Compute the proportion of each of a series's cell values relative to the mean.
    
    Args:
        series: the Pandas series containing the input data values.
        
    Returns:
        A Pandas series where each cell's value is the original cell's value relative to the series's mean.
    """
    pass

It can also be helpful to write a test function that will tell you if your function is doing what you expect.

In [41]:

def test_relative_to_mean(relative_to_mean):
    series = pd.Series([4, 36, 45, 50, 75])
    expected_results = pd.Series([4 / 42, 36 / 42, 45 / 42, 50 / 42, 75 / 42])
    return expected_results.equals(relative_to_mean(series))

Obviously, this will not work now:

In [42]:

test_relative_to_mean(relative_to_mean)

False

We can start implementing by writing a series of comments that break the function into its major steps.

In [43]:

def relative_to_mean(series):
    """
    Compute the proportion of each of a series's cell values relative to the mean.
    
    Args:
        series: the Pandas series containing the input data values.
        
    Returns:
        A Pandas series where each cell's value is the original cell's value relative to the series's mean.
    """
    # Calculate the mean of the series.
    # Divide each cell value by the mean.
    # Return the results.
    pass

We know how to do the first step, so we can write the first line of code. We can also take out pass once we write the first line:

In [44]:

def relative_to_mean(series):
    """
    Compute the proportion of each of a series's cell values relative to the mean.
    
    Args:
        series: the Pandas series containing the input data values.
        
    Returns:
        A Pandas series where each cell's value is the original cell's value relative to the series's mean.
    """
    # Calculate the mean of the series.
    avg = series.mean()
    # Divide each cell value by the mean.
    # Return the results.

How do we divide each cell value by the mean? To do this, I actually looked at the documentation for Pandas series and did a Ctrl-F search for "divide". I found a function called div, which has its own documentation. It says that other can be a scalar value, so let's try putting a single integer.

In [45]:

series.div(2)

   2.0
  18.0
  22.5
  25.0
  37.5
dtype: float64

Nice! So now we can add the second line of code to the function:

In [46]:

def relative_to_mean(series):
    """
    Compute the proportion of each of a series's cell values relative to the mean.
    
    Args:
        series: the Pandas series containing the input data values.
        
    Returns:
        A Pandas series where each cell's value is the original cell's value relative to the series's mean.
    """
    # Calculate the mean of the series.
    avg = series.mean()
    # Divide each cell value by the mean.
    divided_series = series.div(avg)
    # Return the results.

Finally, we just need to return the variable we just made.

In [47]:

def relative_to_mean(series):
    """
    Compute the proportion of each of a series's cell values relative to the mean.
    
    Args:
        series: the Pandas series containing the input data values.
        
    Returns:
        A Pandas series where each cell's value is the original cell's value relative to the series's mean.
    """
    # Calculate the mean of the series.
    avg = series.mean()
    # Divide each cell value by the mean.
    divided_series = series.div(avg)
    # Return the results.
    return divided_series

We can take out the comments for a short function like this. If it takes many lines of code to do each major step, we might want to leave in those comments. One way I personally use to decide whether to keep the comments in is to ask myself how obvious the code would be to someone who knows more Python than me. If I don't think it would be obvious (particularly why I wrote that code), I leave the comment in.

In this case, we can just take it out, so our final function looks like this:

In [48]:

def relative_to_mean(series):
    """
    Compute the proportion of each of a series's cell values relative to the mean.
    
    Args:
        series: the Pandas series containing the input data values.
        
    Returns:
        A Pandas series where each cell's value is the original cell's value relative to the series's mean.
    """
    avg = series.mean()
    divided_series = series.div(avg)
    return divided_series

Jupyter Pitfalls

When you work in Jupyter notebooks, errors can sometimes be a bit unintuitive.

Out-of-Order Execution

If you see this:

In [42]: x = 42

In [41]: x = 3.14

What will the value of x be?

In [43]: x
Out [43]: 42

This is because the In[42] indicates that it ran more recently than In[41], so even though it appears before, the current value of x is 42.

If you suspect this has happened and is causing an error, you can select a cell, and then use Cell -> Run All Above in the top menu to run all the cells above that one in order. Sometimes, this can get the state of the notebook closer to what you expect it to be, which will in turn make finding the bug easier.

Repeated Variable Names

Avoid repeating variable names within a notebook if possible.

This can cause your errors to appear as if they are something completely different. For example, if you make a state object with sample values, like state = make_state(1, 2, 3), and then write something like make_state(42, 2, 100) (note that it is not assigned to the variable), when you run your simulation like run_simulation(state, system), you are going to get completely wrong results. It will look like the error is in the code that runs your simulation, even though the real error is that you forgot to assign your new state to a variable.

This also illustrates another pitfall of repeating variable names: it can cause your code to appear to succeed, even if your code is wrong.