Programming Tips
Steve Matsumoto [email protected]
This notebook will cover various tips for programming that you might find useful as you work through projects.
Pandas
In ModSim, you will primarily work with two types of Pandas data: series and data frames. (The TimeSeries and SweepSeries types in ModSim work very similarly to series in Pandas.) You can find the documentation for series and for data frames, which provide some example code and illustrate all the things you can do with these data types. It's important to note that you do not need to know what all of these do, but if you want to see if there is an easy way to do X with a data frame or series, the documentation is a good place to start. Below, I will highlight a couple things that you may consider doing in your project.
Here is some sample data to serve as examples:
a | b | c | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
How to get a specific row or column out of a data frame
Getting a specific value out of a series is similar to how you set the value of a series. Remember that the index starts at 0, so using an index like [1]
will actually give you the second item in the series.
This works with getting columns out of a data frame:
But this doesn't work for getting a row out of a data frame:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2896 try:
-> 2897 return self._engine.get_loc(key)
2898 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 1
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-5-c53f53f30931> in <module>
1 # This will give result in KeyError: 1
----> 2 frame[1]
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
2978 if self.columns.nlevels > 1:
2979 return self._getitem_multilevel(key)
-> 2980 indexer = self.columns.get_loc(key)
2981 if is_integer(indexer):
2982 indexer = [indexer]
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2897 return self._engine.get_loc(key)
2898 except KeyError:
-> 2899 return self._engine.get_loc(self._maybe_cast_indexer(key))
2900 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2901 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 1
With data frames, you have to use loc
, like so:
If you want to get a specific cell of a data frame, you can use at
:
Information about series or data frames
You can find the rows and columns of a data frame by just printing it:
a | b | c | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
But if the data frame is large, this will take a up a lot of screen space. You can instead use index
(for the rows), which works for both series and data frames, and columns
(for the...columns), which only works for data frames:
RangeIndex
might look familiar: it provides you a start, stop, and a number of steps and gives you an even spacing of that number of steps between the two values.
You can also get the total number of values in a data frame or series:
The "shape" will tell you how many rows or columns are in a series or data frame:
To get just the number of rows or just the number of columns, you can use [0]
after for the number of rows or [1]
for the number of columns:
It's not always clear if you are working with a series or data frame. You can use type
to see what a piece of data is.
For example, getting a row or column out of a data frame will give you a series:
How to find the mean, median, etc. of a data frame or series
Typically you will get the mean or median of a specific row of a data frame, which as we saw above is a series. We do that like this:
The .0
at the end means that the value is being treated as a float, which is Python's way of representing decimal numbers. If you need to use an integer for whatever reason, you can use int
:
You can get the maximum and minimum values with max
and min
.
For a data frame, this will compute the maximum and minimum for each column by default:
Another useful thing you can do is to sum the values in a series.
Making changes to series or data frames
If you assign a series to another variable, it does not copy the series. Any change you make to that variable will affect the original series.
Let's set that back to its original value:
If you need to keep the original values, use copy
:
Now copy
and series
are different:
How to only get cells that match a certain condition
You can actually compare series or data frames to values to get a frame of True
and False
values that reflect that comparison:
a | b | c | |
---|---|---|---|
0 | False | False | False |
1 | False | False | True |
2 | True | True | True |
You can use these to get only the cells that match that condition:
How to check equality
If you want to check that two series are equal, using ==
won't work as you expect:
To just get a yes or no answer as to whether two series match completely, use equals
:
How to read CSV files
A CSV (comma-separated values) file contains data that typically looks like this:
You can read these values using read_csv
:
a | b | c | |
---|---|---|---|
0 | 3 | 4 | 5 |
1 | 5 | 12 | 13 |
2 | 8 | 15 | 17 |
3 | 7 | 24 | 25 |
The function automatically detects column headers.
Debugging
If something does not go as you expect, you should start by thinking about (and maybe writing down) what you expect to have happened versus what did happen.
Let's go back to an earlier example error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2896 try:
-> 2897 return self._engine.get_loc(key)
2898 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 1
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-39-a4ecee57c551> in <module>
1 # This will result in an error
----> 2 frame[1]
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
2978 if self.columns.nlevels > 1:
2979 return self._getitem_multilevel(key)
-> 2980 indexer = self.columns.get_loc(key)
2981 if is_integer(indexer):
2982 indexer = [indexer]
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2897 return self._engine.get_loc(key)
2898 except KeyError:
-> 2899 return self._engine.get_loc(self._maybe_cast_indexer(key))
2900 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2901 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 1
The very last line tells us what the specific error was: KeyError: 1
. If you are new to Python, this is probably not very helpful. But if we figure out where the error happened, we might be able to get a better sense of what the error actually was. Error messages in Python usually include a traceback (mentioned near the top), which roughly describe all the functions that are executing when this error occurs.
Above the last line, there are lines that look like some/folder/file.ext in some.python.function()
. We can see that the files look like pandas/_libs/...
which means that the traceback is looking through code that is part of the Pandas library. The error is probably not here, since we can be reasonably sure that the Pandas authors did a good job. Very occasionally there is a mistake in the library itself, but you should assume that this is almost certainly not the case.
If we continue up from the bottom, we see the traceback go through /srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/indexes/base.py
with lines of code beneath. The filename indicates that we are in the Anaconda environment, which is also probably not the cause of the error.
Finally, we see this:
The <ipython-input-...>
means that this is happening in a Jupyter cell. In fact, the lines that follow are what I typed into the cell. The error message is pointing at the statement frame[1]
, which indicates that this line is responsible for the error. Generally, you should take the line that is closest to the bottom and not part of library code to be the source of the error.
In case you are curious, when you type something like frame[1]
, the 1
is called the key. So KeyError: 1
means that in frame
, the use of 1
as a key failed. Intuitively, this means that frame
does not have a column named 1
and therefore Python doesn't know how to satisfy your request for frame[1]
.
Implementation
Always know what your code should be doing before you implement a function. If you cannot articulate what your function is supposed to do, then you should stop writing that function until you can sketch out what the function does.
It is often helpful to start by documenting your function before you write any code in the body of the function. You can use the pass
keyword to avoid errors about not having written any code in the function.
It can also be helpful to write a test function that will tell you if your function is doing what you expect.
Obviously, this will not work now:
We can start implementing by writing a series of comments that break the function into its major steps.
We know how to do the first step, so we can write the first line of code. We can also take out pass
once we write the first line:
How do we divide each cell value by the mean? To do this, I actually looked at the documentation for Pandas series and did a Ctrl-F search for "divide". I found a function called div
, which has its own documentation. It says that other
can be a scalar value, so let's try putting a single integer.
Nice! So now we can add the second line of code to the function:
Finally, we just need to return the variable we just made.
We can take out the comments for a short function like this. If it takes many lines of code to do each major step, we might want to leave in those comments. One way I personally use to decide whether to keep the comments in is to ask myself how obvious the code would be to someone who knows more Python than me. If I don't think it would be obvious (particularly why I wrote that code), I leave the comment in.
In this case, we can just take it out, so our final function looks like this:
Jupyter Pitfalls
When you work in Jupyter notebooks, errors can sometimes be a bit unintuitive.
Out-of-Order Execution
If you see this:
What will the value of x
be?
This is because the In[42]
indicates that it ran more recently than In[41]
, so even though it appears before, the current value of x
is 42.
If you suspect this has happened and is causing an error, you can select a cell, and then use Cell -> Run All Above
in the top menu to run all the cells above that one in order. Sometimes, this can get the state of the notebook closer to what you expect it to be, which will in turn make finding the bug easier.
Repeated Variable Names
Avoid repeating variable names within a notebook if possible.
This can cause your errors to appear as if they are something completely different. For example, if you make a state object with sample values, like state = make_state(1, 2, 3)
, and then write something like make_state(42, 2, 100)
(note that it is not assigned to the variable), when you run your simulation like run_simulation(state, system)
, you are going to get completely wrong results. It will look like the error is in the code that runs your simulation, even though the real error is that you forgot to assign your new state to a variable.
This also illustrates another pitfall of repeating variable names: it can cause your code to appear to succeed, even if your code is wrong.