Python (Python 3.7.0): a very popular open source programming language. The core distribution contains only the essential features of a general programming language. In order to conduct statistical analysis, we will need to explicitly load several additional packages.
To manually find the current version of Python, open a new command window and type: python --version
IPython (ipython 6.5.0): the computational kernel running the Python commands. Provides the tools for interactive data analysis. Allows you to quickly display graphs and change directories, explore the workspace, and provides a command history.
To manually find the current version of IPython, open a new command window and type: ipython --version
To manually upgrade to the current version of IPython, type: pip install --upgrade ipython
PyPI: The Python Package Index
The Python Package Index (PyPI) is a repository of software for the Python programming language. Individual packages from PyPI can be installed easily from the Windows command shell (cmd) or the terminal window:
pip install [_package_]
To update a package:
pip install --upgrade [_package_]
To get a list of all Python packages on your computer:
Anaconda (Anaconda Navigator 1.8.7) is a package manager, and environment manager, a Python distribution, and a collection of over 1,000 most-commonly used open source packages. Anaconda uses conda (conda 4.5.11) , a more powerful installation manager; however, pip also works from the command prompt with Anaconda.
NumPy (numpy 1.15.1): the most important package for scientific applications which makes working with vectors and matrices fast and efficient. Provides N-dimensional numerical arrays and vectors, linear algebra, Fourier transforms.
SciPy (scipy 1.1.0): builds closely on NumPy, providing more advanced numerical methods, integration, ordinary differential equation (ODE) solvers. For the statistical data analysis, script.stats contains the algorithms for basic statistics.
Matplotlib (matplotlib 2.2.3): Python’s main graphing/plotting library. The documentation on the Matplotlib website is good, especially the gallery.
JuPyter (jupyter 1.0.0): rather than using the interactive IPython command line, during class we will use Python in a ‘notebook’ style from inside the web browser. This keeps the commands and their outputs together in a single document that you can reference later on.
Analyzing and Manipulating Data
Pandas (pandas 0.23.4): provides fast, flexible, and expressive data structures (called DataFrames) designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
patsy (patsy 0.5.0): for describing statistical models (especially linear models, or models that have a linear component) and building design matrices. Brings the convenience of R "formulas" to Python.
xlrd (xlrd 1.1.0): extracts data from MS Excel spreadsheets on any platform.
PyMC (pymc 2.3.6): for Bayesian statistics, including Markov chain Monte Carlo simulations.
scikit-learn (scikit-learn 0.19.2): machine learning tools for Python. Increasingly popular, contains all the main algorithms used in this field such as K-means clustering.
scikits.bootstrap (scikits.bootstrap 1.0.0): provides bootstrap confidence interval algorithms for SciPy.
scikit-image (scikit-image 0.14.0): a bunch of functionality for doing image analysis, including satellite images.
lifelines (lifelines 0.14.6): survival analysis in Python.
xarray (xarray 0.10.8): brings the labeled data power of Pandas to the physical sciences, by providing N-dimensional variants of the core Pandas data structures. Provides a pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays.
Statsmodels (statsmodels 0.9.0): provides implementations of all the major statistical algorithms. Preferentially works with Pandas DataFrames. Has the option of using R-like syntax, which you’ll probably like if you’re familiar with R.
seaborn (seaborn 0.9.0): a set of statistical plotting tools which extends the plotting abilities of matplotlib. The plots look very elegant. Well worth looking at if you do a lot of statistical work. Takes Pandas DataFrames as standard.
Other Application-Dependent Packages
SPy (spectral 0.19): for processing hyperspectral image data (imaging spectroscopy data). It has functions for reading, displaying, manipulating, and classifying hyperspectral imagery.
AstroPy (astropy 3.0.4): contains core functionality and some common tools needed for performing astronomy and astrophysics research with Python.
PyTables (tables 3.4.4): or managing hierarchical datasets and designed to efficently cope with extremely large amounts of data (note Pandas does this pretty well for the most part).