{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Problem description\n",
    "\n",
    "## Context\n",
    "Police deparments are stirving to implement more automated and predictive data systems into their everyday processes to reduce crime and deploy scarce resources more efficiently. This provides an opportunity for more proactive policing if it were possible to alert resources of abnormal patterns in the data as they occur. Boston police department released public dataset with incident reports reported to its 911 call center.\n",
    "\n",
    "## Challenge\n",
    "Assess the potential of the provided data set for predicting where police patrols should be dispatched in order to serve, protect, and optimize (people, money, resources, time).\n",
    "\n",
    "### Description of columns\n",
    "_(As provided online)_\n",
    "\n",
    "1.   `incident_num` (varchar; required) - Internal BPD report number\n",
    "2.   `offense_code` (varchar) - Numerical code of offense description\n",
    "3.   `Offense_Code_Group_Description` (varchar) - Internal categorization of [offense_description]\n",
    "4.   `Offense_Description` (varchar) - Primary descriptor of incident\n",
    "5.   `district` (varchar) - What district the crime was reported in\n",
    "6.   `reporting_area` (varchar) - RA number associated with the where the crime was reported from.\n",
    "7.   `shooting` (char) - Indicated a shooting took place.\n",
    "8.   `occurred_on` (datetime) - Earliest date and time the incident could have taken place\n",
    "9.   `UCR_Part` (varchar) - Universal Crime Reporting Part number (1, 2, 3)\n",
    "10.  `street` (varchar) - Street name the incident took place"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "We load the data in the cells below. Uncomment and run the one corresponding to the language of your choice!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "import pandas as pd\n",
    "reports_df = pd.read_csv('boston_crime_incident_reports_2015aug-2018apr.csv', encoding='latin-1')\n",
    "weather_df = pd.read_csv('boston_weather_data_cleaned_2018oct05.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "# reports_df <- read.csv('boston_crime_incident_reports_2015aug-2018apr.csv', header=TRUE)\n",
    "# weather_df <- read.csv('boston_weather_data_cleaned_2018oct05.csv', header=TRUE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>INCIDENT_NUMBER</th>\n",
       "      <th>OFFENSE_CODE</th>\n",
       "      <th>OFFENSE_CODE_GROUP</th>\n",
       "      <th>OFFENSE_DESCRIPTION</th>\n",
       "      <th>DISTRICT</th>\n",
       "      <th>REPORTING_AREA</th>\n",
       "      <th>SHOOTING</th>\n",
       "      <th>OCCURRED_ON_DATE</th>\n",
       "      <th>YEAR</th>\n",
       "      <th>MONTH</th>\n",
       "      <th>DAY_OF_WEEK</th>\n",
       "      <th>HOUR</th>\n",
       "      <th>UCR_PART</th>\n",
       "      <th>STREET</th>\n",
       "      <th>Lat</th>\n",
       "      <th>Long</th>\n",
       "      <th>Location</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>I182024895</td>\n",
       "      <td>2629</td>\n",
       "      <td>Harassment</td>\n",
       "      <td>HARASSMENT</td>\n",
       "      <td>B3</td>\n",
       "      <td>442</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2018-04-03 20:00:00</td>\n",
       "      <td>2018</td>\n",
       "      <td>4</td>\n",
       "      <td>Tuesday</td>\n",
       "      <td>20</td>\n",
       "      <td>Part Two</td>\n",
       "      <td>WESTCOTT ST</td>\n",
       "      <td>42.293218</td>\n",
       "      <td>-71.078865</td>\n",
       "      <td>(42.29321805, -71.07886455)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>I182024895</td>\n",
       "      <td>619</td>\n",
       "      <td>Larceny</td>\n",
       "      <td>LARCENY ALL OTHERS</td>\n",
       "      <td>B3</td>\n",
       "      <td>442</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2018-04-03 20:00:00</td>\n",
       "      <td>2018</td>\n",
       "      <td>4</td>\n",
       "      <td>Tuesday</td>\n",
       "      <td>20</td>\n",
       "      <td>Part One</td>\n",
       "      <td>WESTCOTT ST</td>\n",
       "      <td>42.293218</td>\n",
       "      <td>-71.078865</td>\n",
       "      <td>(42.29321805, -71.07886455)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>I182024887</td>\n",
       "      <td>1402</td>\n",
       "      <td>Vandalism</td>\n",
       "      <td>VANDALISM</td>\n",
       "      <td>B3</td>\n",
       "      <td>469</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2018-03-28 20:30:00</td>\n",
       "      <td>2018</td>\n",
       "      <td>3</td>\n",
       "      <td>Wednesday</td>\n",
       "      <td>20</td>\n",
       "      <td>Part Two</td>\n",
       "      <td>ALMONT ST</td>\n",
       "      <td>42.275277</td>\n",
       "      <td>-71.095542</td>\n",
       "      <td>(42.27527670, -71.09554245)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "execution_count": 3,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reports_df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Most of the interesting features are not numbers, so the above is not very useful.\n",
    "\n",
    "Other than the incident ID, we can split the data features into two groups: Incident description or space-time coordinates\n",
    "\n",
    "For the Incident description, a lot is redundant. OFFENSE_CODE is an integer representation of OFFENSE_CODE_GROUP, so we will not bother with it (let's stay human readable here).\n",
    "\n",
    "OFFENSE_CODE_GROUP and OFFENSE_DESCRIPTION are pretty similar. The description is more granular, too granular. We don't want to use it.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "HOUR  DISTRICT\n",
       "0     A1          1547\n",
       "      A15          218\n",
       "      A7           495\n",
       "      B2          1786\n",
       "      B3          1231\n",
       "      C11         1705\n",
       "      C6           955\n",
       "      D14          928\n",
       "      D4          1540\n",
       "      E13          569\n",
       "      E18          583\n",
       "      E5           552\n",
       "1     A1          1382\n",
       "      A15           87\n",
       "      A7           395\n",
       "      B2          1135\n",
       "      B3           831\n",
       "      C11          899\n",
       "      C6           493\n",
       "      D14          501\n",
       "      D4           737\n",
       "      E13          320\n",
       "      E18          315\n",
       "      E5           235\n",
       "2     A1          1188\n",
       "      A15           67\n",
       "      A7           329\n",
       "      B2           914\n",
       "      B3           593\n",
       "      C11          671\n",
       "                  ... \n",
       "21    C6           719\n",
       "      D14          718\n",
       "      D4          1338\n",
       "      E13          620\n",
       "      E18          590\n",
       "      E5           441\n",
       "22    A1           951\n",
       "      A15          198\n",
       "      A7           477\n",
       "      B2          1872\n",
       "      B3          1347\n",
       "      C11         1528\n",
       "      C6           645\n",
       "      D14          706\n",
       "      D4          1153\n",
       "      E13          575\n",
       "      E18          512\n",
       "      E5           414\n",
       "23    A1           970\n",
       "      A15          153\n",
       "      A7           386\n",
       "      B2          1479\n",
       "      B3           969\n",
       "      C11         1256\n",
       "      C6           548\n",
       "      D14          662\n",
       "      D4           949\n",
       "      E13          442\n",
       "      E18          392\n",
       "      E5           344\n",
       "Name: MONTH, Length: 288, dtype: int64"
      ]
     },
     "execution_count": 15,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This can give us a first idea of crimes by date and district. For all hours of day, assign police resources in proportions to crimes commited in each district.\n",
    "# Note that some crimes are more important than others.\n",
    "reports_df.groupby(['HOUR', 'DISTRICT']).count().MONTH"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Looking at the different types of crimes, we may want to eventually categorize the crimes in further subcategories. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "OFFENSE_CODE_GROUP\n",
       "Motor Vehicle Accident Response    26459\n",
       "Larceny                            21670\n",
       "Medical Assistance                 18911\n",
       "Investigate Person                 15646\n",
       "Other                              14763\n",
       "Vandalism                          13074\n",
       "Drug Violation                     12698\n",
       "Simple Assault                     12604\n",
       "Verbal Disputes                    11009\n",
       "Towed                               9069\n",
       "Name: MONTH, dtype: int64"
      ]
     },
     "execution_count": 16,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reports_df.groupby('OFFENSE_CODE_GROUP').count().sort_values('MONTH', ascending=False).head(10).MONTH"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x7f0a90860550>"
      ]
     },
     "execution_count": 17,
     "metadata": {
     },
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "a4abbcd27b4e88ade9719901251095297cdb8eaa"
     },
     "execution_count": 17,
     "metadata": {
      "image/png": {
       "height": 263,
       "width": 384
      },
      "needs_background": "light"
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# If we want to visualize the crimes location.\n",
    "reports_df.plot.scatter(x='Long', y='Lat')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x7f0a9068aa20>"
      ]
     },
     "execution_count": 18,
     "metadata": {
     },
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "5d76c1a5d3299d00be70959c4ec24bf1c33dc4b4"
     },
     "execution_count": 18,
     "metadata": {
      "image/png": {
       "height": 263,
       "width": 406
      },
      "needs_background": "light"
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Uhhh? Turns out there are some data with bad values, eg Lat and Long with value of (0,0). Lets get rid of these\n",
    "reports_df[(reports_df['Lat'] > 10) & (reports_df['Long'] < -50)].plot.scatter(x='Long', y='Lat', s=1) # This should work better, we'll need s=1 to make the individual dots easier to see."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Cool, a map of Boston! What are the holes? Why are there disconnected regions? (There is a park in the middle of town, cars can't get in there. Also there are bodies of water, and a prominent bridge is traced out in the upper left part of the map.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (Anaconda 2019)",
   "env": {
    "AR": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-ar",
    "AS": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-as",
    "CC": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-cc",
    "CONDA_EXE": "/ext/anaconda-2019.03/bin/conda",
    "CONDA_PREFIX": "/ext/anaconda-2019.03",
    "CONDA_PYTHON_EXE": "/ext/anaconda-2019.03/bin/python",
    "CPP": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-cpp",
    "CXX": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-c++",
    "CXXFILT": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-c++filt",
    "ELFEDIT": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-elfedit",
    "F77": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gfortran",
    "F90": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gfortran",
    "F95": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-f95",
    "FC": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gfortran",
    "GCC": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gcc",
    "GCC_AR": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gcc-ar",
    "GCC_NM": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gcc-nm",
    "GCC_RANLIB": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gcc-ranlib",
    "GDAL_DATA": "/ext/anaconda-2019.03/share/gdal",
    "GFORTRAN": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gfortran",
    "GPROF": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gprof",
    "GXX": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-g++",
    "JAVA_HOME": "/ext/anaconda-2019.03",
    "JAVA_LD_LIBRARY_PATH": "/ext/anaconda-2019.03/lib/server",
    "LD": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-ld",
    "LD_GOLD": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-ld.gold",
    "LD_LIBRARY_PATH": "/ext/anaconda-2019.03/lib",
    "NM": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-nm",
    "OBJCOPY": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-objcopy",
    "OBJDUMP": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-objdump",
    "OCAMLFIND_CONF": "/ext/anaconda-2019.03/etc/findlib.conf",
    "OCAMLLIB": "/ext/anaconda-2019.03/lib/ocaml",
    "OCAML_PREFIX": "/ext/anaconda-2019.03",
    "PATH": "/ext/anaconda-2019.03/bin:/ext/anaconda-2019.03/condabin:/cocalc/bin:/cocalc/src/smc-project/bin:/home/user/bin:/home/user/.local/bin:/ext/bin:/usr/lib/xpra:/opt/ghc/bin:/usr/local/bin:/usr/bin:/bin:/ext/data/homer/bin:/ext/data/weblogo:/ext/intellij/idea/bin:/ext/pycharm/pycharm/bin:/usr/lib/postgresql/10/bin",
    "PROJ_LIB": "/ext/anaconda-2019.03/share/proj",
    "PYTHONHOME": "/ext/anaconda-2019.03/lib/python3.7",
    "PYTHONPATH": "/ext/anaconda-2019.03/lib/python3.7:/ext/anaconda-2019.03/lib/python3.7/site-packages",
    "RANLIB": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-ranlib",
    "READELF": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-readelf",
    "RSTUDIO_WHICH_R": "/ext/anaconda-2019.03/bin/R",
    "SIZE": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-size",
    "STRINGS": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-strings",
    "STRIP": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-strip"
   },
   "language": "python",
   "metadata": {
    "cocalc": {
     "description": "Python/R distribution for data science",
     "priority": 5,
     "url": "https://www.anaconda.com/distribution/"
    }
   },
   "name": "anaconda2019"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}