{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# Problem description\n", "\n", "## Context\n", "Police deparments are stirving to implement more automated and predictive data systems into their everyday processes to reduce crime and deploy scarce resources more efficiently. This provides an opportunity for more proactive policing if it were possible to alert resources of abnormal patterns in the data as they occur. Boston police department released public dataset with incident reports reported to its 911 call center.\n", "\n", "## Challenge\n", "Assess the potential of the provided data set for predicting where police patrols should be dispatched in order to serve, protect, and optimize (people, money, resources, time).\n", "\n", "### Description of columns\n", "_(As provided online)_\n", "\n", "1. `incident_num` (varchar; required) - Internal BPD report number\n", "2. `offense_code` (varchar) - Numerical code of offense description\n", "3. `Offense_Code_Group_Description` (varchar) - Internal categorization of [offense_description]\n", "4. `Offense_Description` (varchar) - Primary descriptor of incident\n", "5. `district` (varchar) - What district the crime was reported in\n", "6. `reporting_area` (varchar) - RA number associated with the where the crime was reported from.\n", "7. `shooting` (char) - Indicated a shooting took place.\n", "8. `occurred_on` (datetime) - Earliest date and time the incident could have taken place\n", "9. `UCR_Part` (varchar) - Universal Crime Reporting Part number (1, 2, 3)\n", "10. `street` (varchar) - Street name the incident took place" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "We load the data in the cells below. Uncomment and run the one corresponding to the language of your choice!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "import pandas as pd\n", "reports_df = pd.read_csv('boston_crime_incident_reports_2015aug-2018apr.csv', encoding='latin-1')\n", "weather_df = pd.read_csv('boston_weather_data_cleaned_2018oct05.csv')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "# reports_df <- read.csv('boston_crime_incident_reports_2015aug-2018apr.csv', header=TRUE)\n", "# weather_df <- read.csv('boston_weather_data_cleaned_2018oct05.csv', header=TRUE)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
INCIDENT_NUMBEROFFENSE_CODEOFFENSE_CODE_GROUPOFFENSE_DESCRIPTIONDISTRICTREPORTING_AREASHOOTINGOCCURRED_ON_DATEYEARMONTHDAY_OF_WEEKHOURUCR_PARTSTREETLatLongLocation
0I1820248952629HarassmentHARASSMENTB3442NaN2018-04-03 20:00:0020184Tuesday20Part TwoWESTCOTT ST42.293218-71.078865(42.29321805, -71.07886455)
1I182024895619LarcenyLARCENY ALL OTHERSB3442NaN2018-04-03 20:00:0020184Tuesday20Part OneWESTCOTT ST42.293218-71.078865(42.29321805, -71.07886455)
2I1820248871402VandalismVANDALISMB3469NaN2018-03-28 20:30:0020183Wednesday20Part TwoALMONT ST42.275277-71.095542(42.27527670, -71.09554245)
\n", "
" ] }, "execution_count": 3, "metadata": { }, "output_type": "execute_result" } ], "source": [ "reports_df.head(3)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Most of the interesting features are not numbers, so the above is not very useful.\n", "\n", "Other than the incident ID, we can split the data features into two groups: Incident description or space-time coordinates\n", "\n", "For the Incident description, a lot is redundant. OFFENSE_CODE is an integer representation of OFFENSE_CODE_GROUP, so we will not bother with it (let's stay human readable here).\n", "\n", "OFFENSE_CODE_GROUP and OFFENSE_DESCRIPTION are pretty similar. The description is more granular, too granular. We don't want to use it.\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "HOUR DISTRICT\n", "0 A1 1547\n", " A15 218\n", " A7 495\n", " B2 1786\n", " B3 1231\n", " C11 1705\n", " C6 955\n", " D14 928\n", " D4 1540\n", " E13 569\n", " E18 583\n", " E5 552\n", "1 A1 1382\n", " A15 87\n", " A7 395\n", " B2 1135\n", " B3 831\n", " C11 899\n", " C6 493\n", " D14 501\n", " D4 737\n", " E13 320\n", " E18 315\n", " E5 235\n", "2 A1 1188\n", " A15 67\n", " A7 329\n", " B2 914\n", " B3 593\n", " C11 671\n", " ... \n", "21 C6 719\n", " D14 718\n", " D4 1338\n", " E13 620\n", " E18 590\n", " E5 441\n", "22 A1 951\n", " A15 198\n", " A7 477\n", " B2 1872\n", " B3 1347\n", " C11 1528\n", " C6 645\n", " D14 706\n", " D4 1153\n", " E13 575\n", " E18 512\n", " E5 414\n", "23 A1 970\n", " A15 153\n", " A7 386\n", " B2 1479\n", " B3 969\n", " C11 1256\n", " C6 548\n", " D14 662\n", " D4 949\n", " E13 442\n", " E18 392\n", " E5 344\n", "Name: MONTH, Length: 288, dtype: int64" ] }, "execution_count": 15, "metadata": { }, "output_type": "execute_result" } ], "source": [ "# This can give us a first idea of crimes by date and district. For all hours of day, assign police resources in proportions to crimes commited in each district.\n", "# Note that some crimes are more important than others.\n", "reports_df.groupby(['HOUR', 'DISTRICT']).count().MONTH" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Looking at the different types of crimes, we may want to eventually categorize the crimes in further subcategories. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "OFFENSE_CODE_GROUP\n", "Motor Vehicle Accident Response 26459\n", "Larceny 21670\n", "Medical Assistance 18911\n", "Investigate Person 15646\n", "Other 14763\n", "Vandalism 13074\n", "Drug Violation 12698\n", "Simple Assault 12604\n", "Verbal Disputes 11009\n", "Towed 9069\n", "Name: MONTH, dtype: int64" ] }, "execution_count": 16, "metadata": { }, "output_type": "execute_result" } ], "source": [ "reports_df.groupby('OFFENSE_CODE_GROUP').count().sort_values('MONTH', ascending=False).head(10).MONTH" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": { }, "output_type": "execute_result" }, { "data": { "image/png": "a4abbcd27b4e88ade9719901251095297cdb8eaa" }, "execution_count": 17, "metadata": { "image/png": { "height": 263, "width": 384 }, "needs_background": "light" }, "output_type": "execute_result" } ], "source": [ "# If we want to visualize the crimes location.\n", "reports_df.plot.scatter(x='Long', y='Lat')" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 18, "metadata": { }, "output_type": "execute_result" }, { "data": { "image/png": "5d76c1a5d3299d00be70959c4ec24bf1c33dc4b4" }, "execution_count": 18, "metadata": { "image/png": { "height": 263, "width": 406 }, "needs_background": "light" }, "output_type": "execute_result" } ], "source": [ "# Uhhh? Turns out there are some data with bad values, eg Lat and Long with value of (0,0). Lets get rid of these\n", "reports_df[(reports_df['Lat'] > 10) & (reports_df['Long'] < -50)].plot.scatter(x='Long', y='Lat', s=1) # This should work better, we'll need s=1 to make the individual dots easier to see." ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Cool, a map of Boston! What are the holes? Why are there disconnected regions? (There is a park in the middle of town, cars can't get in there. Also there are bodies of water, and a prominent bridge is traced out in the upper left part of the map.)" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (Anaconda 2019)", "env": { "AR": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-ar", "AS": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-as", "CC": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-cc", "CONDA_EXE": "/ext/anaconda-2019.03/bin/conda", "CONDA_PREFIX": "/ext/anaconda-2019.03", "CONDA_PYTHON_EXE": "/ext/anaconda-2019.03/bin/python", "CPP": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-cpp", "CXX": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-c++", "CXXFILT": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-c++filt", "ELFEDIT": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-elfedit", "F77": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gfortran", "F90": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gfortran", "F95": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-f95", "FC": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gfortran", "GCC": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gcc", "GCC_AR": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gcc-ar", "GCC_NM": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gcc-nm", "GCC_RANLIB": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gcc-ranlib", "GDAL_DATA": "/ext/anaconda-2019.03/share/gdal", "GFORTRAN": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gfortran", "GPROF": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-gprof", "GXX": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-g++", "JAVA_HOME": "/ext/anaconda-2019.03", "JAVA_LD_LIBRARY_PATH": "/ext/anaconda-2019.03/lib/server", "LD": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-ld", "LD_GOLD": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-ld.gold", "LD_LIBRARY_PATH": "/ext/anaconda-2019.03/lib", "NM": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-nm", "OBJCOPY": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-objcopy", "OBJDUMP": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-objdump", "OCAMLFIND_CONF": "/ext/anaconda-2019.03/etc/findlib.conf", "OCAMLLIB": "/ext/anaconda-2019.03/lib/ocaml", "OCAML_PREFIX": "/ext/anaconda-2019.03", "PATH": "/ext/anaconda-2019.03/bin:/ext/anaconda-2019.03/condabin:/cocalc/bin:/cocalc/src/smc-project/bin:/home/user/bin:/home/user/.local/bin:/ext/bin:/usr/lib/xpra:/opt/ghc/bin:/usr/local/bin:/usr/bin:/bin:/ext/data/homer/bin:/ext/data/weblogo:/ext/intellij/idea/bin:/ext/pycharm/pycharm/bin:/usr/lib/postgresql/10/bin", "PROJ_LIB": "/ext/anaconda-2019.03/share/proj", "PYTHONHOME": "/ext/anaconda-2019.03/lib/python3.7", "PYTHONPATH": "/ext/anaconda-2019.03/lib/python3.7:/ext/anaconda-2019.03/lib/python3.7/site-packages", "RANLIB": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-ranlib", "READELF": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-readelf", "RSTUDIO_WHICH_R": "/ext/anaconda-2019.03/bin/R", "SIZE": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-size", "STRINGS": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-strings", "STRIP": "/ext/anaconda-2019.03/bin/x86_64-conda_cos6-linux-gnu-strip" }, "language": "python", "metadata": { "cocalc": { "description": "Python/R distribution for data science", "priority": 5, "url": "https://www.anaconda.com/distribution/" } }, "name": "anaconda2019" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }