Problem description

Context

Police deparments are stirving to implement more automated and predictive data systems into their everyday processes to reduce crime and deploy scarce resources more efficiently. This provides an opportunity for more proactive policing if it were possible to alert resources of abnormal patterns in the data as they occur. Boston police department released public dataset with incident reports reported to its 911 call center.

Challenge

Assess the potential of the provided data set for predicting where police patrols should be dispatched in order to serve, protect, and optimize (people, money, resources, time).

Description of columns

(As provided online)

incident_num (varchar; required) - Internal BPD report number
offense_code (varchar) - Numerical code of offense description
Offense_Code_Group_Description (varchar) - Internal categorization of [offense_description]
Offense_Description (varchar) - Primary descriptor of incident
district (varchar) - What district the crime was reported in
reporting_area (varchar) - RA number associated with the where the crime was reported from.
shooting (char) - Indicated a shooting took place.
occurred_on (datetime) - Earliest date and time the incident could have taken place
UCR_Part (varchar) - Universal Crime Reporting Part number (1, 2, 3)
street (varchar) - Street name the incident took place

We load the data in the cells below. Uncomment and run the one corresponding to the language of your choice!

In [1]:

import pandas as pd
reports_df = pd.read_csv('boston_crime_incident_reports_2015aug-2018apr.csv', encoding='latin-1')
weather_df = pd.read_csv('boston_weather_data_cleaned_2018oct05.csv')

In [2]:

# reports_df <- read.csv('boston_crime_incident_reports_2015aug-2018apr.csv', header=TRUE)
# weather_df <- read.csv('boston_weather_data_cleaned_2018oct05.csv', header=TRUE)

In [3]:

reports_df.head(3)

	INCIDENT_NUMBER	OFFENSE_CODE	OFFENSE_CODE_GROUP	OFFENSE_DESCRIPTION	DISTRICT	REPORTING_AREA	SHOOTING	OCCURRED_ON_DATE	YEAR	MONTH	DAY_OF_WEEK	HOUR	UCR_PART	STREET	Lat	Long	Location
0	I182024895	2629	Harassment	HARASSMENT	B3	442	NaN	2018-04-03 20:00:00	2018	4	Tuesday	20	Part Two	WESTCOTT ST	42.293218	-71.078865	(42.29321805, -71.07886455)
1	I182024895	619	Larceny	LARCENY ALL OTHERS	B3	442	NaN	2018-04-03 20:00:00	2018	4	Tuesday	20	Part One	WESTCOTT ST	42.293218	-71.078865	(42.29321805, -71.07886455)
2	I182024887	1402	Vandalism	VANDALISM	B3	469	NaN	2018-03-28 20:30:00	2018	3	Wednesday	20	Part Two	ALMONT ST	42.275277	-71.095542	(42.27527670, -71.09554245)

Most of the interesting features are not numbers, so the above is not very useful.

Other than the incident ID, we can split the data features into two groups: Incident description or space-time coordinates

For the Incident description, a lot is redundant. OFFENSE_CODE is an integer representation of OFFENSE_CODE_GROUP, so we will not bother with it (let's stay human readable here).

OFFENSE_CODE_GROUP and OFFENSE_DESCRIPTION are pretty similar. The description is more granular, too granular. We don't want to use it.

In [15]:

# This can give us a first idea of crimes by date and district. For all hours of day, assign police resources in proportions to crimes commited in each district.
# Note that some crimes are more important than others.
reports_df.groupby(['HOUR', 'DISTRICT']).count().MONTH

HOUR  DISTRICT
0     A1          1547
      A15          218
      A7           495
      B2          1786
      B3          1231
      C11         1705
      C6           955
      D14          928
      D4          1540
      E13          569
      E18          583
      E5           552
1     A1          1382
      A15           87
      A7           395
      B2          1135
      B3           831
      C11          899
      C6           493
      D14          501
      D4           737
      E13          320
      E18          315
      E5           235
2     A1          1188
      A15           67
      A7           329
      B2           914
      B3           593
      C11          671
                  ... 
21    C6           719
      D14          718
      D4          1338
      E13          620
      E18          590
      E5           441
22    A1           951
      A15          198
      A7           477
      B2          1872
      B3          1347
      C11         1528
      C6           645
      D14          706
      D4          1153
      E13          575
      E18          512
      E5           414
23    A1           970
      A15          153
      A7           386
      B2          1479
      B3           969
      C11         1256
      C6           548
      D14          662
      D4           949
      E13          442
      E18          392
      E5           344
Name: MONTH, Length: 288, dtype: int64

Looking at the different types of crimes, we may want to eventually categorize the crimes in further subcategories.

In [16]:

reports_df.groupby('OFFENSE_CODE_GROUP').count().sort_values('MONTH', ascending=False).head(10).MONTH

OFFENSE_CODE_GROUP
Motor Vehicle Accident Response    26459
Larceny                            21670
Medical Assistance                 18911
Investigate Person                 15646
Other                              14763
Vandalism                          13074
Drug Violation                     12698
Simple Assault                     12604
Verbal Disputes                    11009
Towed                               9069
Name: MONTH, dtype: int64

In [17]:

# If we want to visualize the crimes location.
reports_df.plot.scatter(x='Long', y='Lat')

<matplotlib.axes._subplots.AxesSubplot at 0x7f0a90860550>

In [18]:

# Uhhh? Turns out there are some data with bad values, eg Lat and Long with value of (0,0). Lets get rid of these
reports_df[(reports_df['Lat'] > 10) & (reports_df['Long'] < -50)].plot.scatter(x='Long', y='Lat', s=1) # This should work better, we'll need s=1 to make the individual dots easier to see.

<matplotlib.axes._subplots.AxesSubplot at 0x7f0a9068aa20>

Cool, a map of Boston! What are the holes? Why are there disconnected regions? (There is a park in the middle of town, cars can't get in there. Also there are bodies of water, and a prominent bridge is traced out in the upper left part of the map.)

In [0]: