CoCalc Public Filesinterview-1.ipynb
Author: Element AI Interviews
Views : 181
Compute Environment: Ubuntu 18.04 (Deprecated)

# Problem description

## Context

Police deparments are stirving to implement more automated and predictive data systems into their everyday processes to reduce crime and deploy scarce resources more efficiently. This provides an opportunity for more proactive policing if it were possible to alert resources of abnormal patterns in the data as they occur. Boston police department released public dataset with incident reports reported to its 911 call center.

## Challenge

Assess the potential of the provided data set for predicting where police patrols should be dispatched in order to serve, protect, and optimize (people, money, resources, time).

### Description of columns

(As provided online)

1. incident_num (varchar; required) - Internal BPD report number
2. offense_code (varchar) - Numerical code of offense description
3. Offense_Code_Group_Description (varchar) - Internal categorization of [offense_description]
4. Offense_Description (varchar) - Primary descriptor of incident
5. district (varchar) - What district the crime was reported in
6. reporting_area (varchar) - RA number associated with the where the crime was reported from.
7. shooting (char) - Indicated a shooting took place.
8. occurred_on (datetime) - Earliest date and time the incident could have taken place
9. UCR_Part (varchar) - Universal Crime Reporting Part number (1, 2, 3)
10. street (varchar) - Street name the incident took place

We load the data in the cells below. Uncomment and run the one corresponding to the language of your choice!

In [1]:
import pandas as pd
reports_df = pd.read_csv('boston_crime_incident_reports_2015aug-2018apr.csv', encoding='latin-1')
weather_df = pd.read_csv('boston_weather_data_cleaned_2018oct05.csv')

In [2]:
# reports_df <- read.csv('boston_crime_incident_reports_2015aug-2018apr.csv', header=TRUE)
# weather_df <- read.csv('boston_weather_data_cleaned_2018oct05.csv', header=TRUE)

In [3]:
reports_df.head(3)

INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA SHOOTING OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR UCR_PART STREET Lat Long Location
0 I182024895 2629 Harassment HARASSMENT B3 442 NaN 2018-04-03 20:00:00 2018 4 Tuesday 20 Part Two WESTCOTT ST 42.293218 -71.078865 (42.29321805, -71.07886455)
1 I182024895 619 Larceny LARCENY ALL OTHERS B3 442 NaN 2018-04-03 20:00:00 2018 4 Tuesday 20 Part One WESTCOTT ST 42.293218 -71.078865 (42.29321805, -71.07886455)
2 I182024887 1402 Vandalism VANDALISM B3 469 NaN 2018-03-28 20:30:00 2018 3 Wednesday 20 Part Two ALMONT ST 42.275277 -71.095542 (42.27527670, -71.09554245)

Most of the interesting features are not numbers, so the above is not very useful.

Other than the incident ID, we can split the data features into two groups: Incident description or space-time coordinates

For the Incident description, a lot is redundant. OFFENSE_CODE is an integer representation of OFFENSE_CODE_GROUP, so we will not bother with it (let's stay human readable here).

OFFENSE_CODE_GROUP and OFFENSE_DESCRIPTION are pretty similar. The description is more granular, too granular. We don't want to use it.

In [15]:
# This can give us a first idea of crimes by date and district. For all hours of day, assign police resources in proportions to crimes commited in each district.
# Note that some crimes are more important than others.
reports_df.groupby(['HOUR', 'DISTRICT']).count().MONTH

HOUR DISTRICT 0 A1 1547 A15 218 A7 495 B2 1786 B3 1231 C11 1705 C6 955 D14 928 D4 1540 E13 569 E18 583 E5 552 1 A1 1382 A15 87 A7 395 B2 1135 B3 831 C11 899 C6 493 D14 501 D4 737 E13 320 E18 315 E5 235 2 A1 1188 A15 67 A7 329 B2 914 B3 593 C11 671 ... 21 C6 719 D14 718 D4 1338 E13 620 E18 590 E5 441 22 A1 951 A15 198 A7 477 B2 1872 B3 1347 C11 1528 C6 645 D14 706 D4 1153 E13 575 E18 512 E5 414 23 A1 970 A15 153 A7 386 B2 1479 B3 969 C11 1256 C6 548 D14 662 D4 949 E13 442 E18 392 E5 344 Name: MONTH, Length: 288, dtype: int64

Looking at the different types of crimes, we may want to eventually categorize the crimes in further subcategories.

In [16]:
reports_df.groupby('OFFENSE_CODE_GROUP').count().sort_values('MONTH', ascending=False).head(10).MONTH

OFFENSE_CODE_GROUP Motor Vehicle Accident Response 26459 Larceny 21670 Medical Assistance 18911 Investigate Person 15646 Other 14763 Vandalism 13074 Drug Violation 12698 Simple Assault 12604 Verbal Disputes 11009 Towed 9069 Name: MONTH, dtype: int64
In [17]:
# If we want to visualize the crimes location.
reports_df.plot.scatter(x='Long', y='Lat')

<matplotlib.axes._subplots.AxesSubplot at 0x7f0a90860550>
In [18]:
# Uhhh? Turns out there are some data with bad values, eg Lat and Long with value of (0,0). Lets get rid of these
reports_df[(reports_df['Lat'] > 10) & (reports_df['Long'] < -50)].plot.scatter(x='Long', y='Lat', s=1) # This should work better, we'll need s=1 to make the individual dots easier to see.

<matplotlib.axes._subplots.AxesSubplot at 0x7f0a9068aa20>

Cool, a map of Boston! What are the holes? Why are there disconnected regions? (There is a park in the middle of town, cars can't get in there. Also there are bodies of water, and a prominent bridge is traced out in the upper left part of the map.)

In [ ]: