Problem description
Context
Police deparments are stirving to implement more automated and predictive data systems into their everyday processes to reduce crime and deploy scarce resources more efficiently. This provides an opportunity for more proactive policing if it were possible to alert resources of abnormal patterns in the data as they occur. Boston police department released public dataset with incident reports reported to its 911 call center.
Challenge
Assess the potential of the provided data set for predicting where police patrols should be dispatched in order to serve, protect, and optimize (people, money, resources, time).
Description of columns
(As provided online)
incident_num
(varchar; required) - Internal BPD report numberoffense_code
(varchar) - Numerical code of offense descriptionOffense_Code_Group_Description
(varchar) - Internal categorization of [offense_description]Offense_Description
(varchar) - Primary descriptor of incidentdistrict
(varchar) - What district the crime was reported inreporting_area
(varchar) - RA number associated with the where the crime was reported from.shooting
(char) - Indicated a shooting took place.occurred_on
(datetime) - Earliest date and time the incident could have taken placeUCR_Part
(varchar) - Universal Crime Reporting Part number (1, 2, 3)street
(varchar) - Street name the incident took place
We load the data in the cells below. Uncomment and run the one corresponding to the language of your choice!
INCIDENT_NUMBER | OFFENSE_CODE | OFFENSE_CODE_GROUP | OFFENSE_DESCRIPTION | DISTRICT | REPORTING_AREA | SHOOTING | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | UCR_PART | STREET | Lat | Long | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | I182024895 | 2629 | Harassment | HARASSMENT | B3 | 442 | NaN | 2018-04-03 20:00:00 | 2018 | 4 | Tuesday | 20 | Part Two | WESTCOTT ST | 42.293218 | -71.078865 | (42.29321805, -71.07886455) |
1 | I182024895 | 619 | Larceny | LARCENY ALL OTHERS | B3 | 442 | NaN | 2018-04-03 20:00:00 | 2018 | 4 | Tuesday | 20 | Part One | WESTCOTT ST | 42.293218 | -71.078865 | (42.29321805, -71.07886455) |
2 | I182024887 | 1402 | Vandalism | VANDALISM | B3 | 469 | NaN | 2018-03-28 20:30:00 | 2018 | 3 | Wednesday | 20 | Part Two | ALMONT ST | 42.275277 | -71.095542 | (42.27527670, -71.09554245) |
Most of the interesting features are not numbers, so the above is not very useful.
Other than the incident ID, we can split the data features into two groups: Incident description or space-time coordinates
For the Incident description, a lot is redundant. OFFENSE_CODE is an integer representation of OFFENSE_CODE_GROUP, so we will not bother with it (let's stay human readable here).
OFFENSE_CODE_GROUP and OFFENSE_DESCRIPTION are pretty similar. The description is more granular, too granular. We don't want to use it.
Looking at the different types of crimes, we may want to eventually categorize the crimes in further subcategories.
Cool, a map of Boston! What are the holes? Why are there disconnected regions? (There is a park in the middle of town, cars can't get in there. Also there are bodies of water, and a prominent bridge is traced out in the upper left part of the map.)