Midterm
Helena Gray
CDS-101
March 28 2017
Dataset Summary
This dataset contains crimes committed in the city of Chicago since 2001. The source of the dataset is the Chicago Police Department. The email for the maintainer of the dataset is '[email protected]', which is the email of the Research and Development Division of the Chicago police force. Officially, the dataset is from data.cityofchicago.org. Chicago Crime Since 2001
data.cityofchicago.org. (2017). Crimes - 2001 to present [Comma Separate Value file]. Retrieved from https://catalog.data.gov/dataset/crimes-2001-to-present-398a4.
There are 6,291,788 rows in this dataset and there are 22 columns. The dataset is the crimes in Chicago since 2001. The data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system.
A description of the columns is as follows:
X is the row number of the dataset.
ID is the identification number of the incident report
Case.Number is another identification tag for each incident report
Date is the date and time that the incident happened
IUCR is the Illinois Uniform Crime Reporting (IUCR) Codes.
Block is the address where the crime occurred. The street number is partially blocked out
Primary.Type describes the category of the crime
Description is the specific crime committed within the primary type of transgression
Location.Description is what type of location the crime was committed in
Arrest is whether or not an arrest was made
Domestic is whether or not it was a domestic crime
Ward is the ward the crime was committed in. A ward is a local authority area, typically used for electoral purposes.
Community Area refers to the work of the Social Science Research Committee at the University of Chicago, which divided the city of Chicago into 77 community areas which are officially recognized by the City of Chicago. These community areas correspond roughly to neighborhoods or inter-related neighborhoods within the city.
FBI.Code are the FBI codes of the crime.
X.Coordinate refers to a horizontal coordinate on a map of the City of Chicago where the crime occurred.
Y.Coordinate refers to a vertical coordinate on a map of the City of Chicago where the crime occurred.
Year is the year in which the offense occurred.
Updated.On refers to when this particular offense was updated or appended to the dataset.
Latitude refers to the latitudinal coordinate of the offense.
Longitude refers to the longitudinal coordinate of the offense.
Location combines the columns of Latitude and Longitude to give a geographic location.
The code below imports the dataset as a comma separated value file and assigns it to a variable named 'chicago.crime' and call the tidyverse and lubridate packages. The dataset should be in the folder of the working directory before attempting to import. Prior to importing, the dataset was significantly reduced in size so it was possible to import.
Cleaning the Dataset
I cleaned the dataset which originally had 6,291,788 rows and 22 columns to only include the years 2015 to 2017. This reduced the dataset to 577,130 rows. I did this in RStudio outside of SageMathCloud.
I then removed columns which would not be needed for my analysis including X, Case.Number, Updated.On, Location, X.Coordinate, Y.Coordinate, ID, Case.Number, FBI.Code, Block, Beat, Latitude, Longitude, and Location. Some of these columns had values of "NA", "NULL", or "NaN" such as "Ward", "X.Coordinate", "Y.Coordinate", "Latitude", and "Longitude" but were removed as they were not necessary for this analysis.
I separated the column "Date" into 2 separate columns, "DateTime" and "TimeofDay".
I mutated the column "DateTime" into a dateTime format that could be referenced.
I checked again for values in columns of NA, NULL, or NaN prior to data analysis.
The code below checks to see if there are any 'NA', NULL, or 'NaN' values first by using the apply() function. Then, it removes the columns that are unnecessary for this analysis by subsetting the tibble using the subset function() and using a negative index to combine the columns and remove them. The 'Date' column is separated into 'DateTime' and the 'TimeofDay' (which is AM or PM) using the separate() function. This is for ease of analysis. The final line of code converts the 'DateTime' column into the type 'datetime' and format of month-day-year hour-minute-second using the mutate() function. All of these actions assign the result to a variable named 'chicago.crime.test' for each new operation applied to the dataset. Finally, the code to check for NULL or NA values is repeated to confirm a 'clean' dataset and the top of the dataset is shown using the head() function. The dataset appears to conform to 'tidy' data principles.
Exploratory Data Analysis and Visualisation
Data Transformation
The code below creates a new column using the mutate() function as well as arithmetic operations.
The new column shows the proportion of certain crimes by counting number of crimes of specific type and dividing that amount by the total amount of crimes for that year. First it groups the dataset by year and primary type of crime. Then, a column representing the count of total crimes for each category for each year is created and called 'count' and the tibble is organized in descending order of number of crimes. Finally a new column is created called "Proportion" that represents how many crimes out of the total number of crimes for that year were of a certain category.
As demonstrated above, theft, battery, and criminal damage were the most common crimes committed in the city of Chicago. And actually the number of these categories of crimes rose slightly from 2015 to 2016.
The code below uses the group_by() function to group the dataset by Community.Area and then summarize to count the total amount of crimes for each community area for each year.
25,8,28,43,29 are some communities with the consistently high amounts of crimes for 2016 and 2015.
The two most noticeable community areas with consistently high amounts of crimes for 2016 and 2015 are 25 and 8. Community Area 25 corresponds to 'Austin' neighborhood in Chicago. Community Area 8 corresponds to 'Near Northside' neighborhood in Chicago.
The code below first filters the dataset to only show homicides for the years 2016 and 2015 using the filter() function and specifying the 'Year' and 'Primary.Type'.
It appears that in general homicides have been on the rise recently.
_ Key Questions _
Do crimes tend to rise during summer? Yes. Especially during June, July, and August.
What drugs were being manufactured/delivered in Chicago in 2016? Mainly white heroin, crack, and marijuana under 10 grams. However, marijuana over 10 grams is on the rise in 2016.
Neighbors with most crimes? How many crimes typically? The community area "Austin" had around 45 crimes a day in 2016 and Community Area 'Near Northside' had around 28 crimes a day in 2016.
Is Trump right to demand that "Chicago fix the horrible "carnage" going on, 228 shootings in 2017 with 42 killings (up 24% from 2016)?" Should he "send in the feds"?
How many homicides happened in 2017 from January 1st to when Trump tweeted this? How many homicides in this same time period for 2016? From January 1st to January 24th, there were 37 homicides in 2016 and 45 homicides in 2017.
How many crimes involving handguns or 'shootings' have happened in 2017 during this time period? According to the data, not only were there no 'shootings' in 2017 but there were no crimes related to handguns at all in 2017.
Data visualization
The code below filters the dataset to only show homicides using the filter() function. It then separates the 'DateTime' column into two columns 'Date' and 'Time using the separate() function. Finally, it separates the 'Date' column into 3 different columns 'Year1','Month', and 'Day'. The 'Year1' column was not removed because the end goal was to make a plot, therefore the 'tidyness' of the dataset was not a priority. The dataset was then grouped by the Year, Month, and Primary Type using the group_by() dataset. This dataset was then summarized to count the number of homicides that happened each month for each year.
A plot was then made to show the number of homicides for each month for each year, using different colors for different years.
This plot is a representation of a correlation between homicides and months. It does appear as if homicides go up during the summer or hotter months, specifically June, July, and August.
The code below alters the dataset to separate the 'DateTime' column into two separate columns 'Date' and 'Time' and assigns it to a variable 'chicago.crime.test.narc' using the separate() function. It then filters the new dataset to only show those offenses that happened in the year 2016 and were drug-related using the filter() function. The filter() function is used again to filter the previously created dataset to only show the offenses that contained the string 'MANU' in the description of the crime to represent only drug-related crimes that were manufacturing and distributing drugs.
The code then separates the 'Date' column into three separate columns 'Year1','Month', and 'Day'. 'Year1' column was not removed as the ultimate goal was a graph, and thus the 'tidyness' of the tibble was not a priority.
The code then uses the group_by() function to group the dataset by year, month, and description and finally uses the summarize() and arrange() functions to count the number of crimes per category per month and then organizes the tibble in descending order of number of crimes. The top of the tibble is shown using the head() function. It appears as if the main drug being distributed in Chicago is 'white heroin'.
The code below takes the tibble 'by_date_and_type' and creates a line graph for each type of offense in manufacturing/distributing drugs over a period of months for 2016 and assigns it to a variable 'q'. The options() function controls the height and width of the graph display window.
The graph shows that initially there was a large amount of 'white heroin' being distributed in Chicago at the beginning of 2016 and swiftly declined near the end of 2016 while the manufacture/distribution of marijuana in quantities over 10 grams rose as the year went on.
The code below separates the dataset column 'Date' into two separate columns 'Date' and 'Time by using the separate() function and then filters the dataset to show only data for the year 2016 using the filter() function.
The resulting dataset is then grouped by Year, Community.Area, and Date using the group_by() function. A new column is added counting the number of crimes per day per community area using the summarize() function nested in the arrange() function.
The summarize() function is used again to get summary statistics for the dataset such as the mean, minimum, maximum, upper and lower quartiles, median, variance, and standard deviation of the dataset.
Finally the dataset is arranged in descending order of mean amount of crimes using the arrange() function. The top of this dataset is shown using the head() function.
The code below is used to create a boxplot for each community showing some of the summary statistics for each community.area for the year of 2016. As is shown by the boxplot the community areas 25, 8, and 28 have very high rates of crime compared to other community areas. ggplot() and geom_boxplot() are used.
The code below filters the dataset to only show homicides for 2016 from January 1st to January 24th using the filter() function and finally filters the dataset to show homicides within the same time period for the year January 2017 using the filter() function. This is to check the validity of Trump's claim that homicides in Chicago are up.
This was done to test the validity of a comment Trump made about the rising number of homicides in Chicago. It appears that he was telling the truth, that homicides in Chicago have been on the rise during the time period Trump specified from 2016 to 2017. However, during this time period homicides rose 22% from 2016 not 24% and there were actually 45 homicides rather than 42 in 2017.
The code below is filtered using the filter() function to show different types of crime that were handgun related during the time period Trump suggested that there was 228 shootings.
It appears as if Trump fabricated his data about shootings in Chicago for 2017 as there were no crimes involving handguns from January 1 2017 to the day Trump tweeted that he 'should call in the feds for Chicago'.