CoCalc Public Filesdatabreach.ipynbOpen with one click!
Author: Data 88
Views : 53
License: Apache License 2.0
Description: A Python notebook on Data Breaches
In [1]:
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np sns.set_style("whitegrid") sns.set_context("talk")
In [4]:
df = pd.read_csv('./datab2.csv', sep = ';')
In [5]:
df
Entity Story Year Records Sector Method
0 River City Media A dodgy backup has allegedly resulted in over ... 2017 1370000000 Web Accidentally published
1 Unique Identification Authority of India A report says that full data base has been exp... 2017 1000000000 Government Poor security
2 Spambot A misconfigured spambot has leaked over 700m r... 2017 711000000 Web Poor security
3 Friend Finder Network Usernames, email addresses, passwords for site... 2016 412000000 Web Hacked
4 Equifax If you have a credit report, there’s a good ch... 2017 143000000 Financial Hacked
... ... ... ... ... ... ...
265 Cardsystems Solutions Inc. CardSystems was fingered by MasterCard after i... 2005 40000000 Financial Hacked
266 Citigroup Blame the messenger! A box of computer tapes c... 2005 3900000 Financial Lost / stolen device or media
267 Ameritrade Inc. Computer backup tape containing personal infor... 2005 200000 Financial Lost / stolen device or media
268 Automatic Data Processing NaN 2005 125000 Financial Poor security
269 AOL A former America Online software engineer stol... 2004 92000000 Web Inside job

270 rows × 6 columns

In [6]:
df.shape
(270, 6)
In [7]:
df.dtypes
Entity object Story object Year int64 Records int64 Sector object Method object dtype: object
In [8]:
df.head()
Entity Story Year Records Sector Method
0 River City Media A dodgy backup has allegedly resulted in over ... 2017 1370000000 Web Accidentally published
1 Unique Identification Authority of India A report says that full data base has been exp... 2017 1000000000 Government Poor security
2 Spambot A misconfigured spambot has leaked over 700m r... 2017 711000000 Web Poor security
3 Friend Finder Network Usernames, email addresses, passwords for site... 2016 412000000 Web Hacked
4 Equifax If you have a credit report, there’s a good ch... 2017 143000000 Financial Hacked
In [9]:
df.index
RangeIndex(start=0, stop=270, step=1)
In [10]:
df.plot(x="Sector", y="Records", kind = 'line')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a43ccaba8>
In [11]:
ax = sns.barplot(x="Year", y="Records", data=df)
In [12]:
df.sort_values('Records', ascending=False)
Entity Story Year Records Sector Method
0 River City Media A dodgy backup has allegedly resulted in over ... 2017 1370000000 Web Accidentally published
1 Unique Identification Authority of India A report says that full data base has been exp... 2017 1000000000 Government Poor security
103 Yahoo Happened in 2013 but only disclosed late 2016.... 2013 1000000000 Web Hacked
2 Spambot A misconfigured spambot has leaked over 700m r... 2017 711000000 Web Poor security
78 Yahoo Happened in 2014, but no. records stolen was o... 2014 500000000 Web Hacked
... ... ... ... ... ... ...
53 uTorrent It's unclear what data has been breached, exac... 2016 35000 Web Hacked
194 Morgan Stanley Smith Barney Morgan Stanley mailed a CD containing sensitiv... 2011 34000 Financial Lost / stolen device or media
36 Quest Diagnostics Nov. The stolen data contained names, DOBs, la... 2016 34000 Healthcare Hacked
159 Dropbox Websites stolen from other websites used to si... 2012 30000 Web Hacked
54 Wendy's Malware has been used in 1025 of Wendy's resta... 2016 1025 Retail Hacked

270 rows × 6 columns

In [13]:
df.describe()
Year Records
count 270.000000 2.700000e+02
mean 2012.222222 3.275059e+07
std 3.134332 1.341436e+08
min 2004.000000 1.025000e+03
25% 2010.000000 3.223948e+05
50% 2012.000000 2.000000e+06
75% 2015.000000 1.075000e+07
max 2017.000000 1.370000e+09
In [14]:
df_by_year = df.groupby('Year')
In [15]:
type(df_by_year)
pandas.core.groupby.generic.DataFrameGroupBy
In [16]:
df_by_year.describe()
Records
count mean std min 25% 50% 75% max
Year
2004 1.0 9.200000e+07 NaN 92000000.0 92000000.00 92000000.0 92000000.00 9.200000e+07
2005 4.0 1.105625e+07 1.937613e+07 125000.0 181250.00 2050000.0 12925000.00 4.000000e+07
2006 6.0 1.171667e+07 1.086617e+07 200000.0 2950000.00 10500000.0 19250000.00 2.650000e+07
2007 13.0 1.202203e+07 2.550905e+07 89000.0 1000000.00 3000000.0 8500000.00 9.400000e+07
2008 17.0 4.033324e+06 5.164474e+06 50500.0 113000.00 2100000.0 5000000.00 1.800000e+07
2009 14.0 1.834375e+07 3.831834e+07 72000.0 391284.25 1121604.5 7443033.50 1.300000e+08
2010 20.0 8.068238e+05 9.542018e+05 43000.0 174083.25 395000.0 975000.00 3.300000e+06
2011 35.0 6.556471e+06 1.484513e+07 34000.0 200000.00 1000000.0 4572433.00 7.700000e+07
2012 26.0 2.753285e+07 5.208669e+07 30000.0 785000.00 7500000.0 16625000.00 2.000000e+08
2013 31.0 4.233181e+07 1.787852e+08 100000.0 165000.00 1460000.0 5350000.00 1.000000e+09
2014 27.0 3.628846e+07 9.850089e+07 52000.0 1050000.00 4000000.0 15500000.00 5.000000e+08
2015 25.0 1.848113e+07 4.295943e+07 40000.0 400000.00 2700000.0 13000000.00 1.980000e+08
2016 33.0 3.463194e+07 7.684370e+07 1025.0 790724.00 6600000.0 40000000.00 4.120000e+08
2017 18.0 1.831136e+08 4.058881e+08 40000.0 1150000.00 3000000.0 27720469.25 1.370000e+09
In [17]:
list(df_by_year)[10]
(2014, Entity \ 6 Malaysian telcos & MVNOs 47 Privatization Agency of the Republic of Serbia 78 Yahoo 79 Ebay 80 JP Morgan Chase 81 Target 82 Home Depot 83 Korea Credit Bureau 84 Premera 85 Sony Pictures 86 Twitch.tv 87 Gmail* 88 Community Health Systems 89 European Central Bank 90 UPS 91 HSBC Turkey 92 AOL 93 Imgur 94 Staples 95 Neiman Marcus 96 D&B, Altegrity 97 MacRumours.com 98 Japan Airlines 99 Dominios Pizzas (France) 100 NASDAQ 101 Mozilla 102 New York Taxis Story Year Records \ 6 Oct. Data from numerous Malaysian telco & MVNO... 2014 46200000 47 A text file with personal data and financial d... 2014 5190396 78 Happened in 2014, but no. records stolen was o... 2014 500000000 79 The company has said hackers attacked between ... 2014 145000000 80 July 2014: The US's largest bank was compromis... 2014 76000000 81 Investigators believe the data was obtained vi... 2014 70000000 82 Malware installed on cash register system acro... 2014 56000000 83 NaN 2014 20000000 84 Detected 29th Jan 2015. Occured May 2014. Coul... 2014 11000000 85 Wide-ranging hack of potentially every piece o... 2014 10000000 86 March 23rd. Details unknown at this point. All... 2014 10000000 87 5 million Gmail account passwords leaked to a ... 2014 5000000 88 Aug 2014: Community Health Systems, which oper... 2014 4500000 89 NaN 2014 4000000 90 Malware was discovered in the credit & debit c... 2014 4000000 91 In a message to customers on its website, the ... 2014 2700000 92 NaN 2014 2400000 93 Imgur are still investigating how the breach t... 2014 1700000 94 NaN 2014 1160000 95 NaN 2014 1100000 96 Hackers stole millions of social security numb... 2014 1000000 97 NaN 2014 860000 98 Oct 2014: Japan Airlines confirmed the possibl... 2014 750000 99 NaN 2014 600000 100 Nasdaq forum website hacked by hacking ring, e... 2014 500000 101 NaN 2014 76000 102 A freedom of information request resulted in t... 2014 52000 Sector Method 6 Telecoms Hacked 47 Government Accidentally published 78 Web Hacked 79 Web Hacked 80 Financial Hacked 81 Retail Hacked 82 Retail Hacked 83 Financial Inside job 84 Healthcare Hacked 85 Media Hacked 86 Gaming Hacked 87 Web Hacked 88 Healthcare Hacked 89 Financial Hacked 90 Retail Hacked 91 Financial Hacked 92 Web Hacked 93 App Hacked 94 Retail Hacked 95 Retail Hacked 96 Tech Hacked 97 Web Hacked 98 Transport Hacked 99 Web Hacked 100 Financial Hacked 101 Web Poor security 102 Transport Poor security )
In [18]:
df_mean_by_year = df_by_year.mean()
In [19]:
df_mean_by_year
Records
Year
2004 9.200000e+07
2005 1.105625e+07
2006 1.171667e+07
2007 1.202203e+07
2008 4.033324e+06
2009 1.834375e+07
2010 8.068238e+05
2011 6.556471e+06
2012 2.753285e+07
2013 4.233181e+07
2014 3.628846e+07
2015 1.848113e+07
2016 3.463194e+07
2017 1.831136e+08
In [20]:
print(df.index)
RangeIndex(start=0, stop=270, step=1)
In [21]:
print(df_mean_by_year)
Records Year 2004 9.200000e+07 2005 1.105625e+07 2006 1.171667e+07 2007 1.202203e+07 2008 4.033324e+06 2009 1.834375e+07 2010 8.068238e+05 2011 6.556471e+06 2012 2.753285e+07 2013 4.233181e+07 2014 3.628846e+07 2015 1.848113e+07 2016 3.463194e+07 2017 1.831136e+08
In [22]:
df_records_by_year = df_mean_by_year['Records'] plt.scatter(df_records_by_year.index, df_records_by_year) plt.xlabel('Year') plt.ylabel('Median Records Breached')
Text(0, 0.5, 'Median Records Breached')
In [23]:
sns.relplot(x='Year', y='Records', data=df)
<seaborn.axisgrid.FacetGrid at 0x7f3a41632400>
In [24]:
import seaborn as sns sns.set(style="white") # Plot miles per gallon against horsepower with other semantics sns.relplot(x="Year", y="Sector", hue="Method", size="Records", sizes=(40, 400), alpha=.5, palette="muted", height=6, data=df)
<seaborn.axisgrid.FacetGrid at 0x7f3a415edba8>
In [25]:
sns.relplot(x="Year", y="Method", hue="Sector", size="Records", sizes=(1000, 9000), alpha=0.5, palette="muted", height=16, data=df)
<seaborn.axisgrid.FacetGrid at 0x7f3a415356d8>
In [26]:
sns.relplot(x="Year", y="Records", hue="Method", size="Records", sizes=(1000, 9000), alpha=0.5, height=16, data=df)
<seaborn.axisgrid.FacetGrid at 0x7f3a41480e80>