CoCalc Shared Files2017-03-31-131047.ipynbOpen in CoCalc with one click!
Author: Anastasia Morozova
Views : 37
Description: Jupyter notebook 2017-03-31-131047.ipynb

Загрузить набор данных Duncan.csv и поместить его в папку "data"

Задача 1. Построить линейные регрессионные зависимости: Prestige(Incom)

1 способ: Строится линейная зависимость методом наименьших квадратов. Определить какие точки являются выбросами. Перестроить линейную зависимость по данным, которые не содержать выбросы. 2 способ: Постоить линейную регрессивную зависимость, используя М-регрессию на основе функции Хьюбера (ссылка). Идентифицировать выбросы.

In [1]:
#Prestige(Incom) import numpy as np import pandas as pd import matplotlib.pyplot as plt import sklearn.linear_model as sklm import sklearn.metrics as metrics import statsmodels.api as sm import patsy as pt
In [8]:
df = Df.read_csv('data/iris.csv',sep =',' )
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-8-1bf432406d0a> in <module> ----> 1 df = Duncan.read_csv('data/iris.csv',sep =',' ) NameError: name 'Duncan' is not defined
In [4]:
print(type(df)) df
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-4-f42fab0dee4c> in <module> ----> 1 print(type(df)) 2 df NameError: name 'df' is not defined
In [5]:
df.index
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-5-08c554537916> in <module> ----> 1 df.index NameError: name 'df' is not defined
In [6]:
X=df.prestige X= df.iloc[:,4] #prestige print(X)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-6-095286801caf> in <module> ----> 1 X=df.prestige 2 X= df.iloc[:,4] #prestige 3 print(X) NameError: name 'df' is not defined
In [7]:
Y = df.income Y = df.iloc[:,2] #income print(Y)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-7-829126943118> in <module> ----> 1 Y = df.income 2 Y = df.iloc[:,2] #income 3 print(Y) NameError: name 'df' is not defined
In [3]:
plt.figure(figsize=(5,5)) idx1 = (Y==1) idx2 = (Y==2) plt.scatter(X, Y, s=36 ,c='r') plt.xlabel("prestige") plt.ylabel("income") plt.grid(1) plt.show()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-3-f17b9b11cbea> in <module> 1 plt.figure(figsize=(5,5)) ----> 2 idx1 = (Y==1) 3 idx2 = (Y==2) 4 plt.scatter(X, Y, s=36 ,c='r') 5 NameError: name 'Y' is not defined
<Figure size 360x360 with 0 Axes>
In [8]:
Sx=sum(X) print(Sx) z=float(len(X)) print(z) Cx=float(Sx/z) print(Cx)
2146 45.0 47.68888888888889
In [9]:
Sx3=Cx**2 print(Sx3)
2274.2301234567904
In [10]:
Sy=sum(Y) print(Sy) c=float(len(Y)) print(c) Cy=float(Sy/c) print(Cy)
1884 45.0 41.86666666666667
In [11]:
Sx2=0 #сумма квадратов for x in X: Sx2+=x*x print(Sx2) Cx2=Sx2/len(X) # среднее арифметическое квадратов print(Cx)
146028 47.68888888888889
In [12]:
S1=0 for x in X: for y in Y: S1+=x*y Ss=S1/len(X) print(Ss) print(Cx*Cy) print(Cx2-Sx3) print(Cy)
89845.8666667 1996.5748148148148 970.83654321 41.86666666666667
In [40]:
#b=(Ss-Cx*Cy)/(Cx2-Sx3) b=(Ss-Cx*Cy)/((Sx-Cx)**2) a=Cy-b*Cx print(b) print(a)
0.0199525544353 40.9151515152
In [41]:
Yy=0 for x in X: Yy=a+b*x print(Yy) Yy=a+b*X
42.5512609788 42.5712135333 42.7108814143 42.4315456522 42.7108814143 42.651023751 42.7707390776 42.7108814143 41.9526843458 42.6709763055 42.052447118 42.6909288599 42.8505492954 42.0923522268 42.3716879889 41.6733485837 42.4315456522 42.5313084244 41.8130164647 42.7507865232 41.6933011381 41.593538366 41.733206247 41.2343923861 41.5735858115 41.9726369002 42.2519726623 42.052447118 41.4339179305 41.4937755938 41.1146770595 41.2144398317 41.2942500494 41.1146770595 41.1745347228 41.3940128216 41.3142026039 41.0548193962 40.9750091785 41.2343923861 41.0348668418 41.1346296139 41.0747719506 41.733206247 41.1146770595
In [42]:
plt.figure(figsize=(5,5)) idx1 = (Y==1) idx2 = (Y==2) plt.scatter(X, Y, s=20 ,c='r') #plt.scatter(X, Y, s=20, c='w', edgecolors='k', linewidth=1.5) plt.xlabel("prestige") plt.ylabel("income") plt.grid(1) plt.plot(X, Yy, c='r', linestyle='--', label='original') plt.show()
In [43]:
rgr=sklm.LinearRegression() Xr = X.reshape((-1,1)) rgr.fit(Xr,Y)#обучаем
/usr/local/lib/python3.4/dist-packages/ipykernel/__main__.py:2: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead from ipykernel import kernelapp as app
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [44]:
regr_hub =sklm.HuberRegressor() regr_hub.fit(Xr,Y)
HuberRegressor(alpha=0.0001, epsilon=1.35, fit_intercept=True, max_iter=100, tol=1e-05, warm_start=False)
In [45]:
plt.figure(figsize=(9,5)) plt.scatter(X, Y, s=50, c='w', edgecolors='k', linewidth=1.5) plt.plot(X, Yy, linestyle='--', label='original') plt.plot(X, rgr.predict(Xr), label='fittedLEnearREgr') plt.plot(X, regr_hub.predict(Xr), label='Huber') plt.xlabel("prestige") plt.ylabel("income") plt.legend() plt.grid(1) plt.minorticks_on() plt.show()
In [ ]:
In [ ]:
In [ ]: