CoCalc Public FilesProblemSets_BLANK / PS07 / PS07_Ang_Pizio_Accountancy_MBF v2.ipynb
Authors: Chi Hern Ang, Fabricius Somogyi
Views : 66

In [25]:
using Pkg
using RCall
include("./printmat.jl")

using Plots
# gr(size=(600,400))
#pyplot(size=(600,400))
gr(size=(480,320))
default(fmt = :svg)


# Calling R from Julia

R is the leading programming language when it comes to statistics and econometrics. Its power emerges from thousands of well documented packages oftentimes written by professors and leading researchers themselves. When time is scarce to code lines of GARCH estimation code or systems of VAR equations, a good option is to transfer data to R and let it do the computations. The (registered) package needed is RCall. There are several ways of interacting with R in Julia, all are more or less equally straightforward. We will stick to treating R as a black box: put in the data, press a button of required label and extract the result. Full and easy-to-understand documentation is available here.

# Task 1 (10 points)

Using R from Julia, simulate two sequences (100 values each) following Cauchy distribution with location = 0 and scale = 1. Transport them to Julia and assign them to variables y1 and y2. The documentation on Cauchy distribution in R is here.

In [40]:
R"y1 <- rcauchy(100, location = 0, scale = 1)"
R"y2 <- rcauchy(100, location = 0, scale = 1)"

RObject{RealSxp} [1] -7.244468866 0.847779270 -1.082416847 -0.651576274 -0.148679750 [6] 46.464599968 3.471882957 -0.270558850 -0.993496171 -0.951280297 [11] 0.273606194 12.045965976 3.071353942 -0.531499387 0.692925194 [16] -0.450799069 -0.692106099 -4.776544992 1.234736993 -0.169060122 [21] -0.102353746 1.355505764 -2.995360055 1.143103092 -0.798638660 [26] 1.175529673 -0.240438555 3.576590795 -0.173720442 -0.615909402 [31] -0.280300361 0.233946784 -1.191984555 2.335007457 0.447814783 [36] -1.717073994 -0.339348783 -1.626897344 -1.872519546 -25.147153715 [41] -1.808880527 1.237149087 1.415063353 0.539924469 1.014416002 [46] 0.166336026 0.298964990 1.336696623 -0.006271213 -0.735993539 [51] -4.965217798 -0.121421692 1.212729661 -26.200395124 -3.216235381 [56] 0.015580038 0.547205468 0.829951222 -2.861677037 -5.197022895 [61] 0.799260276 3.973996656 -1.562077153 0.199285650 -0.318847465 [66] -1.738368776 -0.022591935 -0.826909121 0.660515845 0.024908008 [71] -0.472798085 -3.473366740 -4.526689041 -0.400015128 -1.244143913 [76] 0.488441258 0.262895114 -1.549342496 1.710511103 -0.254675226 [81] 0.009795992 0.551233114 -1.347351051 -27.278947755 3.045290282 [86] 2.855803338 -0.632225074 2.838604421 1.001077691 0.476820466 [91] 0.067995222 -0.803835685 2.489997383 1.189002229 2.060836215 [96] 1.941041478 -1.733453414 -0.748453048 11.488377800 -10.362050224
In [36]:
@rget y1 y2

println("y1")
print(y1)
println("\ny2")
print(y2)

y1 [-31.55821572534249, 0.5300362367065363, -1.0904282948511226, -4.0406123926708855, 197.24928809746663, 1.5502485728868574, -0.18662992698031694, -0.3300397159360846, -0.7127604077564949, 0.0030992918438616673, 6.226539735223217, -4.3485616341358915, -3.3178984241697913, 3.6116882432892323, 0.9225408238550203, 0.4314494690799277, 0.8552284188932663, 0.4575301607312544, -5.000913735208004, 0.07806622567685487, -0.4999920214220553, 8.188348730240442, -1.223239645210061, 0.25561735111015654, 0.15178339277588176, 0.5618853377909098, -0.09547610209551784, -2.0532482235607703, 1.260012439165121, -43.09100924517277, -2.966408703043354, 3.513248949543934, -20.53248103836206, -0.26111265893321517, -0.18164621004629938, 1.4770842186413675, 0.09458506456302686, -0.6430436136258214, 0.5748498760666672, -4.183623547307209, -0.06791317544648086, 1.2065556353209663, 0.41538680165938435, 0.25203163901334963, 0.6861197523240763, 0.39189839739555027, -0.24236414651868257, -0.14109391193587995, 0.24252754783484526, -0.4580627018347213, -0.7480692506233314, -0.8121232054099217, -0.46323737657576003, -0.17216090862161598, -1.1546448721996105, 1.4037955549447865, -1.6613220397273083, -0.8486691557622563, -2.5852661459761515, 0.042263810869523286, -2.474663074806969, 0.5575562048164977, -1.0170735393491839, 2.250655592846028, 0.3282970965497873, 0.8474996587545602, -0.5415108149586035, 3.876020177911211, 1.964045952747126, -0.4578735543659743, 0.70216756674162, -0.462056512231815, 0.42702569324866646, 3.542763662642219, -15.258443117502726, 0.29313056945230004, 0.8701254748807753, 5.019392346979378, 0.7883504662700018, -0.7566053986094929, 0.5765444555157656, 13.883225312315094, 5.839424591765613, -0.044701183828743545, 0.7903854620386495, 1.5164592646990995, 0.3012839568303177, -1.2354692860461978, 3.875594683423663, -0.009950120628086898, 1.2765115773991373, 0.1467135098945041, -0.9035262120650954, -22.304912669922132, -0.3482779979049259, 0.13392757230467367, 1.6681540439943823, 1.6110912842224132, 0.33238740833330577, -1.7139263554752673] y2 [-0.9629128024938183, -39.08494990611081, -5.3199140042643025, -1.2362854154482499, -0.615473962253258, -0.7091076253927686, -0.5600832160342895, 0.6569406960698989, 0.9665174281731566, -15.231632440152534, 0.31225219402970933, 1.2197902810266503, -0.18489426664175868, -4.256171971979463, 0.28202681880986363, 0.7983816918647068, -2.5902608421677953, 13.351989843438291, -0.5790355150081746, 0.7063956208450053, 0.2514717787388858, 1.456800929402857, -2.048089541047096, -0.8087554997523954, -0.2554797723455376, 1.2355013909278332, -3.6290074075202714, 1.4244105103131286, 1.6111268956024676, -0.2775572544250061, 0.6652462118949632, -2.431390805212439, -1.7605343597738983, -1.008998829215546, 2.242057021123544, -0.8185864762813686, -0.14665102704945274, -0.4886985450724228, -0.05323739547962144, 2.1570508172007727, -0.06011397246644186, 1.9233087854916289, -1.2204243540776472, -0.1016296562947172, -0.1757187369082507, 0.7063283994240329, -1.824046399149404, -1.561087189554391, 1.1861066107614295, -1.8130768663895493, -0.2827110177759487, 0.5265190199766472, 1.9709768267378491, 4.871339277120169, -2.2637306553456926, 1.2844444225962708, 0.8025637962334431, -0.42616641398945404, 0.10141806926894945, -0.9921328959493336, -1.677937928537833, 0.7166004732982686, 0.7771326965115335, 0.6787509703207182, -0.6042047517506698, 7.369622174536689, 2.6872538557410524, -3.198882021388472, 2.862702873820454, 0.693255865663856, -1.2978526216542, -0.3976487862752027, 0.5375722360236875, 10.414096857016249, 0.23557858552627223, -0.7213741169793125, -1.285054529278544, -0.6065463641595923, 1.1012783827301877, 1.4854850200659768, -3.336707576707835, -0.8428562865358522, 15.094556486886471, -6.301091506007204, 0.604295988277596, -0.5533065090200777, 0.6271098045423343, 2.6855614857646377, -2.1769356541496996, -5.766262659759941, 11.765252251435406, -1.1491974719633622, -0.06308893076800778, -2.414309564048888, 1.442992888220538, 1.2477258425986169, 1.611037598401453, 0.3883993737822962, -0.010631427664816942, -1.7194609257058588]
In [6]:
@rget y1 y2

println("y1")
print(y1)
println("\ny2")
print(y2)

y1 [0.2716595082133937, 0.31375455759623183, 0.23615524986738443, 0.31104644887891136, 0.10853557799331377, 0.0837400337886975, 0.190846458655687, 0.17841200323759018, 0.3021881223310938, 0.08423413208947693, 0.2532986252517025, 0.12621080988242728, 0.06555948701094444, 0.2244799744996258, 0.10127837676892831, 0.2743414055171367, 0.23247712625174405, 0.31640417110612534, 0.1266211139647908, 0.0820446785999432, 0.21023221126070843, 0.31181953047044825, 0.3183061965185359, 0.318023137448885, 0.29449786442169834, 0.1698416709106755, 0.211749675299896, 0.11906675329897455, 0.2528091763207466, 0.2069701701114787, 0.12273104937060149, 0.10989184508075897, 0.2965387313242635, 0.23328152980816036, 0.31721631174697196, 0.12842245909575908, 0.08705974568477304, 0.08348703216867753, 0.3131736158264967, 0.18789263829138964, 0.1484393530784246, 0.3060601361256871, 0.30884785194854886, 0.3062505916799342, 0.2273549111479069, 0.2659951425351212, 0.08608511658131968, 0.31223776836676986, 0.31698727553018713, 0.2809414677533673, 0.25934029705385697, 0.31756677254947385, 0.2657983459420546, 0.31442579274683685, 0.3047988929154387, 0.2440598596289094, 0.14620166102677537, 0.3137380353500831, 0.11142937636493798, 0.2005060874044694, 0.08029779846692525, 0.08027108912838615, 0.0879699483451702, 0.3061670076991954, 0.05967024246819389, 0.30672390317469445, 0.2623310603919375, 0.28763768817010743, 0.16881483685936055, 0.2777918173728127, 0.31741389799393027, 0.31536869384788796, 0.1754267713501977, 0.14097697273826473, 0.3039632885691105, 0.10576328267557633, 0.15933264201137848, 0.3145432536069102, 0.1717120893967333, 0.16436484515111238, 0.2075098278535904, 0.2893074199181562, 0.31718003476209244, 0.22024891993202877, 0.11057999678449394, 0.29437910015900937, 0.18949651998502676, 0.028018107429657893, 0.12392847772956571, 0.31639698860079263, 0.28038312543306604, 0.31828311103919205, 0.16707895410262477, 0.25564561880281483, 0.20766631561934762, 0.30397818108386965, 0.31085855699234727, 0.29471812401928416, 0.30624215711575514, 0.22308407322944138] y2 [0.1226601088104884, 0.30067991209913425, 0.23460478204001126, 0.30979836440637126, 0.28522597122969034, 0.17561677095528921, 0.033006313693259175, 0.1631711488161548, 0.061943483858876026, 0.2894111829142465, 0.18529329133598718, 0.2868404360858639, 0.2787341792760113, 0.29103203881397016, 0.2955008334589555, 0.26459101674031965, 0.2077539669351992, 0.256501318891937, 0.07791816424790188, 0.284701526608942, 0.2592928533687757, 0.22248008738728337, 0.2020801855558459, 0.3105912614541402, 0.25610494757670044, 0.2375159991532291, 0.0811946747859257, 0.17527272464168014, 0.13915197198406518, 0.14485890755524525, 0.23781148259406606, 0.11954513472059826, 0.12481819189915326, 0.06770034792877519, 0.3167678804287447, 0.299332422006145, 0.2960683026151408, 0.08812624284750153, 0.043441098705406384, 0.3086176503364804, 0.1636893914921343, 0.2572586672889589, 0.24417835952459668, 0.3158814364286289, 0.1763784131134832, 0.31450813352983037, 0.3174295063421658, 0.313677414224167, 0.08241551543786106, 0.30031347803678293, 0.19341230272494792, 0.06409911258347302, 0.2532478517708398, 0.15774261148001395, 0.31349383347184584, 0.2222438514050432, 0.11156179417031332, 0.1763259967464483, 0.1399735401634156, 0.2607280695101059, 0.2811842358042634, 0.04396255075547221, 0.2940059199732809, 0.29967258756263987, 0.1501368631142591, 0.15960485996995402, 0.15525912735426417, 0.31773040345401965, 0.3081318917197308, 0.029607958686616656, 0.10791273982039083, 0.09512433543186355, 0.09031725501173367, 0.2624012063749462, 0.12917808912802742, 0.10418493672005955, 0.1809053899835294, 0.08567795211014077, 0.2226083133773214, 0.2042903843784891, 0.06519241751926436, 0.29492605387732107, 0.15932970507808847, 0.14583997745835992, 0.07413769955442559, 0.2780389755666177, 0.20651804900618295, 0.2811972853109285, 0.3081028311005687, 0.17727552734699442, 0.14727132193862658, 0.07399031900413183, 0.21677594353653992, 0.0667554396269113, 0.13344483125225567, 0.303480151984531, 0.23158564010187088, 0.30573023473204225, 0.26874853440800717, 0.2434853540191754]

# Task 2 (10 points)

Compute the summary statistics of the two sequences with the summary function. Print the output.

In [45]:
R"summary(y1)"

RObject{RealSxp} Min. 1st Qu. Median Mean 3rd Qu. Max. -141.65905 -1.33590 -0.08866 -0.46860 0.59458 110.55740
In [44]:
R"summary(y2)"

RObject{RealSxp} Min. 1st Qu. Median Mean 3rd Qu. Max. -27.2789 -1.0157 -0.1119 -0.3435 1.0044 46.4646

# Task 3 (10 points)

Create a Q-Q plot of y1 vs. the normal distribution using function qqnorm. In general, a Q-Q plot compares the quantiles of two distribution against each other. If the distribution are equal, all quantiles have to lie on the 45 degree line.

In [83]:
R"qqnorm(y1, ylim=c(-2,2),xlim=c(-2,2))"
R"abline(a = 0, b = 1)"

RObject{NilSxp} NULL
In [82]:
R"qqnorm(y2, ylim=c(-2,2),xlim=c(-2,2))"
R"abline(a = 0, b = 1)"

RObject{NilSxp} NULL

# Task 4 (30 points)

Write a simple function that returns the fitted residuals from a linear regression y ~ x. In R, having estimated a model with

mod = lm(y ~ x),


residuals can be extracted as

res = resid(mod)


Follow the sketch below.

"""
Fetch the residuals from a linear regression y ~ x.

Parameters
----------
x : Array
regressor
y : Array
response
true if constant should be added to x

Returns
-------
res : Array
of fitted residuals
"""
function ols_resid(x::Array, y::Array; addconstant::Bool=false)

return

end

In [85]:
"""
Fetch the residuals from a linear regression y ~ x.

Parameters
----------
x : Array
regressor
y : Array
response
true if constant should be added to x

Returns
-------
res : Array
of fitted residuals
"""
function ols_resid(x::Array, y::Array; addconstant::Bool=false)

@rput x y

formula = string("mod <- lm(y ~ x", addconstant ? "" : "-1", ")")

reval(formula)

reval("res <- resid(mod)")

@rget res

return convert(Array{Float64, 1}, res)
end

ols_resid

# Task 5 (15 points)

Create a third variable y3 as follows:

y3 = 2.5 - 3*y1 + y2


and use your ols_resid function to fetch the residuals from regressing y3 on y1 (with a constant). Compute their summary statistics using the summary function.

In [86]:
y3 = 2.5 .- 3 .*y1 .+ y2;

@rput y3;

In [89]:
ols_resid(y1,y3;addconstant=true);
R"summary(res)"

RObject{RealSxp} Min. 1st Qu. Median Mean 3rd Qu. Max. -38.8619 -0.9665 0.1432 0.0000 1.2258 15.3024

# Task 6 (25 points)

This excercise is for ambitious students who would like to receive a high mark.

K-means clustering is one of the most commonly used unsupervised machine learning algorithm for partitioning a given dataset into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.

Run k-means clustering analysis on the built in R dataset USArrests. This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. Set the number of clusters equal to 3 (centers=3) and use 25 (nstart=25) as a starting value for k-means (feel free to experiment with other starting values). Scale the input data to have zero mean and standard deviaiton of unity. Use the fviz_cluster function from the Rpackage factoextra to visualise the result i.e. plot the clusters. The following link is useful for how to use RCall.

Hint: Part of this exercise is to install and load the factoextra package in R.

R"options(repos='http://cran.rstudio.com/')"
"install R package"

In [102]:
# some essential setting for installation to be possible
R"options(repos='http://cran.rstudio.com/')"
R"install.packages('factoextra',lib='../../myRpackages/')"

RObject{NilSxp} NULL
In [119]:
R"library(datasets)"
R"scdata <- scale(USArrests)"

R"colMeans(scdata,dims=1)"

R"library(factoextra)"