Project: Chi Hern Ang - JuliaClass_Fall2020

Path: ProblemSets_BLANK/PS07/PS07_Ang_Pizio_Accountancy_MBF.ipynb

Views: ¹⁹⁸

Kernel: Julia 1.2

Load Packages

In [1]:

using Pkg
# Pkg.add("RCall")
using RCall
include("./printmat.jl")

using Plots
# gr(size=(600,400))
#pyplot(size=(600,400))    
gr(size=(480,320))
default(fmt = :svg)

Calling R from Julia

R is the leading programming language when it comes to statistics and econometrics. Its power emerges from thousands of well documented packages oftentimes written by professors and leading researchers themselves. When time is scarce to code lines of GARCH estimation code or systems of VAR equations, a good option is to transfer data to R and let it do the computations. The (registered) package needed is RCall. There are several ways of interacting with R in Julia, all are more or less equally straightforward. We will stick to treating R as a black box: put in the data, press a button of required label and extract the result. Full and easy-to-understand documentation is available here.

Task 1 (10 points)

Using R from Julia, simulate two sequences (100 values each) following Cauchy distribution with location = 0 and scale = 1. Transport them to Julia and assign them to variables y1 and y2. The documentation on Cauchy distribution in R is here.

In [2]:

R"y1 <- rcauchy(100, location = 0, scale = 1)";
R"y2 <- rcauchy(100, location = 0, scale = 1)";

In [17]:

@rget y1 y2;

println("y1")
print(y1)
println("\ny2")
print(y2)

y1
[-0.40127448184851017, -0.8827806894415366, -2.630885626879231, -0.22787564323298118, 1.8696619799104872, 0.5611128834749841, 1.0064766997114047, 0.4415505195783929, 2.1396018741559297, 2.4502431598217917, 1.4237987715301852, 8.0061239768102, 0.4482553262705586, -1.0900655790920193, 9.242760595231688, 0.8396905468314516, 1.5309364832539532, -0.6840932631995894, 2.053594603809899, -1.3030266382445599, 0.16970050003133336, 3.908729400408724, -1.2284860956646613, 4.955058126819033, -0.37966360784696945, -1.1109886050068953, 0.8295033310925066, -0.4601945278565793, 1.9839964505551624, 0.9705125478483894, 5.327440922103983, -4.4219383542285655, 4.733822765732337, -0.41104021951223224, 0.777517259364673, 0.006233009965834879, -1.8898014435748318, -1.8434869990608191, -0.3426983756084228, 3.15281122383142, 1.2163816270934296, 0.8124791941057143, 1.5009734077040522, -0.35571476246077366, -3.304445958881701, 0.09118775312742781, 3.629689265301339, 0.0837028556332276, -4.3923416236843345, 0.21269028892365704, -8.773989127995385, 1.8630250092725957, 0.40577690813580497, 0.8594285113217158, 0.007435287218386363, -1.0961442669444508, 1.9735353032999567, -6.765891983128649, 0.9299340704967338, 112.31396269760704, 0.30976011166216877, 2.232124339650933, 0.5902102073737298, 0.9547688033341911, -11.692525071159958, -0.4971060611206508, 8.59718785250857, -1.8988412447854124, 1.5767729377571855, -69.6299914208395, 0.3595933162744196, -4.237728569116978, -0.4745511570431522, 1.2699606580939127, -0.04385275382838605, 0.6271257405974169, -0.37867027653446356, -1.065384971984369, -0.027794463006654613, 2.3580720152073815, -0.7212617034000289, -0.08066046578357954, -0.619510035597776, 1.2817110666393354, -0.13874899956406325, -1.2474287599685443, 0.1729188426649283, 0.19833143228015365, 2.717188248989831, -0.552814234178762, 1.4367602420405503, -1.4237272927091496, -6.635570675632419, 3.471097608407259, -0.011330515450909757, 1.7636067589865712, -0.37778040124019713, 1.1243783657675066, 3.1135738828907136, 0.5639370077525172]
y2
[-4.599200460053147, 0.9465131349280903, -2.092506104308654, 1.636180911374073, -0.5781354280907853, 0.1120121933216801, -0.22563097377294622, -17.023432780613202, 3.8001411344065166, -0.02625066496062029, -2.773710750030968, 4.03453359663476, 17.453006854396023, 7.449995127584029, -1.584114505431959, -6.79294935716999, 0.6031829290043467, -0.8449551244190561, -1.3097017440519856, 1.3808781794527387, 0.19892802330304393, -0.16094555132692856, -0.037970656308170314, -1.6343354526293625, -0.7918280572437556, 1.5489470821310192, -0.02221817199197333, 0.45453575423066794, 0.18747068990997684, 0.8816645855330318, 0.9895347827305399, -0.6408498636056578, 1.380421542417587, -0.14638733343761173, -0.5224054230502674, 0.12476735993607693, -1.3687206002144903, -2.3086316325213563, -0.39708487825428884, 3.6214491547992105, 1.4580989179352981, -2.8555485187540155, 8.19741758393516, 8.099343216243405, 1.052499803897709, -0.7104045413099429, -1.1511984189901017, 0.733249216329992, 2.049677238822567, 0.15025754732798302, 0.044062393449126006, 0.159677522419271, 3.5190288771994704, -0.24259018970748644, -0.21756785760898253, 0.5544959861542452, 0.6373061081134824, -0.9387512297650141, -0.5067049756661499, -1.7506159561053183, -2.04726744472221, -1.1290896419590006, -30.475189600896215, 0.7139615367704037, -12.706087030408192, 0.7447074216871294, -1.4389311836995526, 14.688872021326242, 1.7650568978368044, -0.16921108373880944, 0.9305785969298231, 0.5859333134085813, -0.40764424246283176, 1.5828305402891847, 2.091783014652448, -1.268666005491125, -1.8940153166413332, -0.6507316363040037, -2.4433359794646092, -0.6953928830322529, 0.32811026260119924, -0.9882660216036473, 2.0270568978150347, 0.6497774741431493, -0.41292210658345313, -10.144199990693938, 0.13272318730165786, -0.5772461973361056, 5.905746513675395, -0.17232408516484604, -0.3536187012675405, 1.2631555478949645, -0.8799040779978813, 2.3671446445239273, 0.933239372376048, -4.8939731069718855, -0.5216441689995831, -1.6143818266020193, 0.5441468573728053, 1.983162624581192]

Task 2 (10 points)

Compute the summary statistics of the two sequences with the summary function. Print the output.

In [18]:

R"summary(y1)"

RObject{RealSxp}
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-69.6300  -0.5695   0.2612   0.7370   1.5085 112.3140 

In [19]:

R"summary(y2)"

RObject{RealSxp}
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-30.47519  -0.95113  -0.03211  -0.17442   0.95727  17.45301 

Task 3 (10 points)

Create a Q-Q plot of y1 vs. the normal distribution using function qqnorm. In general, a Q-Q plot compares the quantiles of two distribution against each other. If the distribution are equal, all quantiles have to lie on the 45 degree line.

In [6]:

R"qqnorm(y1, ylim=c(-2,2),xlim=c(-2,2))"
R"abline(a = 0, b = 1)"

RObject{NilSxp}
NULL

Task 4 (30 points)

Write a simple function that returns the fitted residuals from a linear regression y ~ x. In R, having estimated a model with

mod = lm(y ~ x),

residuals can be extracted as

res = resid(mod)

Follow the sketch below.

"""
Fetch the residuals from a linear regression y ~ x.

Parameters
----------
x : Array
    regressor
y : Array
    response
addconstant : Bool
    true if constant should be added to `x`

Returns
-------
res : Array
    of fitted residuals
"""
function ols_resid(x::Array, y::Array; addconstant::Bool=false)

    return
    
end

In [7]:

"""
Fetch the residuals from a linear regression y ~ x.

Parameters
----------
x : Array
    regressor
y : Array
    response
addconstant : Bool
    true if constant should be added to `x`

Returns
-------
res : Array
    of fitted residuals
"""
function ols_resid(x::Array, y::Array; addconstant::Bool=false)
 
    @rput x y
   
    formula = string("mod <- lm(y ~ x", addconstant ? "" : "-1", ")")
    
    reval(formula)
    
    reval("res <- resid(mod)")
    
    @rget res
    
    return convert(Array{Float64, 1}, res)
end

ols_resid

Task 5 (15 points)

Create a third variable y3 as follows:

y3 = 2.5 - 3*y1 + y2

and use your ols_resid function to fetch the residuals from regressing y3 on y1 (with a constant). Compute their summary statistics using the summary function.

In [8]:

y3 = 2.5 .- 3 .*y1 .+ y2;

@rput y3;

In [9]:

ols_resid(y1,y3;addconstant=true);
R"summary(res)"

RObject{RealSxp}
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-30.3009  -0.7820   0.1422   0.0000   1.1316  17.6272 

Task 6 (25 points)

This excercise is for ambitious students who would like to receive a high mark.

K-means clustering is one of the most commonly used unsupervised machine learning algorithm for partitioning a given dataset into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.

Run k-means clustering analysis on the built in R dataset USArrests. This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. Set the number of clusters equal to 3 (centers=3) and use 25 (nstart=25) as a starting value for k-means (feel free to experiment with other starting values). Scale the input data to have zero mean and standard deviaiton of unity. Use the fviz_cluster function from the Rpackage factoextra to visualise the result i.e. plot the clusters. The following link is useful for how to use RCall.

Hint: Part of this exercise is to install and load the factoextra package in R.

R"options(repos='http://cran.rstudio.com/')"
"install R package"
"load R package"

In [10]:

# some essential setting for installation to be possible
R"options(repos='http://cran.rstudio.com/')"
R"install.packages('factoextra',lib='../../myRpackages/')"

* installing *source* package ‘factoextra’ ...
** package ‘factoextra’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (factoextra)
┌ Warning: RCall.jl: trying URL 'http://cran.rstudio.com/src/contrib/factoextra_1.0.5.tar.gz'
│ Content type 'application/x-gzip' length 1390733 bytes (1.3 MB)
│ ==================================================
│ downloaded 1.3 MB
│ 
│ 
│ The downloaded source packages are in
│ 	‘/tmp/RtmpAs56hH/downloaded_packages’
└ @ RCall /home/user/.julia/packages/RCall/g7dhB/src/io.jl:113

RObject{NilSxp}
NULL

In [11]:

R"library(datasets)"
R"scdata <- scale(USArrests)"
R"head(scdata)"

RObject{RealSxp}
               Murder   Assault   UrbanPop         Rape
Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
Arizona    0.07163341 1.4788032  0.9989801  1.042878388
Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144  1.7589234  2.067820292
Colorado   0.02571456 0.3988593  0.8608085  1.864967207

In [12]:

R"colMeans(scdata,dims=1)"

RObject{RealSxp}
       Murder       Assault      UrbanPop          Rape 
-7.667478e-17  1.111611e-16 -4.332645e-16  8.938163e-17 

In [14]:

R"library(factoextra)"
R"km.res <- kmeans(scdata, 3, iter.max = 10, nstart = 25)"
R"fviz_cluster(km.res, USArrests)"

RObject{VecSxp}

In [0]: