Load Packages
Calling R from Julia
R is the leading programming language when it comes to statistics and econometrics. Its power emerges from thousands of well documented packages oftentimes written by professors and leading researchers themselves. When time is scarce to code lines of GARCH estimation code or systems of VAR equations, a good option is to transfer data to R and let it do the computations. The (registered) package needed is RCall
. There are several ways of interacting with R in Julia, all are more or less equally straightforward. We will stick to treating R as a black box: put in the data, press a button of required label and extract the result. Full and easy-to-understand documentation is available here.
Task 1 (10 points)
Using R from Julia, simulate two sequences (100 values each) following Cauchy distribution with location = 0 and scale = 1. Transport them to Julia and assign them to variables y1
and y2
. The documentation on Cauchy distribution in R is here.
Task 2 (10 points)
Compute the summary statistics of the two sequences with the summary
function. Print the output.
Task 3 (10 points)
Create a Q-Q plot of y1
vs. the normal distribution using function qqnorm
. In general, a Q-Q plot compares the quantiles of two distribution against each other. If the distribution are equal, all quantiles have to lie on the 45 degree line.
Task 4 (30 points)
Write a simple function that returns the fitted residuals from a linear regression y ~ x
. In R, having estimated a model with
residuals can be extracted as
Follow the sketch below.
Task 5 (15 points)
Create a third variable y3
as follows:
and use your ols_resid
function to fetch the residuals from regressing y3
on y1
(with a constant). Compute their summary statistics using the summary
function.
Task 6 (25 points)
This excercise is for ambitious students who would like to receive a high mark.
K-means clustering is one of the most commonly used unsupervised machine learning algorithm for partitioning a given dataset into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.
Run k-means clustering analysis on the built in R dataset USArrests
. This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. Set the number of clusters equal to 3 (centers=3
) and use 25 (nstart=25
) as a starting value for k-means (feel free to experiment with other starting values). Scale the input data to have zero mean and standard deviaiton of unity. Use the fviz_cluster
function from the Rpackage factoextra
to visualise the result i.e. plot the clusters. The following link is useful for how to use RCall
.
Hint: Part of this exercise is to install and load the factoextra
package in R.