Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
| Download

a research paper about FDSLRM modeling with supplementary materials - software, notebooks

Project: fdslrm
Views: 1041
Kernel: R (R-Project)

Authors: Andrej Gajdoš, Martina Hančová, Jozef Hanč
Faculty of Science, P. J. Šafárik University in Košice, Slovakia
emails: [email protected], [email protected]


FDSLRM applications - Cyber attacks

Weekly cyber attacks against honeynet

Table of Contents

To get back to the contents, use the Home key.


Data and model

Data description

In this FDSLRM application we model the time series data set representing total weekly number of cyber attacks against honeynet. Data were collected from November 2014 to May 2016 in CZ.NIC honeynet consisting of Kippo honeypots in medium-interaction mode. The number of time series observations is n=72n=72, the correspoding plot with more details is shown in the following section Modeling. The data was adapted from Sokol, 2018.

Model description

The cyber attacks data can be succesfully fitted by the FDSLRM of the form:

X(t)=β1+β2cos(2πt372)+β3sin(2πt372)+β4sin(2πt472)+Y1sin(2π t672)+Y2sin(2πt772)+w(t),tN,X(t)=\beta_1+\beta_2\cos\left(\tfrac{2\pi t\cdot 3}{72}\right)+\beta_3\sin\left(\tfrac{2\pi t\cdot 3}{72}\right)+\beta_4\sin\left(\tfrac{2\pi t\cdot 4}{72}\right) +Y_1\sin\left(\tfrac{2\pi\ t\cdot 6}{72}\right)+Y_2\sin\left(\tfrac{2\pi\cdot t\cdot 7}{72}\right)+w(t), \, t\in \mathbb{N},

where

  • β=(β1,β2,β3,β4)R4\boldsymbol{\beta}=(\beta_1,\,\beta_2,\,\beta_3,\,\beta_4)' \in \mathbb{R}^4\, is a vector of real regression coefficients,

  • Y=(Y1,Y2)N2(0,D)\mathbf{Y} = (Y_1,Y_2)' \sim \mathcal{N}_2(\boldsymbol{0}, \mathrm{D})\, is an unobservable Gaussian random vector with zero mean vector and covariance matrix D=(σ1200σ22)\mathrm{D} = \,\scriptstyle \begin{pmatrix} \sigma_1^2 & 0\\ 0 & \sigma_2^2 \end{pmatrix},

  • w(t)iidN(0,σ02)w(t) \sim \mathcal{iid}\, \mathcal{N} (0, \sigma_0^2)\, is Gaussian iid noise with variance σ02\sigma_0^2,

  • ν=(σ02,σ12,σ22)R+3\boldsymbol{\nu}= (\sigma_0^2, \sigma_1^2, \sigma_2^2) \in \mathbb{R}_{+}^3 \, is a vector of real nonnegative variance-covariance parameters.

We identified the given and most parsimonious structure of the FDSLRM using an iterative process of the model building and selection based on exploratory tools of spectral analysis (Gajdoš et al., 2017; Brockwell & Davis, 2006) and residual diagnostics (see sections Modelling and Residual diagnostics).

Estimating the model parameters

During the modelling, we obtained:

  • estimates ν=(0.059,0.024,0.0139)\boldsymbol{\nu^*}=(0.059, 0.024, 0.0139)' of variance-covariance components ν\boldsymbol{\nu} by residual maximum likelihood (REML),

  • estimates β=(7.683,0.145,0.201,0.130)\boldsymbol{\beta^*}=(7.683, -0.145, 0.201, -0.130)' of regression parameters in trend β\boldsymbol{\beta} by weighted least squares,

  • predictions Y=(0.149,0.111)\mathbf{Y^*}=(0.149, -0.111)' of the random vector Y\mathbf{Y} using the best linear unbiased predictor (BLUP) procedure.

The methods and procedures described for FDSLRM in more detail can be found in Štulajter 2002, 2003; Gajdoš et al. 2018. Estimated values of β,ν\boldsymbol{\beta}, \boldsymbol{\nu} and predictions of Y\mathbf{Y} are also presented in the form of tables in Fitting summary.

Computational software

As for numerical calculations, we conducted our computations in the R statistical computing language (https://www.r-project.org; R Development Core Team, 2018) with the key libraries nlme (Pinhero et al., 2018; Galecki & Burzykowski, 2013), R functions for LMM programmed by Singer (Singer et al., 2017) and R functions for FDSLRM programmed by authors of the Jupyter notebook included in fdslrm package. The complete list of used R libraries is included in Session info.


Modeling

Remark.
Mean value (or FDSLRM trend) component m(t)m(t) of our FDSLRM is the real function in the form of:

  • the linear regression (LR) - the linear combination of deterministic real functions f1(t)=1,f2(t)=cos(2πt372),f3(t)=sin(2πt372),f3(t)=sin(2πt472) f_1(t)=1,\,f_2(t)=\cos\left(\tfrac{2\pi t\cdot 3}{72}\right),\,f_3(t)=\sin\left(\tfrac{2\pi t\cdot 3}{72}\right),\,f_3(t)=\sin\left(\tfrac{2\pi t\cdot 4}{72}\right) with real amplitudes β1,β2,β3,β4\beta_1, \beta_2, \beta_3, \beta_4 m(t)=i=14fi(t)βi,tN. m(t) = \sum\limits_{i=1}^{4}f_i(t)\beta_i, \, t \in \mathbb{N}.

Random errors (or FDSLRM errors) component ε(t)\varepsilon(t) of our FDSLRM is the zero mean value time series consisting of:

  • finite discrete spectrum errors (FDS errors) - the linear combination of deterministic real functions v1(t)=sin(2πt672),v2(t)=sin(2πt772)v_1(t)=\sin\left(\tfrac{2\pi t\cdot 6}{72}\right),\,v_2(t)=\sin\left(\tfrac{2\pi t\cdot 7}{72}\right) with mutually uncorrelated random amplitudes Y1,Y2Y_1,Y_2 and

  • white noise errors (WN errors) - wn\mathcal{wn} (or iid\mathcal{iid}) noise w(t)w(t) ε(t)=j=12vj(t)Yj+w(t),tN. \varepsilon(t) = \sum\limits_{j=1}^{2}v_j(t)Y_j + w(t), \, t \in \mathbb{N}.

Loading R functions and packages

A brief help on all applied R functions and packages designed to work with FDSLRM is in the section Appendix.

library(fdslrm) initialFDSLRM()

Data plot

The cyber attacks data set was adapted from Sokol, 2018. The more detailed data description can be found in Sokol, 2018.

# Loadig data dt <- read.csv2("Cyberattacks.csv", header = FALSE, sep = ",") dt <- as.numeric(as.vector(dt[-1,])) t <- 1:length(dt) # IPython setting for output options(repr.plot.res=120, repr.plot.height=4.5, repr.plot.width=6.5) # Plotting data plot(t, dt, type = "o", xlab = "weeks", ylab = "attacks")
Image in a Jupyter notebook

Due to big variability in the data we consider the logarithmic transformation of data.

dt_log <- log(dt) plot(t, dt_log, type = "o", xlab = "weeks", ylab = "attacks")
Image in a Jupyter notebook

Spectral analysis - Periodogram

Each stationary time series can be decomposed into sines and cosines with different frequencies and random amplitudes. To identify significant (Fourier) frequencies, we apply a spectral time series exploratory tool called periodogram (more details in Gajdoš et al., 2017).

# periodogram periodo <- spec.pgram(dt_log, log="no")
Image in a Jupyter notebook

Six most significant frequencies according to values of spectrum in periodogram.

drawTable(type = "periodogram", periodogram = periodo)
Frequencies by spectrum
spectrum 0.9899895 0.5162030 0.4981809 0.3356683 0.2388390 0.2073755
frequency (raw) 0.0416667 0.0833333 0.0555556 0.0972222 0.1388889 0.1111111

The raw frequencies from periodogram can be easily rewritten to the corresponding Fourier frequencies in the standard form 2πk/n2\pi k/n, where kk is the frequency order and nn is the number of time series observations (or a very close integer number); e.g. the first raw frequency 0.04166670.0416667 can be expressed as 0.0416667=3/720.0416667=3/72, which corresponds to the Fourier frequency 2π3/72=π/122\pi \cdot 3/72 = \pi/12 or the last (sixth) significant frequency: 0.111111172=82π8/72=2π/90.1111111\cdot72=8 \Rightarrow 2\pi\cdot 8/72=2\pi/9.

# orders k for Fourier frequencies print(round(72*c(0.0416667,0.0833333, 0.0555556, 0.0972222, 0.1388889, 0.1111111)))
[1] 3 6 4 7 10 8
fnames= c("3/72", "6/72", "4/72", "7/72", "10/72", "8/72") drawTable(type = "periodogram", periodogram = periodo, frequencies = fnames)
Frequencies by spectrum
spectrum 0.9899895 0.5162030 0.4981809 0.3356683 0.2388390 0.2073755
frequency (raw) 0.04166667 0.08333333 0.05555556 0.09722222 0.13888889 0.11111111
frequency 3/723/72 6/726/72 4/724/72 7/727/72 10/7210/72 8/728/72

Residual diagnostics

Our FDSLRM for the tourism data can be rewritten in the matrix form as a linear mixed model (LMM):

X=Fβ+VY+w,\mathbf{X}=\mathrm{F}\boldsymbol{\beta}+\mathrm{V}\mathbf{Y}+\boldsymbol{w},

where

  • X=(X(1),X(2),,X(72))\mathbf{X} = (X(1), X(2), \ldots, X(72))' is a vector of the time series observations,

  • w=(w(1),w(2),,w(72))\boldsymbol{w} = (w(1),w(2), \ldots, w(72))' is a random vector of corresponding iid (or white) noise values

  • model design matrices F,V\mathrm{F}, \mathrm{V} for our final model have the following structure: F=(1cos(2π372)sin(2π372)sin(2π472)1cos(2π672)sin(2π672)sin(2π872)1cos(2π21672)sin(2π21672)sin(2π28872))V=(sin(2π672)sin(2π772)sin(2π1272)sin(2π1472)sin(2π43272)sin(2π50472)).\mathrm{F} \,{\scriptsize = \, \begin{pmatrix} 1 & \cos\left(\tfrac{2\pi\cdot 3}{72}\right) & \sin\left(\tfrac{2\pi\cdot 3}{72}\right) & \sin\left(\tfrac{2\pi\cdot 4}{72}\right) \\ 1 & \cos\left(\tfrac{2\pi\cdot 6}{72}\right) & \sin\left(\tfrac{2\pi\cdot 6}{72}\right) & \sin\left(\tfrac{2\pi\cdot 8}{72}\right) \\ \vdots & \vdots & \vdots & \vdots \\ 1 & \cos\left(\tfrac{2\pi\cdot 216}{72}\right) & \sin\left(\tfrac{2\pi\cdot 216}{72}\right) & \sin\left(\tfrac{2\pi\cdot 288}{72}\right) \end{pmatrix}\qquad} \mathrm{V} \scriptsize = \begin{pmatrix} \sin\left(\tfrac{2\pi\cdot 6}{72}\right) & \sin\left(\tfrac{2\pi\cdot 7}{72}\right) \\ \sin\left(\tfrac{2\pi\cdot 12}{72}\right) & \sin\left(\tfrac{2\pi\cdot 14}{72}\right) \\ \vdots & \vdots \\ \sin\left(\tfrac{2\pi\cdot 432}{72}\right) & \sin\left(\tfrac{2\pi\cdot 504}{72}\right) \end{pmatrix}.

This fundamental FDSLRM property allows us to apply many results and mathematical techniques of LMM methodology. In the language of LMM terminology β\boldsymbol{\beta} represents the vector of fixed effects, the random component depends on vector Y\mathbf{Y} of random effects and w\boldsymbol{w} of random errors. From the viewpoint of LMM residual analysis, we have to consider three types of residuals:

  • marginal residuals (FDSLRM residuals): XFβ\mathbf{X}-\mathrm{F}{\boldsymbol{\beta^*}},

  • random effects residuals (FDS residuals): VY\mathrm{V}{\mathbf{Y^*}},

  • conditional residuals (WN residuals): XFβVY\mathbf{X}-\mathrm{F}{\boldsymbol{\beta^*}}-\mathrm{V}{\mathbf{Y^*}}.

How was the final form of F\mathrm{F} and V\mathrm{V} found?

Due to the LMM structure, we can apply graphical (exploratory) tools and quantitative tests of LMM residual diagnostics (Singer et al., 2017) for FDSLRM observations (see the next subsection Graphical tools). The most suitable form of F\mathrm{F} and V\mathrm{V} is found by an iterative process (see rules and steps explained in the next remark) of applying the mentioned tools whose results are summarized in the following table. Two most adequate and parsimonious structures of the FDSLRM (2b, 3b) consist of two low frequencies (2π3/72,2π4/72)(2\pi\cdot 3/72, 2\pi\cdot 4/72) and two higher ones (2π6/72,2π7/72)(2\pi\cdot 6/72, 2\pi\cdot 7/72). Since the difference in AIC, BIC for both models is relatively small, our final choice is model 3b thanks to the generally smaller mean squared error in predictions (Hančová, 2007).

Iteration
number
Graphical diagnostic tools Numerical diagnostic tests Frequencies (raw) in model
L O1 H O2 ACF PACF N1 N2 N3 Normality
test
Independence
test
372\dfrac{3}{72} 472\dfrac{4}{72} 672\dfrac{6}{72} 772\dfrac{7}{72}
1. \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark ? ? \checkmark \checkmark ×\times T,1,1 T,1,1 R,1,1 -
2a. \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark ? \checkmark \checkmark \checkmark T,1,1 T,1,1 R,1,1 R,1,1
2b. ?? \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark T,1,1 T,0,1 R,0,1 R,0,1
3a. \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark ? ? \checkmark \checkmark \checkmark T,1,1 T,1,1 T,1,1 R,1,1
3b. ?? \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark T,1,1 T,0,1 T,0,1 R,0,1

*Coding: letter (T or R) - presence of the frequency in the trend or random component; the first number (1 or 0) - presence of the frequency in the cos term; the second number (1 or 0) - in the sin term

Key rules and steps of the iterative econometric FDSLRM-building General rules of the FDSLRM-building for econometric data (Štulajter, 2002; Box et al., 2016; Gajdoš et al, 2017):

  • the model should be parsimonious (in the form of a simple dependence with a small number of the parameters),

  • the model should create small unexplained (random) deviations (small variances, covariances or mean squared errors),

  • the model should include lower frequencies in the trend component, and higher ones in the random component.

Key steps of the FDSLRM-building for econometric data (our experience):

  • follow the scheme of the Box-Jenkins iterative, three-stage time series model-building approach: formulation (identification), estimation (fit), diagnostic (checking),

  • the more apparent periodic or seasonal patterns in the data mean the smaller number of iterations in the model building,

  • start at least with the three most significant frequencies in the model,

  • use the graphical-tools diagnostic matrix and numerical tests (see below) to check the adequacy of the model,

  • remove or add other significant frequencies and check the model adequacy,

  • remove cos or sin term with a particular frequency and check the model adequacy,

  • very small or very big variance parameter estimates indicate a reason to remove the corresponding cos or sin term with a particular frequency in the random component,

  • very small predictions of random effects also indicate an exclusion of the corresponding term in the random component,

  • use information criteria AIC, BIC to choose between competing adequate models.

Graphical (exploratory) tools

In the LMM residual analysis for FDSLRM we can use the following matrix of graphical exploratory tools (plots) as diagnostic tools for all FDSLRM assumptions. A brief description of the tools is in the Appendix.

|LaTeX\LaTeX|Graphical-tools diagnostic matrix\large\mbox{Graphical-tools diagnostic matrix}|LaTeX\LaTeX| |---|------------------------------------------------|---| | | |linearity of fixed effects (L)\mbox{linearity of fixed effects (L)}| outlying observations (O1)\mbox{outlying observations (O1)}\hspace{0.75cm} | independence of cond. errors (ACF)\mbox{independence of cond. errors (ACF)} | |stand. marg. residuals vs marg. fitted values|stand. marg. residuals vs times\hspace{0.75cm}|ACF of cond. residuals| | | |homoscedascity of cond. errors (H)\mbox{homoscedascity of cond. errors (H)}|outlying observations (O2)\mbox{outlying observations (O2)}\hspace{0.75cm}|independence of cond. errors (PACF)\mbox{independence of cond. errors (PACF)} | |stand. cond. residuals vs cond. predictions|stand. cond. residuals vs times\hspace{0.75cm}|PACF of cond. residuals| | | |normality of cond. errors (N1)\mbox{normality of cond. errors (N1)}|normality of cond. errors (N2)\mbox{normality of cond. errors (N2)}\hspace{0.75cm}|normality of cond. errors (N3)\mbox{normality of cond. errors (N3)} | |histogram of cond. residuals|histogram of stand. least conf. residuals\hspace{0.75cm}|stand. least conf. residuals vs N(0,1)\mathcal{N}(0,1) quantiles|

We present the residual diagnostics results for the final model (2b).

# Fitting the final FDSLRM output <- fitDiagFDSLRM(dt_log, t, c(3/72,4/72), include_fixed_eff = c(1,1,0,1), freq_random = c(6/72,7/72), include_random_eff = c(0,1,0,1)) options(repr.plot.res=600, repr.plot.height=9, repr.plot.width=10) drawDiagPlots("all", output)
Image in a Jupyter notebook

Single panels for diagnostic

Our function drawDiagPlots() also allows to show any of diagnostic plots above in a single panel. Here we show two additional plots (not included in the Graphical-tools diagnostic matrix).

  • plot: cumulative periodogram of conditional residuals - detection of periodic nonrandomness in conditional residuals

  • plot: standardized marginal residuals vs marginal fitted values - test of linearity of fixed effects

options(repr.plot.res=80, repr.plot.height=6, repr.plot.width=6) drawDiagPlots(output$diagnostic_plots_names$CumulatPeriodogCondResid, output)
Image in a Jupyter notebook
options(repr.plot.res=100, repr.plot.height=5, repr.plot.width=7) drawDiagPlots(output$diagnostic_plots_names$StdMarginalResidVsFittedValues, output)
Image in a Jupyter notebook

Numerical tests

Tests of residual independence

# output$Box_test # output$BoxLjung_test print(output$Box_test_lag10_resid) print(output$BoxLjung_test_lag10_resid)
Box-Pierce test data: resid(fit) X-squared = 15.425, df = 10, p-value = 0.1173 Box-Ljung test data: resid(fit) X-squared = 16.36, df = 10, p-value = 0.08977

Test of residual normality

print(output$ShapiroWilk_test_norm_cond_resid) print(output$ShapiroWilk_test_stand_least_conf_resid)
Shapiro-Wilk normality test data: resid(fit, type = "normalized") W = 0.97489, p-value = 0.1592 Shapiro-Wilk normality test data: SingerEtAl_resid_diag$least.confounded.residuals W = 0.96362, p-value = 0.04463

Fitting summary

Parameter estimates

Estimates of regression coefficients

drawTable(type = "fixed", fixed_eff = output$fixed_effects)
β1\beta_{1} β2\beta_{2} β3\beta_{3} β4\beta_{4}
7.683338 -0.1448348 0.2014634 -0.1296788

Predictions of random effects

drawTable(type = "random", random_eff = output$random_effects)
Y1Y_{1} Y2Y_{2}
0.1492801 -0.1112381

Estimates of variance parameters

drawTable(type = "variance", variances = c(output$error_variance, diag(output$rand_eff_variance)))
σ02\sigma_{0}^2 σ12\sigma_{1}^2 σ22\sigma_{2}^2
0.059342 0.0238263 0.0138469

Fit summary

Graphical summary for the final model (3b)

  • plot: time series observations (black), fitted values (blue), estimated trend (red) vs times

options(repr.plot.res=120, repr.plot.height=5, repr.plot.width=6.5) drawDiagPlots(output$diagnostic_plots_names$FittedTimeSeries, output)
Image in a Jupyter notebook

Numerical summary for the final model (2b)

print(output$fit_summary)
Linear mixed-effects model fit by REML Data: d AIC BIC logLik 34.91972 50.45628 -10.45986 Random effects: Formula: ~-1 + v1 + v2 | g Structure: Diagonal v1 v2 Residual StdDev: 0.1543577 0.117673 0.2436022 Fixed effects: as.formula(paste("x~", paste(names(d)[2:kk], collapse = "+"))) Value Std.Error DF t-value p-value (Intercept) 7.683337 0.02870879 68 267.63014 0.0000 f2 -0.144835 0.04060036 68 -3.56733 0.0007 f3 0.201463 0.04060036 68 4.96211 0.0000 f4 -0.129679 0.04060036 68 -3.19403 0.0021 Correlation: (Intr) f2 f3 f2 0 f3 0 0 f4 0 0 0 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -2.5042072 -0.4728622 0.0421905 0.5706248 1.8761966 Number of Observations: 72 Number of Groups: 1

Numerical summary for model 3b

# AIC, BIC, loglike for model 3b output2b <- fitDiagFDSLRM(dt_log, t, c(3/72,4/72,6/72), include_fixed_eff = c(1,1,0,1,0,1), freq_random = c(7/72), include_random_eff = c(0,1)) print(output2b$fit_summary)
Linear mixed-effects model fit by REML Data: d AIC BIC logLik 35.75192 51.18476 -10.87596 Random effects: Formula: ~-1 + v1 | g v1 Residual StdDev: 0.117673 0.2436022 Fixed effects: as.formula(paste("x~", paste(names(d)[2:kk], collapse = "+"))) Value Std.Error DF t-value p-value (Intercept) 7.683337 0.02870879 67 267.63014 0.0000 f2 -0.144835 0.04060036 67 -3.56733 0.0007 f3 0.201463 0.04060036 67 4.96211 0.0000 f4 -0.129679 0.04060036 67 -3.19403 0.0021 f5 0.159608 0.04060036 67 3.93119 0.0002 Correlation: (Intr) f2 f3 f4 f2 0 f3 0 0 f4 0 0 0 f5 0 0 0 0 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -2.46749126 -0.49793962 0.04298845 0.54554736 1.89739460 Number of Observations: 72 Number of Groups: 1

Session info

print(sessionInfo())
R version 3.4.4 (2018-03-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.2 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] CVXR_0.99-4 pracma_2.2.5 gnm_1.1-0 sommer_3.9.5 [5] crayon_1.3.4 lattice_0.20-35 matrixcalc_1.0-3 fpp2_2.3 [9] expsmooth_2.3 fma_2.3 ggplot2_3.1.1 forecast_8.5 [13] nlme_3.1-137 car_3.0-0 carData_3.0-2 Matrix_1.2-14 [17] MASS_7.3-50 IRdisplay_0.5.0 kableExtra_0.9.0 fdslrm_0.1.0 loaded via a namespace (and not attached): [1] xts_0.11-2 scs_1.2-3 bit64_0.9-7 [4] httr_1.3.1 rprojroot_1.3-2 repr_1.0.0 [7] tools_3.4.4 backports_1.1.1 R6_2.4.0 [10] mgcv_1.8-24 lazyeval_0.2.2 colorspace_1.4-1 [13] nnet_7.3-12 withr_2.1.2 tidyselect_0.2.5 [16] bit_1.1-14 curl_3.3 compiler_3.4.4 [19] rvest_0.3.2 Cairo_1.5-9 xml2_1.2.0 [22] tseries_0.10-45 scales_0.5.0 lmtest_0.9-37 [25] fracdiff_1.4-2 readr_1.3.1 quadprog_1.5-7 [28] pbdZMQ_0.3-3 stringr_1.4.0 digest_0.6.18 [31] relimp_1.0-5 foreign_0.8-69 rmarkdown_1.10 [34] R.utils_2.8.0 rio_0.5.10 base64enc_0.1-3 [37] pkgconfig_2.0.2 htmltools_0.3.6 highr_0.8 [40] rlang_0.3.4 readxl_1.0.0 TTR_0.23-3 [43] rstudioapi_0.10 quantmod_0.4-13 zoo_1.8-5 [46] jsonlite_1.6 dplyr_0.8.0.1 zip_2.0.1 [49] R.oo_1.22.0 magrittr_1.5 qvcalc_1.0.0 [52] ECOSolveR_0.5.2 Rcpp_1.0.1 IRkernel_0.8.12.9000 [55] munsell_0.5.0 abind_1.4-5 R.methodsS3_1.7.1 [58] stringi_1.4.3 plyr_1.8.4 grid_3.4.4 [61] parallel_3.4.4 forcats_0.4.0 haven_2.1.0 [64] hms_0.4.2 knitr_1.21 pillar_1.3.1 [67] uuid_0.1-2 urca_1.3-0 glue_1.3.1 [70] evaluate_0.13 getPass_0.2-2 data.table_1.12.2 [73] cellranger_1.1.0 gtable_0.3.0 purrr_0.3.2 [76] assertthat_0.2.1 xfun_0.6 openxlsx_4.1.0 [79] Rmpfr_0.7-2 viridisLite_0.3.0 timeDate_3043.102 [82] tibble_2.1.1 gmp_0.5-13.5

References

This notebook belongs to suplementary materials of the paper submitted to Statistical Papers and available at https://arxiv.org/abs/1905.07771.

Abstract of the paper

We propose a two-stage estimation method of variance components in time series models known as FDSLRMs, whose observations can be described by a linear mixed model (LMM). We based estimating variances, fundamental quantities in a time series forecasting approach called kriging, on the empirical (plug-in) best linear unbiased predictions of unobservable random components in FDSLRM.

The method, providing invariant non-negative quadratic estimators, can be used for any absolutely continuous probability distribution of time series data. As a result of applying the convex optimization and the LMM methodology, we resolved two problems - theoretical existence and equivalence between least squares estimators, non-negative (M)DOOLSE, and maximum likelihood estimators, (RE)MLE, as possible starting points of our method and a practical lack of computational implementation for FDSLRM. As for computing (RE)MLE in the case of n n observed time series values, we also discovered a new algorithm of order O(n)\mathcal{O}(n), which at the default precision is 10710^7 times more accurate and n2n^2 times faster than the best current Python(or R)-based computational packages, namely CVXPY, CVXR, nlme, sommer and mixed.

We illustrate our results on three real data sets - electricity consumption, tourism and cyber security - which are easily available, reproducible, sharable and modifiable in the form of interactive Jupyter notebooks.  ~


Appendix - Tools, R functions

A brief help on applied diagnostic tools and R functions.

Graphical and numerical tools

  • standardized marginal residuals vs marginal fitted values: residuals not randomly distributed, an obvious pattern of dependency (systematic trend) \rightarrow assumption of linearity of fixed effects violated \rightarrow model rejected.

  • standardized conditional residuals vs predicted values: residuals not randomly distributed, increase in variance of residuals \rightarrow assumption of homoscedasticity violated \rightarrow model rejected.

  • standardized least confounded residuals vs N(0,1) quantiles: points do not lie close to the straight line, many points (more than 5%) lie out of confidence bounds \rightarrow assumption of normality of conditionl errors violated \rightarrow model rejected.

  • standardized marginal residuals vs observation indices: some points are extremely far away from the majority of points \rightarrow outliers detected.

  • standardized conditional residuals vs observation indices: some points are extremely far away from the majority of points \rightarrow outliers detected.

  • autocorrelation function of conditional residuals: more than 5% of values cross the empirical bounds \rightarrow independence of conditional errors violated \rightarrow reject model.

  • partial autocorrelation function of conditional residuals: more than 5% of values cross the empirical bounds \rightarrow independence of conditional errors violated \rightarrow reject model.

  • histogram of conditional residuals: histogram does not approximately look like a Gaussian distribution \rightarrow normality of conditional errors violated \rightarrow reject model.

  • histogram of standardized least confounded residuals: histogram does not approximately look like a Gaussian distribution \rightarrow normality of conditional errors violated \rightarrow reject model.

As pointed out in Nobre and Singer, 2007 according to Hilden-Minton, 1995 a residual is said to be confounded for a specific type of error if it also depends on errors different from those that it is supposed to predict. In linear mixed models, conditional residuals and the BLUP are confounded (Nobre and Singer, 2007). This implies, for example, that estimated conditional residuals be may not be adequate to check for normality of conditional errors since when random effects are grossly non-normal, estimated conditional residuals may not present a normal behavior even when conditional error is normal (Nobre and Singer, 2007, Section 4). Following the suggestion of Hilden-Minton, 1995 we consider standardized conditional least confounded residuals, obtained as linear combinations of the standardized conditional residuals that minimize the proportion of their variance due to the random effects.

  • cumulative periodogram of conditional residuals: cumulative periodogram shows strong systematic deviations from the straight line connecting point [0,0][0,0] with point [0.5,1][0.5,1] \rightarrow reject model.

R functions

Packages - fdslrm, stats, repr

The R package fdslrm has been developed by authors of this notebook and serves the purpose of modelling time series.

fdslrm: Time series analysis and forecasting using LMM

  • Purpose: R package for modeling and prediction of time series using linear mixed models.

  • Version: 0.1.0, 2019

  • Depends: kableExtra, IRdisplay, MASS, Matrix, car, nlme, stats, forecast, fpp2, matrixcalc, sommer, gnm, pracma, CVXR

  • Maintainer: Andrej Gajdoš

  • Authors: Andrej Gajdoš, Jozef Hanč, Martina Hančová

  • URL: https://github.com/fdslrm/R-package

  • Installation: Run jupyter notebook 00 installation fdslrm.ipynb once before the first run of any R-based Jupyter notebook.

The spec.pgram() function from base R stats package produces periodogram - estimation for spectral density of a time series.

The options(repr.plot.res=.., repr.plot.height=.., repr.plot.width=..) from repr package set parameters for resizing R plots in Jupyter notebooks.

Authors' source code - R functions

The initialFDSLRM() loads essential R packages (nlme, kableExtra, IRdisplay, MASS, Matrix, car, stats,...) for time series analysis and visualization (it also installs any missing package - in case of the installation it can take several minutes on the standard computer). Moreover, the function loads authors' R functions designed to work with FDSLRM and some Singer's R functions (Singer et al. 2017) for LMM modified by authors of this jupyter notebook .

The fitDiagFDSLRM() function fits the given FDSLRM and consequently conducts the residual diagnostics. Particularly, it means computing all estimates for the unknown parameters and all the statistical test of residuals. The basic input parameters of this function are

  • x - time series (vector),

  • times - vector,

  • freq_mean - frequencies in trend (vector),

  • poly_trend_degree - trend polynomial degree, default is zero,

  • include_fixed_eff - vector of ones and zeros specifying if the cos\cos and sin\sin component of corresponding frequency is included in trend or not,

  • freq_random - frequencies in random component (vector),

  • include_random_eff - vector of ones and zeros specifying if the cos\cos and sin\sin component of corresponding frequency is included in random component or not,

  • season_period - number specifying seasonality (periodicity), default value is zero.

The drawDiagPlots() function draws the diagnostics plots of residuals. This function has two inputs

  • plots_names - name of particular graph,
    alternatively, user can specify the input parameter plots_names = "all" and the matrix of all diagnostics plots will be displayed,

  • fit_diag_fdslrm_output - output of the funtion fitDiagFDSLRM().

The drawTable() function creates the table of significant frequencies. The basic parameters allow to draw table for frequencies from periodogram and model parameters from fitting.