Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
| Download

a research paper about FDSLRM modeling with supplementary materials - software, notebooks

Project: fdslrm
Views: 1041
Kernel: R (R-Project)

Authors: Andrej Gajdoš, Jozef Hanč, Martina Hančová
Faculty of Science, P. J. Šafárik University in Košice, Slovakia
email: [email protected]


MLE, REMLE for cyber attacks

R-based computational tools - fdslrm, nlme, MMEinR(mixed), sommer

Table of Contents

  • Data and model - data and model description of empirical data

  • MLE - computation by standard packages nlme, MMEinR(mixed) and fdslrm

  • REMLE - computation by standard package nlme, MMEinR(mixed), sommer and fdslrm

  • References

To get back to the table of contents, use the Home key.


Data and Model

This FDSLRM application describes the real time series data set representing total weekly number of cyber attacks against a honeynet -- an unconventional tool which mimics real systems connected to Internet like business or school computers intranets to study methods, tools and goals of cyber attackers.

Data, taken from from Sokol, 2017 were collected from November 2014 to May 2016 in CZ.NIC honeynet consisting of Kippo honeypots in medium-interaction mode. The number of time series observations is n=72n=72.

The suitable FDSLRM, after a preliminary logarithmic transformation of data Z(t)=logX(t)Z(t) = \log X(t), is Gaussian orthogonal:

Z(t) ⁣= ⁣β1+β2cos(2πt372)+β3sin(2πt372)+β4sin(2πt472)++Y1sin(2π t672)+Y2sin(2πt772)+w(t),tN, \begin{array}{rl} & Z(t) & \! = \! &\beta_1+\beta_2\cos\left(\tfrac{2\pi t\cdot 3}{72}\right)+\beta_3\sin\left(\tfrac{2\pi t\cdot 3}{72}\right)+\beta_4\sin\left(\tfrac{2\pi t\cdot 4}{72}\right) + \\ & & & +Y_1\sin\left(\tfrac{2\pi\ t\cdot 6}{72}\right)+Y_2\sin\left(\tfrac{2\pi\cdot t\cdot 7}{72}\right)+w(t), \, t\in \mathbb{N}, \end{array}

where β=(β1,β2,β3,β4)R4,Y=(Y1,Y2)N2(0,D),w(t)iidN(0,σ02),ν=(σ02,σ12,σ22)R+3\boldsymbol{\beta}=(\beta_1,\,\beta_2,\,\beta_3,\,\beta_4)' \in \mathbb{R}^4, \mathbf{Y} = (Y_1,Y_2)' \sim \mathcal{N}_2(\boldsymbol{0}, \mathrm{D}), w(t) \sim \mathcal{iid}\, \mathcal{N} (0, \sigma_0^2), \boldsymbol{\nu}= (\sigma_0^2, \sigma_1^2, \sigma_2^2) \in \mathbb{R}_{+}^3.

We identified the given and most parsimonious structure of the FDSLRM using an iterative process of the model building and selection based on exploratory tools of spectral analysis and residual diagnostics (for details see our Jupyter notebook cyberattacks.ipynb).

# data - time series observation x <- read.csv2('cyberattacks.csv', header = FALSE, sep = ",") x <- as.numeric(as.vector(x[-1,])) x <- log(x) t <- 1:length(x)
# auxilliary functions to create design matrices F, V and the projection matrix M_F makeF <- function(times, freqs) { n <- length(times) f <- length(freqs) c1 <- matrix(1, n) F <- matrix(c1, n, 1) for (i in 1:f) { F <- cbind(F, cos(2 * pi * freqs[i] * times)) F <- cbind(F, sin(2 * pi * freqs[i] * times)) } return(F) } makeV <- function(times, freqs) { V <- makeF(times, freqs) V <- V[, -1] return(V) } makeM_F <- function(F) { n <- nrow(F) c1 <- rep(1, n) I <- diag(c1) M_F <- I - F %*% solve((t(F) %*% F)) %*% t(F) return(M_F) }
# model parameters n <- 72 k <- 4 l <- 2 # model - design matrices F, V F <- makeF(t, c(3/72, 4/72)) F <- F[,-4] V <- makeV(t, c(6/72, 7/72)) V <- V[,-c(1,3)]

MLE

nlme: Linear and Nonlinear Mixed Effects Models

  • Purpose: R package to fit and compare Gaussian linear and nonlinear mixed-effects models

  • Version: 3.1-140, 2019

  • URL: https://CRAN.R-project.org/package=nlme]

  • Key tool: lme function for fitting LMM

  • Computational parameters of lme:

  • maxIter - maximum number of iterations (default: 50). * msMaxIter - maximum number of iterations for the optimization step (default: 50). * tolerance - tolerance for the convergence criterion (default: 1e-6). * niterEM - number of iterations for the EM algorithm used to refine the initial estimates of the random effects (default: 25). * msMaxEval - maximum number of evaluations of the objective function permitted for optimizer nlminb (default: 200). * msTol - tolerance for the convergence criterion on the first iteration when optim is used (default is 1e-7). * .relStep - relative step for numerical derivatives calculations. Default is .Machine$double.eps^(1/3) while .Machine$double.eps = 2.220446e-16. * opt - the optimizer to be used, either "nlminb" (the default) or "optim". * optimMethod - the optimization method, a version of quasi-Newton method to be used with the optim optimizer (default:"BFGS", alternative: "L-BFGS-B") * minAbsParApVar - numeric value - minimum absolute parameter value in the approximate variance calculation (default: 0.05).

library(nlme)
REMLE_nlme <- function(X, F, V, method) { g <- rep(1,length(X)) d <- data.frame(g,F,V,X) colnames(d) <- c("g","f1","f2","f3","f4","v1","v2","X") result <- NULL if(method == "ML") { result <- lme(fixed=X~f2+f3+f4,random=list(g=pdDiag(~-1+v1+v2)),data=d,method="ML") } else { result <- lme(fixed=X~f2+f3+f4,random=list(g=pdDiag(~-1+v1+v2)),data=d,method="REML") } return(as.vector(c(result$sigma^2, diag(getVarCov(result))))) }
# auxilliary function to calculate the norm of a vector norm_vec <- function(x) sqrt(sum(x^2))
MLEnlme <- REMLE_nlme(x, F, V, "ML") print(MLEnlme) norm_vec(MLEnlme)
[1] 0.05595104 0.02392048 0.01394113
[1] 0.06242647

fdslrm: Time series analysis and forecasting using LMM

  • Purpose: R package for modeling and prediction of time series using linear mixed models.

  • Version: 0.1.0, 2019

  • Depends: kableExtra, IRdisplay, MASS, Matrix, car, nlme, stats, forecast, fpp2, matrixcalc, sommer, gnm, pracma, CVXR

  • Maintainer: Andrej Gajdoš

  • Authors: Andrej Gajdoš, Jozef Hanč, Martina Hančová

  • URL: https://github.com/fdslrm/R-package

  • Installation: Run jupyter notebook 00 installation fdslrm.ipynb once before the first run of any R-based Jupyter notebook.

fdslrm (nlme)

function fitDiagFDSLRM via parameter "lme" implements lme function from nlme detailed explanat

library(fdslrm) initialFDSLRM()
Attaching package: ‘fdslrm’ The following objects are masked _by_ ‘.GlobalEnv’: makeF, makeM_F, makeV
# fit particular FDSLRM employing ML estimation of variances (by 'lme' function) MLEfitnlme <- fitDiagFDSLRM(x, t, freq_mean = c(3/72, 4/72), include_fixed_eff = c(1,1,0,1), freq_random = c(6/72, 7/72), include_random_eff = c(0,1,0,1), fit_function = "lme", var_estim_method = "ML")
print(as.vector(c(MLEfitnlme$error_variance, diag(MLEfitnlme$rand_eff_variance)))) norm_vec(as.vector(c(MLEfitnlme$error_variance, diag(MLEfitnlme$rand_eff_variance))))
[1] 0.05595104 0.02392048 0.01394113
[1] 0.06242647

MMEinR

R version of Witkovský's MATLAB function mixed https://www.mathworks.com/matlabcentral/fileexchange/200

  • Purpose: R function to estimate parameters of a linear mixed model (LMM) with a simple variance components structure

  • URL: https://github.com/fdslrm/MMEinR

  • Computational parameters of MMEinR (mixed):

  • tolerance for the convergence criterion (epss = 1e-8)

  • iterative method solving Henderson's mixed model equations

  • function return the estimates of variance parameters in a different order: (ν~1,,ν~l,ν~0)(\tilde{\nu}_1,\ldots,\tilde{\nu}_l,\tilde{\nu}_0)' in comparison with other tools

# load function 'MMEinR' devtools::source_url("https://github.com/fdslrm/MMEinR/blob/master/MMEinR.R?raw=TRUE")
SHA-1 hash of file is 36b3b4cd1a1f98c5fb36b541fc6b3956f6663aba
MLEmixed <- mixed(x, F, V, c(1,1), c(1,1,1), 1) print(MLEmixed$s2) print(norm_vec(MLEmixed$s2))
[1] 0.02392048 0.01394113 0.05595104 [1] 0.06242647

fdslrm (MMEinR)

function fitDiagFDSLRM via parameter "mixed" implements R version of MATLAB mixed function

# fit particular FDSLRM employing ML estimation of variances (by 'mixed' function) MLEfitmixed <- fitDiagFDSLRM(x, t, freq_mean = c(3/72, 4/72), include_fixed_eff = c(1,1,0,1), freq_random = c(6/72, 7/72), include_random_eff = c(0,1,0,1), fit_function = "mixed", var_estim_method = "ML")
print(as.vector(c(MLEfitmixed$error_variance, MLEfitmixed$rand_eff_variance))) norm_vec(as.vector(c(MLEfitmixed$error_variance, MLEfitmixed$rand_eff_variance)))
[1] 0.05595104 0.02392048 0.01394113
[1] 0.06242647

REMLE

nlme

REMLEnlme <- REMLE_nlme(x, F, V, "REML") print(REMLEnlme) norm_vec(REMLEnlme)
[1] 0.05934201 0.02382629 0.01384694
[1] 0.06542862

fdslrm (nlme)

# fit particular FDSLRM employing REML estimation of variances (by 'lme' function) REMLEfitnlme <- fitDiagFDSLRM(x, t, freq_mean = c(3/72, 4/72), include_fixed_eff = c(1,1,0,1), freq_random = c(6/72, 7/72), include_random_eff = c(0,1,0,1), fit_function = "lme", var_estim_method = "REML")
print(as.vector(c(REMLEfitnlme$error_variance, diag(REMLEfitnlme$rand_eff_variance)))) norm_vec(as.vector(c(REMLEfitnlme$error_variance, diag(REMLEfitnlme$rand_eff_variance))))
[1] 0.05934201 0.02382629 0.01384694
[1] 0.06542862

MMEinR

MLEmixed <- mixed(x, F, V, c(1,1), c(1,1,1), 2) print(MLEmixed$s2) print(norm_vec(MLEmixed$s2))
[1] 0.02382629 0.01384694 0.05934201 [1] 0.06542862

fdslrm (MMEinR)

# fit particular FDSLRM employing REML estimation of variances (by 'MMEinR (mixed)' function) REMLEfitmixed <- fitDiagFDSLRM(x, t, freq_mean = c(3/72, 4/72), include_fixed_eff = c(1,1,0,1), freq_random = c(6/72, 7/72), include_random_eff = c(0,1,0,1), fit_function = "mixed", var_estim_method = "REML")
print(as.vector(c(REMLEfitmixed$error_variance, REMLEfitmixed$rand_eff_variance))) norm_vec(as.vector(c(REMLEfitmixed$error_variance, REMLEfitmixed$rand_eff_variance)))
[1] 0.05934201 0.02382629 0.01384694
[1] 0.06542862

sommer: Solving Mixed Model Equations in R

  • Purpose: R package - structural multivariate-univariate linear mixed model solver for multiple random effects and estimation of unknown variance-covariance structures (i.e. heterogeneous and unstructured variance models). Designed for genomic prediction and genome wide association studies (GWAS), particularly focused in the pp > nn problem (more coefficients than observations) to include multiple known relationship matrices and estimate complex unknown covariance structures. Spatial models can be fitted using the two-dimensional spline functionality in sommer.

  • Version: 3.9.3, 2019

  • URL: https://CRAN.R-project.org/package=sommer

  • Key tool: mmer function for fitting

  • computational parameters of mmer:

  • iters - maximum number of iterations (default: 20). * tolpar - tolerance for the convergence criterion for the change in loglikelihood (default: 1e-03) * tolparinv - tolerance parameter for matrix inverse used when singularities are encountered (default:1e-06). * method - this refers to the method or algorithm to be used for estimating variance components (default: NR, direct-inversion Newton-Raphson method, alternative: AI, Average Information) * function return the estimates of variance parameters in a different order: (ν~1,,ν~l,ν~0)(\tilde{\nu}_1,\ldots,\tilde{\nu}_l,\tilde{\nu}_0)' in comparison with other tools

library(sommer)
REMLE_sommer <- function(X, F, V) { f1 <- F[,1] f2 <- F[,2] f3 <- F[,3] f4 <- F[,4] v1 <- V[,1] v2 <- V[,2] DT <- data.frame(X, f1, f2, f3, f4, v1, v2) suppressWarnings(result_new <- mmer(fixed = X~f2+f3+f4, random = ~ vs(ds(v1),1)+vs(ds(v2),1), data = DT, verbose = FALSE)) output <- as.vector(unlist(result_new$sigma)) return(output) }
REMLEsommer <- REMLE_sommer(x, F, V) print(REMLEsommer) print(norm_vec(REMLEsommer))
[1] 0.02382450 0.01384666 0.05934209 [1] 0.06542798

fdslrm (sommer)

function fitDiagFDSLRM via parameter "mmer" implements mmer function from sommer

# fit particular FDSLRM employing REML estimation of variances (by 'mmer' function) REMLEfitsommer <- suppressWarnings(fitDiagFDSLRM(x, t, freq_mean = c(3/72, 4/72), include_fixed_eff = c(1,1,0,1), freq_random = c(6/72, 7/72), include_random_eff = c(0,1,0,1), fit_function = "mmer", var_estim_method = "REML"))
fixed-effect model matrix is rank deficient so dropping 1 columns / coefficients
print(as.vector(c(REMLEfitsommer$error_variance, REMLEfitsommer$rand_eff_variance))) print(norm_vec(c(REMLEfitsommer$error_variance, REMLEfitsommer$rand_eff_variance)))
[1] 0.05934209 0.02382450 0.01384666 [1] 0.06542798

References

This notebook belongs to suplementary materials of the paper submitted to Statistical Papers and available at https://arxiv.org/abs/1905.07771.

Abstract of the paper

We propose a two-stage estimation method of variance components in time series models known as FDSLRMs, whose observations can be described by a linear mixed model (LMM). We based estimating variances, fundamental quantities in a time series forecasting approach called kriging, on the empirical (plug-in) best linear unbiased predictions of unobservable random components in FDSLRM.

The method, providing invariant non-negative quadratic estimators, can be used for any absolutely continuous probability distribution of time series data. As a result of applying the convex optimization and the LMM methodology, we resolved two problems - theoretical existence and equivalence between least squares estimators, non-negative (M)DOOLSE, and maximum likelihood estimators, (RE)MLE, as possible starting points of our method and a practical lack of computational implementation for FDSLRM. As for computing (RE)MLE in the case of n n observed time series values, we also discovered a new algorithm of order O(n)\mathcal{O}(n), which at the default precision is 10710^7 times more accurate and n2n^2 times faster than the best current Python(or R)-based computational packages, namely CVXPY, CVXR, nlme, sommer and mixed.

We illustrate our results on three real data sets - electricity consumption, tourism and cyber security - which are easily available, reproducible, sharable and modifiable in the form of interactive Jupyter notebooks.