Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
| Download

a research paper about FDSLRM modeling with supplementary materials - software, notebooks

Project: fdslrm
Views: 1041
Kernel: R (R-Project)

Authors: Andrej Gajdoš, Jozef Hanč, Martina Hančová
Faculty of Science, P. J. Šafárik University in Košice, Slovakia
email: [email protected]


EBLUP-NE for cyber attacks

R-based computational tools - CVXR

Table of Contents

To get back to the contents, use the Home key.

R packages

CVXR: R-Embedded Object-Oriented Modeling Language for Convex Optimization

  • Purpose: R package for solving convex optimization tasks

  • Version: 0.99-5, 2019

  • URL: https://cvxr.rbind.io/

  • Computational parameters of CVXR:

  • solver - the convex optimization solver ECOS, OSQP, and SCS chosen according to the given problem * OSQP for convex quadratic problems * max_iter - maximum number of iterations (default: 4000). * eps_abs - absolute accuracy (default: 1e-3). * eps_rel - relative accuracy (default: 1e-4). * ECOS for convex second-order cone problems * maxit - maximum number of iterations (default: 100). * abstol - absolute accuracy (default: 1e-8). * reltol - relative accuracy (default: 1e-8). * feastol - tolerance for feasibility conditions (default: 1e-8). * abstol_inacc - absolute accuracy for inaccurate solution (default: 5e-5). * reltol_inacc - relative accuracy for inaccurate solution (default: 5e-5). * feastol_inacc - tolerance for feasibility condition for inaccurate solution (default: 1e-4). * SCS for large-scale convex cone problems * max_iters - maximum number of iterations (default: 5000). * eps - convergence tolerance (default: 1e-5). * alpha - relaxation parameter (default: 1.5). * scale - factor by which the data is rescaled, only used if normalize is TRUE (default: 1.0). * normalize - whether the heuristic data rescaling should be used (default: TRUE).

library(CVXR)
Attaching package: ‘CVXR’ The following object is masked from ‘package:stats’: power

Data and Model

This FDSLRM application describes the real time series data set representing total weekly number of cyber attacks against a honeynet -- an unconventional tool which mimics real systems connected to Internet like business or school computers intranets to study methods, tools and goals of cyber attackers.

Data, taken from from Sokol, 2017 were collected from November 2014 to May 2016 in CZ.NIC honeynet consisting of Kippo honeypots in medium-interaction mode. The number of time series observations is n=72n=72.

The suitable FDSLRM, after a preliminary logarithmic transformation of data Z(t)=logX(t)Z(t) = \log X(t), is Gaussian orthogonal:

Z(t) ⁣= ⁣β1+β2cos(2πt372)+β3sin(2πt372)+β4sin(2πt472)++Y1sin(2π t672)+Y2sin(2πt772)+w(t),tN, \begin{array}{rl} & Z(t) & \! = \! &\beta_1+\beta_2\cos\left(\tfrac{2\pi t\cdot 3}{72}\right)+\beta_3\sin\left(\tfrac{2\pi t\cdot 3}{72}\right)+\beta_4\sin\left(\tfrac{2\pi t\cdot 4}{72}\right) + \\ & & & +Y_1\sin\left(\tfrac{2\pi\ t\cdot 6}{72}\right)+Y_2\sin\left(\tfrac{2\pi\cdot t\cdot 7}{72}\right)+w(t), \, t\in \mathbb{N}, \end{array}

where β=(β1,β2,β3,β4)R4,Y=(Y1,Y2)N2(0,D),w(t)iidN(0,σ02),ν=(σ02,σ12,σ22)R+3\boldsymbol{\beta}=(\beta_1,\,\beta_2,\,\beta_3,\,\beta_4)' \in \mathbb{R}^4, \mathbf{Y} = (Y_1,Y_2)' \sim \mathcal{N}_2(\boldsymbol{0}, \mathrm{D}), w(t) \sim \mathcal{iid}\, \mathcal{N} (0, \sigma_0^2), \boldsymbol{\nu}= (\sigma_0^2, \sigma_1^2, \sigma_2^2) \in \mathbb{R}_{+}^3.

We identified the given and most parsimonious structure of the FDSLRM using an iterative process of the model building and selection based on exploratory tools of spectral analysis and residual diagnostics (for details see our Jupyter notebook cyberattacks.ipynb).

# data - time series observation x <- read.csv2('cyberattacks.csv', header = FALSE, sep = ",") x <- as.numeric(as.vector(x[-1,])) x <- log(x) t <- 1:length(x)
# auxilliary functions to create design matrices F, V and the projection matrix M_F makeF <- function(times, freqs) { n <- length(times) f <- length(freqs) c1 <- matrix(1, n) F <- matrix(c1, n, 1) for (i in 1:f) { F <- cbind(F, cos(2 * pi * freqs[i] * times)) F <- cbind(F, sin(2 * pi * freqs[i] * times)) } return(F) } makeV <- function(times, freqs) { V <- makeF(times, freqs) V <- V[, -1] return(V) } makeM_F <- function(F) { n <- nrow(F) c1 <- rep(1, n) I <- diag(c1) M_F <- I - F %*% solve((t(F) %*% F)) %*% t(F) return(M_F) }
# model parameters n <- 72 k <- 4 l <- 2 # model - design matrices F, V F <- makeF(t, c(3/72, 4/72)) F <- F[,-4] V <- makeV(t, c(6/72, 7/72)) V <- V[,-c(1,3)] # columns vj of V and their squared norm ||vj||^2 nv2 = colSums(V ^ 2)
# auxiliary matrices and vectors # Gram matrices GF, GV GF <- t(F) %*% F GV <- t(V) %*% V InvGF <- solve(GF) InvGV <- solve(GV) # projectors PF, MF, PV, MV In <- diag(n) PF <- F %*% InvGF %*% t(F) PV <- V %*% InvGV %*% t(V) MF <- In - PF MV <- In - PV # residuals e e <- MF %*% x

Natural estimators

ANALYTICALLY

using formula (4.1) from Hancova et al 2019

ParseError: KaTeX parse error: \renewcommand{\arraystretch} when command \arraystretch does not yet exist; use \newcommand

1st\boldsymbol{1^{st}} stage of EBLUP-NE

# auxilliary function to calculate the norm of a vector norm_vec <- function(x) sqrt(sum(x^2))
# NE according to formula (4.1) NE0 <- 1/(n-k-l) * t(e) %*% MV %*% (e) NEj <- (t(e) %*% V) ^ 2 / nv2 ^ 2 NE <- c(NE0, NEj) print(NE) print(norm_vec(NE))
[1] 0.05934201 0.02547468 0.01549533 [1] 0.06641189

CVXR

NE as a convex optimization problem

minimizef0(ν)=eeVDV2+MVeeMVν0MFMV2$6pt]subject toν=(ν0,,νl)[0,)l+1 \begin{array}{ll} \textit{minimize} & \quad f_0(\boldsymbol{\nu})=||\mathbf{e}\mathbf{e}' - \mathrm{VDV'}||^2+||\mathrm{M_V}\mathbf{e}\mathbf{e}'\mathrm{M_V}-\nu_0\mathrm{M_F}\mathrm{M_V}||^2 $6pt] \textit{subject to} & \quad \boldsymbol{\nu} = \left(\nu_0, \ldots, \nu_l\right)'\in [0, \infty)^{l+1} \end{array}

# the optimization variable, objective function v <- Variable(l+1) fv <- sum_squares(e%*%t(e)-V%*%diag(v[2:(l+1)])%*%t(V)) + sum_squares(MV%*%e%*%t(e)%*%MV-v[1]%*%MF%*%MV) # the optimization problem for NE objective <- Minimize(fv) constraints <- list(v >= 0) prob <- Problem(objective, constraints) # solve the NE problem sol <- solve(prob) cat("NEcvxr =",as.vector(sol$getValue(v))) cat("\n") cat("norm =", norm_vec(as.vector(sol$getValue(v))))
NEcvxr = 0.05934215 0.02547457 0.01549518 norm = 0.06641194

2nd\boldsymbol{2^{nd}} stage of EBLUP-NE

using formula (3.10) from Hancova et al 2019.

ν˚j=ρj2ν˘j;j=0,1,lρ0=1,ρj=ν^jvj2ν^0+ν^jvj2 \mathring{\nu}_j = \rho_j^2 \breve{\nu}_j; j = 0,1 \ldots, l\\ \rho_0 = 1, \rho_j = \dfrac{\hat{\nu}_j||\mathbf{v}_j||^2}{\hat{\nu}_0+\hat{\nu}_j||\mathbf{v}_j||^2}

where ν˘\boldsymbol{\breve{\nu}} are NE, ν^\boldsymbol{\hat{\nu}} are initial estimates for EBLUP-NE

# EBLUP-NE based on formula (3.9) rho2 <- function(est) { result <- c(1) for(j in 2:length(est)) { result <- c(result, (est[j]*nv2[j-1]/(est[1]+est[j]*nv2[j-1])) ^ 2) } return(result) }
EBLUPNE <- function(est) { result <- NE * rho2(est) return(result) }
# numerical results print(rho2(NE))
[1] 1.0000000 0.8821446 0.8169426
print(EBLUPNE(NE)) print(norm_vec(EBLUPNE(NE)))
[1] 0.05934201 0.02247235 0.01265880 [1] 0.06470492

NN-DOOLSE or MLE

using the result of the KKT algorithm (tab.3, Hancova et al 2019) from PY-estimation-cyberattacks-SciPy.ipynb

1st\boldsymbol{1^{st}} stage of EBLUP-NE

KKT algorithm

using the the KKT algorithm (tab.3, Hancova et al 2019)

 ~

q=(ee(ev1)2(evl)2) \qquad \mathbf{q} = \left(\begin{array}{c} \mathbf{e}' \mathbf{e}\\ (\mathbf{e}' \mathbf{v}_{1})^2 \\ \vdots \\ (\mathbf{e}' \mathbf{v}_{l})^2 \end{array}\right)

G=(nv12v22vl2v12v1400v220v240vl200vl4)\qquad\mathrm{G} = \left(\begin{array}{ccccc} \small n^* & ||\mathbf{v}_{1}||^2 & ||\mathbf{v}_{2}||^2 & \ldots & ||\mathbf{v}_{l}||^2 \\ ||\mathbf{v}_{1}||^2 & ||\mathbf{v}_{1}||^4 & 0 & \ldots & 0 \\ ||\mathbf{v}_{2}||^2 & 0 & ||\mathbf{v}_{2}||^4 & \ldots & 0 \\ \vdots & \vdots & \vdots & \ldots & \vdots \\ ||\mathbf{v}_{l}||^2 & 0 & 0 & \ldots & ||\mathbf{v}_{l}||^4 \end{array}\right)

NNMDOOLSE_kkt <- function(X, F, V, method = "NNDOOLSE") { n_star <- length(X) k <- ncol(F) l <- ncol(V) if(method == "NNMDOOLSE") { n_star <- n_star - k } MF <- makeM_F(F) u <- diag(t(V) %*% V) G <- rbind(c(n_star,u),cbind(u, diag(u^2))) b_comb <- expand.grid(rep(list(0:1), l)) eps <- c(MF %*% X) q <- c(eps %*% eps, (eps %*% V)^2) K <- G s <- vector() for(i in 1:nrow(b_comb)) { K_inv <- matrix() b <- as.vector(unlist(b_comb[i,])) for(j in 1:length(b)) { if(b[j] == 0) { K[1,j+1] <- 0 K[j+1,j+1] <- -1 } K_inv <- solve(K) } beta <- c(K_inv %*% q) if(all(beta >= 0)) { s <- beta[1] for(m in 1:l) { if(b[m] == 0) { s <- c(s, 0) } else { s <- c(s, beta[m+1]) } } break } else { K <- G } } return(list("estimates" = s, "b" = b)) }
NN_DOOLSE <- NNMDOOLSE_kkt(x, F, V) print(NN_DOOLSE$estimates) print(norm_vec(NN_DOOLSE$estimates)) print(NN_DOOLSE$b)
[1] 0.05595104 0.02392048 0.01394113 [1] 0.06242647 [1] 1 1

CVXR

nonnegative DOOLSE as a convex optimization problem

ParseError: KaTeX parse error: Got function '\boldsymbol' with no arguments as subscript at position 97: …hbf{e}'-\Sigma_\̲b̲o̲l̲d̲s̲y̲m̲b̲o̲l̲{\nu}||^2 $6pt]…

NNDOOLSE_CVXR <- function(X, F, V) { n <- length(X) k <- ncol(F) l <- ncol(V) # GF <- t(F) %*% F # InvGF <- solve(GF) I <- diag(n) # PF <- F %*% InvGF %*% t(F) #MF <- I - PF MF <- makeM_F(F) MFV <- MF %*% V SX <- MF %*% X %*% t(X) %*% MF s <- Variable(l+1) p_obj <- Minimize(sum_squares(SX - (s[1] %*% I) - (V %*% diag(s[2:(l+1)]) %*% t(V)))) constr <- list(s >= 0) prob <- Problem(p_obj, constr) sol <- solve(prob) return(as.vector(sol$getValue(s))) }
NN_DOOLSEcvxr <- NNDOOLSE_CVXR(x, F, V) print(NN_DOOLSEcvxr) print(norm_vec(NN_DOOLSEcvxr))
[1] 0.05595096 0.02392044 0.01394110 [1] 0.06242637

CVXR

using equivalent (RE)MLE convex problem (proposition 5, Hancova et al 2019)

minimizef0(d)=(n ⁣l)lnd0j=1lln(d0djvj2+d0eeeVdiag{dj}Ve$6pt]subject tod0>max{djvj2,j=1,,l}dj0,j=1,lfor MLE: n=n, for REMLE: n=nkback transformation:ν0=1d0,νj=djd0(d0djvj2) \begin{array}{ll} \textit{minimize} & \quad f_0(\mathbf{d})=-(n^*\!-l)\ln d_0 - \displaystyle\sum\limits_{j=1}^{l} \ln(d_0-d_j||\mathbf{v}_j||^2+d_0\mathbf{e}'\mathbf{e}-\mathbf{e}'\mathrm{V}\,\mathrm{diag}\{d_j\}\mathrm{V}'\mathbf{e} $6pt] \textit{subject to} & \quad d_0 > \max\{d_j||\mathbf{v}_j||^2, j = 1, \ldots, l\} \\ & \quad d_j \geq 0, j=1,\ldots l \\ & \\ & \quad\text{for MLE: } n^* = n, \text{ for REMLE: } n^* = n-k \\ \textit{back transformation:} & \quad \nu_0 = \dfrac{1}{d_0}, \nu_j = \dfrac{d_j}{d_0\left(d_0 -d_j||\mathbf{v}_j||^2\right)} \end{array}

MLE_CVXR <- function(X, F, V){ n <- length(X) l <- ncol(V) MF <- makeM_F(F) GV <- t(V) %*% V p <- n - l e <- as.vector(MF %*% X) ee <- as.numeric(t(e) %*% e) eV <- t(e) %*% V Ve <- t(V) %*% e d <- Variable(l+1) logdetS <- p * log(d[1]) + sum(log(d[1] - GV %*% d[2:(l+1)])) obj <- Maximize(logdetS - ((d[1] * ee) - (eV %*% diag(d[2:(l+1)]) %*% Ve))) constr <- list(d[2:(l+1)] >= 0, d[1] >= max_entries(GV %*% d[2:(l+1)])) p_MLE <- Problem(obj, constr) sol <- solve(p_MLE) s <- 1 / sol$getValue(d)[1] sj <- vector() for(j in 2:(l+1)) { sj <- c(sj, sol$getValue(d)[j]/(sol$getValue(d)[1] * (sol$getValue(d)[1] - sol$getValue(d)[j] * GV[j-1,j-1]))) } result <- c(s, sj) return(as.vector(result)) }
MLEcvxr <- MLE_CVXR(x, F, V) print(MLEcvxr) print(norm_vec(MLEcvxr))
[1] 0.05595104 0.02392050 0.01394115 [1] 0.06242648

2nd\boldsymbol{2^{nd}} stage of EBLUP-NE

# numerical results print(rho2(NN_DOOLSE$estimates))
[1] 1.0000000 0.8817033 0.8094585
print(EBLUPNE(NN_DOOLSE$estimates)) print(norm_vec(EBLUPNE(NN_DOOLSE$estimates)))
[1] 0.05934201 0.02246111 0.01254283 [1] 0.06467842

NN-DOOLSE or REMLE

using the KKT algorithm (tab.3, Hancova et al 2019)

1st\boldsymbol{1^{st}} stage of EBLUP-NE

KKT algorithm

NN_MDOOLSE <- NNMDOOLSE_kkt(x, F, V, method = "NNMDOOLSE")
print(NN_MDOOLSE$estimates) print(norm_vec(NN_MDOOLSE$estimates)) print(NN_MDOOLSE$b)
[1] 0.05934201 0.02382629 0.01384694 [1] 0.06542862 [1] 1 1

CVXR

nonnegative DOOLSE as a convex optimization problem

ParseError: KaTeX parse error: Got function '\boldsymbol' with no arguments as subscript at position 109: …hrm{M_F}\Sigma_\̲b̲o̲l̲d̲s̲y̲m̲b̲o̲l̲{\nu}\mathrm{M_…

NNMDOOLSE_CVXR <- function(X, F, V) { n <- length(X) k <- ncol(F) l <- ncol(V) # GF <- t(F) %*% F # InvGF <- solve(GF) I <- diag(n) # PF <- F %*% InvGF %*% t(F) #MF <- I - PF MF <- makeM_F(F) MFV <- MF %*% V SX <- MF %*% X %*% t(X) %*% MF s <- Variable(l+1) p_obj <- Minimize(sum_squares(SX - (s[1] %*% MF) - (MFV %*% diag(s[2:(l+1)]) %*% t(MFV)))) constr <- list(s >= 0) prob <- Problem(p_obj, constr) sol <- solve(prob) return(as.vector(sol$getValue(s))) }
NN_MDOOLSEcvxr <- NNMDOOLSE_CVXR(x, F, V) print(NN_MDOOLSEcvxr) print(norm_vec(NN_MDOOLSEcvxr))
[1] 0.05934192 0.02382624 0.01384690 [1] 0.06542851

CVXR

using equivalent (RE)MLE convex problem (proposition 5, Hancova et al 2019)

minimizef0(d)=(n ⁣l)lnd0j=1lln(d0djvj2+d0eeeVdiag{dj}Ve$6pt]subject tod0>max{djvj2,j=1,,l}dj0,j=1,lfor MLE: n=n, for REMLE: n=nkback transformation:ν0=1d0,νj=djd0(d0djvj2) \begin{array}{ll} \textit{minimize} & \quad f_0(\mathbf{d})=-(n^*\!-l)\ln d_0 - \displaystyle\sum\limits_{j=1}^{l} \ln(d_0-d_j||\mathbf{v}_j||^2+d_0\mathbf{e}'\mathbf{e}-\mathbf{e}'\mathrm{V}\,\mathrm{diag}\{d_j\}\mathrm{V}'\mathbf{e} $6pt] \textit{subject to} & \quad d_0 > \max\{d_j||\mathbf{v}_j||^2, j = 1, \ldots, l\} \\ & \quad d_j \geq 0, j=1,\ldots l \\ & \\ & \quad\text{for MLE: } n^* = n, \text{ for REMLE: } n^* = n-k \\ \textit{back transformation:} & \quad \nu_0 = \dfrac{1}{d_0}, \nu_j = \dfrac{d_j}{d_0\left(d_0 -d_j||\mathbf{v}_j||^2\right)} \end{array}

REMLE_CVXR <- function(X, F, V){ n <- length(X) k <- ncol(F) l <- ncol(V) MF <- makeM_F(F) GV <- t(V) %*% V p <- n - l - k e <- as.vector(MF %*% X) ee <- as.numeric(t(e) %*% e) eV <- t(e) %*% V Ve <- t(V) %*% e d <- Variable(l+1) logdetS <- p * log(d[1]) + sum(log(d[1] - GV %*% d[2:(l+1)])) obj <- Maximize(logdetS - ((d[1] * ee) - (eV %*% diag(d[2:(l+1)]) %*% Ve))) constr <- list(d[2:(l+1)] >= 0, d[1] >= max_entries(GV %*% d[2:(l+1)])) p_remle <- Problem(obj, constr) sol <- solve(p_remle) s <- 1 / sol$getValue(d)[1] sj <- vector() for(j in 2:(l+1)) { sj <- c(sj, sol$getValue(d)[j]/(sol$getValue(d)[1] * (sol$getValue(d)[1] - sol$getValue(d)[j] * GV[j-1,j-1]))) } result <- c(s, sj) return(as.vector(result)) }
REMLEcvxr <- REMLE_CVXR(x, F, V) print(REMLEcvxr) print(norm_vec(REMLEcvxr))
[1] 0.05934202 0.02382630 0.01384695 [1] 0.06542863

2nd\boldsymbol{2^{nd}} stage of EBLUP-NE

# numerical results print(rho2(NN_MDOOLSE$estimates))
[1] 1.0000000 0.8747730 0.7985572
print(EBLUPNE(NN_MDOOLSE$estimates)) print(norm_vec(EBLUPNE(NN_MDOOLSE$estimates)))
[1] 0.05934201 0.02228456 0.01237391 [1] 0.06458475

References

This notebook belongs to suplementary materials of the paper submitted to Statistical Papers and available at https://arxiv.org/abs/1905.07771.

Abstract of the paper

We propose a two-stage estimation method of variance components in time series models known as FDSLRMs, whose observations can be described by a linear mixed model (LMM). We based estimating variances, fundamental quantities in a time series forecasting approach called kriging, on the empirical (plug-in) best linear unbiased predictions of unobservable random components in FDSLRM.

The method, providing invariant non-negative quadratic estimators, can be used for any absolutely continuous probability distribution of time series data. As a result of applying the convex optimization and the LMM methodology, we resolved two problems - theoretical existence and equivalence between least squares estimators, non-negative (M)DOOLSE, and maximum likelihood estimators, (RE)MLE, as possible starting points of our method and a practical lack of computational implementation for FDSLRM. As for computing (RE)MLE in the case of n n observed time series values, we also discovered a new algorithm of order O(n)\mathcal{O}(n), which at the default precision is 10710^7 times more accurate and n2n^2 times faster than the best current Python(or R)-based computational packages, namely CVXPY, CVXR, nlme, sommer and mixed.

We illustrate our results on three real data sets - electricity consumption, tourism and cyber security - which are easily available, reproducible, sharable and modifiable in the form of interactive Jupyter notebooks.