a research paper about FDSLRM modeling with supplementary materials - software, notebooks
Authors: Andrej Gajdoš, Martina Hančová, Jozef Hanč
Faculty of Science, P. J. Šafárik University in Košice, Slovakia
emails: [email protected], [email protected]
FDSLRM applications - Cyber attacks
Weekly cyber attacks against honeynet
Table of Contents
Data and model - data and model description, estimating parameters, software
Modeling - loading R functions and packages, data plot, periodogram
Residual diagnostics - description of graphical tools, numerical tests
Fitting summary - estimated model parameters, fit summary
Session info - list of applied R packages in computations
References - list of detailed references for data and applied methods
Appendix - Tools, R functions - brief help on applied diagnostic tools, R functions
To get back to the contents, use the Home key.
Data and model
Data description
In this FDSLRM application we model the time series data set representing total weekly number of cyber attacks against honeynet. Data were collected from November 2014 to May 2016 in CZ.NIC honeynet consisting of Kippo honeypots in medium-interaction mode. The number of time series observations is , the correspoding plot with more details is shown in the following section Modeling. The data was adapted from Sokol, 2018.
Model description
The cyber attacks data can be succesfully fitted by the FDSLRM of the form:
where
is a vector of real regression coefficients,
is an unobservable Gaussian random vector with zero mean vector and covariance matrix ,
is Gaussian iid noise with variance ,
is a vector of real nonnegative variance-covariance parameters.
We identified the given and most parsimonious structure of the FDSLRM using an iterative process of the model building and selection based on exploratory tools of spectral analysis (Gajdoš et al., 2017; Brockwell & Davis, 2006) and residual diagnostics (see sections Modelling and Residual diagnostics).
Estimating the model parameters
During the modelling, we obtained:
estimates of variance-covariance components by residual maximum likelihood (REML),
estimates of regression parameters in trend by weighted least squares,
predictions of the random vector using the best linear unbiased predictor (BLUP) procedure.
The methods and procedures described for FDSLRM in more detail can be found in Štulajter 2002, 2003; Gajdoš et al. 2018. Estimated values of and predictions of are also presented in the form of tables in Fitting summary.
Computational software
As for numerical calculations, we conducted our computations in the R statistical computing language (https://www.r-project.org; R Development Core Team, 2018) with the key libraries nlme (Pinhero et al., 2018; Galecki & Burzykowski, 2013), R functions for LMM programmed by Singer (Singer et al., 2017) and R functions for FDSLRM programmed by authors of the Jupyter notebook included in fdslrm package. The complete list of used R libraries is included in Session info.
Remark.
Mean value (or FDSLRM trend) component of our FDSLRM is the real function in the form of:
the linear regression (LR) the linear combination of deterministic real functions with real amplitudes
Random errors (or FDSLRM errors) component of our FDSLRM is the zero mean value time series consisting of:
finite discrete spectrum errors (FDS errors) the linear combination of deterministic real functions with mutually uncorrelated random amplitudes and
white noise errors (WN errors) (or ) noise
Loading R functions and packages
A brief help on all applied R functions and packages designed to work with FDSLRM is in the section Appendix.
Data plot
The cyber attacks data set was adapted from Sokol, 2018. The more detailed data description can be found in Sokol, 2018.
Due to big variability in the data we consider the logarithmic transformation of data.
Spectral analysis - Periodogram
Each stationary time series can be decomposed into sines and cosines with different frequencies and random amplitudes. To identify significant (Fourier) frequencies, we apply a spectral time series exploratory tool called periodogram (more details in Gajdoš et al., 2017).
Six most significant frequencies according to values of spectrum in periodogram.
spectrum | 0.9899895 | 0.5162030 | 0.4981809 | 0.3356683 | 0.2388390 | 0.2073755 |
frequency (raw) | 0.0416667 | 0.0833333 | 0.0555556 | 0.0972222 | 0.1388889 | 0.1111111 |
The raw frequencies from periodogram can be easily rewritten to the corresponding Fourier frequencies in the standard form , where is the frequency order and is the number of time series observations (or a very close integer number); e.g. the first raw frequency can be expressed as , which corresponds to the Fourier frequency or the last (sixth) significant frequency: .
spectrum | 0.9899895 | 0.5162030 | 0.4981809 | 0.3356683 | 0.2388390 | 0.2073755 |
frequency (raw) | 0.04166667 | 0.08333333 | 0.05555556 | 0.09722222 | 0.13888889 | 0.11111111 |
frequency |
Our FDSLRM for the tourism data can be rewritten in the matrix form as a linear mixed model (LMM):
where
is a vector of the time series observations,
is a random vector of corresponding iid (or white) noise values
model design matrices for our final model have the following structure:
This fundamental FDSLRM property allows us to apply many results and mathematical techniques of LMM methodology. In the language of LMM terminology represents the vector of fixed effects, the random component depends on vector of random effects and of random errors. From the viewpoint of LMM residual analysis, we have to consider three types of residuals:
marginal residuals (FDSLRM residuals): ,
random effects residuals (FDS residuals): ,
conditional residuals (WN residuals): .
How was the final form of and found?
Due to the LMM structure, we can apply graphical (exploratory) tools and quantitative tests of LMM residual diagnostics (Singer et al., 2017) for FDSLRM observations (see the next subsection Graphical tools). The most suitable form of and is found by an iterative process (see rules and steps explained in the next remark) of applying the mentioned tools whose results are summarized in the following table. Two most adequate and parsimonious structures of the FDSLRM (2b, 3b) consist of two low frequencies and two higher ones . Since the difference in AIC, BIC for both models is relatively small, our final choice is model 3b thanks to the generally smaller mean squared error in predictions (Hančová, 2007).
Iteration number |
Graphical diagnostic tools | Numerical diagnostic tests | Frequencies (raw) in model | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
L | O1 | H | O2 | ACF | PACF | N1 | N2 | N3 | Normality test |
Independence test |
|||||
1. | ? | ? | T,1,1 | T,1,1 | R,1,1 | - | |||||||||
2a. | ? | T,1,1 | T,1,1 | R,1,1 | R,1,1 | ||||||||||
2b. | T,1,1 | T,0,1 | R,0,1 | R,0,1 | |||||||||||
3a. | ? | ? | T,1,1 | T,1,1 | T,1,1 | R,1,1 | |||||||||
3b. | T,1,1 | T,0,1 | T,0,1 | R,0,1 |
*Coding: letter (T or R) presence of the frequency in the trend or random component; the first number (1 or 0) presence of the frequency in the cos term; the second number (1 or 0) in the sin term
Key rules and steps of the iterative econometric FDSLRM-building General rules of the FDSLRM-building for econometric data (Štulajter, 2002; Box et al., 2016; Gajdoš et al, 2017):
the model should be parsimonious (in the form of a simple dependence with a small number of the parameters),
the model should create small unexplained (random) deviations (small variances, covariances or mean squared errors),
the model should include lower frequencies in the trend component, and higher ones in the random component.
Key steps of the FDSLRM-building for econometric data (our experience):
follow the scheme of the Box-Jenkins iterative, three-stage time series model-building approach: formulation (identification), estimation (fit), diagnostic (checking),
the more apparent periodic or seasonal patterns in the data mean the smaller number of iterations in the model building,
start at least with the three most significant frequencies in the model,
use the graphical-tools diagnostic matrix and numerical tests (see below) to check the adequacy of the model,
remove or add other significant frequencies and check the model adequacy,
remove cos or sin term with a particular frequency and check the model adequacy,
very small or very big variance parameter estimates indicate a reason to remove the corresponding cos or sin term with a particular frequency in the random component,
very small predictions of random effects also indicate an exclusion of the corresponding term in the random component,
use information criteria AIC, BIC to choose between competing adequate models.
Graphical (exploratory) tools
In the LMM residual analysis for FDSLRM we can use the following matrix of graphical exploratory tools (plots) as diagnostic tools for all FDSLRM assumptions. A brief description of the tools is in the Appendix.
|||| |---|------------------------------------------------|---| | | || | | |stand. marg. residuals vs marg. fitted values|stand. marg. residuals vs times|ACF of cond. residuals| | | ||| | |stand. cond. residuals vs cond. predictions|stand. cond. residuals vs times|PACF of cond. residuals| | | ||| | |histogram of cond. residuals|histogram of stand. least conf. residuals|stand. least conf. residuals vs quantiles|
We present the residual diagnostics results for the final model (2b).
Single panels for diagnostic
Our function drawDiagPlots()
also allows to show any of diagnostic plots above in a single panel. Here we show two additional plots (not included in the Graphical-tools diagnostic matrix).
plot: cumulative periodogram of conditional residuals - detection of periodic nonrandomness in conditional residuals
plot: standardized marginal residuals vs marginal fitted values - test of linearity of fixed effects
Numerical tests
Tests of residual independence
Test of residual normality
Parameter estimates
Estimates of regression coefficients
7.683338 | -0.1448348 | 0.2014634 | -0.1296788 |
Predictions of random effects
0.1492801 | -0.1112381 |
Estimates of variance parameters
0.059342 | 0.0238263 | 0.0138469 |
Fit summary
Graphical summary for the final model (3b)
plot: time series observations (black), fitted values (blue), estimated trend (red) vs times
Numerical summary for the final model (2b)
Numerical summary for model 3b
References
This notebook belongs to suplementary materials of the paper submitted to Statistical Papers and available at https://arxiv.org/abs/1905.07771.
Hančová, M., Vozáriková, G., Gajdoš, A., Hanč, J. (2019). Estimating variance components in time series linear regression models using empirical BLUPs and convex optimization, https://arxiv.org/, 2019.
Abstract of the paper
We propose a two-stage estimation method of variance components in time series models known as FDSLRMs, whose observations can be described by a linear mixed model (LMM). We based estimating variances, fundamental quantities in a time series forecasting approach called kriging, on the empirical (plug-in) best linear unbiased predictions of unobservable random components in FDSLRM.
The method, providing invariant non-negative quadratic estimators, can be used for any absolutely continuous probability distribution of time series data. As a result of applying the convex optimization and the LMM methodology, we resolved two problems theoretical existence and equivalence between least squares estimators, non-negative (M)DOOLSE, and maximum likelihood estimators, (RE)MLE, as possible starting points of our method and a practical lack of computational implementation for FDSLRM. As for computing (RE)MLE in the case of observed time series values, we also discovered a new algorithm of order , which at the default precision is times more accurate and times faster than the best current Python(or R)-based computational packages, namely CVXPY, CVXR, nlme, sommer and mixed.
We illustrate our results on three real data sets electricity consumption, tourism and cyber security which are easily available, reproducible, sharable and modifiable in the form of interactive Jupyter notebooks.
Brockwell, P. J., Davis, R. A. (2016). Introduction to Time Series and Forecasting (3rd ed.). New York, NY: Springer
Brockwell, P. J., & Davis, R. A. (2006). Time Series: Theory and Methods (2nd ed.). New York: Springer-Verlag
Box, G. E. P., Jenkins, G. M., Reinsel, G. C., Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Hoboken, New Jersey: Wiley
Gajdoš, A., Hančová, M., Hanč, J. (2017). Kriging Methodology and Its Development in Forecasting Econometric Time Series. Statistica: Statistics and Economy Journal, 2017, Vol. 97, No. 1, pp. 59–73
Galecki, A, Burzykowski, T. (2013). Linear Mixed-Effects Models Using R: A Step-by-Step Approach. New York: Springer
Hančová, M. (2007). Comparison of prediction quality of the best linear unbiased predictors in time series linear regression models. Proceedings of 15th European Young Statisticians Meeting. Castro Urdiales (Spain): University of Extremadura, http://matematicas.unex.es/~idelpuerto/15thEYSM.html.
Hilden-Minton, J.A. (1995). Multilevel diagnostics for mixed and hierarchical linear models, Unpublished PhD Thesis, University of California, Los Angeles.
Nobre, J.S., Singer, J.M. (2007). Residual analysis for linear mixed models. Biom. J., Vol. 49, pp. 863–875
Pinheiro, J., Bates D., DebRoy S., Sarkar, D., R Core Team (2018). nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-131, URL: https://CRAN.R-project.org/package=nlme
R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/
Singer, J. M., Rocha, F. M. M., Nobre, J. S. (2017). Graphical Tools for Detecting Departures from Linear Mixed Model Assumptions and Some Remedial Measures. International Statistical Review, Vol. 85, pp. 290–324; R functions for LMM residual diagnostics https://www.ime.usp.br/~jmsinger/lmmdiagnostics.zip
Sokol P., Gajdoš, A. (2017). Prediction of Attacks Against Honeynet Based on Time Series Modeling. Silhavy, R., Silhavy, P., & Prokopova, Z. (Eds.). (2017). Applied Computational Intelligence and Mathematical Methods (Vol. 662). Cham: Springer International Publishing. pp. 360-371
Štulajter, F. (2003). The MSE of the BLUP in a Finite Discrete Spectrum LRM. Tatra Mountains Mathematical Publications, 2003, Vol. 26, No. 1, pp. 125–131
Štulajter, F. (2002). Predictions in Time Series Using Regression Models. New York: Springer
A brief help on applied diagnostic tools and R functions.
Graphical and numerical tools
standardized marginal residuals vs marginal fitted values: residuals not randomly distributed, an obvious pattern of dependency (systematic trend) assumption of linearity of fixed effects violated model rejected.
standardized conditional residuals vs predicted values: residuals not randomly distributed, increase in variance of residuals assumption of homoscedasticity violated model rejected.
standardized least confounded residuals vs N(0,1) quantiles: points do not lie close to the straight line, many points (more than 5%) lie out of confidence bounds assumption of normality of conditionl errors violated model rejected.
standardized marginal residuals vs observation indices: some points are extremely far away from the majority of points outliers detected.
standardized conditional residuals vs observation indices: some points are extremely far away from the majority of points outliers detected.
autocorrelation function of conditional residuals: more than 5% of values cross the empirical bounds independence of conditional errors violated reject model.
partial autocorrelation function of conditional residuals: more than 5% of values cross the empirical bounds independence of conditional errors violated reject model.
histogram of conditional residuals: histogram does not approximately look like a Gaussian distribution normality of conditional errors violated reject model.
histogram of standardized least confounded residuals: histogram does not approximately look like a Gaussian distribution normality of conditional errors violated reject model.
As pointed out in Nobre and Singer, 2007 according to Hilden-Minton, 1995 a residual is said to be confounded for a specific type of error if it also depends on errors different from those that it is supposed to predict. In linear mixed models, conditional residuals and the BLUP are confounded (Nobre and Singer, 2007). This implies, for example, that estimated conditional residuals be may not be adequate to check for normality of conditional errors since when random effects are grossly non-normal, estimated conditional residuals may not present a normal behavior even when conditional error is normal (Nobre and Singer, 2007, Section 4). Following the suggestion of Hilden-Minton, 1995 we consider standardized conditional least confounded residuals, obtained as linear combinations of the standardized conditional residuals that minimize the proportion of their variance due to the random effects.
cumulative periodogram of conditional residuals: cumulative periodogram shows strong systematic deviations from the straight line connecting point with point reject model.
R functions
Packages - fdslrm, stats, repr
The R package fdslrm has been developed by authors of this notebook and serves the purpose of modelling time series.
fdslrm: Time series analysis and forecasting using LMM
Purpose: R package for modeling and prediction of time series using linear mixed models.
Version: 0.1.0, 2019
Depends: kableExtra, IRdisplay, MASS, Matrix, car, nlme, stats, forecast, fpp2, matrixcalc, sommer, gnm, pracma, CVXR
Maintainer: Andrej Gajdoš
Authors: Andrej Gajdoš, Jozef Hanč, Martina Hančová
Installation: Run jupyter notebook
00 installation fdslrm.ipynb
once before the first run of any R-based Jupyter notebook.
The
spec.pgram()
function from base R stats package produces periodogram - estimation for spectral density of a time series.
The
options(repr.plot.res=.., repr.plot.height=.., repr.plot.width=..)
from repr package set parameters for resizing R plots in Jupyter notebooks.
Authors' source code - R functions
The
initialFDSLRM()
loads essential R packages (nlme, kableExtra, IRdisplay, MASS, Matrix, car, stats,
...) for time series analysis and visualization (it also installs any missing package - in case of the installation it can take several minutes on the standard computer). Moreover, the function loads authors' R functions designed to work with FDSLRM and some Singer's R functions (Singer et al. 2017) for LMM modified by authors of this jupyter notebook .
The
fitDiagFDSLRM()
function fits the given FDSLRM and consequently conducts the residual diagnostics. Particularly, it means computing all estimates for the unknown parameters and all the statistical test of residuals. The basic input parameters of this function are
x
- time series (vector),
times
- vector,
freq_mean
- frequencies in trend (vector),
poly_trend_degree
- trend polynomial degree, default is zero,
include_fixed_eff
- vector of ones and zeros specifying if the and component of corresponding frequency is included in trend or not,
freq_random
- frequencies in random component (vector),
include_random_eff
- vector of ones and zeros specifying if the and component of corresponding frequency is included in random component or not,
season_period
- number specifying seasonality (periodicity), default value is zero.
The
drawDiagPlots()
function draws the diagnostics plots of residuals. This function has two inputs
plots_names
- name of particular graph,
alternatively, user can specify the input parameterplots_names = "all"
and the matrix of all diagnostics plots will be displayed,
fit_diag_fdslrm_output
- output of the funtionfitDiagFDSLRM()
.
The
drawTable()
function creates the table of significant frequencies. The basic parameters allow to draw table for frequencies from periodogram and model parameters from fitting.