1 Introduction
Our daily life is now surrounded by various types of sensors, ranging from smartphones, video cameras, Internet of things, to robots. The observations yield by such devices over time are naturally organized in time series data Qin et al. (2017); Yang et al. (2015). In this paper, we focus on time series with exogenous variables. Specifically, given a target time series as well as an additional set of time series corresponding to exogenous variables, a predictive model using the historical observations of both target and exogenous variables to predict the future values of the target variable is an autoregressive exogenous model, referred to as ARX. ARX models have been successfully used for modeling the inputoutput behavior of many complex systems DiPietro et al. (2017); Zemouri et al. (2010); Lin et al. (1996). In addition to forecasting, the interpretability of such models is essential for deployment, e.g. understanding the different importance of exogenous variables w.r.t. the evolution of the target one Hu et al. (2018); Siggiridou & Kugiumtzis (2016); Zhou et al. (2015).
Meanwhile, long shortterm memory units (LSTM)
Hochreiter & Schmidhuber (1997)and the gated recurrent unit (GRU)
Cho et al. (2014), a class of recurrent neural networks (RNN), have achieved great success in various applications on sequence and time series data Lipton et al. (2015); Wang et al. (2016); Guo et al. (2016); Lin et al. (2017); Sutskever et al. (2014).However, current recurrent neural networks fall short of achieving interpretability on the variable level when they are used for ARX models. For instance, when fed with the multivariable historical observations of the target and exogenous variables, LSTM blindly blends the information of all variables into the memory cells and hidden states which are used for prediction. Therefore, it is intractable to distinguish the contribution of individual variables into the prediction by looking into hidden states Zhang et al. (2017).
Recently, attentionbased neural networks Bahdanau et al. (2014); Vinyals et al. (2015); Chorowski et al. (2015); Choi et al. (2016); Qin et al. (2017); Cinar et al. (2017) have been proposed to enhance the ability of RNN in selectively using longterm memory as well as the interpretability. Current attention mechanism is mostly applied to hidden states across time steps, thereby focusing on capturing temporally important information and failing to uncover the different importance of input variables.
To this end, we aim to develop a LSTM neural network based ARX model to achieve a unified framework of both forecasting and knowledge discovery. In particular, the contribution is fourfold. First, we propose the multivariable LSTM, referred to as MVLSTM, with tensorized hidden states and associated updating scheme, such that each element of the hidden state tensor encodes information for a certain input variable. Second, by using the variablewise hidden states we develop a probabilistic mixture representation of temporal and variable attention. Learning and forecasting of MVLSTM are built on top of this mixture attention mechanism. Third, we propose to interpret and quantify variable importance by the posterior inference of variable attention. Lastly, we perform an extensive experimental evaluation of MVLSTM against statistical, machine learning and neural network baselines to demonstrate the prediction performance and interpretability of MVLSTM. The idea of MVLSTM easily applies to other variants of RNN, e.g., GRU or stacking multiple MVLSTM layers. These will be the future work.
2 Related work
Vanilla recurrent neural networks have been used to study nonlinear ARX problem in Zemouri et al. (2010); Diaconescu (2008); DiPietro et al. (2017). Tank et al. (2017, 2018) proposed to identify causal variables w.r.t. the target one via sparse regularization. Our MVLSTM is intended for providing the accurate prediction as well as interpretability of variable importance via attention mechanism.
Recently, attention mechanism has gained increasing popularity due to its ability in enabling recurrent neural networks to select parts of hidden states across time steps as well as enhancing the interpretability of networks Bahdanau et al. (2014); Vinyals et al. (2015); Choi et al. (2016); Vaswani et al. (2017); Lai et al. (2017); Qin et al. (2017); Cinar et al. (2017); Choi et al. (2018); Guo et al. (2018). However, current attention mechanism is normally applied to hidden states across time steps, and for multivariable input sequence, it fails to characterize variable level importance. Only some very recent studies Choi et al. (2016); Qin et al. (2017) attempted to develop attention mechanism capable of handling multivariable sequence data. Qin et al. (2017); Choi et al. (2016) first use neural networks to learn weights on input variables and then feed weighted input data into another neural network Qin et al. (2017) or use it directly for forecasting Choi et al. (2016). In our MVLSTM, temporal and variable attention are jointly derived from hidden states for individual variables learned via one endtoend network.
Another line of related research is about tensorization and selectively updating of hidden states in recurrent neural networks. Novikov et al. (2015); Do et al. (2017) proposed to represent hidden states as a matrix. He et al. (2017) developed tensorized LSTM in which hidden states are represented by tensors to enhance the capacity of networks without additional parameters. Koutnik et al. (2014); Neil et al. (2016); Kuchaiev & Ginsburg (2017) put forward to partition the hidden layer into separated modules as independent feature groups. In MVLSTM, hidden states are organized in a matrix, each element of which encodes information specific to an input variable. Meanwhile, the hidden states are correlatively updated such that intercorrelation among input variables is still captured.
3 MultiVariable LSTM
Assume we have exogenous time series and a target series of length , where and .^{1}^{1}1Vectors are assumed to be in column form throughout this paper. By stacking exogenous time series and target series, we define a multivariable input sequence as , where is the multivariable input at time step and is the observation of th exogenous time series at time . Given , we aim to learn a nonlinear mapping to predict the next value of the target series, namely . Model should be interpretable in the sense that we can understand which exogenous variables are crucial for the prediction.
3.1 Network Architecture
Inspired by He et al. (2017); Kuchaiev & Ginsburg (2017), in MVLSTM we develop tensorized hidden states and associated update scheme, which are able to ensure that each element of the hidden state tensor encapsulates information exclusively from a certain variable of the input. As a result, it enables to develop a flexible temporal and variable attention mechanism on top of such hidden states.
Specifically, we define the hidden state tensor (matrix) at time step in a MVLSTM layer as , where , , and is overall size of the layer. The element of is a hidden state vector specific to th input variable. Then, we define the inputtohidden transition tensor (matrix) as , where and . The hiddentohidden transition tensor is defined as: , where and .
Similar to the standard LSTM neural networks Hochreiter & Schmidhuber (1997), MVLSTM has the input, forget and output gates as well as the memory cells to control the update of hidden state matrix. Given the newly incoming input at time and the hidden state matrix and memory cell up to , we formulate the iterative update process in a MVLSTM layer as follows:
(1)  
(2)  
(3)  
(4) 
Overall, Eq. (1) gives rise to the cell update matrix , where corresponds to the update w.r.t. input variable . The term and respectively capture the update from the hidden states of the previous step and the new input. Concretely, the tensordot operation in Eq. (1) returns the product of two tensors along a specified axis. Thus, given tensor and , the tensordot of and along the axis is expressed as , where . Additionally, we define as the product between the transition matrix and input vector: .
Eq. (2) derives the input gate , forget gate and output gate by using and . All these gates are vectors of dimension . refers to the vectorization operation, where in Eq. (2) it concatenates columns of into a vector of dimension . is the concatenation operation.
represents the elementwise sigmoid activation function. Each element in gate vectors is derived based on
that carries information regarding all input variables, so as to utilize the crosscorrelation between input variables.In Eq. (3), memory cell vector is updated by using the previous cell and vectorized cell update matrix obtained in Eq. (1). denotes elementwise multiplication. Finally, in Eq. (4) the new hidden state matrix at is the matricization ^{2}^{2}2 In our case, matricization is the operation that reshape a vector of into a matrix of . of weighted by the output gate.
3.2 Mixture Temporal and Variable Attention
After feeding a sequence of into MVLSTM, we obtain a sequence of hidden state matrices, denoted by , where and element . is then used in our mixture temporal and variable attention mechanism, which facilitates the following learning, inference and interpretation of variable importance.
Specifically, our attention mechanism is based on a probabilistic mixture of experts model Zong et al. (2018); Graves (2013); Shazeer et al. (2017) over as:
(5) 
where and .
In Eq. (5
), we introduce a latent random variable
into the the density function of to govern the generation of conditional on historical data . is a discrete variable over the set of values corresponding to input variables. Mathematically, the joint density of and is decomposed into a component model (i.e. ) and the prior of conditioning on (i.e. ). The component model characterizes the density of conditioned on historical data of variable , while the prior of controls to what extent is generated by variable as well as enabling to adaptively adjust the contribution of variable to fit . For , it refers to the autoregressive part.Evaluating each part of Eq. (5) amounts to the temporal and variable attention process using hidden states in MVLSTM. Temporal attention is first applied to the sequence of hidden states for individual variables, so as to obtain a summarized hidden state for each variable. The history of each variable is encoded in such temporally summarized hidden states, which are used to calculate and . Then, since the prior in (5) is a discrete distribution on , it naturally characterizes the attention on the exogenous and autoregressive parts for predicting .
In detail, the weights and bias of the temporal attention process are defined as and . and the element corresponds to th variable. The temporal attention is then derived as:
(6)  
(7)  
(8)  
(9) 
In Eq. (6), is derived via the tensordot operation, where element is the attention score on previous steps of variable (other methods of deriving attention scores is compatible with MVLSTM Cinar et al. (2017); Qin et al. (2017) and we use the simple one layer transformation in the present paper). Then, the attention weights is obtained by performing on each row of . gives rise to the variablewise context matrix . Recall that the hidden state matrix at is . By concatenating and along axis in Eq. (9), we obtain the context enhanced hidden state matrix , where is a hidden state summarizing the temporal information of variable .
Now we can formulate individual component model in Eq. (5) as:
(10) 
where we impose normal distribution over
, and and are output weight and bias. In experiments, we simply set to one. Meanwhile, by using summarized hidden states , we derive to characterize variable level attention as:(11) 
where is the variable attention weight and is the bias.
3.3 Learning, Inference and Interpretation
In the learning phase, denote by the set of parameters in MVLSTM. Given a set of training sequences and
, the loss function to optimize is defined based on the negative log likelihood of the mixture model plus the regularization term as:
(12) 
In the inference phase, the prediction of is obtained by the weighted sum of means as Graves (2013); Bishop (1994): .
For the interpretation of the variable importance via mixture attention, we consider to use the posterior of , i.e.
(13) 
which takes the prediction performance of individual variables into account. We refer to the derived and respectively as posterior and prior attention.
Meanwhile, note that we obtain the posterior of for each training sequence. In order to attain a uniform view of variable importance over the set of data, we define the importance of an input variable by aggregating all the posterior attention of this variable as follows:
4 Experiments
In this part, we report experimental results. Due to the page limitation, please refer to the appendix section for full results.
4.1 Datasets
We use three real datasets^{3}^{3}3https://archive.ics.uci.edu/ml/datasets.html and one synthetic dataset to evaluate MVLSTM and baselines.
PM2.5: It contains hourly PM2.5 data and the associated meteorological data in Beijing of China. PM2.5 measurement is the target series. The exogenous time series include dew point, temperature, pressure, combined wind direction, cumulated wind speed, hours of snow, and hours of rain. Totally we have multivariable sequences.
Energy: It collects the appliance energy use in a low energy building. The target series is the energy data logged every 10 minutes. Exogenous time series consist of variables, e.g. the house inside temperature conditions and outside weather information including temperature, wind speed, humanity and dew point from the nearest weather station. The number of sequence is .
Plant: This dataset records the time series of energy production of a photovoltaic (PV) power plant in Italy Ceci et al. (2017). Exogenous data consists of dimensional time series regarding weather conditions (such as temperature, cloud coverage, etc.). It gives sequences for evaluation.
Synthetic: It is generated based on the idea of Lorenz model Tank et al. (2017, 2018). Exogenous series are generated via the ARMA process with randomized parameters. The target series is driven by an ARMA process plus coupled exogenous series of variable and with randomized autoregressive orders and thus the synthetic dataset has ground truth of variable importance. In total, we generate sequences of 10 exogenous time series.
For each dataset, we perform Augmented DickeyFuller (ADFuller) and Kwiatkowski Phillips Schmidt Shin (KPSS) tests to determine the necessity of differencing time series Kirchgässner et al. (2012). The window size, namely in Sec. 3, is set to 30. We further study the prediction performance under different window sizes in the supplementary material. Each dataset is split into training (), validation () and testing sets ().
4.2 Baselines and Evaluation Setup
The first category of statistics baselines includes:
STRX is the structural time series model with exogenous variables Scott & Varian (2014); Radinsky et al. (2012). It is formulated in terms of unobserved components via the state space method.
ARIMAX augments the classical time series autoregressive integrated moving average model (ARIMA) by adding regression terms on exogenous variables Hyndman & Athanasopoulos (2014).
The second category of machine learning baselines includes popular tree ensemble methods and regularized regression as:
RF
refers to random forests. It is an ensemble learning method consisting of several decision trees
Liaw et al. (2002); Meek et al. (2002) and has been used in time series prediction Patel et al. (2015).XGT
refers to the extreme gradient boosting
Chen & Guestrin (2016). It is the application of boosting methods to regression trees Friedman (2001).ENET represents ElasticNet, which is a regularized regression method combining both L1 and L2 penalties of the lasso and ridge methods Zou & Hastie (2005) and has been used in time series analysis Liu et al. (2010); Bai & Ng (2008).
The third category of neural network baselines includes:
RETAIN requires to pretrain two recurrent neural networks to respectively derive weights on temporal steps and variables, which are then used to perform prediction Choi et al. (2016).
DUAL is built upon encoderdecoder architecture Qin et al. (2017), which uses an encoder LSTM to learn weights on input variables and then feeds preweighted input data into a decoder LSTM for forecasting.
cLSTM proposes to identify Granger causal variables via sparse regularization on the weights of LSTM Tank et al. (2017, 2018).
Additionally, we have two variants of MVLSTM denoted by MVIndep and MVFusion, which are developed to evaluate the efficacy of the updating and mixture mechanism of MVLSTM. MVIndep builds independent recurrent neural networks for each input variable, whose outputs are fed into the mixture attention process to obtain prediction. The only difference between MVFusion and MVLSTM is that, instead of using mixture attention, MVFusion fuses the hidden states of each variable into one hidden state via variable attention.
In ARIMAX, the orders of autoregression and movingaverage terms are set via the autocorrelation and partial autocorrelation. For RF and XGT, the hyperparameter tree depth and the number of iterations are chosen from range and via grid search. For XGT, L2 regularization is added by searching within . As for ENET, the coefficients for L2 and L1 penalties are selected from . For these machine learning baselines, multivariable input sequences are flattened into feature vectors.
We implemented MVLSTM and neural network baselines with Tensorflow
^{4}^{4}4 Code will be released upon requested.. For training, we used Adam with the minibatch of instances Kingma & Ba (2014). For the size of recurrent and dense layers in the baselines, we conduct grid search over. The size of the MVLSTM recurrent layer is set by the number of neurons per variable selected from
. Dropout is set to . Learning rate is searched in . L2 regularization is added with the coefficient chosen from . We train each approach times and report average performance.We consider two metrics to measure the prediction performance. Specifically, RMSE is defined as . MAE is defined as .
4.3 Prediction Performance
We report the prediction errors of all approaches in Table 1 and Table 2. In Table 1, we observe that in most of the time, STRX and ARIMAX underperform machine learning and neural network solutions. Among RF, XGT, and ENET, XGT performs the best mostly. As for neural network baselines, DUAL outperforms RETAIN and cLSTM as well as machine learning baselines in the Synthetic and Energy datasets. Our MVLSTM outperforms baselines by around at most. MVLSTM performs slightly better than both of MVFusion and MVIndep, while providing the interpretation benefit, which is shown in the next group of experiments. Above observations also apply to the MAE results in Table 2 and we skip the detailed description.
MethodsDataset  Synthetic  Energy  Plant  PM2.5 
STRX  
ARIMAX  
RF  
XGT  
ENET  
DUAL  
RETAIN  
cLSTM  
MVFusion  
MVIndep  
MVLSTM 
MethodsDataset  Synthetic  Energy  Plant  PM2.5 
STRX  
ARIMAX  
RF  
XGT  
ENET  
DUAL  
RETAIN  
cLSTM  
MVFusion  
MVIndep  
MVLSTM 
4.4 Model Interpretation
In this part, we compare MVLSTM to baselines also with interpretability over the variable importance, i.e. DUAL, RETAIN and cLSTM. For real datasets without ground truth about variable importance, we perform Granger causality test Arnold et al. (2007) to identify causal variables, which are considered as important variables for the further comparison. For the synthetic dataset, we evaluate by observing whether an approach can recognize variable and with high importance value.
Similar to MVLSTM, we can collect variable attentions of each sequence in DUAL and RETAIN and obtain importance value by Eq. (14). Note that variable attentions obtained in RETAIN are unnormalized values. In cLSTM, we identify important variables by nonzero corresponding weights of the neural network Tank et al. (2018) and thus have no importance value to report in Table 3.
Table 3 reports some top variables ranked by the corresponding importance value in the brackets. The higher the importance value, the more crucial the variable. In dataset PM2.5, three variables (i.e. dew point, cumulated wind speed, and pressure) identified as Granger causal variables are also top ranked by the variable importance in MVLSTM. As is pointed out by Liang et al. (2015), dew point and pressure are the most influential. Strong wind can bring dry and fresh air and it is crucial as well. This is in line with the variable importance detected by MVLSTM. On the contrary, baselines miss identifying some variables. Likewise, for Plant dataset, as is suggested by Mekhilef et al. (2012); Ghazi & Ip (2014) in addition to cloud cover, humidity, wind speed, and temperature affect the efficiency of PV cells and thus important for power generation.
Furthermore, Figure 3 visualizes the histograms of attention values of two example variables in the PM2.5 dataset. In MVLSTM, compared with priors, the posterior attention of the variable “dew point” shifts rightward, while the posterior of variable “cumulated hours of rain” moves towards zero. It indicates that posterior attention rectifies the prior by taking into account the predictive likelihood. As a result, the variable importance derived from posterior attention is more distinguishable and informative, compared with the attention weights in DUAL and RETAIN.
5 Conclusion
In this paper, we propose an interpretable multivariable LSTM for time series with exogenous variables. Based on the tensorized hidden states in MVLSTM, we develop mixture temporal and variable attention mechanism, which enables to infer and quantify the variable importance w.r.t. the target series. Extensive experiments on a synthetic dataset with ground truth and real datasets with Granger causality test exhibit the superior prediction performance and interpretability of MVLSTM.
References
 Arnold et al. (2007) Andrew Arnold, Yan Liu, and Naoki Abe. Temporal causal modeling with graphical granger methods. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 66–75. ACM, 2007.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2014.
 Bai & Ng (2008) Jushan Bai and Serena Ng. Forecasting economic time series using targeted predictors. Journal of Econometrics, 146(2):304–317, 2008.
 Bishop (1994) Christopher M Bishop. Mixture density networks. 1994.
 Ceci et al. (2017) Michelangelo Ceci, Roberto Corizzo, Fabio Fumarola, Donato Malerba, and Aleksandra Rashkovska. Predictive modeling of pv energy production: How to set up the learning task for a better prediction? IEEE Transactions on Industrial Informatics, 13(3):956–966, 2017.
 Chen & Guestrin (2016) Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In SIGKDD, pp. 785–794. ACM, 2016.
 Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014.
 Choi et al. (2016) Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems, pp. 3504–3512, 2016.
 Choi et al. (2018) Heeyoul Choi, Kyunghyun Cho, and Yoshua Bengio. Finegrained attention mechanism for neural machine translation. Neurocomputing, 284:171–176, 2018.
 Chorowski et al. (2015) Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attentionbased models for speech recognition. In Advances in neural information processing systems, pp. 577–585, 2015.
 Cinar et al. (2017) Yagmur Gizem Cinar, Hamid Mirisaee, Parantapa Goswami, Eric Gaussier, Ali AïtBachir, and Vadim Strijov. Positionbased content attention for time series forecasting with sequencetosequence rnns. In International Conference on Neural Information Processing, pp. 533–544. Springer, 2017.
 Diaconescu (2008) Eugen Diaconescu. The use of narx neural networks to predict chaotic time series. Wseas Transactions on computer research, 3(3):182–191, 2008.
 DiPietro et al. (2017) Robert DiPietro, Christian Rupprecht, Nassir Navab, and Gregory D Hager. Analyzing and exploiting narx recurrent neural networks for longterm dependencies. In International Conference on Learning Representations, 2017.
 Do et al. (2017) Kien Do, Truyen Tran, and Svetha Venkatesh. Matrixcentric neural networks. arXiv preprint arXiv:1703.01454, 2017.
 Friedman (2001) Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232, 2001.
 Ghazi & Ip (2014) Sanaz Ghazi and Kenneth Ip. The effect of weather conditions on the efficiency of pv panels in the southeast of uk. Renewable Energy, 69:50–59, 2014.
 Graves (2013) Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 Guo et al. (2016) Tian Guo, Zhao Xu, Xin Yao, Haifeng Chen, Karl Aberer, and Koichi Funaya. Robust online time series prediction with recurrent neural networks. In 2016 IEEE DSAA, pp. 816–825. IEEE, 2016.
 Guo et al. (2018) Tian Guo, Tao Lin, and Yao Lu. An interpretable lstm neural network for autoregressive exogenous model. In workshop track at International Conference on Learning Representations, 2018.
 He et al. (2017) Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. Wider and deeper, cheaper and faster: Tensorized lstms for sequence learning. In Advances in Neural Information Processing Systems, pp. 1–11, 2017.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.

Hu et al. (2018)
Ziniu Hu, Weiqing Liu, Jiang Bian, Xuanzhe Liu, and TieYan Liu.
Listening to chaotic whispers: A deep learning framework for newsoriented stock trend prediction.
In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 261–269. ACM, 2018.  Hyndman & Athanasopoulos (2014) Rob J Hyndman and George Athanasopoulos. Forecasting: principles and practice. OTexts, 2014.
 Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kirchgässner et al. (2012) Gebhard Kirchgässner, Jürgen Wolters, and Uwe Hassler. Introduction to modern time series analysis. Springer Science & Business Media, 2012.
 Koutnik et al. (2014) Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. In International Conference on Machine Learning, pp. 1863–1871, 2014.
 Kuchaiev & Ginsburg (2017) Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for lstm networks. arXiv preprint arXiv:1703.10722, 2017.
 Lai et al. (2017) Guokun Lai, WeiCheng Chang, Yiming Yang, and Hanxiao Liu. Modeling longand shortterm temporal patterns with deep neural networks. arXiv preprint arXiv:1703.07015, 2017.
 Liang et al. (2015) Xuan Liang, Tao Zou, Bin Guo, Shuo Li, Haozhe Zhang, Shuyi Zhang, Hui Huang, and Song Xi Chen. Assessing beijing’s pm2. 5 pollution: severity, weather impact, apec and winter heating. In Proc. R. Soc. A, volume 471, pp. 20150257. The Royal Society, 2015.
 Liaw et al. (2002) Andy Liaw, Matthew Wiener, et al. Classification and regression by randomforest. R news, 2(3):18–22, 2002.

Lin et al. (2017)
Tao Lin, Tian Guo, and Karl Aberer.
Hybrid neural networks for learning the trend in time series.
In
Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI17
, pp. 2273–2279, 2017.  Lin et al. (1996) Tsungnan Lin, Bill G Horne, Peter Tino, and C Lee Giles. Learning longterm dependencies in narx recurrent neural networks. IEEE Transactions on Neural Networks, 7(6):1329–1338, 1996.
 Lipton et al. (2015) Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzell. Learning to diagnose with lstm recurrent neural networks. arXiv preprint arXiv:1511.03677, 2015.
 Liu et al. (2010) Yan Liu, Alexandru NiculescuMizil, Aurelie C Lozano, and Yong Lu. Learning temporal causal graphs for relational timeseries analysis. In ICML, pp. 687–694, 2010.
 Meek et al. (2002) Christopher Meek, David Maxwell Chickering, and David Heckerman. Autoregressive tree models for timeseries analysis. In SDM, pp. 229–244. SIAM, 2002.
 Mekhilef et al. (2012) S Mekhilef, R Saidur, and M Kamalisarvestani. Effect of dust, humidity and air velocity on efficiency of photovoltaic cells. Renewable and sustainable energy reviews, 16(5):2920–2925, 2012.
 Neil et al. (2016) Daniel Neil, Michael Pfeiffer, and ShihChii Liu. Phased lstm: Accelerating recurrent network training for long or eventbased sequences. In Advances in Neural Information Processing Systems, pp. 3882–3890, 2016.
 Novikov et al. (2015) Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pp. 442–450, 2015.
 Patel et al. (2015) Jigar Patel, Sahil Shah, Priyank Thakkar, and K Kotecha. Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques. Expert Systems with Applications, 42(1):259–268, 2015.
 Qin et al. (2017) Yao Qin, Dongjin Song, Haifeng Cheng, Wei Cheng, Guofei Jiang, and Garrison W. Cottrell. A dualstage attentionbased recurrent neural network for time series prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pp. 2627–2633. AAAI Press, 2017.
 Radinsky et al. (2012) Kira Radinsky, Krysta Svore, Susan Dumais, Jaime Teevan, Alex Bocharov, and Eric Horvitz. Modeling and predicting behavioral dynamics on the web. In WWW, pp. 599–608. ACM, 2012.
 Scott & Varian (2014) Steven L Scott and Hal R Varian. Predicting the present with bayesian structural time series. International Journal of Mathematical Modelling and Numerical Optimisation, 5(12):4–23, 2014.
 Shazeer et al. (2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparselygated mixtureofexperts layer. International Conference on Learning Representations, 2017.

Siggiridou & Kugiumtzis (2016)
Elsa Siggiridou and Dimitris Kugiumtzis.
Granger causality in multivariate time series using a timeordered restricted vector autoregressive model.
IEEE Transactions on Signal Processing, 64(7):1759–1773, 2016.  Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
 Tank et al. (2017) Alex Tank, Ian Cover, Nicholas J Foti, Ali Shojaie, and Emily B Fox. An interpretable and sparse neural network model for nonlinear granger causality discovery. arXiv preprint arXiv:1711.08160, 2017.
 Tank et al. (2018) Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily Fox. Neural granger causality for nonlinear time series. arXiv preprint arXiv:1802.05842, 2018.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010, 2017.
 Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700, 2015.
 Wang et al. (2016) Linlin Wang, Zhu Cao, Yu Xia, and Gerard de Melo. Morphological segmentation with window lstm neural networks. In AAAI, 2016.

Yang et al. (2015)
Jian Bo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali
Krishnaswamy.
Deep convolutional neural networks on multichannel time series for human activity recognition.
In IJCAI, pp. 25–31, 2015.  Zemouri et al. (2010) Ryad Zemouri, Rafael Gouriveau, and Noureddine Zerhouni. Defining and applying prediction performance metrics on a recurrent narx time series model. Neurocomputing, 73(1315):2506–2521, 2010.
 Zhang et al. (2017) Liheng Zhang, Charu Aggarwal, and GuoJun Qi. Stock price prediction via discovering multifrequency trading patterns. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2141–2149. ACM, 2017.
 Zhou et al. (2015) Xiabing Zhou, Wenhao Huang, Ni Zhang, Weisong Hu, Sizhen Du, Guojie Song, and Kunqing Xie. Probabilistic dynamic causal model for temporal data. In Neural Networks (IJCNN), 2015 International Joint Conference on, pp. 1–8. IEEE, 2015.

Zong et al. (2018)
Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho,
and Haifeng Chen.
Deep autoencoding gaussian mixture model for unsupervised anomaly detection.
In International Conference on Learning Representations, 2018.  Zou & Hastie (2005) Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
6 Appendix
6.1 MultiVariable LSTM
Theorem 1.
The hidden states and memory cells in MVLSTM are updated by the process below:
and therefore each element of hidden state matrix encodes the information exclusively from the corresponding input variable .
Proof.
By the tensordot operation , only the elements of and corresponding to are matched to perform calculation, namely . Meanwhile, since the product between inputhidden transition weights and input vector is , each resulting element only carries information about variable . Then, though the derivation of gates , , and mix information from all input variables in order to capture crosscorrelation among variables, memory cells are updated by multiplication operation between gates and and therefore the information encoded in are still specific to each input variable. Likewise, hidden state matrix derived from the updated memory retain the variablewise hidden states. ∎
6.2 Prediction Performance
In addition to the results under window size in Table 1 and 2, we report the prediction errors under different window sizes i.e. in Eq. (12).
MethodsDataset  Synthetic  Energy  Plant  PM2.5 
STRX  
ARIMAX  
RF  
XGT  
ENET  
DUAL  
RETAIN  
cLSTM  
MVFusion  
MVIndep  
MVLSTM 
MethodsDataset  Synthetic  Energy  Plant  PM2.5 
STRX  
ARIMAX  
RF  
XGT  
ENET  
DUAL  
RETAIN  
cLSTM  
MVFusion  
MVIndep  
MVLSTM 
MethodsDataset  Synthetic  Energy  Plant  PM2.5 
STRX  
ARIMAX  
RF  
XGT  
ENET  
DUAL  
RETAIN  
cLSTM  
MVFusion  
MVIndep  
MVLSTM 
MethodsDataset  Synthetic  Energy  Plant  PM2.5 
STRX  
ARIMAX  
RF  
XGT  
ENET  
DUAL  
RETAIN  
cLSTM  
MVFusion  
MVIndep  
MVLSTM 
6.3 Model Interpretation
In this part, we provide the variable list about each dataset in Table 8 and report the full variable importance in Table 9. Figure 4 to 7 visualize the histograms of attention of all variables in each dataset. In MVLSTM, compared with priors, the posterior attention rectifies the prior by taking into account the predictive likelihood. while the attention weights in DUAL and RETAIN are not representative enough.
Dataset  Variables  
PM2.5 


Plant 


Energy 


Synthetic  Variable 0 to 9, target variable 10 
Comments
There are no comments yet.