A Short History of Time-Series Forecasting (TSF)—with Core Formulas You Can Reuse

Last updated on 09 Nov 2025

This post distills how TSF models evolved—from statistical workhorses to today’s diversified deep architectures—and adds copy-paste formulas for the main families (ARIMA/SARIMA, RNN/LSTM/GRU/TCN, Transformers, Diffusion, and SSM/S4/Mamba). The storyline and taxonomy follow the excellent survey by Kim et al. (2025).

TL;DR — Timeline of TSF Models

1950s–1970s: Exponential Smoothing → ARIMA → SARIMA establish statistical foundations. These remain strong baselines due to simplicity and domain interpretability.
1980s–2000s: Classic ML enters (trees, SVMs), followed by neural nets (MLP → RNN/LSTM/GRU). CNN/TCN/WaveNet capture local motifs; GNNs appear for spatio-temporal structure.
2017→: Transformers jump from NLP into TSF; Logsparse/LogTrans, Reformer, Informer, Autoformer, Pyraformer, Fedformer tackle cost, seasonality, and long context.
2021–2024: Pushback: non-Transformer baselines (including simple linear/MLP) often rival or beat Transformers on long-term TSF; Transformer variants respond with sparsity, patching, frequency-domain tricks, and de-stationarization.
2021–2024: Diffusion models arrive for probabilistic TSF (DDPM, SDE/score-based, latent diffusion, multi-resolution guidance).
2024→: Mamba/modern SSMs bring linear-time, long-context sequence modeling; variants incorporate stability constraints, channel-dependency learning, and Transformer-inspired techniques.

For an at-a-glance visual of “who came when,” see the survey’s Fig. 3 (evolution) and Fig. 8 (remarkable historical models).

The Long View: How We Got Here (and What’s Next)

Statistical era. Box-Jenkins ARIMA/SARIMA and Holt-Winters provided principled recipes for autocorrelation, differencing, and seasonality. They endure as baselines thanks to clarity and speed, especially when data are scarce or explainability is paramount.

Classic ML & early neural nets. As compute and data grew, ML models and neural nets learned nonlinearities: MLP/RNN (BPTT), then LSTM/GRU to fix long-range memory; CNN/TCN/WaveNet captured local and multi-scale patterns; GNNs modeled spatial graphs (traffic, sensors).

Transformer decade. With attention’s parallelism and long-dependency handling, Transformers became the default; the TSF literature chronicles efficiency-oriented variants—LogTrans/Logsparse, Reformer, Informer, Autoformer, Pyraformer, Fedformer—plus de-stationary and frequency-domain upgrades.

Renaissance of diversity. Evidence that “simpler” non-Transformer baselines can match or beat attention on LTSF benchmarks reopened the field. Today’s practice is issue-driven (channel dependencies, distribution shift, causality, feature extraction), not architecture-fashion-driven.

Foundation & Diffusion. Pretrained (“foundation”) models pursue cross-domain generalization. Diffusion methods (DDPM, SDE/score-based, latent diffusion, guided sampling) deliver calibrated probabilistic forecasts and multi-resolution generation.

SSM/Mamba. Modern state-space models (S4 → H3) and Mamba-style selective SSMs achieve linear-time long-context modeling; recent TSF variants combine stability constraints, channel-dependency learning, patching, and frequency-domain ideas.

Bottom line for practitioners.
Pick the tool for the problem:

Very long context / latency-sensitive? Start with SSM/Mamba or strong linear/MLP baselines; compare to a lean Transformer variant.
Pronounced periodicity / multi-scale? Consider TCN/CNN or Transformer variants with decomposition/frequency components.
Need probabilistic forecasts? Explore Diffusion (DDPM/SDE/latent) with decomposition or multi-resolution guidance.
Expect distribution shift or cross-domain reuse? Look at issue-driven normalization/denormalization and pretraining adapters.

Math Appendix — Copy-Paste Formulas for the Main Families

Use $$ ... $$ (KaTeX/MathJax) in your blog. Metrics at the end pair with any model head. The model choices and equations follow the survey’s structure.

1) Statistical Baselines

Simple / Holt / Holt–Winters (additive)
$$
\hat y_{t+1}=\alpha y_t+(1-\alpha)\hat y_t,\quad 0<\alpha<1
$$
Holt (level $l_t$, trend $b_t$):
$$
\begin{aligned}
l_t&=\alpha y_t+(1-\alpha)(l_{t-1}+b_{t-1}),\
b_t&=\beta(l_t-l_{t-1})+(1-\beta)b_{t-1},\
\hat y_{t+h}&=l_t+h,b_t .
\end{aligned}
$$
Holt–Winters (season $s_t$, period $m$):
$$
\begin{aligned}
l_t&=\alpha(y_t-s_{t-m})+(1-\alpha)(l_{t-1}+b_{t-1}),\
b_t&=\beta(l_t-l_{t-1})+(1-\beta)b_{t-1},\
s_t&=\gamma,(y_t-l_t)+(1-\gamma)s_{t-m},\
\hat y_{t+h}&=l_t+h,b_t+s_{t-m+h\ \mathrm{mod}\ m}.
\end{aligned}
$$
ARIMA / SARIMA (backshift $B$):
$$
\phi(B),(1-B)^d y_t=\theta(B),\varepsilon_t,\quad \varepsilon_t!\sim!\mathcal N(0,\sigma^2).
$$
$$
\Phi(B^m)(1-B^m)^D,\phi(B)(1-B)^d,y_t=\Theta(B^m)\theta(B),\varepsilon_t.
$$
Context: concise history and roles of exponential smoothing, ARIMA, SARIMA.

2) Neural Sequence Models

Vanilla RNN
$$
h_t=\sigma(W_h h_{t-1}+W_x x_t+b_h),\quad \hat y_t=W_y h_t+b_y.
$$

LSTM
$$
\begin{aligned}
i_t&=\sigma(W_i x_t+U_i h_{t-1}+b_i),\quad
f_t=\sigma(W_f x_t+U_f h_{t-1}+b_f),\
o_t&=\sigma(W_o x_t+U_o h_{t-1}+b_o),\quad
\tilde c_t=\tanh(W_c x_t+U_c h_{t-1}+b_c),\
c_t&=f_t\odot c_{t-1}+i_t\odot \tilde c_t,\quad
h_t=o_t\odot\tanh(c_t).
\end{aligned}
$$

GRU
$$
\begin{aligned}
z_t&=\sigma(W_z x_t+U_z h_{t-1}),\quad
r_t=\sigma(W_r x_t+U_r h_{t-1}),\
\tilde h_t&=\tanh(W_h x_t+U_h(r_t\odot h_{t-1})),\
h_t&=(1-z_t)\odot h_{t-1}+z_t\odot \tilde h_t.
\end{aligned}
$$

TCN / Dilated Conv (kernel len $K$, dilation $d$):
$$
y(t)=\sum_{i=0}^{K-1} w_i,x!\big(t-d,i\big).
$$
Context: from MLP/RNN to LSTM/GRU and CNN/TCN/WaveNet/GNN in TSF.

3) Transformer for TSF

Scaled dot-product attention
$$
\mathrm{Attn}(Q,K,V)=\mathrm{softmax}!\Big(\frac{QK^\top}{\sqrt{d_k}}\Big)V .
$$

Encoder to regression head (sketch)
$$
H=\mathrm{TransformerEnc}(X)+\mathrm{PE},\quad
\hat Y=W_o,\mathrm{Pool}(H)+b_o .
$$

Variants reduce $O(L^2)$ cost (sparse, hashing), add seasonal-trend decomposition, patching, and frequency-domain blocks: LogTrans/Logsparse, Reformer, Informer, Autoformer, Pyraformer, Fedformer.

4) Diffusion Models (Probabilistic TSF)

DDPM (discrete)
Forward:
$$
q(\mathbf x_t|\mathbf x_{t-1})=\mathcal N!\big(\sqrt{1-\beta_t},\mathbf x_{t-1},,\beta_t I\big),
\quad
q(\mathbf x_t|\mathbf x_0)=\mathcal N!\big(\sqrt{\bar\alpha_t},\mathbf x_0,(1-\bar\alpha_t)I\big),
$$
with $\alpha_t=1-\beta_t,\ \bar\alpha_t=\prod_{s=1}^t\alpha_s$.
Reverse:
$$
p_\theta(\mathbf x_{t-1}|\mathbf x_t)=
\mathcal N!\big(\mu_\theta(\mathbf x_t,t),,\Sigma_\theta(\mathbf x_t,t)\big).
$$

Score-based SDE (continuous)
$$
d\mathbf x=f(\mathbf x,t),dt+g(t),d\mathbf w,\quad
d\mathbf x=\big[f-g^2\nabla_{\mathbf x}\log p_t(\mathbf x)\big]dt+g,d\bar{\mathbf w}.
$$

TSF implementations often combine decomposition (trend/seasonal), frequency-domain cues, multi-resolution scheduling, and latent diffusion for efficiency; guidance can be classifier-based or classifier-free.

5) State-Space Models (SSM), S4, and Mamba

Discrete LTI SSM
$$
\mathbf x_t=A,\mathbf x_{t-1}+B,\mathbf u_t,\quad
\mathbf y_t=C,\mathbf x_t+D,\mathbf u_t.
$$
Convolutional (kernel) view:
$$
\mathbf y_t=\sum_{\tau\ge0}K_\tau,\mathbf u_{t-\tau},\quad K_\tau=C,A^\tau B+\mathbf 1_{{\tau=0}}D.
$$

S4 → H3: specialized parameterizations/diagonalization and hierarchical gated blocks stabilize and extend long-range modeling. Mamba makes SSM parameters input-selective and uses kernel fusion/parallel scan for linear-time inference:
$$
\mathbf x_t=A(\mathbf u_t)\mathbf x_{t-1}+B(\mathbf u_t)\mathbf u_t,\quad
\mathbf y_t=C(\mathbf u_t)\mathbf x_t.
$$
Context: why SSMs re-emerged for long-context TSF and how Mamba variants evolve.

6) Common TSF Losses

Point losses
$$
\mathrm{MSE}=\frac1n\sum_{t=1}^n (y_t-\hat y_t)^2,\qquad
\mathrm{MAE}=\frac1n\sum_{t=1}^n |y_t-\hat y_t|.
$$

Quantile (pinball) loss, $\tau\in(0,1)$
$$
\mathcal L_\tau(y,\hat q)=\max!\big(\tau(y-\hat q),,(\tau-1)(y-\hat q)\big).
$$

Gaussian NLL
$$
\mathcal L_{\text{NLL}}=\frac12\sum_{t=1}^n!\Big[\log(2\pi\sigma_t^2)+\frac{(y_t-\mu_t)^2}{\sigma_t^2}\Big].
$$

Context: metrics landscape in the survey; MAE/MSE remain common for comparability.

Practical Reading Guide (What to Reference When)

Need a map of models and eras? See the survey’s Sections 3 & 4, Fig. 3, Fig. 8.
Transformer variants & cost fixes? Section 3.3.1, Fig. 9 (full vs. sparse attention).
Diffusion for TSF (DDPM/SDE/latent/guidance)? Section 4.4.
SSM/S4/Mamba (structures, stability, channels)? Section 4.5, Fig. 14.
Issue-driven view (channel dependency, shift, causality, features)? Section 5.

Source

This article summarizes and quotes from:
J. Kim, H. Kim, H. Kim, D. Lee, S. Yoon (2025). “A Comprehensive Survey of Deep Learning for Time Series Forecasting: Architectural Diversity and Open Challenges.” (arXiv, May 1, 2025). Sections and figures referenced inline.

If you’d like, I can also turn the timeline + formulas into a one-page PDF handout for your talks, or split this post into a two-part “history + math appendix” series for your Journal.