[Review] Deep learning models for real-life human activity recognition from smartphone sensor data

[Review] Deep learning models for real-life human activity recognition from smartphone sensor data
Paper reviewed: García-González, D., Rivero, D., Fernández-Blanco, E., & Luaces, M. R. (2023). Deep learning models for real-life human activity recognition from smartphone sensor data. Internet of Things, 24, 100925. This is an open-access article under CC BY 4.0; you may reuse with attribution.

The authors target real-life (in-the-wild) human activity recognition (HAR) from personal smartphones and show that hybrid deep models—specifically (DS-CNN)-LSTM—reach 94.80% accuracy (stratified 10-fold CV) on a four-class dataset (inactive, active, walking, driving) collected without controlled placement or fixed sampling rates. Longer windows (60–90 s) help, and fusing accelerometer, gyroscope, magnetometer, and GPS outperforms single-sensor setups.


Background & Positioning

Classic smartphone HAR benchmarks (e.g., UCI-HAR, WISDM, HHAR, UniMiB-SHAR) were curated in controlled conditions with fixed placements (waist or pocket) and nearly constant sampling rates; strong numbers in lab conditions often do not transfer to daily life variability (orientation, placement, device heterogeneity). The paper contrasts these datasets and motivates a real-life alternative that adds GPS and leaves placement free, using participants’ own phones.


Dataset & Label Space

  • Participants: 19 adults (~25–50 y), naturalistic usage with diverse device models; only two women (limited gender diversity).
  • Sensors: Accelerometer, gyroscope, magnetometer, GPS.
  • Activities: Inactive (phone not on you), Active (stationary tasks with movement; e.g., cleaning or lecturing), Walking (on foot/jog), Driving (all motorized transport).
  • Real-life collection: Users freely started/stopped sessions via an app; placement and orientation were unconstrained.

Why it’s hard: Sampling rates vary widely by sensor, activity, and handset; accelerometer spikes under motion, gyroscope reflects orientation changes, and GPS is sparse. Table 2 in the paper quantifies per-activity means and SDs (e.g., Acc: ~9.5 Hz inactive vs ~37 Hz driving).


Preprocessing & Segmentation

  • Outlier removal (GPS): Drop jumps >0.2° lat/long or >500 m altitude between samples.
  • Trim edges: Remove first/last 5 s of each session to avoid pickup/put-away transients.
  • GPS replication: Because observations are often >10 s apart, replicate at 10 s steps to align with window stride (paper deviates from the dataset paper’s 1 s replication). Sessions with no GPS are discarded.
  • Gap handling: Discard non-GPS gaps >5 s.
  • Resampling (key design choice): Instead of interpolation, the authors fix IMU-type sensors at 5 Hz (every 200 ms) by selecting the closest real sample; GPS is set to 0.1 Hz (every 10 s). This avoids synthetic samples amid irregular native rates (often ~10 ms or ~200 ms modes).
  • Windows & stride: 30/60/90 s windows with 20/50/80 s overlap (i.e., 10 s stride). Longer windows better reflect the “long-themed” activities. Class counts show imbalance toward inactive.

Models & Architectures

Individual backbones

  • DS-CNN (Depthwise-Separable CNN): Efficiency-oriented 1-D convolutions + max-pooling; chosen for speed with similar accuracy to standard CNNs.
  • LSTM / Bi-LSTM: Sequence models capturing temporal context; Bi-LSTM concatenates forward/backward passes.

Hybrids (proposed)

  • (DS-CNN)-LSTM and (DS-CNN)-(Bi-LSTM) with DS-CNN layers first, then LSTM layers (e.g., for 2 layers: DS-CNN → DS-CNN → LSTM → LSTM). Max-pool outputs feed the LSTM stack.

Hyperparameters & Training

  • Grid: layers ∈ {1,2}; filters/neurons ∈ {16,32,64}; CNN kernels ∈ {3,5,7}; padding “same”; ReLU; Adam; cross-entropy. Batch size 32 after pilots. Dropout 0.5 before the softmax head. Max 100 iters with early stopping (patience 20) on val loss.

Validation & Metrics

  • Stratified 10-fold cross-validation (80% train / 10% val / 10% test within each fold) to mitigate imbalance and reduce single-split variance. Primary metric is accuracy; macro F1 reported for representative cases.
Note: Because splits are not subject-exclusive, some same-user leakage is possible; authors argue the diverse behaviors lessen its impact but acknowledge it as a limitation.

Results

Overall (all four sensors, best window)

  • (DS-CNN)-LSTM @ 90 s: 94.80% ± 4.09 (accuracy, 10-fold).
  • (DS-CNN)-(Bi-LSTM) @ 90 s: 94.16% ± 5.06.
  • LSTM @ 90 s: 93.52% ± 5.59; Bi-LSTM @ 90 s: 93.09% ± 5.10.
  • DS-CNN @ 90 s: 90.70% ± 7.29.

Window length effect: 60–90 s windows outperform 30 s on this long-duration activity set.

Baselines (same dataset): RF 92.97% ± 6.23, XGBoost 92.23% ± 7.30, k-NN 89.02% ± 8.00 (under comparable CV), confirming a deep-learning edge.

Class difficulty: Active remains hardest (broad, fuzzy definition; segments can include short walking/idle periods), though confusions are reduced relative to earlier work.

Sensor ablations (discussion): Acc & gyro are strongest; magnetometer and GPS alone trail but all-sensor fusion yields the best performance—now confirmed in a real-life dataset.


Discussion & Design Insights

  1. Hybrid > single-family: Temporal modeling is essential; adding LSTM on top of convolutional features boosts accuracy and stabilizes performance.
  2. Windowing matters: Real-life, long-horizon activities benefit from ≥60 s windows; too short misses context (traffic lights, mixed behaviors).
  3. No single “best” micro-tuning: Accuracy is not very sensitive to 1–2 layers, {16,32,64} units/filters, or kernels {3,5,7}; architectural pattern (hybrid) and window length dominate.
  4. Sampling strategy over interpolation: Choosing real nearest samples (5 Hz IMU; 0.1 Hz GPS) avoids artifacts in highly irregular streams and matched the 10 s stride.

Strengths

  • In-the-wild evidence: Uses free placement, varied phones, and volatile sampling—closer to deployment reality than lab datasets.
  • Clear, reproducible pipeline: Preprocessing rules (outliers, trimming, resampling), transparent validation, and code/dataset links provided.
  • Comprehensive comparisons: Individual vs hybrid models, traditional baselines, window sizes, and sensor discussions.

Limitations

  • Cross-subject leakage risk: Stratified folds may include the same participant across train/test; a subject-wise split would further stress generalization.
  • Label granularity: Active aggregates disparate behaviors; finer subclasses (e.g., “housework,” “typing/standing”) might reduce confusion.
  • Fixed resampling heuristic: The 5 Hz/0.1 Hz choice is pragmatic; alternative interpolation or sensor-specific resampling might yield further gains.

Practical Takeaways (for your projects)

  • Start here (strong baseline): (DS-CNN)-LSTM, 2 layers each (e.g., 32–64 units/filters; kernels 3–7), windows 60–90 s, stride 10 s, Adam, dropout 0.5, early-stopping 20.
  • Sensors: Use Acc + Gyro at minimum; add Mag + GPS when available for best results.
  • Preprocessing: Trim edges, handle GPS outliers, nearest-sample resampling (IMU 5 Hz; GPS 0.1 Hz), discard long gaps. Align GPS replication with window stride (10 s).
  • Evaluation: Prefer subject-wise or leave-one-subject-out in follow-ups to quantify cross-user transfer; keep macro-F1 alongside accuracy on imbalanced sets.

Relation to Prior Work

The paper extends the deep-learning trend in smartphone HAR—e.g., CNNs for local patterns and LSTM/bi-LSTM for temporal context (Yang et al., Ronao & Cho, Ignatov; DeepConvLSTM-style hybrids)—by demonstrating their effectiveness under real-life constraints and by confirming that sensor fusion and longer windows are key in the wild.


Suggested BibTeX

@article{GarciaGonzalez2023RealLifeHAR,
  title   = {Deep learning models for real-life human activity recognition from smartphone sensor data},
  author  = {Garc{\'i}a-Gonz{\'a}lez, Daniel and Rivero, Daniel and Fern{\'a}ndez-Blanco, Enrique and Luaces, Miguel R.},
  journal = {Internet of Things},
  volume  = {24},
  pages   = {100925},
  year    = {2023},
  doi     = {10.1016/j.iot.2023.100925}
}

This article is published under Creative Commons Attribution 4.0 (CC BY 4.0). This review paraphrases the content and avoids reproducing figures/tables or long verbatim text. If you include any figures or extended excerpts in your journal, provide attribution and a link to the source, consistent with CC BY 4.0.