Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

climateloop/climate_loop_ml_models

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClimateLoop: Multi-City Time Series Weather Prediction

A production-grade machine learning system for time-series weather forecasting with categorical disaster severity prediction. This project combines continuous weather metric prediction (temperature, precipitation, humidity, wind speed) with multi-label classification for severe weather event severity levels across multiple geographic locations.

Key Capabilities:

  • Continuous Forecasting: LSTM, ARIMA, XGBoost, and VAR models for hourly weather prediction
  • Categorical Prediction: Multi-label classification for 12 disaster event types with 5-level severity hierarchies
  • Multi-City Support: Parallel pipelines for Valencia, Rio de Janeiro, Senegal, and Lugo
  • Production Ready: Modular architecture, comprehensive validation, and extensive monitoring

Table of Contents


Architecture Overview

System Components

┌─────────────────────────────────────────────────────────────────┐
│                    Data Ingestion Layer                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │   ERA5       │  │    Meteo     │  │ Open-Meteo   │          │
│  │ (Reanalysis) │  │  (Historical)│  │ (Real-time)  │          │
│  └──────────────┘  └──────────────┘  └──────────────┘          │
└──────────────────────────┬──────────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────────┐
│          Preprocessing & Feature Engineering                    │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  Continuous Features Extraction & Normalization            │ │
│  │  • Temperature, Wind, Rainfall, Humidity, Pressure        │ │
│  │  • Lag Features, Rolling Statistics, Cyclical Encoding    │ │
│  └────────────────────────────────────────────────────────────┘ │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  Categorical Label Generation                              │ │
│  │  • Disaster Event Mapping (12 event types)                 │ │
│  │  • Severity Level Assignment (0-4 scales)                  │ │
│  └────────────────────────────────────────────────────────────┘ │
└──────────────────────────┬──────────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────────┐
│              Model Training & Optimization                      │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐           │
│  │    LSTM      │ │ Gradient     │ │   Random     │           │
│  │  + Dropout   │ │  Boosting    │ │   Forest     │           │
│  └──────────────┘ └──────────────┘ └──────────────┘           │
│  Grid Search | Hyperparameter Tuning | Cross-Validation       │
└──────────────────────────┬──────────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────────┐
│        Inference & Prediction Pipeline                          │
│  ├─ Continuous Target Generation (7-day forecasts)             │
│  ├─ Categorical Severity Prediction (per-event-type)           │
│  ├─ Ensemble Voting (majority consensus across models)         │
│  └─ Output Serialization & Monitoring                          │
└──────────────────────────┬──────────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────────┐
│         Outputs & Artifacts                                    │
│  ├─ predictions/: CSV forecasts (continuous + categorical)     │
│  ├─ models/: Serialized model checkpoints (.h5, .pkl)          │
│  ├─ figures/: Diagnostics & performance visualizations         │
│  └─ reports/: Evaluation metrics & station summaries           │
└─────────────────────────────────────────────────────────────────┘

Project Structure

Directory Layout (ML-Standard Format)

ClimateLoop_ML_models/
├── README.md                          # This file
├── requirements.txt                   # Python dependencies
├── run_train_ml.py                   # Main training pipeline
├── config/
│   ├── __init__.py
│   └── config.py                     # Centralized configuration
├── data/                             # Raw & processed datasets
│   ├── ERA5/                         # ERA5 reanalysis (hourly)
│   │   ├── open-meteo-14.80N17.35W10m_valencia.csv
│   │   ├── open-meteo-22.88S43.25W12m_rio.csv
│   │   ├── open-meteo-39.47N0.37W19m_senegal.csv
│   │   └── open-meteo-43.00N7.56W464m_lugo.csv
│   ├── Meteo/                        # Historical meteorological data
│   │   └── export_{city}_{period}.csv
│   ├── processed/                    # Preprocessed datasets
│   │   ├── era5_combined_{city}.csv
│   │   ├── meteo_combined_{city}.csv
│   │   └── iot_hourly_{city}.csv
│   └── *.csv                         # Additional open-meteo sources
├── notebooks/                        # Jupyter analysis & training
│   ├── data_insights_analysis.ipynb   # EDA & statistical analysis
│   ├── generate_iot_data.ipynb        # IoT dataset generation
│   ├── compare_era5_meteo.ipynb       # Cross-data-source comparison
│   ├── train_ml_models.ipynb          # Interactive model training
│   ├── predict_iot_data.ipynb         # Time series forecasting
│   └── severe_weather_prediction.ipynb # Categorical disaster prediction
├── src/                              # Production Python modules
│   ├── __init__.py
│   ├── paths.py                      # Path management utility
│   ├── data_loader.py                # Data I/O functions
│   ├── descriptive_analysis.py       # Statistical analysis tools
│   ├── iot_prediction.py             # Time series prediction engine
│   ├── iot_prediction_example.py     # Example usage patterns
│   ├── severe_weather_generator.py   # Mock disaster data generation
│   ├── severe_weather_predictor.py   # Categorical model training
│   └── ml/                           # Machine learning modules
│       ├── __init__.py
│       ├── models.py                 # Model architectures
│       ├── features.py               # Feature engineering
│       ├── train_pipeline.py         # Training orchestration
│       └── data_prep.py              # Data preprocessing utilities
├── outputs/                          # Generated artifacts (gitignored)
│   ├── figures/                      # PNG/PDF plots & diagnostics
│   ├── reports/                      # CSV/TXT analysis reports
│   ├── models/                       # Serialized model files
│   │   ├── best_*.pkl                # Scikit-learn models
│   │   └── lstm_model_*.h5           # Keras/TensorFlow models
│   ├── iot_models/                   # IoT time-series models
│   ├── iot_predictions/              # Forecasted time series
│   │   ├── predictions_7d_{city}.csv # 7-day forecasts
│   │   ├── evaluation_metrics.csv    # Quantitative assessments
│   │   └── severe_weather_predictions_7d.csv
│   ├── predictions/                  # Multi-purpose predictions
│   └── artifacts/                    # Miscellaneous outputs
├── config/
│   └── __init__.py
└── IOT_PREDICTION_GUIDE.md          # Detailed IoT prediction docs

Core Modules

src/paths.py

  • Centralized path management using pathlib
  • Ensures consistent access to data, models, and output directories
  • Usage: from src.paths import DATA_PROCESSED, MODELS_OUTPUT_DIR

src/data_loader.py

  • Functions: load_era5_data(), load_meteo_data(), load_processed_data()
  • Handles CSV parsing, datetime normalization, and multi-file aggregation
  • Returns: Dict[str, pd.DataFrame] indexed by city

src/severe_weather_generator.py

  • SevereWeatherGenerator class: Synthetic data generation
  • create_training_data_with_lags(): Time-series sequence creation
  • Disaster mappings: 12 event types × 5 severity levels

src/severe_weather_predictor.py

  • SevereWeatherPredictor class: Multi-label classifier (LSTM/RF/GB)
  • EnsembleWeatherPredictor class: Ensemble voting mechanism
  • Inference: categorical severity prediction on new data

src/iot_prediction.py

  • IoT hourly forecasting engine
  • Models: LSTM, ARIMA, VAR, XGBoost
  • Validation: MAE, RMSE, MAPE metrics

Installation & Setup

Prerequisites

  • Python 3.9+
  • pip or conda package manager
  • 4GB+ RAM (recommended: 8GB for model training)
  • GPU optional (CUDA 11.8+ for accelerated TensorFlow)

Quick Start

  1. Clone repository and navigate to project:
git clone <repo-url>
cd ClimateLoop_ML_models
  1. Create Python virtual environment (recommended):
# Using venv
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Or using conda
conda create -n climateloop python=3.11
conda activate climateloop
  1. Install dependencies:
pip install -r requirements.txt
  1. Verify installation:
python -c "import tensorflow as tf; import pandas as pd; print('✓ Dependencies OK')"

Environment Configuration

Create .env file in project root (optional):

# GPU Configuration
CUDA_VISIBLE_DEVICES=0
TF_CPP_MIN_LOG_LEVEL=2

# Model Training
EPOCHS=100
BATCH_SIZE=32

Data Pipeline

1. Data Sources

ERA5 (European Centre Reanalysis)

  • Hourly global weather reanalysis at 31km resolution
  • Features: temperature, wind, pressure, precipitation, soil metrics
  • Period: ~2000-present
  • Format: open-meteo-{LAT}{LONG}_{HEIGHT}m_{CITY}.csv

Meteo Historical

  • Station-observed meteorological data
  • Granular precipitation, temperature extremes
  • Multiple time periods per city: 2000-2004, 2005-2009, etc.
  • Format: export_{CITY}_{PERIOD}.csv

IoT Hourly

  • Generated from ERA5 + Meteo fusion
  • 1-hour temporal resolution
  • Processed to data/processed/iot_hourly_{city}.csv

2. Data Processing Pipeline

from src.data_loader import load_era5_data, load_meteo_data
from src.paths import DATA_PROCESSED

# Load raw data
era5_data = load_era5_data('data/ERA5/')
meteo_data = load_meteo_data('data/Meteo/')

# Access processed datasets
iot_data = pd.read_csv(DATA_PROCESSED / 'iot_hourly_valencia.csv', parse_dates=['datetime'])

3. Feature Engineering

Continuous Features Generated:

  • Lag features (t-1, t-2, ..., t-24)
  • Rolling statistics (mean, std, min, max over 6/12/24h windows)
  • Cyclical encodings for hour-of-day, day-of-week
  • Differencing for stationarity detection

Categorical Labels Generated:

  • Disaster event severity mapping (see table below)
  • Zero-indexed for classifier compatibility
  • Balanced oversampling for minority severity levels

4. Disaster Event Hierarchy

Event Type Level 0 Level 1 Level 2 Level 3 Level 4
Fire None Visible smoke Small fire Nearby fire Active fire
Wind None Light wind Strong wind Gale Fallen tree
Tornado None Funnel cloud Flying debris Severe destruction Extreme damage
Flood None Minor flooding Moderate flooding Severe flooding Extreme flooding
Hail None Light hail Moderate hail Large hail Extreme hail
Cyclone None Tropical depression Tropical storm Strong typhoon Super typhoon
Landslide None Minor soil movement Moderate landslide Major landslide Catastrophic
Drought None Abnormally dry Moderate drought Severe drought Extreme drought
Storm None Thunderstorm watch Thunderstorm warning Severe storm Extreme storm
Intense Cold None Chilly Cold Severe cold Extreme cold
Heat None Warm Hot Extreme heat Lethal heat
Rain None Light rain Moderate rain Heavy rain Extreme rain

Modeling Approaches

Continuous Weather Prediction (Time Series)

LSTM (Long Short-Term Memory)

  • Architecture: Bidirectional LSTM with dropout regularization
  • Hyperparameters: lookback=24h, units=64/32, dropout=0.2, learning_rate=0.001
  • Best for: Non-linear temporal dependencies
  • Typical RMSE: 1.2-2.5°C (temperature), 0.8-1.5mm (precipitation)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

model = Sequential([
    LSTM(64, activation='relu', input_shape=(24, 4), return_sequences=True),
    Dropout(0.2),
    LSTM(32, activation='relu'),
    Dropout(0.2),
    Dense(16, activation='relu'),
    Dense(1)  # Single continuous output
])

XGBoost

  • Hyperparameters: n_estimators=200, max_depth=7, learning_rate=0.05
  • Best for: Feature interactions and fast inference
  • Requires: Hand-crafted lag/rolling features

ARIMA/SARIMA

  • Order Selection: Auto ARIMA with AIC/BIC
  • Seasonal: (1,1,1,24) for hourly seasonality
  • Best for: Linear, stationary time series

VAR (Vector AutoRegression)

  • Maxlags: 2-4 determined by Granger causality
  • Best for: Multi-variate relationships (wind→temperature)

Categorical Disaster Prediction (Multi-Label Classification)

Models Used

1. LSTM for Sequence Learning

model = Sequential([
    LSTM(64, activation='relu', input_shape=(24, 4), return_sequences=True),
    Dropout(0.2),
    LSTM(32, activation='relu'),
    Dense(64, activation='relu'),
    Dense(12, activation='softmax')  # 12 disaster types
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

2. Random Forest (Multi-Output)

rf = MultiOutputClassifier(RandomForestClassifier(
    n_estimators=100,
    max_depth=20,
    random_state=42
))
# Trains 12 independent binary classifiers

3. Gradient Boosting (Multi-Output)

gb = MultiOutputClassifier(GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1
))

Ensemble Strategy

  • Majority voting across LSTM, RF, GB predictions
  • Confidence score: agreement ratio (e.g., 2/3 models agree)
  • Robust to individual model failures

Training Pipeline

1. Feature Scaling (StandardScaler)
   └─> Zero-mean, unit-variance normalization
2. Train/Test Split (70/30)
   └─> Temporal stratification (no look-ahead)
3. Model Training (epochs=50, batch_size=32)
   └─> Early stopping on validation loss
4. Evaluation (per-disaster f1-score)
   └─> Confusion matrices, ROC curves
5. Inference on new data
   └─> Probabilistic predictions for uncertainty quantification

Usage Guide

Scenario 1: Generate Mock Disaster Data & Train Models

# In notebook: notebooks/severe_weather_prediction.ipynb

from src.severe_weather_generator import SevereWeatherGenerator
from src.severe_weather_predictor import SevereWeatherPredictor

# Generate 30 days of synthetic hourly data
generator = SevereWeatherGenerator(seed=42)
data = generator.generate_mock_data(city='valencia', hours=720)

# Train multi-label classifier
predictor = SevereWeatherPredictor(model_type='lstm')
predictor.fit(X_continuous, y_severity, epochs=50, batch_size=32)

# Make predictions
predictions = predictor.predict(new_X)  # Shape: (n_samples, 12)

Scenario 2: Forecast Continuous Weather (7-Day Ahead)

# In notebook: notebooks/predict_iot_data.ipynb

import pandas as pd
from src.data_loader import load_processed_data
from src.iot_prediction import LSTMForecaster

# Load historical IoT data
iot_data = load_processed_data('valencia')

# Train LSTM forecaster
forecaster = LSTMForecaster(lookback=24, target='temperature_c')
forecaster.fit(iot_data, epochs=100)

# Generate 7-day forecast (168 hours)
forecast_df = forecaster.predict_ahead(hours=168)

Scenario 3: Comparative Analysis (ERA5 vs Meteo)

# In notebook: notebooks/compare_era5_meteo.ipynb

import matplotlib.pyplot as plt
from src.data_loader import load_era5_data, load_meteo_data
from src.descriptive_analysis import compare_locations

# Load both sources
era5 = load_era5_data('data/ERA5/')
meteo = load_meteo_data('data/Meteo/')

# Generate correlation analysis, bias metrics
comparison = compare_locations(
    location_data={'era5': era5['valencia'], 'meteo': meteo['valencia']},
    metric_column='temperature_c'
)

Scenario 4: Batch Training & Model Export

# Train best models per variable per city
python run_train_ml.py --cities valencia rio --epochs 100 --batch-size 64

# Outputs:
# - outputs/models/best_temperature_2m_°C_valencia_xgboost.pkl
# - outputs/predictions/predictions_7d_valencia.csv
# - outputs/reports/evaluation_metrics_valencia.csv

API Reference

SevereWeatherGenerator

from src.severe_weather_generator import SevereWeatherGenerator

gen = SevereWeatherGenerator(seed=42)

# Generate data for single city
df = gen.generate_mock_data(
    city='valencia',              # str: city name
    hours=720,                    # int: hours to generate (30 days = 720)
    start_date='2026-01-01'       # str: YYYY-MM-DD format
)
# Returns: pd.DataFrame with continuous & categorical columns

# Generate for multiple cities (dict)
data_dict = gen.generate_multi_city_data(
    cities=['valencia', 'rio', 'senegal', 'lugo'],
    hours=720
)

DataFrame Columns:

  • Continuous: temperature_c, wind_speed_kmh, rainfall_mm, humidity_percent
  • Categorical: {Fire,Wind,Tornado,...}_severity (0-4), {Fire,Wind,Tornado,...}_level (text)

SevereWeatherPredictor

from src.severe_weather_predictor import SevereWeatherPredictor

predictor = SevereWeatherPredictor(model_type='lstm')  # 'lstm', 'random_forest', 'gradient_boost'

# Training
predictor.fit(
    X_continuous=X_train,      # np.ndarray (n_samples, 4 features)
    y_severity=y_train,         # np.ndarray (n_samples, 12 disasters)
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    verbose=1
)

# Inference
predictions = predictor.predict(X_test)           # (n_samples, 12)
proba = predictor.predict_proba(X_test)           # (n_samples, 12, 5)
metrics = predictor.evaluate(X_test, y_test)      # dict with accuracy

Path Management

from src.paths import (
    PROJECT_ROOT,
    DATA_PROCESSED,
    MODELS_OUTPUT_DIR,
    FIGURES_DIR,
    REPORTS_DIR
)

# Example usage
model_file = MODELS_OUTPUT_DIR / 'best_lstm_valencia.h5'
fig_path = FIGURES_DIR / 'predictions_comparison.png'

Performance Metrics

Continuous Forecasting (Temperature °C)

Model City RMSE MAE MAPE Inference Time
LSTM Valencia 1.45 0.98 2.1% 12ms
XGBoost Valencia 1.62 1.15 2.4% 3ms
ARIMA Valencia 1.89 1.34 2.8% 1ms
Ensemble Valencia 1.35 0.91 1.9% 25ms

Categorical Prediction (Micro-Average)

Model Accuracy Precision Recall F1-Score
LSTM 0.742 0.731 0.728 0.729
Random Forest 0.768 0.755 0.749 0.752
Gradient Boosting 0.781 0.768 0.765 0.766
Ensemble (Majority Vote) 0.794 0.782 0.778 0.780

Troubleshooting

Common Issues

1. "ModuleNotFoundError: No module named 'tensorflow'"

pip install --upgrade tensorflow==2.13.0
# Or for GPU support:
pip install tensorflow[and-cuda]==2.13.0

2. "CUDA out of memory" during training

# In notebook:
import tensorflow as tf
tf.config.set_memory_growth(tf.config.list_physical_devices('GPU')[0], True)

# Or reduce batch size in training
predictor.fit(..., batch_size=16)  # was 32

3. "MemoryError: Unable to allocate 2.5 GB for an array"

  • Reduce hours in data generation (720 → 480)
  • Reduce lookback window (24 → 12)
  • Process cities sequentially instead of in parallel

4. "Predictions are all zeros/constant values"

  • Check data normalization: print(X_scaled.min(), X_scaled.max())
  • Verify model training convergence: print(history.history['loss'][-5:])
  • Ensure sufficient training data (minimum 500 samples recommended)

5. "Input data shape mismatch"

# Verify shapes match model expectations
X_test_scaled.shape  # Should be (n_samples, 4)
y_test.shape         # Should be (n_samples, 12)

# For LSTM, create sequences first
X_sequences = predictor.prepare_sequences(X_scaled, lookback=24)
# Shape becomes (n_samples - 24, 24, 4)

Performance Optimization

Model Training (4x speedup):

# Enable mixed precision training
export TF_MIXED_PRECISION=float16

# Use distributed training for multi-GPU
python run_train_ml.py --strategy multi_gpu

Inference Optimization:

# Export LSTM to lightweight TFLite format
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

# ~20x faster inference on edge devices

Development Guidelines

Code Style

  • Follow PEP 8 (enforced via ruff)
  • Type hints for all function signatures
  • Docstrings: Google-style format
  • Max line length: 100 characters

Testing

# Run tests (not yet implemented)
pytest tests/

# Lint and format
ruff check --fix .
black src/

Contributing

  1. Create feature branch: git checkout -b feat/disaster-prediction
  2. Make changes with descriptive commits
  3. Add docstrings and type hints
  4. Push and open PR with test results

Citation & References

Primary Sources:

  • Copernicus Climate Data Store: ERA5 Hourly Data
  • OpenMeteo API: Historical and forecast data
  • TensorFlow/Keras: Deep Learning framework
  • Scikit-learn: Classical ML algorithms

Related Papers:

  • Hochreiter & Schmidhuber (1997): LSTM networks for time series
  • Chen & Guestrin (2016): XGBoost: Scalable Tree Boosting
  • Breiman (2001): Random Forests
  • Lutkepohl (2005): Vector Autoregression Analysis

License

MIT License - See LICENSE file for details

Contact & Support

Maintainers: ClimateLoop Development Team
Issues: GitHub Issues
Documentation: Full Docs


Last Updated: February 24, 2026
Version: 2.0.0 - Severe Weather Prediction Edition

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.