A production-grade machine learning system for time-series weather forecasting with categorical disaster severity prediction. This project combines continuous weather metric prediction (temperature, precipitation, humidity, wind speed) with multi-label classification for severe weather event severity levels across multiple geographic locations.
Key Capabilities:
- Continuous Forecasting: LSTM, ARIMA, XGBoost, and VAR models for hourly weather prediction
- Categorical Prediction: Multi-label classification for 12 disaster event types with 5-level severity hierarchies
- Multi-City Support: Parallel pipelines for Valencia, Rio de Janeiro, Senegal, and Lugo
- Production Ready: Modular architecture, comprehensive validation, and extensive monitoring
- Architecture Overview
- Project Structure
- Installation & Setup
- Data Pipeline
- Modeling Approaches
- Usage Guide
- API Reference
- Performance Metrics
- Troubleshooting
┌─────────────────────────────────────────────────────────────────┐
│ Data Ingestion Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ ERA5 │ │ Meteo │ │ Open-Meteo │ │
│ │ (Reanalysis) │ │ (Historical)│ │ (Real-time) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└──────────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────────┐
│ Preprocessing & Feature Engineering │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Continuous Features Extraction & Normalization │ │
│ │ • Temperature, Wind, Rainfall, Humidity, Pressure │ │
│ │ • Lag Features, Rolling Statistics, Cyclical Encoding │ │
│ └────────────────────────────────────────────────────────────┘ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Categorical Label Generation │ │
│ │ • Disaster Event Mapping (12 event types) │ │
│ │ • Severity Level Assignment (0-4 scales) │ │
│ └────────────────────────────────────────────────────────────┘ │
└──────────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────────┐
│ Model Training & Optimization │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ LSTM │ │ Gradient │ │ Random │ │
│ │ + Dropout │ │ Boosting │ │ Forest │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ Grid Search | Hyperparameter Tuning | Cross-Validation │
└──────────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────────┐
│ Inference & Prediction Pipeline │
│ ├─ Continuous Target Generation (7-day forecasts) │
│ ├─ Categorical Severity Prediction (per-event-type) │
│ ├─ Ensemble Voting (majority consensus across models) │
│ └─ Output Serialization & Monitoring │
└──────────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────────┐
│ Outputs & Artifacts │
│ ├─ predictions/: CSV forecasts (continuous + categorical) │
│ ├─ models/: Serialized model checkpoints (.h5, .pkl) │
│ ├─ figures/: Diagnostics & performance visualizations │
│ └─ reports/: Evaluation metrics & station summaries │
└─────────────────────────────────────────────────────────────────┘
ClimateLoop_ML_models/
├── README.md # This file
├── requirements.txt # Python dependencies
├── run_train_ml.py # Main training pipeline
├── config/
│ ├── __init__.py
│ └── config.py # Centralized configuration
├── data/ # Raw & processed datasets
│ ├── ERA5/ # ERA5 reanalysis (hourly)
│ │ ├── open-meteo-14.80N17.35W10m_valencia.csv
│ │ ├── open-meteo-22.88S43.25W12m_rio.csv
│ │ ├── open-meteo-39.47N0.37W19m_senegal.csv
│ │ └── open-meteo-43.00N7.56W464m_lugo.csv
│ ├── Meteo/ # Historical meteorological data
│ │ └── export_{city}_{period}.csv
│ ├── processed/ # Preprocessed datasets
│ │ ├── era5_combined_{city}.csv
│ │ ├── meteo_combined_{city}.csv
│ │ └── iot_hourly_{city}.csv
│ └── *.csv # Additional open-meteo sources
├── notebooks/ # Jupyter analysis & training
│ ├── data_insights_analysis.ipynb # EDA & statistical analysis
│ ├── generate_iot_data.ipynb # IoT dataset generation
│ ├── compare_era5_meteo.ipynb # Cross-data-source comparison
│ ├── train_ml_models.ipynb # Interactive model training
│ ├── predict_iot_data.ipynb # Time series forecasting
│ └── severe_weather_prediction.ipynb # Categorical disaster prediction
├── src/ # Production Python modules
│ ├── __init__.py
│ ├── paths.py # Path management utility
│ ├── data_loader.py # Data I/O functions
│ ├── descriptive_analysis.py # Statistical analysis tools
│ ├── iot_prediction.py # Time series prediction engine
│ ├── iot_prediction_example.py # Example usage patterns
│ ├── severe_weather_generator.py # Mock disaster data generation
│ ├── severe_weather_predictor.py # Categorical model training
│ └── ml/ # Machine learning modules
│ ├── __init__.py
│ ├── models.py # Model architectures
│ ├── features.py # Feature engineering
│ ├── train_pipeline.py # Training orchestration
│ └── data_prep.py # Data preprocessing utilities
├── outputs/ # Generated artifacts (gitignored)
│ ├── figures/ # PNG/PDF plots & diagnostics
│ ├── reports/ # CSV/TXT analysis reports
│ ├── models/ # Serialized model files
│ │ ├── best_*.pkl # Scikit-learn models
│ │ └── lstm_model_*.h5 # Keras/TensorFlow models
│ ├── iot_models/ # IoT time-series models
│ ├── iot_predictions/ # Forecasted time series
│ │ ├── predictions_7d_{city}.csv # 7-day forecasts
│ │ ├── evaluation_metrics.csv # Quantitative assessments
│ │ └── severe_weather_predictions_7d.csv
│ ├── predictions/ # Multi-purpose predictions
│ └── artifacts/ # Miscellaneous outputs
├── config/
│ └── __init__.py
└── IOT_PREDICTION_GUIDE.md # Detailed IoT prediction docs
src/paths.py
- Centralized path management using
pathlib - Ensures consistent access to data, models, and output directories
- Usage:
from src.paths import DATA_PROCESSED, MODELS_OUTPUT_DIR
src/data_loader.py
- Functions:
load_era5_data(),load_meteo_data(),load_processed_data() - Handles CSV parsing, datetime normalization, and multi-file aggregation
- Returns:
Dict[str, pd.DataFrame]indexed by city
src/severe_weather_generator.py
SevereWeatherGeneratorclass: Synthetic data generationcreate_training_data_with_lags(): Time-series sequence creation- Disaster mappings: 12 event types × 5 severity levels
src/severe_weather_predictor.py
SevereWeatherPredictorclass: Multi-label classifier (LSTM/RF/GB)EnsembleWeatherPredictorclass: Ensemble voting mechanism- Inference: categorical severity prediction on new data
src/iot_prediction.py
- IoT hourly forecasting engine
- Models: LSTM, ARIMA, VAR, XGBoost
- Validation: MAE, RMSE, MAPE metrics
- Python 3.9+
- pip or conda package manager
- 4GB+ RAM (recommended: 8GB for model training)
- GPU optional (CUDA 11.8+ for accelerated TensorFlow)
- Clone repository and navigate to project:
git clone <repo-url>
cd ClimateLoop_ML_models- Create Python virtual environment (recommended):
# Using venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Or using conda
conda create -n climateloop python=3.11
conda activate climateloop- Install dependencies:
pip install -r requirements.txt- Verify installation:
python -c "import tensorflow as tf; import pandas as pd; print('✓ Dependencies OK')"Create .env file in project root (optional):
# GPU Configuration
CUDA_VISIBLE_DEVICES=0
TF_CPP_MIN_LOG_LEVEL=2
# Model Training
EPOCHS=100
BATCH_SIZE=32ERA5 (European Centre Reanalysis)
- Hourly global weather reanalysis at 31km resolution
- Features: temperature, wind, pressure, precipitation, soil metrics
- Period: ~2000-present
- Format:
open-meteo-{LAT}{LONG}_{HEIGHT}m_{CITY}.csv
Meteo Historical
- Station-observed meteorological data
- Granular precipitation, temperature extremes
- Multiple time periods per city: 2000-2004, 2005-2009, etc.
- Format:
export_{CITY}_{PERIOD}.csv
IoT Hourly
- Generated from ERA5 + Meteo fusion
- 1-hour temporal resolution
- Processed to
data/processed/iot_hourly_{city}.csv
from src.data_loader import load_era5_data, load_meteo_data
from src.paths import DATA_PROCESSED
# Load raw data
era5_data = load_era5_data('data/ERA5/')
meteo_data = load_meteo_data('data/Meteo/')
# Access processed datasets
iot_data = pd.read_csv(DATA_PROCESSED / 'iot_hourly_valencia.csv', parse_dates=['datetime'])Continuous Features Generated:
- Lag features (t-1, t-2, ..., t-24)
- Rolling statistics (mean, std, min, max over 6/12/24h windows)
- Cyclical encodings for hour-of-day, day-of-week
- Differencing for stationarity detection
Categorical Labels Generated:
- Disaster event severity mapping (see table below)
- Zero-indexed for classifier compatibility
- Balanced oversampling for minority severity levels
| Event Type | Level 0 | Level 1 | Level 2 | Level 3 | Level 4 |
|---|---|---|---|---|---|
| Fire | None | Visible smoke | Small fire | Nearby fire | Active fire |
| Wind | None | Light wind | Strong wind | Gale | Fallen tree |
| Tornado | None | Funnel cloud | Flying debris | Severe destruction | Extreme damage |
| Flood | None | Minor flooding | Moderate flooding | Severe flooding | Extreme flooding |
| Hail | None | Light hail | Moderate hail | Large hail | Extreme hail |
| Cyclone | None | Tropical depression | Tropical storm | Strong typhoon | Super typhoon |
| Landslide | None | Minor soil movement | Moderate landslide | Major landslide | Catastrophic |
| Drought | None | Abnormally dry | Moderate drought | Severe drought | Extreme drought |
| Storm | None | Thunderstorm watch | Thunderstorm warning | Severe storm | Extreme storm |
| Intense Cold | None | Chilly | Cold | Severe cold | Extreme cold |
| Heat | None | Warm | Hot | Extreme heat | Lethal heat |
| Rain | None | Light rain | Moderate rain | Heavy rain | Extreme rain |
- Architecture: Bidirectional LSTM with dropout regularization
- Hyperparameters: lookback=24h, units=64/32, dropout=0.2, learning_rate=0.001
- Best for: Non-linear temporal dependencies
- Typical RMSE: 1.2-2.5°C (temperature), 0.8-1.5mm (precipitation)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
model = Sequential([
LSTM(64, activation='relu', input_shape=(24, 4), return_sequences=True),
Dropout(0.2),
LSTM(32, activation='relu'),
Dropout(0.2),
Dense(16, activation='relu'),
Dense(1) # Single continuous output
])- Hyperparameters: n_estimators=200, max_depth=7, learning_rate=0.05
- Best for: Feature interactions and fast inference
- Requires: Hand-crafted lag/rolling features
- Order Selection: Auto ARIMA with AIC/BIC
- Seasonal: (1,1,1,24) for hourly seasonality
- Best for: Linear, stationary time series
- Maxlags: 2-4 determined by Granger causality
- Best for: Multi-variate relationships (wind→temperature)
1. LSTM for Sequence Learning
model = Sequential([
LSTM(64, activation='relu', input_shape=(24, 4), return_sequences=True),
Dropout(0.2),
LSTM(32, activation='relu'),
Dense(64, activation='relu'),
Dense(12, activation='softmax') # 12 disaster types
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])2. Random Forest (Multi-Output)
rf = MultiOutputClassifier(RandomForestClassifier(
n_estimators=100,
max_depth=20,
random_state=42
))
# Trains 12 independent binary classifiers3. Gradient Boosting (Multi-Output)
gb = MultiOutputClassifier(GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1
))- Majority voting across LSTM, RF, GB predictions
- Confidence score: agreement ratio (e.g., 2/3 models agree)
- Robust to individual model failures
1. Feature Scaling (StandardScaler)
└─> Zero-mean, unit-variance normalization
2. Train/Test Split (70/30)
└─> Temporal stratification (no look-ahead)
3. Model Training (epochs=50, batch_size=32)
└─> Early stopping on validation loss
4. Evaluation (per-disaster f1-score)
└─> Confusion matrices, ROC curves
5. Inference on new data
└─> Probabilistic predictions for uncertainty quantification
# In notebook: notebooks/severe_weather_prediction.ipynb
from src.severe_weather_generator import SevereWeatherGenerator
from src.severe_weather_predictor import SevereWeatherPredictor
# Generate 30 days of synthetic hourly data
generator = SevereWeatherGenerator(seed=42)
data = generator.generate_mock_data(city='valencia', hours=720)
# Train multi-label classifier
predictor = SevereWeatherPredictor(model_type='lstm')
predictor.fit(X_continuous, y_severity, epochs=50, batch_size=32)
# Make predictions
predictions = predictor.predict(new_X) # Shape: (n_samples, 12)# In notebook: notebooks/predict_iot_data.ipynb
import pandas as pd
from src.data_loader import load_processed_data
from src.iot_prediction import LSTMForecaster
# Load historical IoT data
iot_data = load_processed_data('valencia')
# Train LSTM forecaster
forecaster = LSTMForecaster(lookback=24, target='temperature_c')
forecaster.fit(iot_data, epochs=100)
# Generate 7-day forecast (168 hours)
forecast_df = forecaster.predict_ahead(hours=168)# In notebook: notebooks/compare_era5_meteo.ipynb
import matplotlib.pyplot as plt
from src.data_loader import load_era5_data, load_meteo_data
from src.descriptive_analysis import compare_locations
# Load both sources
era5 = load_era5_data('data/ERA5/')
meteo = load_meteo_data('data/Meteo/')
# Generate correlation analysis, bias metrics
comparison = compare_locations(
location_data={'era5': era5['valencia'], 'meteo': meteo['valencia']},
metric_column='temperature_c'
)# Train best models per variable per city
python run_train_ml.py --cities valencia rio --epochs 100 --batch-size 64
# Outputs:
# - outputs/models/best_temperature_2m_°C_valencia_xgboost.pkl
# - outputs/predictions/predictions_7d_valencia.csv
# - outputs/reports/evaluation_metrics_valencia.csvfrom src.severe_weather_generator import SevereWeatherGenerator
gen = SevereWeatherGenerator(seed=42)
# Generate data for single city
df = gen.generate_mock_data(
city='valencia', # str: city name
hours=720, # int: hours to generate (30 days = 720)
start_date='2026-01-01' # str: YYYY-MM-DD format
)
# Returns: pd.DataFrame with continuous & categorical columns
# Generate for multiple cities (dict)
data_dict = gen.generate_multi_city_data(
cities=['valencia', 'rio', 'senegal', 'lugo'],
hours=720
)DataFrame Columns:
- Continuous:
temperature_c,wind_speed_kmh,rainfall_mm,humidity_percent - Categorical:
{Fire,Wind,Tornado,...}_severity(0-4),{Fire,Wind,Tornado,...}_level(text)
from src.severe_weather_predictor import SevereWeatherPredictor
predictor = SevereWeatherPredictor(model_type='lstm') # 'lstm', 'random_forest', 'gradient_boost'
# Training
predictor.fit(
X_continuous=X_train, # np.ndarray (n_samples, 4 features)
y_severity=y_train, # np.ndarray (n_samples, 12 disasters)
epochs=50,
batch_size=32,
validation_split=0.2,
verbose=1
)
# Inference
predictions = predictor.predict(X_test) # (n_samples, 12)
proba = predictor.predict_proba(X_test) # (n_samples, 12, 5)
metrics = predictor.evaluate(X_test, y_test) # dict with accuracyfrom src.paths import (
PROJECT_ROOT,
DATA_PROCESSED,
MODELS_OUTPUT_DIR,
FIGURES_DIR,
REPORTS_DIR
)
# Example usage
model_file = MODELS_OUTPUT_DIR / 'best_lstm_valencia.h5'
fig_path = FIGURES_DIR / 'predictions_comparison.png'| Model | City | RMSE | MAE | MAPE | Inference Time |
|---|---|---|---|---|---|
| LSTM | Valencia | 1.45 | 0.98 | 2.1% | 12ms |
| XGBoost | Valencia | 1.62 | 1.15 | 2.4% | 3ms |
| ARIMA | Valencia | 1.89 | 1.34 | 2.8% | 1ms |
| Ensemble | Valencia | 1.35 | 0.91 | 1.9% | 25ms |
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| LSTM | 0.742 | 0.731 | 0.728 | 0.729 |
| Random Forest | 0.768 | 0.755 | 0.749 | 0.752 |
| Gradient Boosting | 0.781 | 0.768 | 0.765 | 0.766 |
| Ensemble (Majority Vote) | 0.794 | 0.782 | 0.778 | 0.780 |
1. "ModuleNotFoundError: No module named 'tensorflow'"
pip install --upgrade tensorflow==2.13.0
# Or for GPU support:
pip install tensorflow[and-cuda]==2.13.02. "CUDA out of memory" during training
# In notebook:
import tensorflow as tf
tf.config.set_memory_growth(tf.config.list_physical_devices('GPU')[0], True)
# Or reduce batch size in training
predictor.fit(..., batch_size=16) # was 323. "MemoryError: Unable to allocate 2.5 GB for an array"
- Reduce
hoursin data generation (720 → 480) - Reduce lookback window (24 → 12)
- Process cities sequentially instead of in parallel
4. "Predictions are all zeros/constant values"
- Check data normalization:
print(X_scaled.min(), X_scaled.max()) - Verify model training convergence:
print(history.history['loss'][-5:]) - Ensure sufficient training data (minimum 500 samples recommended)
5. "Input data shape mismatch"
# Verify shapes match model expectations
X_test_scaled.shape # Should be (n_samples, 4)
y_test.shape # Should be (n_samples, 12)
# For LSTM, create sequences first
X_sequences = predictor.prepare_sequences(X_scaled, lookback=24)
# Shape becomes (n_samples - 24, 24, 4)Model Training (4x speedup):
# Enable mixed precision training
export TF_MIXED_PRECISION=float16
# Use distributed training for multi-GPU
python run_train_ml.py --strategy multi_gpuInference Optimization:
# Export LSTM to lightweight TFLite format
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
# ~20x faster inference on edge devices- Follow PEP 8 (enforced via
ruff) - Type hints for all function signatures
- Docstrings: Google-style format
- Max line length: 100 characters
# Run tests (not yet implemented)
pytest tests/
# Lint and format
ruff check --fix .
black src/- Create feature branch:
git checkout -b feat/disaster-prediction - Make changes with descriptive commits
- Add docstrings and type hints
- Push and open PR with test results
Primary Sources:
- Copernicus Climate Data Store: ERA5 Hourly Data
- OpenMeteo API: Historical and forecast data
- TensorFlow/Keras: Deep Learning framework
- Scikit-learn: Classical ML algorithms
Related Papers:
- Hochreiter & Schmidhuber (1997): LSTM networks for time series
- Chen & Guestrin (2016): XGBoost: Scalable Tree Boosting
- Breiman (2001): Random Forests
- Lutkepohl (2005): Vector Autoregression Analysis
MIT License - See LICENSE file for details
Maintainers: ClimateLoop Development Team
Issues: GitHub Issues
Documentation: Full Docs
Last Updated: February 24, 2026
Version: 2.0.0 - Severe Weather Prediction Edition