Real-Time Big Data Analytics Pipeline for Flash Sale Stockout Prevention

Overview

Flash sales such as 11.11 and 12.12 introduce extreme demand volatility in e-commerce systems. Inventory depletion can occur within minutes, while traditional batch analytics react too late — after revenue and customer trust are already lost.

This project implements a fully dockerized, real-time Big Data Analytics (BDA) pipeline designed to prevent stockouts during high-traffic flash sales by providing minute-level, action-oriented insights to decision-makers.

The system continuously ingests high-velocity data, computes distributed analytics, caches KPIs for low-latency access, archives historical data, and surfaces only decision-relevant signals through a tactical dashboard.

Business Problem

During flash sales:

Demand spikes are non-linear and time-dependent
Inventory drains faster than batch refresh cycles
Managers lack visibility into where and when stockouts will occur
Reactive reporting leads to:
- Lost revenue
- Order cancellations
- Wasted ad spend
- Fulfillment bottlenecks

Objective

Enable managers to answer — within one minute:

Which products are about to stock out?
Where is the risk occurring (city / warehouse)?
What action should be taken immediately? (restock, pause ads, reroute fulfillment, monitor)

System Architecture (High-Level)

End-to-end flow:

Statistical Data Generator
        ↓
     MongoDB (Hot Data)
        ↓
 Apache Spark (OLAP & KPIs)
        ↓
     Redis (KPI Cache)
        ↓
   Live Dashboard (Streamlit)

        ↘
        Hadoop HDFS (Archival & Cold Storage)

Orchestration: Apache Airflow Deployment: Fully containerized using Docker & Docker Compose

Technology Stack

Layer	Technology	Purpose
Data Generation	Python (Statistical Simulation)	Realistic flash-sale behavior
Fresh Storage	MongoDB	High-write, low-latency ingestion
Processing	Apache Spark	Distributed OLAP analytics
Orchestration	Apache Airflow	Scheduling & pipeline control
Caching	Redis	Sub-second KPI reads
Archival	Hadoop HDFS	Scalable cold storage
Visualization	Streamlit	Tactical real-time dashboard
Infrastructure	Docker	Reproducibility & isolation

Scale & Data Volume

This project operates at realistic big-data scale:

~69 million streaming records generated
~16 GB of incoming data
Continuous ingestion throughout runtime
Automatic archival once thresholds are crossed

This ensures:

Non-trivial join sizes
Meaningful Spark execution plans
Realistic performance characteristics

Statistical Data Generation (Not Random)

Data is generated using parameterized statistical models, not uniform randomness.

Key characteristics

Context-aware Poisson-based order arrivals
- Orders per minute sampled from Poisson(λ)
- λ dynamically adjusted using:
  - Flash sale intensity
  - Discount levels
  - City demand factors
  - Warehouse pressure

Weighted product selection

Probability proportional to:

popularity × discount_multiplier × scarcity_multiplier

Scarcity correlation
- Lower inventory increases selection bias
- Enables natural stockout emergence
Log-normal price modeling
- Prices centered around discounted base price
- Category-level variance for realism

This approach produces:

Demand skew
Hot SKUs
Geographic imbalance
Emergent stockouts (not hard-coded)

Data Model (Fact–Dimension Schema)

Fact Tables (Streaming)

`orders_fact`

order_id
product_id
seller_id
warehouse_id
city_id
order_qty
order_value
order_ts

`inventory_fact`

inventory_event_id
product_id
warehouse_id
inventory_on_hand
reserved_stock
fulfillment_delay
inventory_ts

Dimension Tables

products_dim (category, base price)
warehouses_dim (capacity, location)
cities_dim (city, province)

All analytics require multi-table joins, enabling OLAP-style queries.

Real-Time KPIs (Computed Every Minute and Cached in Redis)

Orders per minute
Sales velocity (units/min per product & warehouse)
Stockout ETA (minutes-to-stockout)
High-risk SKU count
Warehouse overload count
Geographic risk distribution
Risk trend over time
Actionable alerts table

Query characteristics

Uses WHERE, GROUP BY, and HAVING
Time-windowed (last 15 minutes)
Early aggregation before joins
Derived KPIs instead of raw scans

Orchestration (Airflow DAGs)

DAGs implemented

Dimension Seeding DAG
- Initializes reference tables
Minute-Level KPI DAG
- Triggers Spark analytics every minute
- Updates Redis cache
Archival DAG
- Moves older data from MongoDB → HDFS
- Keeps hot storage lean

Each DAG is isolated for:

Fault containment
Debuggability
Operational clarity

Performance Optimizations

Optimizations are applied at every layer:

Ingestion

Rate-controlled generation
Single write target (MongoDB)

Storage

Hot–cold data separation (MongoDB ↔ HDFS)
Time-windowed queries only

Processing

Early aggregation
Join ordering
Derived KPIs
No full scans

Caching

Redis stores only pre-aggregated KPIs
Eliminates Spark/Mongo hits from dashboard

Visualization

Dashboard reads Redis only
Sub-second refresh
Minimal, decision-centric visuals

Metadata & Governance

Clear schema definitions
Timestamped records
Partitioned archival (date / hour )
Airflow logs provide lineage and execution metadata

Dashboard Design

The dashboard is tactical, not descriptive.

Designed to help managers:

Spot stockout risk early
Identify where it’s happening
Decide action immediately

No scrolling, no heavy charts — only actionable insights.

Why This Matters

This project demonstrates how real-time analytics, when engineered correctly, can shift decision-making from reactive reporting to preventive action.

It reflects:

Realistic data scale
Industry-grade architecture
Practical engineering trade-offs
Business-first analytics design

Status

✔ Fully implemented ✔ Dockerized & reproducible ✔ Real-time & scalable ✔ Industry-aligned architecture

How to Run Locally (Dockerized Setup)

This project is fully containerized using Docker Compose. The steps below bring up the complete real-time Big Data Analytics pipeline locally, including ingestion, processing, orchestration, caching, archival, and visualization.

Prerequisites

Docker ≥ 24.x
Docker Compose (v2)
At least 8–12 GB RAM recommended
Linux (tested)

Build and start core services

Build all images and start base infrastructure services:

docker compose up -d --build

This initializes:

MongoDB (fresh streaming data)
Redis (KPI cache)
Hadoop NameNode & DataNode (archival storage)
Spark Master & Worker
Airflow container (initial state)

Initialize Airflow metadata database

Airflow requires an internal metadata database before it can schedule DAGs:

docker compose run --rm airflow airflow db init

This creates the Airflow metadata schema used for:

DAG state
Task execution history
Scheduling metadata

Create Airflow admin user

Create an admin user to access the Airflow UI:

docker compose run --rm airflow airflow users create \
  --username admin \
  --firstname Admin \
  --lastname User \
  --role Admin \
  --email admin@example.com \
  --password admin

Start Airflow services

Start the Airflow container in detached mode:

docker compose up -d airflow

(Optional) Reset the admin password if required:

docker exec -it airflow airflow users reset-password \
  --username admin \
  --password admin

Start the streaming data generator

Launch the statistical flash-sale data generator:

docker compose up -d generator

This begins:

Continuous order generation
Inventory updates
High-velocity streaming inserts into MongoDB

The generator simulates flash-sale behavior using statistical models, not random noise.

Start the dashboard

Launch the real-time dashboard:

docker compose up -d dashboard

The dashboard:

Reads only from Redis
Displays pre-aggregated KPIs
Updates automatically as data changes

Access it at:

http://localhost:8501

Start Airflow scheduler and webserver

Run the Airflow scheduler and webserver inside the container:

docker exec -it airflow bash -lc "airflow webserver -D && airflow scheduler -D"

Access the Airflow UI at:

http://localhost:8088

From here you can:

Enable DAGs
Monitor task execution
Inspect logs
Observe minute-level analytics & archival workflows

Prepare HDFS archive directories

Initialize the HDFS directory structure for archived data and metadata:

docker exec -it namenode bash -lc "
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/metadata &&
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/orders_fact &&
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/inventory_fact
"

These directories store:

Archived fact data (orders, inventory)
Metadata describing archival batches
Partitioned historical data (date/hour-based)

What Happens After Startup

Once all services are running:

The generator continuously inserts streaming data into MongoDB
Airflow DAGs trigger Spark jobs every minute
Spark computes KPIs using join-based OLAP queries
Redis caches KPIs for low-latency access
HDFS stores archived historical data
The dashboard updates live without recomputation

This setup enables real-time stockout risk detection under flash-sale load.

Verifying the System

Optional checks:

# Check Redis KPIs
docker exec -it redis redis-cli KEYS '*'

# Check MongoDB collections
docker exec -it mongo mongosh

# Check HDFS UI
http://localhost:9870

Shutting Down

To stop all services:

docker compose down

To remove volumes as well:

docker compose down -v

ℹ️ Notes

Kafka is intentionally not used — ingestion is deterministic and controlled for reproducible analytics.
The pipeline is designed for minute-level micro-batching, not event-driven chaos.
Redis is used strictly as a KPI cache, not a source of truth.

Name	Name	Last commit message	Last commit date
Latest commit History 2 Commits 2 Commits
airflow	airflow
dashboard	dashboard
generator	generator
spark	spark
.DS_Store	.DS_Store
README.md	README.md
docker-compose.yml	docker-compose.yml

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

Real-Time Big Data Analytics Pipeline for Flash Sale Stockout Prevention

Overview

Business Problem

Objective

System Architecture (High-Level)

Technology Stack

Scale & Data Volume

Statistical Data Generation (Not Random)

Key characteristics

Data Model (Fact–Dimension Schema)

Fact Tables (Streaming)

orders_fact

inventory_fact

Dimension Tables

Real-Time KPIs (Computed Every Minute and Cached in Redis)

Query characteristics

Orchestration (Airflow DAGs)

DAGs implemented

Performance Optimizations

Ingestion

Storage

Processing

Caching

Visualization

Metadata & Governance

Dashboard Design

Why This Matters

Status

How to Run Locally (Dockerized Setup)

Build and start core services

Initialize Airflow metadata database

Create Airflow admin user

Start Airflow services

Start the streaming data generator

Start the dashboard

Start Airflow scheduler and webserver

Prepare HDFS archive directories

What Happens After Startup

Verifying the System

Shutting Down

ℹ️ Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`orders_fact`

`inventory_fact`

Packages