Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

zee404-code/FullStack_BigDataAnalytics_Project

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Real-Time Big Data Analytics Pipeline for Flash Sale Stockout Prevention

Overview

Flash sales such as 11.11 and 12.12 introduce extreme demand volatility in e-commerce systems. Inventory depletion can occur within minutes, while traditional batch analytics react too late — after revenue and customer trust are already lost.

This project implements a fully dockerized, real-time Big Data Analytics (BDA) pipeline designed to prevent stockouts during high-traffic flash sales by providing minute-level, action-oriented insights to decision-makers.

The system continuously ingests high-velocity data, computes distributed analytics, caches KPIs for low-latency access, archives historical data, and surfaces only decision-relevant signals through a tactical dashboard.

Business Problem

During flash sales:

  • Demand spikes are non-linear and time-dependent

  • Inventory drains faster than batch refresh cycles

  • Managers lack visibility into where and when stockouts will occur

  • Reactive reporting leads to:

    • Lost revenue
    • Order cancellations
    • Wasted ad spend
    • Fulfillment bottlenecks

Objective

Enable managers to answer — within one minute:

  1. Which products are about to stock out?
  2. Where is the risk occurring (city / warehouse)?
  3. What action should be taken immediately? (restock, pause ads, reroute fulfillment, monitor)

System Architecture (High-Level)

End-to-end flow:

Statistical Data Generator
        ↓
     MongoDB (Hot Data)
        ↓
 Apache Spark (OLAP & KPIs)
        ↓
     Redis (KPI Cache)
        ↓
   Live Dashboard (Streamlit)

        ↘
        Hadoop HDFS (Archival & Cold Storage)

Orchestration: Apache Airflow Deployment: Fully containerized using Docker & Docker Compose

Architecture_Diagram

Technology Stack

Layer Technology Purpose
Data Generation Python (Statistical Simulation) Realistic flash-sale behavior
Fresh Storage MongoDB High-write, low-latency ingestion
Processing Apache Spark Distributed OLAP analytics
Orchestration Apache Airflow Scheduling & pipeline control
Caching Redis Sub-second KPI reads
Archival Hadoop HDFS Scalable cold storage
Visualization Streamlit Tactical real-time dashboard
Infrastructure Docker Reproducibility & isolation

Scale & Data Volume

This project operates at realistic big-data scale:

  • ~69 million streaming records generated
  • ~16 GB of incoming data
  • Continuous ingestion throughout runtime
  • Automatic archival once thresholds are crossed

This ensures:

  • Non-trivial join sizes
  • Meaningful Spark execution plans
  • Realistic performance characteristics

Statistical Data Generation (Not Random)

Statistical_Data_Generation

Data is generated using parameterized statistical models, not uniform randomness.

Key characteristics

  • Context-aware Poisson-based order arrivals

    • Orders per minute sampled from Poisson(λ)

    • λ dynamically adjusted using:

      • Flash sale intensity
      • Discount levels
      • City demand factors
      • Warehouse pressure
  • Weighted product selection

    • Probability proportional to:

      popularity × discount_multiplier × scarcity_multiplier
      
  • Scarcity correlation

    • Lower inventory increases selection bias
    • Enables natural stockout emergence
  • Log-normal price modeling

    • Prices centered around discounted base price
    • Category-level variance for realism

This approach produces:

  • Demand skew
  • Hot SKUs
  • Geographic imbalance
  • Emergent stockouts (not hard-coded)

Data Model (Fact–Dimension Schema)

Data Schema

Fact Tables (Streaming)

orders_fact

  • order_id
  • product_id
  • seller_id
  • warehouse_id
  • city_id
  • order_qty
  • order_value
  • order_ts

inventory_fact

  • inventory_event_id
  • product_id
  • warehouse_id
  • inventory_on_hand
  • reserved_stock
  • fulfillment_delay
  • inventory_ts

Dimension Tables

  • products_dim (category, base price)
  • warehouses_dim (capacity, location)
  • cities_dim (city, province)

All analytics require multi-table joins, enabling OLAP-style queries.

Real-Time KPIs (Computed Every Minute and Cached in Redis)

KPIs

  • Orders per minute
  • Sales velocity (units/min per product & warehouse)
  • Stockout ETA (minutes-to-stockout)
  • High-risk SKU count
  • Warehouse overload count
  • Geographic risk distribution
  • Risk trend over time
  • Actionable alerts table

Query characteristics

  • Uses WHERE, GROUP BY, and HAVING
  • Time-windowed (last 15 minutes)
  • Early aggregation before joins
  • Derived KPIs instead of raw scans

Orchestration (Airflow DAGs)

DAGs implemented

  1. Dimension Seeding DAG

    • Initializes reference tables
  2. Minute-Level KPI DAG

    • Triggers Spark analytics every minute
    • Updates Redis cache
  3. Archival DAG

    • Moves older data from MongoDB → HDFS
    • Keeps hot storage lean

Each DAG is isolated for:

  • Fault containment
  • Debuggability
  • Operational clarity

Performance Optimizations

Optimizations are applied at every layer:

Ingestion

  • Rate-controlled generation
  • Single write target (MongoDB)

Storage

  • Hot–cold data separation (MongoDB ↔ HDFS)
  • Time-windowed queries only

Processing

  • Early aggregation
  • Join ordering
  • Derived KPIs
  • No full scans

Caching

  • Redis stores only pre-aggregated KPIs
  • Eliminates Spark/Mongo hits from dashboard

Visualization

  • Dashboard reads Redis only
  • Sub-second refresh
  • Minimal, decision-centric visuals

Metadata & Governance

  • Clear schema definitions
  • Timestamped records
  • Partitioned archival (date / hour )
  • Airflow logs provide lineage and execution metadata

Dashboard Design

The dashboard is tactical, not descriptive.

Designed to help managers:

  • Spot stockout risk early
  • Identify where it’s happening
  • Decide action immediately

No scrolling, no heavy charts — only actionable insights.

Why This Matters

This project demonstrates how real-time analytics, when engineered correctly, can shift decision-making from reactive reporting to preventive action.

It reflects:

  • Realistic data scale
  • Industry-grade architecture
  • Practical engineering trade-offs
  • Business-first analytics design

Status

✔ Fully implemented ✔ Dockerized & reproducible ✔ Real-time & scalable ✔ Industry-aligned architecture

How to Run Locally (Dockerized Setup)

This project is fully containerized using Docker Compose. The steps below bring up the complete real-time Big Data Analytics pipeline locally, including ingestion, processing, orchestration, caching, archival, and visualization.

Prerequisites

  • Docker ≥ 24.x
  • Docker Compose (v2)
  • At least 8–12 GB RAM recommended
  • Linux (tested)

Build and start core services

Build all images and start base infrastructure services:

docker compose up -d --build

This initializes:

  • MongoDB (fresh streaming data)
  • Redis (KPI cache)
  • Hadoop NameNode & DataNode (archival storage)
  • Spark Master & Worker
  • Airflow container (initial state)

Initialize Airflow metadata database

Airflow requires an internal metadata database before it can schedule DAGs:

docker compose run --rm airflow airflow db init

This creates the Airflow metadata schema used for:

  • DAG state
  • Task execution history
  • Scheduling metadata

Create Airflow admin user

Create an admin user to access the Airflow UI:

docker compose run --rm airflow airflow users create \
  --username admin \
  --firstname Admin \
  --lastname User \
  --role Admin \
  --email admin@example.com \
  --password admin

Start Airflow services

Start the Airflow container in detached mode:

docker compose up -d airflow

(Optional) Reset the admin password if required:

docker exec -it airflow airflow users reset-password \
  --username admin \
  --password admin

Start the streaming data generator

Launch the statistical flash-sale data generator:

docker compose up -d generator

This begins:

  • Continuous order generation
  • Inventory updates
  • High-velocity streaming inserts into MongoDB

The generator simulates flash-sale behavior using statistical models, not random noise.

Start the dashboard

Launch the real-time dashboard:

docker compose up -d dashboard

The dashboard:

  • Reads only from Redis
  • Displays pre-aggregated KPIs
  • Updates automatically as data changes

Access it at:

http://localhost:8501

Start Airflow scheduler and webserver

Run the Airflow scheduler and webserver inside the container:

docker exec -it airflow bash -lc "airflow webserver -D && airflow scheduler -D"

Access the Airflow UI at:

http://localhost:8088

From here you can:

  • Enable DAGs
  • Monitor task execution
  • Inspect logs
  • Observe minute-level analytics & archival workflows

Prepare HDFS archive directories

Initialize the HDFS directory structure for archived data and metadata:

docker exec -it namenode bash -lc "
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/metadata &&
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/orders_fact &&
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/inventory_fact
"

These directories store:

  • Archived fact data (orders, inventory)
  • Metadata describing archival batches
  • Partitioned historical data (date/hour-based)

What Happens After Startup

Once all services are running:

  • The generator continuously inserts streaming data into MongoDB
  • Airflow DAGs trigger Spark jobs every minute
  • Spark computes KPIs using join-based OLAP queries
  • Redis caches KPIs for low-latency access
  • HDFS stores archived historical data
  • The dashboard updates live without recomputation

This setup enables real-time stockout risk detection under flash-sale load.

Verifying the System

Optional checks:

# Check Redis KPIs
docker exec -it redis redis-cli KEYS '*'

# Check MongoDB collections
docker exec -it mongo mongosh

# Check HDFS UI
http://localhost:9870

Shutting Down

To stop all services:

docker compose down

To remove volumes as well:

docker compose down -v

ℹ️ Notes

  • Kafka is intentionally not used — ingestion is deterministic and controlled for reproducible analytics.
  • The pipeline is designed for minute-level micro-batching, not event-driven chaos.
  • Redis is used strictly as a KPI cache, not a source of truth.

About

A complete end-to-end Dockerized Real-Time Big Data Analytics Pipeline using MongoDB, Spark, Redis, Airflow, and HDFS with a live analytics dashboard.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.