Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

smaruf/data-engineering

Open more actions menu

Repository files navigation

Data Engineering Learning Journey

Project for learning data engineering as a professional

3-Month Data Engineering Learning Plan

Month 1 — Python Data Engineering + SQL + ETL Basics

Learn:

Python libraries for data engineering:

  • pandas (for data manipulation)
  • SQLAlchemy (Python SQL toolkit)

SQL deep dive:

  • Complex queries, window functions, joins
  • Performance tuning

ETL concepts:

  • Building simple pipelines

Practice:

  • Build ETL scripts extracting data from CSV/JSON APIs
  • Transform data with pandas
  • Load data into a local Postgres DB
  • Learn and write complex SQL queries to prepare data sets

Resources:


Month 2 — Apache Spark + Data Pipeline Orchestration (Airflow)

Learn:

  • Apache Spark fundamentals (PySpark preferred)
  • Build batch data processing jobs
  • Apache Airflow basics: DAGs, operators, scheduling
  • Set up Airflow locally or in Docker

Practice:

  • Build a Spark job to process a medium-size public dataset (e.g., NYC Taxi Trips, Kaggle datasets)
  • Build an Airflow DAG to run your Spark job on schedule and track success/failure

Resources:


Month 3 — Cloud Data Engineering + Streaming (AWS + Kafka)

Learn:

  • AWS Glue (serverless ETL)
  • AWS Redshift (data warehouse)
  • AWS Kinesis basics or Apache Kafka (more open source)
  • Build real-time data ingestion and processing pipelines

Practice:

  • Create an ETL job in AWS Glue that extracts from S3 and loads into Redshift
  • Build a Kafka producer and consumer app in Python or Java
  • Set up a simple streaming pipeline to process data in real-time (Kafka → Spark Streaming or Kinesis Data Analytics)

Resources:


Bonus Tips

  • Document your projects on GitHub with READMEs and architecture diagrams
  • Share progress as blog posts or short videos — great for portfolio & networking
  • Join data engineering communities (LinkedIn, Reddit r/dataengineering, Slack groups)

Project Structure

This repository contains various data engineering projects and learning resources:

├── 3-weeks-plan/              # 3-week intensive data engineering plan
│   ├── week1-batch-etl/
│   ├── week2-streaming-airflow/
│   └── week3-cloud-etl/
├── full-phased-project/       # Comprehensive phased data engineering project
│   ├── phase1-batch-etl/
│   ├── phase2-streaming-orchestration/
│   └── phase3-cloud-pipeline/
├── basic-statistics/          # Production-ready statistics with Python+Fortran
│   ├── src/                   # Python and Fortran implementations
│   ├── docs/                  # Theory, theorems, and guides
│   ├── examples/              # Real-world use cases
│   ├── tests/                 # Comprehensive test suite
│   ├── api/                   # FastAPI service
│   └── docker/                # Production deployment
├── cobol-project/             # Production-ready COBOL project with converters
│   ├── src/                   # COBOL source programs
│   ├── converters/            # Python ↔ COBOL conversion tools
│   ├── examples/              # Example programs
│   └── docs/                  # Comprehensive documentation
├── fortan-ai/                 # Fortran AI project
├── snowflake-databricks-mastery/  # Cloud data warehouse projects
└── README.md

Featured Projects

📊 Basic Statistics - Production-Ready Statistical Computing

A comprehensive statistics project featuring:

  • Dual Implementation: Python (flexible) + Fortran (performance)
  • Complete Theory: Statistical theorems with proofs
  • Real Use Cases: A/B testing, quality control, market analysis
  • Big Data Ready: Spark integration, distributed computing
  • Production API: FastAPI service with Docker deployment
  • Fully Tested: 28+ unit tests, property-based testing

➡️ Explore Statistics Project

🔷 COBOL Project - Legacy Meets Modern

A comprehensive COBOL project featuring:

  • 5 Production-Ready COBOL Programs demonstrating different COBOL features
  • Bidirectional Converters: Python ↔ COBOL conversion tools
  • Complete Documentation: COBOL features guide and conversion guide
  • Example Programs: Ready-to-use examples for learning

➡️ Explore COBOL Project

About

Project for learning data engineering as professional

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.