Analyzing Yellow Taxi Data. This project can be executed using Docker as Container for loading data to DB as a scheduler Job using Chronos in Mesos Platform or Kubernetes Scheduler
Data Downloading and Loading Process are tracked and managed using table tbl_status
Python 3.7+ and PostgresSql
.. code-block::
.
|-- Dockerfile
|-- MANIFEST.in
|-- Makefile
|-- README.md
|-- dataset
| |--
|-- docker-compose.yml
|-- requirements.txt
|-- setup.cfg
|-- setup.py
|-- sql
| `-- V1.0_CREATE_TABLE.sql
|-- src
| `-- processing
| |-- __init__.py
| |-- cli.py
| |-- constant.py
| |-- transform.py
| `-- util.py
|-- tests
| |-- __init__.py
| `-- test_analysis.py
`-- yellow_taxi_analysis.ipynb
yellow_taxi_analysis.ipynb : Has all the analysis outputs.
Steps:
-
git clone https://github.com/krishnabigdata/taxi-data-analysis.git -
pip install -r taxi-data-analysis/requirements.txt -
pip install --upgrade taxi-data-analysis -
docker-compose up -d
Building Docker and using docker
make build -e VERSION=latestmake push -e VERSION=latestdocker run -t -i --network host docker.io/krishnabigdata/taxi_data_analysis -v ${PWD}:/taxi_data_analysis/dataset --action download --year 2019 --month 1 --color yellow
Commands to use the processing cli
.. code-block:: bash
usage: processing [-h] [--year YEAR] [--month {1,2,3,4,5,6}]
[--color {yellow}] --action
{all,download,load,avg_trip,avg_trip_local,rolling_avg_trip}
[--verbose VERBOSE]
Taxi Data Analysis
optional arguments:
-h, --help show this help message and exit
--year YEAR year of data to load (default: 2019)
--month {1,2,3,4,5,6}
month of data to load (default: 1)
--color {yellow} color of data to load (default: yellow)
--action {all,download,load,avg_trip,avg_trip_local,rolling_avg_trip}
action to be performed (default: all)
--verbose VERBOSE logging action to be performed (default: True)
all: Performs all stepsDownloading,LoadingToDB,Queries DB for AVG and Rolling AVG
avg_trip_local- Calculates Trip Distance Average by Month by Querying the Locally downloaded file.
avg_trip- Calculates Trip Distance Average by Month by Querying the DB.
rolling_avg_trip- Calculates 45 Day Rolling Trip Distance Average by Querying the DB.
We can use the below options for distributed processing in order to process huge volume of data which cannot be processed by single instance.
pyspark- For distributed processingDB: Parallel loading of files to DB and analysis using SQL queries.Streaming: Data produced as events toKafkaand Processing usingKafa-Connect connectorsorSpark Structured StreamingorConsume from Kafkaand load toDB->SQL Query