Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

nitish9413/open_auto_loader

Open more actions menu

Repository files navigation

🚀 OpenAutoLoader

PyPI version License: MIT Python 3.12+ Powered by Polars PyPI Downloads

OpenAutoLoader is a high-performance, incremental data ingestion engine. It bridges the gap between raw cloud storage and production-ready Delta Lakes using the lightning-fast Polars Rust engine.

Stop writing complex Spark jobs for simple file ingestion. OpenAutoLoader provides a "Databricks-style" Auto Loader experience in a lightweight Python package.


💡 Why OpenAutoLoader?

Traditional ingestion often requires heavy JVM clusters (Spark) or manual file tracking. OpenAutoLoader changes that:

  • Zero-Spark Overhead: Runs on standard Python environments with Rust-level performance.
  • Exactly-Once Processing: Integrated SQLite checkpointing ensures no duplicate data, even if a job restarts.
  • Schema First: Automatically infers, saves, and enforces JSON schema contracts to prevent data corruption.
  • Cloud Native: A single API for Local, S3, Azure Blob (ABFSS), and GCS.

🛠️ Installation

# Core (Local files only)
pip install open-auto-loader

# Full Cloud Support (Recommended)
pip install "open-auto-loader[all]"

🚀 Quick Start: S3 to Delta Lake

from open_auto_loader import OpenAutoLoader

# Define your cloud credentials
storage_options = {
    "aws_access_key_id": "YOUR_ACCESS_KEY",
    "aws_secret_access_key": "YOUR_SECRET_KEY",
    "region": "ap-south-1"
}

# Initialize the loader
loader = OpenAutoLoader(
    source="s3://my-raw-bucket/incoming_logs/",
    target="s3://my-silver-bucket/tables/user_logs",
    check_point="./metadata/checkpoints.db",
    schema_path="./metadata/schemas/",
    storage_options=storage_options
)

# Run the ingestion batch
loader.run(batch_id="daily_run_2026_03_18")

🏗️ Architecture: How it Works

  1. Scanner: Uses fsspec to identify new files since the last successful batch_id.
  2. Schema Guard: Checks the file header against the stored JSON contract in schema_path.
  3. Polars Engine: Streams the data using sink_delta(), minimizing memory footprint.
  4. Metadata Injection: Automatically adds _batch_id, _processed_at, and _source_file to every row for full auditability.
  5. Committer: Updates the SQLite checkpoint only after a successful Delta write.

📋 Compatibility Matrix

Feature Local AWS S3 Azure Blob Google GCS
Incremental Loading
Schema Enforcement
Service Principal Auth N/A
Streaming Sink

🤝 Contributing

Contributions are welcome! Whether it's a bug fix, a new cloud provider, or performance tuning, feel free to open a PR.

Created with ❤️ by Nitish Katkade

About

OpenAutoLoader: A lightweight, open-source alternative to Databricks Auto Loader. Built with Polars and SQLite for efficient, incremental file ingestion.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

Morty Proxy This is a proxified and sanitized view of the page, visit original site.