Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

A high-performance Python utility that fetches and merges multiple adlists into domain-only output for DNS blockers like Pi-hole, AdGuard, and similar DNS filtering solutions.

License

Notifications You must be signed in to change notification settings

Toomas633/Adlist-Parser

Open more actions menu

Repository files navigation

Adlist-Parser

A high-performance Python utility that fetches and merges multiple adlists into domain-only output for DNS blockers like Pi-hole, AdGuard, and similar DNS filtering solutions.

Table of Contents

Features

  • Fast Concurrent Processing: Processes 1.6M+ entries from 50+ sources in ~50-60 seconds
  • Zero Dependencies: Uses only Python standard library (3.8+)
  • Dual Output: Generates both adlists and whitelists simultaneously
  • Smart Content Processing: Handles domains, wildcards, regex patterns, and Pi-hole format conversions
  • ABP Filter Support: Converts Pi-hole regex patterns to AdBlock Plus (ABP) format with automatic wildcard normalization
  • Intelligent Separation: Automatically separates exception rules (@@||) from blocklist to whitelist
  • Domain Validation: Validates and filters invalid domain entries during post-processing
  • Real-time Progress: Animated progress spinners with detailed status updates
  • Error Resilient: Failed fetches don't crash the pipeline; they're logged and filtered out

Quick Start

  1. Clone the repository:

    git clone https://github.com/Toomas633/Adlist-Parser.git
    cd Adlist-Parser
  2. Run the parser:

    • From a repo checkout:
      python -m adparser
    • If installed as a package:
      adlist-parser

    On Windows PowerShell:

    python -m adparser
    # or, if installed
    adlist-parser
  3. Find your results:

    • output/adlist.txt - Merged blocklist (~1.6M entries)
    • output/whitelist.txt - Merged whitelist (~2K entries)

Installation (optional)

You can run from a checkout (above) or install locally to get the adlist-parser command:

# Editable install for development
python -m pip install -e .

# Then run
adlist-parser

Configuration

Input Sources

Configure your sources in JSON files:

data/adlists.json - Blocklist sources:

{
  "lists": ["blacklist.txt", "old_adlist.txt"],
  "urls": [
    "https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts",
    "https://adaway.org/hosts.txt",
    "https://v.firebog.net/hosts/AdguardDNS.txt"
  ]
}

data/whitelists.json - Whitelist sources:

{
  "lists": ["whitelist.txt"],
  "urls": [
    "https://raw.githubusercontent.com/hagezi/dns-blocklists/main/domains/whitelist-referral.txt"
  ]
}

Source Types

  • URLs: HTTP/HTTPS links to remote lists
  • Local files: Relative paths to files in the data/ directory
  • Mixed format: Each source can contain domains, wildcards, regex patterns, or Pi-hole entries

Notes:

  • Path resolution: relative paths inside the JSON files are resolved relative to the JSON file location (not the CWD).
  • Accepted keys: both files accept any of lists, urls, adlists, or sources for compatibility; they are merged.

Output Format

Supported Input Formats

The parser intelligently handles multiple input formats:

  • Plain domains: example.com
  • Wildcards: *.example.com
  • Pi-hole regex: (\.|^)example\.com$
  • AdBlock patterns: /pattern/flags
  • Host file entries: 0.0.0.0 example.com
  • Comments: Lines starting with #, !, //, or ;

Output Processing

  1. Domain Extraction: Extracts clean domains from various host file formats
  2. Wildcard Handling: *.domain.com is preserved as a domain token (wildcard not expanded). In the final domain output the leading *. is stripped to domain.com.
  3. ABP Normalization: Fixes broken ABP patterns automatically:
    • ||*cdn.domain.com^||*.cdn.domain.com^ (missing dot after wildcard)
    • ||app.*.adjust.com^||*.adjust.com^ (wildcard-only label removed)
    • ||domain.google.*^||domain.google^ (wildcard TLD removed - not supported)
    • -domain.com^||-domain.com^ (adds missing || prefix)
    • @@|domain.com^|@@||domain.com^ (fixes single pipe + trailing pipe)
  4. ABP Conversion: Pi-hole regex patterns convert to ||domain^ format when possible
  5. Blocklist/Whitelist Separation: Automatically moves @@|| exception entries from blocklist to whitelist
  6. Domain Validation: Validates and removes invalid domain entries during post-processing
  7. Regex Handling: Complex regexes that can't convert to ABP are discarded (pipeline doesn't crash)
  8. Deduplication: Preserves first-seen order during normalization; final outputs are sorted case-insensitively during post-processing
  9. Comment Filtering: Strips whole-line and inline comments (#, !, //, ;)
  10. HTML Filtering: Removes HTML tags and attributes from lists
  11. Error Resilience: Failed fetches logged and filtered during normalization
  12. Adlist merge: Adlist pipeline merges with prior output/adlist.txt before writing to preserve entries across transient source failures (whitelist writes directly)

Determinism and file format

  • Outputs use LF-only line endings.
  • Sorting is deterministic and case-insensitive; deduplication is case-insensitive and whitespace-trimmed.
  • Headers are regenerated during post-processing (don’t hand-edit outputs).

Output file header (sample)

Each output starts with a generated header like this:

# Adlist - Generated by Adlist-Parser
# https://github.com/Toomas633/Adlist-Parser
#
# Created/modified: 2025-01-01 00:00:00 UTC
# Total entries: 1,684,272
# Domains: 400,527
# ABP-style rules: 1,283,745
# Sources processed: 50
#
# This file is automatically generated. Do not edit manually.
# To update, run: adlist-parser or python -m adparser
#

Performance

  • Concurrency: Fetches multiple sources simultaneously (max 16 workers)
  • Async Processing: Adlists and whitelists processed in parallel
  • Memory Efficient: Line-by-line processing for large datasets
  • Real-world Scale: Tested with 1.6M+ entries from 50+ sources

Tuning

  • Concurrency: network fetching uses up to 16 workers (see adparser/fetcher.py). You can adjust the cap there if needed for your environment.
  • I/O: most heavy I/O runs off the event loop using asyncio.to_thread(); disk speed can impact total time.
  • Output size: output/adlist.txt can reach ~1.6–1.7M lines depending on sources.

How It Works

Two concurrent pipelines run via asyncio.gather() in adparser/cli.py:

Pipeline Flow (for both adlist and whitelist):

  1. Load Sources → Parse JSON configs and resolve paths
  2. Fetch Content → Concurrent downloads (16 workers max)
  3. Generate List → Normalize, categorize, and convert entries
  4. Adlist-only merge → Merge new entries with prior output/adlist.txt before write (preserves previous content across transient source failures); whitelist writes directly
  5. Write Output → Save with auto-generated headers (LF-only line endings)
  6. Post-Processing → Separate blocklist/whitelist entries, validate domains, regenerate headers, and re-write both files
  7. Redundancy Report → Analyze duplicates and overlaps

Key Processing Steps:

  • All heavy I/O wrapped with asyncio.to_thread() to keep event loop responsive
  • Progress displayed via animated spinners (adparser/status.py)
  • Domain validation using DOMAIN_RE regex pattern
  • Pi-hole regex → ABP conversion when patterns are simple enough
  • ABP wildcard normalization fixes malformed patterns automatically
  • Post-processing separates @@|| exception entries to whitelist
  • Cross-list deduplication ensures no conflicts between blocklist and whitelist
  • Failed sources tracked separately and reported at end
  • Files are written with LF-only line endings

Architecture

The codebase follows a modular async architecture with strict separation of concerns:

adparser/cli.py           # Main orchestrator with async/await
├── adparser/io.py        # JSON parsing, path resolution, file I/O
├── adparser/fetcher.py   # Concurrent HTTP fetching (ThreadPoolExecutor)
├── adparser/content.py   # Domain extraction, normalization, regex conversion
├── adparser/models.py    # Source descriptor dataclass (URL vs local files)
├── adparser/status.py    # Progress spinners and terminal UI updates
├── adparser/reporting.py # Results summary with emoji formatting
├── adparser/redundancy.py# Duplicate detection and overlap analysis
└── adparser/constants.py # File path constants

Design Principles:

  • Single Responsibility: Each module handles one concern (fetch/parse/write separated)
  • Error Isolation: Failed sources don't crash the pipeline
  • Async I/O: Heavy operations run in thread pool via to_thread()
  • Progress Feedback: Global status_display coordinates concurrent spinners
  • Order Preservation: Deduplication via _dedupe_preserve_order() before final sort

VS Code Tasks

This repository includes ready-to-run tasks for Windows PowerShell:

  • Adlist-Parser: runs the tool end-to-end (equivalent to python -m adparser).
  • Tests: Pytest (coverage): runs pytest with coverage as configured in pyproject.toml.
  • Lint: Pylint (report): runs pylint and writes pylint-report.txt; non-zero exit is allowed while still generating the report.

Quality gates expectations:

  • Build: N/A (pure Python). Runtime entry point is adparser.cli:main.
  • Tests: PASS on local data set; full runs may take ~50–60s.
  • Lint: PASS is ideal; if not, review pylint-report.txt.

Redundancy Analysis

The parser includes built-in redundancy detection to help optimize your source lists:

Features:

  • Duplicate Detection: Identifies sources with identical content
  • Local File Analysis: Shows which entries in local files are already covered by remote sources
  • Removal Suggestions: Lists first 20 redundant entries with count of remaining

Example Output:


🔁 Duplicate sources (identical content): 2 groups
├─ 🌐 https://example.com/list1.txt
└─ 🌐 https://example.com/list2.txt
💡 Tip: Keep one source from this group, remove the others

📄 Local file redundancy analysis:
• blacklist.txt: 150/200 entries (75.0%) already in remote sources

Input Format Examples

The parser intelligently processes various input formats and converts them appropriately:

Input Format Processing Result Notes
example.com example.com (domain) Plain domain preserved
*.example.com ||*.example.com^ (ABP rule) Wildcard converted to ABP
0.0.0.0 example.com example.com (domain) Host file format extracted
(\.|^)example\.com$ ||example.com^ (ABP rule) Pi-hole regex converted
/ads?/ → ABP rule or discarded Converted if simple, discarded if complex
# Comment line → filtered Comment removed
domain.com # inline domain.com Inline comment stripped
<div>html</div> → filtered HTML tags removed
@@||exception.com^ → Moved to whitelist as ||exception.com^ Exception rule separated
||*cdn.example.com^ ||*.cdn.example.com^ Malformed ABP pattern normalized

Troubleshooting

Common Issues:

  • Network Errors: Failed sources are listed as "UNAVAILABLE SOURCES" in the final report with 🌐 (remote) or 📄 (local) indicators
  • Proxy Issues: Configure system proxy settings or mirror remote sources locally in data/ and update JSON configs
  • Large Files: output/adlist.txt can be 30MB+; use command-line tools (grep, wc -l) for inspection
  • Slow Performance: Check network speed; adjust worker count in adparser/fetcher.py (default: 16)
  • Memory Usage: The parser uses line-by-line processing, so memory footprint stays low even with 1.6M+ entries

FAQ

  • Why are element-hiding rules (e.g., ##, #@?#) missing from outputs?
    • This tool targets DNS blocklists. Element-hiding is cosmetic (browser-side), so such rules are dropped during normalization.
  • Why do some regex rules disappear?
    • Only simple, anchored Pi-hole patterns are converted to ABP (||domain^). Complex/JS-like regex is discarded for safety and DNS relevance.
  • My local file entries are already covered by remotes—how do I find them?
    • Check the redundancy section at the end of the run; it lists duplicates and local entries already provided by remote sources.

Example Output

🚀 Starting Adlist-Parser...
⚡ Processing adlists and whitelists concurrently...

⚡ Adlist: Fetching content... |/-\ [48/50 (96%)]
⚡ Whitelist: Processing domains...

Adlist: ✅ Complete - 1684272 entries (400527 domains, 1283745 ABP rules)
Whitelist: ✅ Complete - 2337 entries (1346 domains, 991 ABP rules)

=== Adlists redundancy analysis ===
Analyzed 50 sources.
✅ No redundancy issues detected

============================================================
🎉 ALL PROCESSING COMPLETED IN 53.16 SECONDS! 🎉
============================================================
📊 RESULTS SUMMARY:
┌──────────────────────────────────────────────────────────┐
│ 🛡️  ADLIST:    50 sources → 1684272 entries              │
│   📝 Domains:  400527 | ABP rules: 1283745              │
├──────────────────────────────────────────────────────────┤
│ ✅ WHITELIST:  6 sources →    2337 entries               │
│   📝 Domains:    1346 | ABP rules:     991              │
├──────────────────────────────────────────────────────────┤
│ 📁 Output files:                                         │
│   • output/adlist.txt                                    │
│   • output/whitelist.txt                                 │
└──────────────────────────────────────────────────────────┘

Requirements

  • Python 3.8 or higher
  • No external dependencies (uses only standard library)

Use Cases

  • Pi-hole: Use output/adlist.txt as a blocklist and output/whitelist.txt as an allowlist
  • AdGuard Home: Import both files as custom filtering rules
  • DNS Filtering: Any DNS-based ad blocker that supports domain lists
  • Network Security: Corporate firewall domain blocking lists

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Here's how to get started:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes following the existing patterns:
    • Keep modules single-responsibility
    • Use asyncio.to_thread() for I/O operations
    • Wrap long operations with spinner.show_progress()
    • Add inline documentation for complex logic
  • Keep runtime stdlib-only; do not add dependencies or modify pyproject.toml beyond [project.optional-dependencies].dev
  • Preserve public contracts: fetcher.fetch, content.generate_list, content.separate_blocklist_whitelist, io.write_output
  • Do not widen domain regex or IDN heuristics; keep _maybe_extract_domain and DOMAIN_RE conservative
  1. Test thoroughly with python -m adparser
  2. Verify output files are generated correctly
  3. Submit a pull request with clear description

Development Tips:

  • Read .github/copilot-instructions.md for architecture overview
  • Check adparser/content.py for parsing rules and regex patterns
  • Use existing regex patterns rather than adding new ones
  • Maintain backward compatibility with existing JSON configs

Fast dev loop

For quick iterations, limit sources to local files to reduce runtime:

  • Edit data/adlists.json and data/whitelists.json to only include the local files:

    { "lists": ["blacklist.txt"], "urls": [] }
    { "lists": ["whitelist.txt"], "urls": [] }
  • Add a few test lines to data/blacklist.txt and data/whitelist.txt and run:

    python -m adparser

This exercises the full pipeline (status UI, normalization, separation, reporting) in seconds.

Acknowledgments

  • Built for the DNS filtering community
  • Inspired by the need for fast, reliable adlist aggregation
  • Uses high-quality sources from the community (StevenBlack, Hagezi, FadeMind, and others)

About

A high-performance Python utility that fetches and merges multiple adlists into domain-only output for DNS blockers like Pi-hole, AdGuard, and similar DNS filtering solutions.

Topics

Resources

License

Stars

Watchers

Forks

Languages

Morty Proxy This is a proxified and sanitized view of the page, visit original site.