A high-performance Python utility that fetches and merges multiple adlists into domain-only output for DNS blockers like Pi-hole, AdGuard, and similar DNS filtering solutions.
- Features
- Quick Start
- Installation (optional)
- Configuration
- Output Format
- Performance
- How It Works
- Architecture
- VS Code Tasks
- Redundancy Analysis
- Troubleshooting
- Example Output
- Requirements
- Use Cases
- Contributing
- License
- Acknowledgments
- Fast Concurrent Processing: Processes 1.6M+ entries from 50+ sources in ~50-60 seconds
- Zero Dependencies: Uses only Python standard library (3.8+)
- Dual Output: Generates both adlists and whitelists simultaneously
- Smart Content Processing: Handles domains, wildcards, regex patterns, and Pi-hole format conversions
- ABP Filter Support: Converts Pi-hole regex patterns to AdBlock Plus (ABP) format with automatic wildcard normalization
- Intelligent Separation: Automatically separates exception rules (
@@||) from blocklist to whitelist - Domain Validation: Validates and filters invalid domain entries during post-processing
- Real-time Progress: Animated progress spinners with detailed status updates
- Error Resilient: Failed fetches don't crash the pipeline; they're logged and filtered out
-
Clone the repository:
git clone https://github.com/Toomas633/Adlist-Parser.git cd Adlist-Parser -
Run the parser:
- From a repo checkout:
python -m adparser
- If installed as a package:
adlist-parser
On Windows PowerShell:
python -m adparser # or, if installed adlist-parser
- From a repo checkout:
-
Find your results:
output/adlist.txt- Merged blocklist (~1.6M entries)output/whitelist.txt- Merged whitelist (~2K entries)
You can run from a checkout (above) or install locally to get the adlist-parser command:
# Editable install for development
python -m pip install -e .
# Then run
adlist-parserConfigure your sources in JSON files:
data/adlists.json - Blocklist sources:
{
"lists": ["blacklist.txt", "old_adlist.txt"],
"urls": [
"https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts",
"https://adaway.org/hosts.txt",
"https://v.firebog.net/hosts/AdguardDNS.txt"
]
}data/whitelists.json - Whitelist sources:
{
"lists": ["whitelist.txt"],
"urls": [
"https://raw.githubusercontent.com/hagezi/dns-blocklists/main/domains/whitelist-referral.txt"
]
}- URLs: HTTP/HTTPS links to remote lists
- Local files: Relative paths to files in the
data/directory - Mixed format: Each source can contain domains, wildcards, regex patterns, or Pi-hole entries
Notes:
- Path resolution: relative paths inside the JSON files are resolved relative to the JSON file location (not the CWD).
- Accepted keys: both files accept any of
lists,urls,adlists, orsourcesfor compatibility; they are merged.
The parser intelligently handles multiple input formats:
- Plain domains:
example.com - Wildcards:
*.example.com - Pi-hole regex:
(\.|^)example\.com$ - AdBlock patterns:
/pattern/flags - Host file entries:
0.0.0.0 example.com - Comments: Lines starting with
#,!,//, or;
- Domain Extraction: Extracts clean domains from various host file formats
- Wildcard Handling:
*.domain.comis preserved as a domain token (wildcard not expanded). In the final domain output the leading*.is stripped todomain.com. - ABP Normalization: Fixes broken ABP patterns automatically:
||*cdn.domain.com^→||*.cdn.domain.com^(missing dot after wildcard)||app.*.adjust.com^→||*.adjust.com^(wildcard-only label removed)||domain.google.*^→||domain.google^(wildcard TLD removed - not supported)-domain.com^→||-domain.com^(adds missing||prefix)@@|domain.com^|→@@||domain.com^(fixes single pipe + trailing pipe)
- ABP Conversion: Pi-hole regex patterns convert to
||domain^format when possible - Blocklist/Whitelist Separation: Automatically moves
@@||exception entries from blocklist to whitelist - Domain Validation: Validates and removes invalid domain entries during post-processing
- Regex Handling: Complex regexes that can't convert to ABP are discarded (pipeline doesn't crash)
- Deduplication: Preserves first-seen order during normalization; final outputs are sorted case-insensitively during post-processing
- Comment Filtering: Strips whole-line and inline comments (
#,!,//,;) - HTML Filtering: Removes HTML tags and attributes from lists
- Error Resilience: Failed fetches logged and filtered during normalization
- Adlist merge: Adlist pipeline merges with prior
output/adlist.txtbefore writing to preserve entries across transient source failures (whitelist writes directly)
- Outputs use LF-only line endings.
- Sorting is deterministic and case-insensitive; deduplication is case-insensitive and whitespace-trimmed.
- Headers are regenerated during post-processing (don’t hand-edit outputs).
Each output starts with a generated header like this:
# Adlist - Generated by Adlist-Parser
# https://github.com/Toomas633/Adlist-Parser
#
# Created/modified: 2025-01-01 00:00:00 UTC
# Total entries: 1,684,272
# Domains: 400,527
# ABP-style rules: 1,283,745
# Sources processed: 50
#
# This file is automatically generated. Do not edit manually.
# To update, run: adlist-parser or python -m adparser
#
- Concurrency: Fetches multiple sources simultaneously (max 16 workers)
- Async Processing: Adlists and whitelists processed in parallel
- Memory Efficient: Line-by-line processing for large datasets
- Real-world Scale: Tested with 1.6M+ entries from 50+ sources
- Concurrency: network fetching uses up to 16 workers (see
adparser/fetcher.py). You can adjust the cap there if needed for your environment. - I/O: most heavy I/O runs off the event loop using
asyncio.to_thread(); disk speed can impact total time. - Output size:
output/adlist.txtcan reach ~1.6–1.7M lines depending on sources.
Two concurrent pipelines run via asyncio.gather() in adparser/cli.py:
Pipeline Flow (for both adlist and whitelist):
- Load Sources → Parse JSON configs and resolve paths
- Fetch Content → Concurrent downloads (16 workers max)
- Generate List → Normalize, categorize, and convert entries
- Adlist-only merge → Merge new entries with prior
output/adlist.txtbefore write (preserves previous content across transient source failures); whitelist writes directly - Write Output → Save with auto-generated headers (LF-only line endings)
- Post-Processing → Separate blocklist/whitelist entries, validate domains, regenerate headers, and re-write both files
- Redundancy Report → Analyze duplicates and overlaps
Key Processing Steps:
- All heavy I/O wrapped with
asyncio.to_thread()to keep event loop responsive - Progress displayed via animated spinners (
adparser/status.py) - Domain validation using
DOMAIN_REregex pattern - Pi-hole regex → ABP conversion when patterns are simple enough
- ABP wildcard normalization fixes malformed patterns automatically
- Post-processing separates
@@||exception entries to whitelist - Cross-list deduplication ensures no conflicts between blocklist and whitelist
- Failed sources tracked separately and reported at end
- Files are written with LF-only line endings
The codebase follows a modular async architecture with strict separation of concerns:
adparser/cli.py # Main orchestrator with async/await
├── adparser/io.py # JSON parsing, path resolution, file I/O
├── adparser/fetcher.py # Concurrent HTTP fetching (ThreadPoolExecutor)
├── adparser/content.py # Domain extraction, normalization, regex conversion
├── adparser/models.py # Source descriptor dataclass (URL vs local files)
├── adparser/status.py # Progress spinners and terminal UI updates
├── adparser/reporting.py # Results summary with emoji formatting
├── adparser/redundancy.py# Duplicate detection and overlap analysis
└── adparser/constants.py # File path constants
Design Principles:
- Single Responsibility: Each module handles one concern (fetch/parse/write separated)
- Error Isolation: Failed sources don't crash the pipeline
- Async I/O: Heavy operations run in thread pool via
to_thread() - Progress Feedback: Global
status_displaycoordinates concurrent spinners - Order Preservation: Deduplication via
_dedupe_preserve_order()before final sort
This repository includes ready-to-run tasks for Windows PowerShell:
- Adlist-Parser: runs the tool end-to-end (equivalent to
python -m adparser). - Tests: Pytest (coverage): runs
pytestwith coverage as configured inpyproject.toml. - Lint: Pylint (report): runs pylint and writes
pylint-report.txt; non-zero exit is allowed while still generating the report.
Quality gates expectations:
- Build: N/A (pure Python). Runtime entry point is
adparser.cli:main. - Tests: PASS on local data set; full runs may take ~50–60s.
- Lint: PASS is ideal; if not, review
pylint-report.txt.
The parser includes built-in redundancy detection to help optimize your source lists:
Features:
- Duplicate Detection: Identifies sources with identical content
- Local File Analysis: Shows which entries in local files are already covered by remote sources
- Removal Suggestions: Lists first 20 redundant entries with count of remaining
Example Output:
🔁 Duplicate sources (identical content): 2 groups
├─ 🌐 https://example.com/list1.txt
└─ 🌐 https://example.com/list2.txt
💡 Tip: Keep one source from this group, remove the others
📄 Local file redundancy analysis:
• blacklist.txt: 150/200 entries (75.0%) already in remote sources
The parser intelligently processes various input formats and converts them appropriately:
| Input Format | Processing Result | Notes |
|---|---|---|
example.com |
→ example.com (domain) |
Plain domain preserved |
*.example.com |
→ ||*.example.com^ (ABP rule) |
Wildcard converted to ABP |
0.0.0.0 example.com |
→ example.com (domain) |
Host file format extracted |
(\.|^)example\.com$ |
→ ||example.com^ (ABP rule) |
Pi-hole regex converted |
/ads?/ |
→ ABP rule or discarded | Converted if simple, discarded if complex |
# Comment line |
→ filtered | Comment removed |
domain.com # inline |
→ domain.com |
Inline comment stripped |
<div>html</div> |
→ filtered | HTML tags removed |
@@||exception.com^ |
→ Moved to whitelist as ||exception.com^ |
Exception rule separated |
||*cdn.example.com^ |
→ ||*.cdn.example.com^ |
Malformed ABP pattern normalized |
Common Issues:
- Network Errors: Failed sources are listed as "UNAVAILABLE SOURCES" in the final report with
🌐(remote) or📄(local) indicators - Proxy Issues: Configure system proxy settings or mirror remote sources locally in
data/and update JSON configs - Large Files:
output/adlist.txtcan be 30MB+; use command-line tools (grep,wc -l) for inspection - Slow Performance: Check network speed; adjust worker count in
adparser/fetcher.py(default: 16) - Memory Usage: The parser uses line-by-line processing, so memory footprint stays low even with 1.6M+ entries
- Why are element-hiding rules (e.g.,
##,#@?#) missing from outputs?- This tool targets DNS blocklists. Element-hiding is cosmetic (browser-side), so such rules are dropped during normalization.
- Why do some regex rules disappear?
- Only simple, anchored Pi-hole patterns are converted to ABP (
||domain^). Complex/JS-like regex is discarded for safety and DNS relevance.
- Only simple, anchored Pi-hole patterns are converted to ABP (
- My local file entries are already covered by remotes—how do I find them?
- Check the redundancy section at the end of the run; it lists duplicates and local entries already provided by remote sources.
🚀 Starting Adlist-Parser...
⚡ Processing adlists and whitelists concurrently...
⚡ Adlist: Fetching content... |/-\ [48/50 (96%)]
⚡ Whitelist: Processing domains...
Adlist: ✅ Complete - 1684272 entries (400527 domains, 1283745 ABP rules)
Whitelist: ✅ Complete - 2337 entries (1346 domains, 991 ABP rules)
=== Adlists redundancy analysis ===
Analyzed 50 sources.
✅ No redundancy issues detected
============================================================
🎉 ALL PROCESSING COMPLETED IN 53.16 SECONDS! 🎉
============================================================
📊 RESULTS SUMMARY:
┌──────────────────────────────────────────────────────────┐
│ 🛡️ ADLIST: 50 sources → 1684272 entries │
│ 📝 Domains: 400527 | ABP rules: 1283745 │
├──────────────────────────────────────────────────────────┤
│ ✅ WHITELIST: 6 sources → 2337 entries │
│ 📝 Domains: 1346 | ABP rules: 991 │
├──────────────────────────────────────────────────────────┤
│ 📁 Output files: │
│ • output/adlist.txt │
│ • output/whitelist.txt │
└──────────────────────────────────────────────────────────┘
- Python 3.8 or higher
- No external dependencies (uses only standard library)
- Pi-hole: Use
output/adlist.txtas a blocklist andoutput/whitelist.txtas an allowlist - AdGuard Home: Import both files as custom filtering rules
- DNS Filtering: Any DNS-based ad blocker that supports domain lists
- Network Security: Corporate firewall domain blocking lists
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Here's how to get started:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes following the existing patterns:
- Keep modules single-responsibility
- Use
asyncio.to_thread()for I/O operations - Wrap long operations with
spinner.show_progress() - Add inline documentation for complex logic
- Keep runtime stdlib-only; do not add dependencies or modify
pyproject.tomlbeyond[project.optional-dependencies].dev - Preserve public contracts:
fetcher.fetch,content.generate_list,content.separate_blocklist_whitelist,io.write_output - Do not widen domain regex or IDN heuristics; keep
_maybe_extract_domainandDOMAIN_REconservative
- Test thoroughly with
python -m adparser - Verify output files are generated correctly
- Submit a pull request with clear description
Development Tips:
- Read
.github/copilot-instructions.mdfor architecture overview - Check
adparser/content.pyfor parsing rules and regex patterns - Use existing regex patterns rather than adding new ones
- Maintain backward compatibility with existing JSON configs
For quick iterations, limit sources to local files to reduce runtime:
-
Edit
data/adlists.jsonanddata/whitelists.jsonto only include the local files:{ "lists": ["blacklist.txt"], "urls": [] }{ "lists": ["whitelist.txt"], "urls": [] } -
Add a few test lines to
data/blacklist.txtanddata/whitelist.txtand run:python -m adparser
This exercises the full pipeline (status UI, normalization, separation, reporting) in seconds.
- Built for the DNS filtering community
- Inspired by the need for fast, reliable adlist aggregation
- Uses high-quality sources from the community (StevenBlack, Hagezi, FadeMind, and others)