Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

superglue1/racing-scraper-backtest-engine

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Racing Scraper & Backtest Engine

A production-grade data pipeline and quantitative backtesting engine for horse racing analytics. This project demonstrates end-to-end work across three problem areas that each carry their own engineering challenges:

  1. Anti-bot data acquisition — collecting hundreds of thousands of records from a Cloudflare-protected ASP.NET WebForms site using a layered bypass strategy (TLS fingerprint impersonation as the primary path, full headed browser automation as fallback).
  2. High-throughput parameter search — exhaustive evaluation of millions of strategy configurations on a 12-month dataset using multiprocessing.Pool across all available CPU cores.
  3. Honest validation — walk-forward train/test split to distinguish real edges from overfit noise, with reproducible Excel reporting.

The codebase is sanitized for public release: real source URLs are replaced with placeholders, no real scraped data is included, and a synthetic dataset generator is provided so the engine can be run end-to-end with one command.


Why this project is interesting

Most "scraper portfolio projects" you see on GitHub are 50 lines of requests.get() + BeautifulSoup against a static site. This one solves the problem you actually hit in real freelance work: the site is hostile to scraping, the obvious approaches all fail, and the only thing that works is building a layered fallback ladder — TLS impersonation when possible, real-browser automation when not.

Most "backtest portfolio projects" optimize a single strategy and report the in-sample ROI. This one runs an exhaustive grid search over 2.26 million parameter combinations across 15 cores, then runs walk-forward validation on a 9-month train / 3-month test split to check whether the strategies survive on unseen data.

Both pieces are wired together end-to-end: scraper output feeds the engine, engine output is a multi-sheet Excel deliverable.


Quickstart

git clone https://github.com/yourname/racing-scraper-backtest-engine.git
cd racing-scraper-backtest-engine
pip install -r requirements.txt

# Generate synthetic dataset (1000 races, ~10k runners) and run the engine
python run_demo.py

Output: data/output/demo_results.xlsx — multi-sheet Excel with strategy recommendations, modifier breakdowns by day-of-week / class / track condition, walk-forward validation, and Kelly stake sizing.

The demo runs in about a minute on a modern laptop. The same pipeline against a real 12-month dataset (~210k runners, 190k enriched past starts) takes 5–10 minutes for the grid search on a 16-core box.


Architecture overview

                     ┌──────────────────────────────────────┐
                     │   STAGE 1: API harvest               │
                     │   curl-cffi + Chrome TLS impersonation│
                     │   →  ~210k runner records            │
                     └─────────────┬────────────────────────┘
                                   │ runners.csv
                                   ▼
                     ┌──────────────────────────────────────┐
                     │   STAGE 2: enrichment                │
                     │   Playwright (real Chromium)         │
                     │   bypasses Cloudflare + ASP.NET      │
                     │   →  ~190k past starts                │
                     └─────────────┬────────────────────────┘
                                   │ history.csv
                                   ▼
                     ┌──────────────────────────────────────┐
                     │   MERGE & ENRICH                     │
                     │   pandas merge on (HorseID,Date,Track)│
                     │   + race-class parser                │
                     │   + track-type classifier (Metro/    │
                     │     Provincial/Country)              │
                     └─────────────┬────────────────────────┘
                                   │ pool DataFrame
                                   ▼
                     ┌──────────────────────────────────────┐
                     │   GRID SEARCH (multiprocessing)      │
                     │   2.26M combos, 15-core Pool         │
                     │   two-tier: base filter + modifiers  │
                     └─────────────┬────────────────────────┘
                                   │ ranked configs
                                   ▼
                     ┌──────────────────────────────────────┐
                     │   WALK-FORWARD VALIDATION            │
                     │   9mo train / 3mo test               │
                     │   →  reject overfit, keep real edges │
                     └─────────────┬────────────────────────┘
                                   │
                                   ▼
                     ┌──────────────────────────────────────┐
                     │   EXCEL OUTPUT                       │
                     │   15+ sheets: Overview, Plans,       │
                     │   Volume options, modifier tables    │
                     └──────────────────────────────────────┘

Each stage is a self-contained module with explicit inputs/outputs. The scraper can be skipped entirely if you have your own dataset — the engine reads CSV.


Anti-bot bypass — the interesting part

The target site sits behind Cloudflare and uses ASP.NET WebForms with __VIEWSTATE / __EVENTVALIDATION tokens. Eight different scraping approaches were tested before settling on the final two-tier strategy:

# Approach Result
1 requests.get() with browser headers 403 Forbidden (TLS fingerprint)
2 requests + persistent cookies 403 Forbidden
3 cloudscraper library Outdated, fails on modern challenge
4 httpx with HTTP/2 403 Forbidden
5 Selenium with stealth plugin Detected as automation
6 Selenium-undetected-chromedriver Worked initially, broke after CF update
7 curl-cffi with Chrome impersonation Works for the public API endpoint
8 Playwright headed Chromium + persistent session Works for the protected pages

The key insight: Cloudflare's TLS fingerprinting (JA3) blocks Python's standard ssl module because Python's TLS handshake looks nothing like a real Chrome handshake. curl-cffi solves this by linking against curl-impersonate, which replays the exact TLS extensions, cipher suite ordering, and ALPN preferences that a real Chrome 120+ would send.

For pages where the site additionally inspects browser-side JavaScript challenges, no HTTP-level approach works — only a real browser does. Playwright with headless=False and a persistent session storage file gets through.

Full writeup with code excerpts: docs/anti_bot_bypass.md


Example output

The engine produces a multi-sheet Excel report. Here's what each section looks like (screenshots from a 12-month run on ~145k runners):

Overview — strategy summary at a glance

Overview sheet

Recommended Plans — actionable rules with confidence and stake size

Recommended plans

Volume Options — ROI vs bet frequency tradeoff

Volume options

Strategy Summary — base settings + pool stats

Strategy summary

Modifier breakdown — by day of week

Day of week breakdown


Backtest engine — what makes it non-trivial

A naive grid search over 7 ticks × 7 price-min × 6 price-max × 6 standout × 4 fav-toggle × 36 state-filters × 7 track-type-filters = ~150k combos per strategy, times 2 strategies, times exclusion of invalid combos = ~2.26M evaluated combinations in the real run. Single-process pandas over 210k rows × 2.26M filters would take hours.

The engine uses multiprocessing.Pool with imap_unordered and a chunksize of 50, which on a 16-core box completes the search in 4–7 minutes. The DataFrame is passed to workers via fork (Linux) or pickle (Windows/Mac) — on Windows the DataFrame is shared via Manager to avoid per-worker memory blowup.

Two-tier strategy architecture separates two concerns:

  • Stage 1 (base filter) — UI-selectable settings: min ticks, price band, state inclusion/exclusion, track-type filter, favourite toggle. These define the pool of bets the user is willing to consider.
  • Stage 2 (modifiers) — additional dimensions (day-of-week, race class group, track condition) that further refine the pool. Each modifier value is scored against the base pool's ROI to compute an edge and a confidence (sample-size-adjusted).

The output is a list of concrete actionable rules: "Base + DayOfWeek=Saturday → ROI +X% over N bets, suggested stake Y units (fractional Kelly capped at 3u)."

Full writeup: docs/grid_search.md


Walk-forward validation

A grid search of 2.26M combinations will find configurations that look profitable on any random dataset — that's the multiple-testing problem. To distinguish real edges from data-mined noise, the engine runs a walk-forward split:

  • Train: months 1–9 of the dataset
  • Test: months 10–12

The best strategies from the grid search are then re-evaluated separately on train and test. If the test ROI is in the same ballpark as the train ROI, the edge is plausibly real. If the test ROI collapses (or goes negative) while train looks great, the strategy is overfit and is dropped.

On the original real-world run, the top strategies retained positive ROI on out-of-sample data, suggesting the edges generalize beyond the in-sample period.

See src/validation/walk_forward.py and the validation section in docs/architecture.md.


Repository layout

racing-scraper-backtest-engine/
├── README.md                      # this file
├── LICENSE                        # MIT
├── requirements.txt
├── run_demo.py                    # end-to-end demo on synthetic data
├── src/
│   ├── scraper/
│   │   ├── api_scraper.py         # Stage 1: curl-cffi + TLS impersonation
│   │   ├── browser_scraper.py     # Stage 2: Playwright fallback
│   │   └── parser.py              # HTML → structured rows
│   ├── engine/
│   │   ├── grid_search.py         # multiprocessing parameter search
│   │   ├── modifiers.py           # Stage 2 modifier scoring + Kelly staking
│   │   └── excel_report.py        # multi-sheet Excel output
│   ├── validation/
│   │   └── walk_forward.py        # train/test split validator
│   └── utils/
│       ├── track_types.py         # Metro/Provincial/Country classifier
│       └── race_class.py          # race-class parser & grouping
├── data/
│   ├── synthetic/
│   │   └── generate_synthetic.py  # creates demo CSVs
│   └── output/                    # Excel results land here (gitignored)
├── docs/
│   ├── anti_bot_bypass.md         # detailed bypass writeup
│   ├── architecture.md            # data flow and design decisions
│   └── grid_search.md             # engine internals
└── tests/
    ├── test_parser.py
    ├── test_track_classifier.py
    ├── test_race_class.py
    └── test_grid_search.py

Tech stack

  • Python 3.10+
  • pandas, numpy — data manipulation
  • curl-cffi — Chrome TLS impersonation
  • playwright — headed browser automation
  • beautifulsoup4, lxml — HTML parsing
  • openpyxl — Excel output
  • pytest — testing

What this project does NOT include

To keep the repo focused on engineering and to avoid glamorizing gambling:

  • No betting account integration. The engine's output is an Excel file. What you do with it is your business.
  • No "guaranteed strategy" claims. Walk-forward validation is the closest thing to honest evaluation; even strategies that pass it can fail in production due to regime change, market efficiency, or simple bad luck.
  • No real scraped data. The synthetic generator produces data with the same schema and similar statistical properties, but no actual races or horses.

License

MIT. See LICENSE.

This project is for educational and portfolio purposes. The author is not responsible for any losses incurred from following any output of this code.

About

Production-grade scraper and backtest engine for horse racing analytics. Cloudflare bypass via curl-cffi/Playwright, 2.26M-combo grid search, walk-forward validation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

Morty Proxy This is a proxified and sanitized view of the page, visit original site.