This document describes the layered scraping strategy used to collect ~210k runner records and ~190k enriched past-start records from a Cloudflare-protected ASP.NET WebForms site.
The names of the target site and the project's commercial use case are deliberately omitted. Every technique below is publicly documented and is used routinely in legitimate web automation work (e.g. enterprise QA, regression testing, monitoring of one's own services). The same techniques are also used by bad actors; that is true of every web automation tool ever built. This writeup is for the legitimate use case: you have a contract or a personal project that requires data from a site whose owners are indifferent or hostile to programmatic access, and the data is publicly visible to any logged-in user.
Two layers, in this order:
-
curl-cffiwithimpersonate="chrome"for the public AJAX endpoint. Replays Chrome's TLS handshake (JA3 fingerprint, ALPN, cipher suite ordering) so Cloudflare's TLS inspection cannot distinguish the client from a real browser. ~30 records/second sustained. -
Playwright with a real Chromium and a persistent session for protected pages where TLS impersonation is not enough — the site fires a JS challenge that requires a real browser to compute. Manual login once; the session cookie is saved and reused headlessly thereafter. ~1 page every 2–3 seconds.
The requests library does not work, period. Eight different approaches were tested before this combination converged.
- Static frontend pages live behind Cloudflare with the standard challenge page enabled.
- Backend is ASP.NET WebForms (
__VIEWSTATE,__EVENTVALIDATION, postback model). - One key endpoint is an AJAX search source that returns HTML fragments. This endpoint is reachable directly with the right cookies and TLS fingerprint, but only because Cloudflare's protections on it are looser than on the main pages.
- The protected horse-profile pages cannot be scraped via HTTP at all — the page renders some content via a client-side AJAX call that itself depends on JS state established at page load. A bare HTTP request gets a stub.
Every approach in this list was tried, in roughly this order. Each one failed in a different way, and each failure mode taught me something about what the site was actually checking.
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...",
"Accept": "text/html,application/xhtml+xml,...",
"Accept-Language": "en-AU,en;q=0.9",
}
r = requests.get(url, headers=headers)
# → 403 Forbidden (Cloudflare challenge page)Why it fails: Cloudflare doesn't only look at headers. The TLS handshake itself is fingerprinted. Python's ssl module sends a TLS Client Hello that has a distinctive set of cipher suites, extensions, and ordering — the JA3 fingerprint — that no real browser emits. Cloudflare maintains a denylist of "non-browser" JA3 hashes, and Python is on it.
You can change every HTTP header to look like Chrome and you still get blocked, because the block happens before HTTP — at the TLS layer.
sess = requests.Session()
sess.cookies.update(cookies_from_real_chrome)
r = sess.get(url, headers=headers)
# → 403 Forbidden (still)Why it fails: Same TLS-fingerprint problem. The session has the right cookies but the wrong handshake, so Cloudflare rejects the request before any cookie is even read.
A popular library that historically solved Cloudflare's older JavaScript challenges by emulating them in Python.
import cloudscraper
scraper = cloudscraper.create_scraper()
r = scraper.get(url)
# → 403 ForbiddenWhy it fails: cloudscraper was written for Cloudflare's older challenge format. Cloudflare ships challenge updates regularly; current challenges (Turnstile, the v2 JS challenges) require a real JS engine to solve, and cloudscraper cannot keep up. The library is effectively unmaintained against current CF protections.
import httpx
client = httpx.Client(http2=True)
r = client.get(url, headers=headers)
# → 403 ForbiddenWhy it fails: Python's httpx uses the same ssl module internals as requests for the TLS handshake. HTTP/2 doesn't change the JA3.
from selenium import webdriver
from selenium_stealth import stealth
driver = webdriver.Chrome()
stealth(driver, ...)
driver.get(url)
# → Detected as automation, eventually challengedWhy it fails: Cloudflare runs JS-side checks for automation markers — navigator.webdriver, the presence of the CDP pipe, weird timing of requestIdleCallback, headless-Chrome quirks in navigator.permissions and WebGL.RENDERER. selenium-stealth patches the most obvious ones but misses subtler tells, and the cat-and-mouse continues.
A more aggressive stealth fork of Selenium's chromedriver.
import undetected_chromedriver as uc
driver = uc.Chrome()
driver.get(url)
# → Worked initially; broke after a Cloudflare updateWhy it fails: This actually got through for a couple of weeks during initial development. Then a Cloudflare-side update added a new check that the patched chromedriver didn't match. This is the fundamental problem with any "stealth" fork — it's a moving target, and the project that maintains the patches has to track every CF update, which is unsustainable.
This is the breakthrough.
from curl_cffi import requests as cffi_requests
r = cffi_requests.get(
url,
headers=headers,
cookies=session_cookies,
impersonate="chrome", # ← key argument
timeout=45,
)
# → 200 OK, real HTML responseWhy it works: curl-cffi is Python bindings for curl-impersonate, a fork of curl that has been patched to send the exact TLS Client Hello of a real Chrome browser — the same cipher suite ordering, same TLS extensions in the same order, same ALPN negotiation, same H2 SETTINGS frames. From Cloudflare's perspective, the JA3 hash is indistinguishable from Chrome.
The impersonate="chrome" argument selects the latest tracked Chrome version. There are also chrome110, chrome116, chrome120 and similar pinned versions if you need stability across Cloudflare updates.
This works for the AJAX endpoint because the endpoint's only protection is Cloudflare's TLS-level filter. Once you're through that, the cookies + headers do the rest.
Some pages on the target site are not reachable via the AJAX endpoint. They require loading the full HTML page, which fires a chain of JS that establishes state in the page's own JS context, which is then required by a subsequent AJAX call to actually return content. No HTTP-only approach can replay this — you need a real browser.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(
headless=False, # headless mode is detected
args=["--disable-blink-features=AutomationControlled"],
)
context = browser.new_context(storage_state="session.json")
page = context.new_page()
page.goto(url, wait_until="domcontentloaded")
page.wait_for_selector("li.timeline-event[data-racedate]", timeout=12000)
html = page.content()
# → parse out the dataWhy it works: It IS Chrome. The TLS handshake is real Chrome (not curl-impersonate's replay). The JS engine is V8. navigator.webdriver can be patched out via the launch arg. The session cookies are real Chrome session cookies, written to disk in Chrome's own format.
The cost: ~2.5 seconds per page (page navigation + wait for AJAX render + DOM read). For 210k records this would be a week. But because most of the data was already harvestable via curl-cffi, Playwright was only needed for the enrichment stage (~30k unique horses), which finishes overnight.
# src/scraper/api_scraper.py (excerpt)
from curl_cffi import requests as cffi_requests
def fetch_records(params, cookies, max_retries=3):
"""
Fetch a page from the public AJAX endpoint.
Uses Chrome TLS impersonation to defeat Cloudflare JA3 filtering.
"""
headers = {
"User-Agent": CHROME_UA,
"Accept": "*/*",
"Accept-Language": "en-AU,en;q=0.9",
"X-Requested-With": "XMLHttpRequest",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"Sec-Ch-Ua": CHROME_SEC_CH_UA,
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"Windows"',
}
for attempt, backoff in enumerate([0, 5, 15, 45]):
if backoff:
time.sleep(backoff)
try:
r = cffi_requests.get(
ENDPOINT_URL,
params=params,
headers=headers,
cookies=cookies,
impersonate="chrome",
timeout=45,
)
if r.status_code in (401, 403):
raise AuthExpired()
if r.status_code == 200:
return r.text
except Exception as e:
last_err = e
continue
raise RuntimeError(f"Gave up after retries: {last_err}")Headers exactly match Chrome 120+. The Sec-Fetch-* and Sec-Ch-Ua-* headers are what a real Chrome AJAX call from this domain would send — getting them wrong is enough to fail.
# src/scraper/browser_scraper.py (excerpt)
def fetch_one_via_browser(context, record_id, slug):
"""
Load a protected profile page, wait for AJAX-rendered content,
extract the data via the same parser used by Stage 1.
"""
url = f"https://example-racing-data.com/profile/{slug}/{record_id}"
page = context.new_page()
rows = []
try:
page.goto(url, wait_until="domcontentloaded", timeout=30_000)
# wait for the AJAX-rendered content to appear
page.wait_for_selector(
"li.timeline-event[data-racedate], .PagerResults",
timeout=12_000,
)
# paginate by clicking Next until exhausted
for _ in range(MAX_PAGES):
html = page.content()
rows.extend(parse_page(html, record_id))
next_link = page.query_selector("a.UnselectedNext")
if not next_link:
break
current = int(page.query_selector(".SelectedPage").inner_text())
next_link.click()
page.wait_for_function(
f"() => parseInt(document.querySelector('.SelectedPage').innerText) > {current}",
timeout=10_000,
)
finally:
page.close()
return rowsThree details that are non-obvious:
headless=False. Headless Chrome is detectable in a dozen ways (window dimensions,navigator.plugins, missing audio APIs, GPU rendering quirks). For this site, headless was a hard fail. The visible window is annoying during dev but it's cheap to minimize and the trade-off is fine for an overnight run.page.wait_for_functionfor pagination. Just waiting for.SelectedPageto exist isn't enough — it always exists. We have to wait for its value to change, which is the only signal that the AJAX click handler has finished and replaced the page content.- Persistent storage state. The first run prompts for manual login; afterwards
context.storage_state(path=...)saves the session. Subsequent runs reload it and skip the login flow entirely. This is the single biggest reliability win — the script can resume after any crash without losing the session.
A few features that are easy to overlook but matter a lot in practice when you're running a scraper for hours or days unattended:
A JSON progress file is written every 25 records:
def save_progress(done_ids):
PROGRESS_FILE.write_text(json.dumps(sorted(done_ids)))
def load_progress():
if PROGRESS_FILE.exists():
return set(json.loads(PROGRESS_FILE.read_text()))
return set()If the run crashes, restarting picks up where it left off. No duplicate work.
A failure mode that wastes hours is a session expiring partway through a long run — the scraper keeps making requests, they all return empty, and you discover at the end that the last 50% of records have no data. Solution: track consecutive empty responses and abort early.
if len(rows) == 0:
consecutive_empty += 1
else:
consecutive_empty = 0
if consecutive_empty >= 10:
print("10 records in a row returned empty — session likely expired.")
print("Delete session.json and re-authenticate.")
breaktime.sleep(0.4) between requests, plus the natural ~2s page load. This is approximately one request every 3 seconds, which is well within what a human user reading the site casually would generate. The goal is not just to evade detection — it's also to not overload a small operator's site.
- Use a residential proxy pool from the start. This run was done from a single IP and finished without any IP-level block, but for a larger or longer-running job a proxy rotator is cheap insurance. Bright Data, Decodo, Webshare are all reasonable.
- Accept the time cost of Playwright earlier. I spent two days exploring HTTP-only options before resigning myself to Playwright for the enrichment stage. Looking back, if a site has any meaningful anti-bot protection, going straight to a real browser is the highest-EV first move and you can optimize later if it's too slow.
- Persist the parser separately from the fetcher. The parser (regex + BeautifulSoup) is the same regardless of how the HTML was retrieved. In the codebase here it's already in its own module, but in earlier iterations the parser was glued into the fetcher and that made it harder to test. Now
parse_page(html, record_id) -> list[dict]is a pure function and is unit-tested independently.
Scraping is governed by the site's terms of service, by the laws of your jurisdiction (CFAA in the US, Computer Misuse Act in the UK), and increasingly by case law on database rights and on whether scraping public data violates contract. This document describes techniques in a generic way; whether a specific use of them is legitimate depends on facts I don't have.
Rule of thumb that has served me well:
- Public data, accessed at human-like rates, no circumvention of paywalls or auth → almost always fine.
- Behind a paywall you're a paying customer of → check ToS, but generally OK for personal use.
- Behind auth you don't have, or scraping someone's PII → don't.
If in doubt, ask the site owner. Many small operators are happy to provide an API or a data export if you explain what you want and offer to pay for it.