Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Latest commit

 

History

History
History
318 lines (233 loc) · 15.2 KB

File metadata and controls

318 lines (233 loc) · 15.2 KB
Copy raw file
Download raw file
Outline
Edit and raw actions

Anti-Bot Bypass: Engineering Notes

This document describes the layered scraping strategy used to collect ~210k runner records and ~190k enriched past-start records from a Cloudflare-protected ASP.NET WebForms site.

The names of the target site and the project's commercial use case are deliberately omitted. Every technique below is publicly documented and is used routinely in legitimate web automation work (e.g. enterprise QA, regression testing, monitoring of one's own services). The same techniques are also used by bad actors; that is true of every web automation tool ever built. This writeup is for the legitimate use case: you have a contract or a personal project that requires data from a site whose owners are indifferent or hostile to programmatic access, and the data is publicly visible to any logged-in user.


TL;DR

Two layers, in this order:

  1. curl-cffi with impersonate="chrome" for the public AJAX endpoint. Replays Chrome's TLS handshake (JA3 fingerprint, ALPN, cipher suite ordering) so Cloudflare's TLS inspection cannot distinguish the client from a real browser. ~30 records/second sustained.

  2. Playwright with a real Chromium and a persistent session for protected pages where TLS impersonation is not enough — the site fires a JS challenge that requires a real browser to compute. Manual login once; the session cookie is saved and reused headlessly thereafter. ~1 page every 2–3 seconds.

The requests library does not work, period. Eight different approaches were tested before this combination converged.


What the target looks like

  • Static frontend pages live behind Cloudflare with the standard challenge page enabled.
  • Backend is ASP.NET WebForms (__VIEWSTATE, __EVENTVALIDATION, postback model).
  • One key endpoint is an AJAX search source that returns HTML fragments. This endpoint is reachable directly with the right cookies and TLS fingerprint, but only because Cloudflare's protections on it are looser than on the main pages.
  • The protected horse-profile pages cannot be scraped via HTTP at all — the page renders some content via a client-side AJAX call that itself depends on JS state established at page load. A bare HTTP request gets a stub.

What did NOT work

Every approach in this list was tried, in roughly this order. Each one failed in a different way, and each failure mode taught me something about what the site was actually checking.

1. requests.get() with browser-like headers

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...",
    "Accept": "text/html,application/xhtml+xml,...",
    "Accept-Language": "en-AU,en;q=0.9",
}
r = requests.get(url, headers=headers)
# → 403 Forbidden (Cloudflare challenge page)

Why it fails: Cloudflare doesn't only look at headers. The TLS handshake itself is fingerprinted. Python's ssl module sends a TLS Client Hello that has a distinctive set of cipher suites, extensions, and ordering — the JA3 fingerprint — that no real browser emits. Cloudflare maintains a denylist of "non-browser" JA3 hashes, and Python is on it.

You can change every HTTP header to look like Chrome and you still get blocked, because the block happens before HTTP — at the TLS layer.

2. requests.Session() with persistent cookies copied from a real browser

sess = requests.Session()
sess.cookies.update(cookies_from_real_chrome)
r = sess.get(url, headers=headers)
# → 403 Forbidden (still)

Why it fails: Same TLS-fingerprint problem. The session has the right cookies but the wrong handshake, so Cloudflare rejects the request before any cookie is even read.

3. cloudscraper

A popular library that historically solved Cloudflare's older JavaScript challenges by emulating them in Python.

import cloudscraper
scraper = cloudscraper.create_scraper()
r = scraper.get(url)
# → 403 Forbidden

Why it fails: cloudscraper was written for Cloudflare's older challenge format. Cloudflare ships challenge updates regularly; current challenges (Turnstile, the v2 JS challenges) require a real JS engine to solve, and cloudscraper cannot keep up. The library is effectively unmaintained against current CF protections.

4. httpx with HTTP/2

import httpx
client = httpx.Client(http2=True)
r = client.get(url, headers=headers)
# → 403 Forbidden

Why it fails: Python's httpx uses the same ssl module internals as requests for the TLS handshake. HTTP/2 doesn't change the JA3.

5. Selenium with the selenium-stealth plugin

from selenium import webdriver
from selenium_stealth import stealth

driver = webdriver.Chrome()
stealth(driver, ...)
driver.get(url)
# → Detected as automation, eventually challenged

Why it fails: Cloudflare runs JS-side checks for automation markers — navigator.webdriver, the presence of the CDP pipe, weird timing of requestIdleCallback, headless-Chrome quirks in navigator.permissions and WebGL.RENDERER. selenium-stealth patches the most obvious ones but misses subtler tells, and the cat-and-mouse continues.

6. undetected-chromedriver

A more aggressive stealth fork of Selenium's chromedriver.

import undetected_chromedriver as uc
driver = uc.Chrome()
driver.get(url)
# → Worked initially; broke after a Cloudflare update

Why it fails: This actually got through for a couple of weeks during initial development. Then a Cloudflare-side update added a new check that the patched chromedriver didn't match. This is the fundamental problem with any "stealth" fork — it's a moving target, and the project that maintains the patches has to track every CF update, which is unsustainable.

7. curl-cffi with Chrome impersonation — WORKS for the AJAX endpoint

This is the breakthrough.

from curl_cffi import requests as cffi_requests

r = cffi_requests.get(
    url,
    headers=headers,
    cookies=session_cookies,
    impersonate="chrome",   # ← key argument
    timeout=45,
)
# → 200 OK, real HTML response

Why it works: curl-cffi is Python bindings for curl-impersonate, a fork of curl that has been patched to send the exact TLS Client Hello of a real Chrome browser — the same cipher suite ordering, same TLS extensions in the same order, same ALPN negotiation, same H2 SETTINGS frames. From Cloudflare's perspective, the JA3 hash is indistinguishable from Chrome.

The impersonate="chrome" argument selects the latest tracked Chrome version. There are also chrome110, chrome116, chrome120 and similar pinned versions if you need stability across Cloudflare updates.

This works for the AJAX endpoint because the endpoint's only protection is Cloudflare's TLS-level filter. Once you're through that, the cookies + headers do the rest.

8. Playwright with a real headed Chromium — WORKS for protected pages

Some pages on the target site are not reachable via the AJAX endpoint. They require loading the full HTML page, which fires a chain of JS that establishes state in the page's own JS context, which is then required by a subsequent AJAX call to actually return content. No HTTP-only approach can replay this — you need a real browser.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=False,  # headless mode is detected
        args=["--disable-blink-features=AutomationControlled"],
    )
    context = browser.new_context(storage_state="session.json")
    page = context.new_page()
    page.goto(url, wait_until="domcontentloaded")
    page.wait_for_selector("li.timeline-event[data-racedate]", timeout=12000)
    html = page.content()
    # → parse out the data

Why it works: It IS Chrome. The TLS handshake is real Chrome (not curl-impersonate's replay). The JS engine is V8. navigator.webdriver can be patched out via the launch arg. The session cookies are real Chrome session cookies, written to disk in Chrome's own format.

The cost: ~2.5 seconds per page (page navigation + wait for AJAX render + DOM read). For 210k records this would be a week. But because most of the data was already harvestable via curl-cffi, Playwright was only needed for the enrichment stage (~30k unique horses), which finishes overnight.


The two-stage architecture in code

Stage 1: bulk harvest via TLS impersonation

# src/scraper/api_scraper.py (excerpt)

from curl_cffi import requests as cffi_requests

def fetch_records(params, cookies, max_retries=3):
    """
    Fetch a page from the public AJAX endpoint.
    Uses Chrome TLS impersonation to defeat Cloudflare JA3 filtering.
    """
    headers = {
        "User-Agent": CHROME_UA,
        "Accept": "*/*",
        "Accept-Language": "en-AU,en;q=0.9",
        "X-Requested-With": "XMLHttpRequest",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-origin",
        "Sec-Ch-Ua": CHROME_SEC_CH_UA,
        "Sec-Ch-Ua-Mobile": "?0",
        "Sec-Ch-Ua-Platform": '"Windows"',
    }

    for attempt, backoff in enumerate([0, 5, 15, 45]):
        if backoff:
            time.sleep(backoff)
        try:
            r = cffi_requests.get(
                ENDPOINT_URL,
                params=params,
                headers=headers,
                cookies=cookies,
                impersonate="chrome",
                timeout=45,
            )
            if r.status_code in (401, 403):
                raise AuthExpired()
            if r.status_code == 200:
                return r.text
        except Exception as e:
            last_err = e
            continue
    raise RuntimeError(f"Gave up after retries: {last_err}")

Headers exactly match Chrome 120+. The Sec-Fetch-* and Sec-Ch-Ua-* headers are what a real Chrome AJAX call from this domain would send — getting them wrong is enough to fail.

Stage 2: protected-page enrichment via Playwright

# src/scraper/browser_scraper.py (excerpt)

def fetch_one_via_browser(context, record_id, slug):
    """
    Load a protected profile page, wait for AJAX-rendered content,
    extract the data via the same parser used by Stage 1.
    """
    url = f"https://example-racing-data.com/profile/{slug}/{record_id}"
    page = context.new_page()
    rows = []
    try:
        page.goto(url, wait_until="domcontentloaded", timeout=30_000)
        # wait for the AJAX-rendered content to appear
        page.wait_for_selector(
            "li.timeline-event[data-racedate], .PagerResults",
            timeout=12_000,
        )
        # paginate by clicking Next until exhausted
        for _ in range(MAX_PAGES):
            html = page.content()
            rows.extend(parse_page(html, record_id))
            next_link = page.query_selector("a.UnselectedNext")
            if not next_link:
                break
            current = int(page.query_selector(".SelectedPage").inner_text())
            next_link.click()
            page.wait_for_function(
                f"() => parseInt(document.querySelector('.SelectedPage').innerText) > {current}",
                timeout=10_000,
            )
    finally:
        page.close()
    return rows

Three details that are non-obvious:

  • headless=False. Headless Chrome is detectable in a dozen ways (window dimensions, navigator.plugins, missing audio APIs, GPU rendering quirks). For this site, headless was a hard fail. The visible window is annoying during dev but it's cheap to minimize and the trade-off is fine for an overnight run.
  • page.wait_for_function for pagination. Just waiting for .SelectedPage to exist isn't enough — it always exists. We have to wait for its value to change, which is the only signal that the AJAX click handler has finished and replaced the page content.
  • Persistent storage state. The first run prompts for manual login; afterwards context.storage_state(path=...) saves the session. Subsequent runs reload it and skip the login flow entirely. This is the single biggest reliability win — the script can resume after any crash without losing the session.

Reliability features

A few features that are easy to overlook but matter a lot in practice when you're running a scraper for hours or days unattended:

Resume support

A JSON progress file is written every 25 records:

def save_progress(done_ids):
    PROGRESS_FILE.write_text(json.dumps(sorted(done_ids)))

def load_progress():
    if PROGRESS_FILE.exists():
        return set(json.loads(PROGRESS_FILE.read_text()))
    return set()

If the run crashes, restarting picks up where it left off. No duplicate work.

Empty-result detection

A failure mode that wastes hours is a session expiring partway through a long run — the scraper keeps making requests, they all return empty, and you discover at the end that the last 50% of records have no data. Solution: track consecutive empty responses and abort early.

if len(rows) == 0:
    consecutive_empty += 1
else:
    consecutive_empty = 0

if consecutive_empty >= 10:
    print("10 records in a row returned empty — session likely expired.")
    print("Delete session.json and re-authenticate.")
    break

Polite pacing

time.sleep(0.4) between requests, plus the natural ~2s page load. This is approximately one request every 3 seconds, which is well within what a human user reading the site casually would generate. The goal is not just to evade detection — it's also to not overload a small operator's site.


Things I would do differently next time

  • Use a residential proxy pool from the start. This run was done from a single IP and finished without any IP-level block, but for a larger or longer-running job a proxy rotator is cheap insurance. Bright Data, Decodo, Webshare are all reasonable.
  • Accept the time cost of Playwright earlier. I spent two days exploring HTTP-only options before resigning myself to Playwright for the enrichment stage. Looking back, if a site has any meaningful anti-bot protection, going straight to a real browser is the highest-EV first move and you can optimize later if it's too slow.
  • Persist the parser separately from the fetcher. The parser (regex + BeautifulSoup) is the same regardless of how the HTML was retrieved. In the codebase here it's already in its own module, but in earlier iterations the parser was glued into the fetcher and that made it harder to test. Now parse_page(html, record_id) -> list[dict] is a pure function and is unit-tested independently.

Legal note

Scraping is governed by the site's terms of service, by the laws of your jurisdiction (CFAA in the US, Computer Misuse Act in the UK), and increasingly by case law on database rights and on whether scraping public data violates contract. This document describes techniques in a generic way; whether a specific use of them is legitimate depends on facts I don't have.

Rule of thumb that has served me well:

  • Public data, accessed at human-like rates, no circumvention of paywalls or auth → almost always fine.
  • Behind a paywall you're a paying customer of → check ToS, but generally OK for personal use.
  • Behind auth you don't have, or scraping someone's PII → don't.

If in doubt, ask the site owner. Many small operators are happy to provide an API or a data export if you explain what you want and offer to pay for it.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.