feat: add Hudson Rock infostealer-corpus integration (v2.20.0)#9
feat: add Hudson Rock infostealer-corpus integration (v2.20.0)#9abdullahbilal64 wants to merge 2 commits intoOpenOSINT:mainOpenOSINT/OpenOSINT:mainfrom abdullahbilal64:feat/hudsonrock-integrationabdullahbilal64/OpenOSINT:feat/hudsonrock-integrationCopy head branch name to clipboard
Conversation
Adds search_hudsonrock, a new OSINT tool that queries Hudson Rock's Cavalier API for credentials exposed via infostealer malware (RedLine, Lumma, Raccoon, Vidar, StealC, ...). Auto-routes by input shape: emails → /search-by-email, domains → /search-by-domain, usernames and E.164 phone numbers → /search-by-username. No API key required; the optional HUDSONROCK_API_KEY is sent as Bearer auth for commercial-tier access. Registers the tool across all six interface layers (agent loop, MCP server, CLI subcommand, REPL display, web UI catalog) per the integration checklist in CONTRIBUTING.md. Bumps version to 2.20.0 across openosint/__init__.py, pyproject.toml, and .mcp/server.json. Adds .env.example entry, README env + Integrations table rows, and a CHANGELOG entry. Closes OpenOSINT#4.
|
First off — this is a genuinely clean PR. Full integration checklist across all six layers, consistent version bump, 25 tests, and you flagged the _EMAIL_RE duplicate transparently instead of silently fixing it. Appreciated. The (redacted) label on top_logins — in _format_stealers the records print under Top logins (redacted):, but the formatter passes through whatever the API returns, and the test fixture has full addresses (user@example.com, admin@example.com). The label asserts a redaction the code doesn't actually perform. Either mask them in the formatter (e.g. first char of local-part + domain) or relabel to something accurate like Top logins (as returned by API):. For a tool we ship "for authorized security research only," I'd rather the wording match the behavior exactly. On the web_server.py:757 duplicate of the same regex flaw — good call keeping it out of scope. I'll open a follow-up issue to track it (or happy to take a second PR if you want to grab it). |
- Rename "Top logins (redacted)" → "Top logins (as returned by API)". The formatter does no redaction itself; Hudson Rock's free tier already partial-masks server-side, but the test fixture and any future paid-tier response could carry unredacted logins under the old label. Honest label matches what the code does. - Align [2.20.0] CHANGELOG date to 2026-06-05 to match the commit log and README footer. Addresses review feedback on the PR.
|
Thanks for the quick and careful review and I've pushed both the fixes in the last commit. The redacted label was genuinely a very sharp thing to catch. It looked right end-to-end because the live API kept returning pre-masked strings during my testing, so the label happened to coincide with the behavior. I've gone with the option of fixing the statement and telling the user that we are just passing along whatever is returned by the API. In addition, I've also fixed the date in the CHANGELOG to match with the actual commit date. By the way, I'm happy to take the lead on the follow up PR. |
Summary
Adds
search_hudsonrock, a new OSINT tool that queries Hudson Rock's Cavalier API for credentials exposed via infostealer malware (RedLine, Lumma, Raccoon, Vidar, StealC, …). Auto-routes by input shape — emails →/search-by-email, domains →/search-by-domain, usernames and E.164 phone numbers →/search-by-username— and works without an API key against the free public endpoint (50 req / 10 s rate limit).HUDSONROCK_API_KEY, if set, is sent asAuthorization: Bearer …for commercial-tier access. Closes #4.Why
Infostealer-corpus checks fill a coverage gap that
search_breach(HaveIBeenPwned) doesn't address: HIBP indexes credentials that have been published as breaches, but a substantial fraction of compromised credentials surface only in malware botnet logs that are sold privately. For email and domain investigations this materially increases recall, and thedomainmode returns useful aggregate signals (compromised-employee count, top stealer families, victim-AV breakdown) for assessing organisational exposure.What's in the diff
New tool module
openosint/tools/search_hudsonrock.py— asyncrun_hudsonrock_osint(query, timeout_seconds);_classify()selects the endpoint from input shape; separate formatters for the domain-aggregate response and per-record (email/username) responses; output redacts top-logins and masks victim IPs as returned by the API. Follows the project's tool-contract convention: never raises across the API boundary, returns descriptive error strings on failure.Interface registration (per the integration checklist in
CONTRIBUTING.md)openosint/agent.py— Anthropic tool definition + dispatch entry in_TOOL_MAP;SYSTEM_PROMPTnow suggestssearch_hudsonrockalongsidesearch_breachfor credential-exposure investigations.openosint/mcp_server.py—Tool(...)entry inlist_tools()and dispatch branch; module docstring updated to reflect 17 tools.openosint/cli.py—openosint hudsonrock QUERY [-t SECONDS]subcommand.openosint/repl.py— display row in_TOOL_INFO_ROWS.openosint/web_server.py—_TOOL_CATALOGentry (Identity category) +_RUNNERSmapping; tool surfaces in the web UI sidebar and the AI chat tool-use path.Version + docs
openosint/__init__.py,pyproject.toml, and.mcp/server.json(both the top-levelversionand the package entry)._VERSIONinweb_server.pywas already at 2.20.0..env.example: newHUDSONROCK_API_KEYentry with a comment explaining the public-vs-commercial-tier behavior.README.md: feature line (16 tools→17), env-var table row, Integrations table row pointing at hudsonrock.com.CHANGELOG.md:[2.20.0]entry underAdded/Changed.Drive-by cleanups inside the new file (flagging transparently — happy to split into a separate commit/PR if maintainers prefer)
_EMAIL_REhad[\w-]+in the domain class, rejecting any multi-level domain (user@mail.example.comfailed_is_valid_email()). Changed to[\w.-]+; the TLD[a-z]{2,}anchor andre.IGNORECASEflag are preserved. Note: the same flaw exists atopenosint/web_server.py:757inside_demo_chat_stream— left untouched here to keep this PR scoped to the integration; can be a follow-up._fetch_hudsonrockpreviously called_raise_for_status(resp.status)and then re-checkedresp.status == 404to return{}. The two checks were redundant:_raise_for_statussilentlyreturned on 404, then the caller detected 404 again. Collapsed into a single explicit 404 short-circuit in the caller;_raise_for_statusnow only raises, matching its name.Test plan