Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

DEVOP-571: add SECURITY-RUNBOOK.md (Shai-Hulud incident response)#3

Merged
spooktheducks merged 3 commits into
allora-network:mainallora-network/.github:mainfrom
srt0422:devop-571-security-runbooksrt0422/.github:devop-571-security-runbookCopy head branch name to clipboard
May 14, 2026
Merged

DEVOP-571: add SECURITY-RUNBOOK.md (Shai-Hulud incident response)#3
spooktheducks merged 3 commits into
allora-network:mainallora-network/.github:mainfrom
srt0422:devop-571-security-runbooksrt0422/.github:devop-571-security-runbookCopy head branch name to clipboard

Conversation

@srt0422

@srt0422 srt0422 commented May 13, 2026

Copy link
Copy Markdown

Summary

Org-wide incident response runbook for Shai-Hulud-class supply-chain compromise. Lives in the .github org repo so it surfaces on every repo's Security tab.

What landed

SECURITY-RUNBOOK.md at the repo root, covering all the required scenarios:

  • Detection sources — Falco, IOC sweep, Dependabot, secret scanning, manual report; channel + owner per source.
  • Triage decision tree in plain text (works on phone, in PDF, pasted into Slack).
  • Scenario A — dev workstation compromise (disconnect → revoke → wipe → rebuild, with the credential-by-credential checklist).
  • Scenario B — CI runner compromise (disable workflow, audit blast radius, rotate, audit recent publishes).
  • Scenario C — compromised package we published (yank/deprecate, advisory, clean rebuild + downstream notify).
  • Scenario D — cluster pod compromise (cordon, capture-before-delete forensics, rotate SA-scoped secrets, uncordon).
  • Token rotation cadence — quarterly default, with "rotate immediately if" triggers.
  • Tabletop exercise schedule — annual, Q1, with explicit format and skip-approval rule.
  • Appendix — gh search incantations, cosign verify cookbook, sweep trigger, node drain.

Style is operational, not compliance-flavored: short imperative steps, explicit owner per step, each scenario sectioned as Stop the bleed / Audit blast radius / Restore service / Close-out so the on-call can skim during an actual incident.

Linear

https://linear.app/alloralabs/issue/DEVOP-571

Test plan

  • Walk a DevOps engineer through Scenario A start-to-finish and check whether the credential-by-credential list misses anything they actually use.
  • Confirm the runbook renders correctly on the org Security tab after merge.
  • First annual tabletop (DEVOP-573) will be the real test — runbook should self-update based on what was slow/ambiguous.

🤖 Generated with Claude Code


Summary by cubic

Adds an org-wide incident response runbook for Shai-Hulud–class supply‑chain compromises. Lives in allora-network/.github as SECURITY-RUNBOOK.md so it shows on every repo’s Security tab and satisfies DEVOP-571.

  • New Features

    • Detection sources with channel and owner.
    • Plain‑text triage decision tree.
    • Four scenarios with step‑by‑step actions: developer workstation, CI runner, compromised publish, and cluster pod (Stop the bleed → Audit → Restore → Close‑out), including deriving the ServiceAccount from saved pod YAML during Scenario D audits to avoid races after pod deletion.
    • Token rotation cadence with immediate‑rotate triggers.
    • Annual tabletop schedule and close‑out rules.
    • Appendix with quick commands for gh, cosign, and kubectl.
  • Bug Fixes

    • Scenario D SA grep fallback now uses POSIX [[:space:]] to work on BSD/macOS grep.

Written for commit 6574da0. Summary will update on new commits.

Org-wide incident response runbook for Shai-Hulud-class supply-chain
compromise. Lives in the .github org repo so it surfaces on every repo's
Security tab.

Covers all required scenarios from the acceptance criteria:

* Detection sources table (Falco, IOC sweep, Dependabot, secret scanning,
  manual report) with channel + owner per source.
* Triage decision tree — flowchart in plain text so it works on phone /
  in PDF / pasted into Slack.
* Scenario A: developer workstation compromise (disconnect, revoke,
  wipe, rebuild — explicit credential-by-credential checklist).
* Scenario B: CI runner compromise (disable workflow, audit blast radius,
  rotate every secret in scope, audit recent publishes, restore on a
  clean rebuild).
* Scenario C: compromised package published from our org (yank +
  deprecate, advisory, publish clean rebuild from clean environment,
  downstream notification, IOC list update).
* Scenario D: cluster pod compromise (cordon, forensic capture in order,
  delete, audit ServiceAccount scope, rotate, uncordon).
* Token rotation cadence — quarterly default, with per-credential rules
  and "rotate immediately if" triggers.
* Tabletop exercise schedule — annual, with explicit format and
  skip-approval rule.
* Appendix of useful commands (gh search, cosign verify, sweep trigger,
  node drain).

The runbook deliberately reads like an actual operational document, not
a compliance artifact: short imperative sentences, explicit owner per
step, no passive voice. Each scenario starts with "Stop the bleed,"
"Audit blast radius," "Restore service," "Close-out" so an on-call can
skim and find their position.

Refs: https://linear.app/alloralabs/issue/DEVOP-571

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic analysis

1 issue found across 1 file

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="SECURITY-RUNBOOK.md">

<violation number="1" location="SECURITY-RUNBOOK.md:349">
P2: Deriving the ServiceAccount from `$POD` after the pod is deleted can fail and break blast-radius auditing. Capture SA from the previously saved pod YAML (or before deletion) so audit steps remain executable.</violation>
</file>

Linked issue analysis

Linked issue: DEVOP-571: Write SECURITY-RUNBOOK.md in .github org repo

Status Acceptance criteria Notes
Detection sources: where alerts fire (Falco → Slack, IOC sweep → GitHub Issue, Dependabot, secret scanning push protection) SECURITY-RUNBOOK.md contains a '1. Detection sources' table listing Falco → #security-alerts, IOC sweep → GitHub issue workflow, Dependabot, and secret scanning push protection.
Triage decision tree: confirm vs. false-positive; who pages whom The runbook includes a Triage decision tree flowchart with branching for false positives and instructions on who to page (on-call, publisher).
Dev machine suspected infected: disconnect, revoke tokens, wipe, rebuild Scenario A lists immediate disconnect, revoke token steps (detailed per-credential), preserve evidence guidance, and wipe+reinstall + reissue credentials.
CI runner suspected infected: disable workflow, rotate every secret in the env, audit recent publishes/pushes, yank+republish if needed Scenario B instructs disabling workflows or scaling runner pool to 0, enumerates rotating every credential in scope, and auditing recent publishes (with escalation to Scenario C if suspect).
Compromised package published: yank from npm/PyPI/Harbor, publish corrected version from clean environment, notify downstream Scenario C details yanking/deprecating/unpublishing behavior per registry, steps to publish corrected version from a clean environment, and notifying downstream consumers plus updating IOC lists.
Cluster pod suspected compromised: cordon node, capture forensic data via Falco/audit logs, delete pod, rotate secrets the pod could read, post-mortem Scenario D prescribes cordoning the node, captures-forensics commands (kubectl describe/logs/exec), deleting or scaling to 0, listing/rotating secrets the pod could read, and mandatory post-mortem.
Token rotation cadence: quarterly for any long-lived credential not on OIDC Section 7 contains a token rotation table specifying quarterly rotation for PATs/npm/PyPI/Harbor and notes about migrating to OIDC.
Tabletop exercise schedule: annual (see DEVOP-573) Section 8 defines an annual Q1 tabletop exercise, format, attendees, and skip rules referencing DEVOP-573.
PR merged Merge is an acceptance condition but cannot be satisfied by the patch diff itself; the PR is adding the runbook but is not merged yet.
Architecture diagram
sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub
    participant Slack as Slack #security-alerts
    participant Falco as Falco (Cluster Runtime)
    participant IOC as Daily IOC Sweep
    participant Oncall as DevOps On-Call
    participant Package as Package Registry (npm/PyPI)
    participant Cluster as Kubernetes Cluster
    
    Note over Dev,Cluster: Detection Sources
    
    Falco->>Oncall: Alert: container behavior violation
    Falco->>Slack: Cross-post alert via Falcosidekick
    IOC->>GH: Auto-file issue in incident-response repo
    IOC->>Slack: Cross-post alert
    GH->>Slack: Dependabot alert
    GH->>Dev: Secret scanning push blocked
    Dev->>Slack: Manual report: suspicious activity
    
    Note over Oncall,Cluster: Triage Decision Tree (5-10 min)
    
    Oncall->>Oncall: Acknowledge in Slack within 5 min
    alt Falco alert & known false-positive
        Oncall->>Slack: Ack, tune rule in flux-*/falco/rules.yaml
    else IOC match: package@version
        alt We published that package?
            Oncall->>Package: Scenario C: yank/deprecate package
            Oncall->>Oncall: Page publisher + on-call
        else Not our publish
            Oncall->>GH: Pin known-good version, open PR
        end
    else Secret detected
        Oncall->>Oncall: Rotate secret immediately (see §7)
        Oncall->>GH: Audit usage for last 90 days
    else Weird workstation behavior
        Oncall->>Dev: Scenario A
    else Weird CI runner behavior
        Oncall->>GH: Scenario B
    else Weird pod behavior
        Oncall->>Cluster: Scenario D
    else Unknown alert
        Oncall->>Slack: Dig in, file follow-up ticket
    end
    
    Note over Dev,Cluster: Scenario A - Workstation Compromise
    
    Dev->>Dev: Disconnect machine from network
    Dev->>Slack: Post alert from phone
    Dev->>GH: Revoke all PATs and SSH keys
    Dev->>Package: Revoke npm/PyPI tokens
    Dev->>Dev: Revoke AWS access keys
    Dev->>Dev: Wipe + reinstall OS
    Dev->>Dev: Reissue fine-grained credentials only
    
    Note over GH,Cluster: Scenario B - CI Runner Compromise
    
    Oncall->>GH: gh workflow disable <name>
    Oncall->>Cluster: Scale Arc runner replicas to 0
    Oncall->>Oncall: List runner access scope
    Oncall->>Oncall: Check GitHub Actions audit log
    
    Note over Dev,Cluster: Scenario C - Compromised Package
    
    Oncall->>Package: Yank/deprecate package version
    Oncall->>GH: Publish GHSA advisory
    Oncall->>Dev: Notify downstream consumers
    
    Note over GH,Cluster: Scenario D - Cluster Pod Compromise
    
    Oncall->>Cluster: kubectl cordon node
    Oncall->>Cluster: kubectl describe pod > capture.txt
    Oncall->>Cluster: kubectl delete pod
    Oncall->>Cluster: kubectl uncordon node
    
    Note over Dev,Oncall: Token Rotation Cadence (Quarterly)
    Note over GH,Cluster: Tabletop Exercise (Annual Q1)
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread SECURITY-RUNBOOK.md Outdated
After step 3 deletes the pod (or scales the deployment to 0), a live
`kubectl get pod` lookup for the ServiceAccount in step 4 will fail or
return the SA of a freshly-recreated replacement pod. Read the SA from
the snapshot captured in step 2 so blast-radius auditing stays
executable after containment.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="SECURITY-RUNBOOK.md">

<violation number="1" location="SECURITY-RUNBOOK.md:356">
P1: Replace the non-portable `\s` with `[[:space:]]` in the grep fallback so the command works correctly on both GNU and BSD/macOS systems.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread SECURITY-RUNBOOK.md Outdated
BSD grep (default on macOS) does not honor the `\s` Perl-style
shorthand inside `-E` patterns. Switch to `[[:space:]]` so the
fallback works identically on GNU and BSD/macOS systems.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@srt0422 srt0422 added the shai-hulud Shai-Hulud supply-chain defense work label May 13, 2026
@spooktheducks spooktheducks merged commit 7365652 into allora-network:main May 14, 2026
1 check passed
srt0422 added a commit to srt0422/.github that referenced this pull request May 22, 2026
Documents the inaugural Shai-Hulud-class tabletop exercise: an injected
"eliza-allora-plugin was published with a postinstall payload yesterday
at 4pm" scenario that walks the team end-to-end through the
SECURITY-RUNBOOK (DEVOP-571).

The doc is operational, not a writeup. It contains:

* The injected scenario, including the specific exfil mechanics, the
  IOC discovery timeline, and the T+0 trigger.
* Pre-assigned roles (incident lead, communicator, executor, BE rep,
  FE rep, founder observer-only) with explicit don't-skip-a-role rule.
* Six phases keyed to runbook sections, each with a target elapsed
  time and explicit success/failure modes the facilitator watches
  for.
* The 30-minute time-to-clean-republish target broken into 4 phases
  (T+5 / T+10 / T+20 / T+30) so participants can self-check progress
  mid-exercise.
* A debrief script (6 questions, in order) that produces ticket
  inputs verbatim from the team's own language.
* Output checklist for the facilitator (Linear tickets, runbook PR,
  lessons-learned section update, next-year calendar invite).
* Notes-from-runbook-author section identifying the three seams in
  the runbook that the exercise should specifically stress.

The exercise itself is a team activity and is NOT considered complete
until the run + debrief actually happen. DEVOP-573 stays In Review
until the facilitator schedules and runs the live session.

Blocks-by: DEVOP-571 (runbook). PR allora-network#3 in this repo authors the
runbook; this PR cross-references it.

Refs: https://linear.app/alloralabs/issue/DEVOP-573

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-human-review shai-hulud Shai-Hulud supply-chain defense work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.