Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Skip attach E2E test on in-box Windows PowerShell (20260614 image regression); cap CI job#2318

Open
andyleejordan wants to merge 9 commits into
mainPowerShell/PowerShellEditorServices:mainfrom
andyleejordan/reduce-ci-test-timeoutPowerShell/PowerShellEditorServices:andyleejordan/reduce-ci-test-timeoutCopy head branch name to clipboard
Open

Skip attach E2E test on in-box Windows PowerShell (20260614 image regression); cap CI job#2318
andyleejordan wants to merge 9 commits into
mainPowerShell/PowerShellEditorServices:mainfrom
andyleejordan/reduce-ci-test-timeoutPowerShell/PowerShellEditorServices:andyleejordan/reduce-ci-test-timeoutCopy head branch name to clipboard

Conversation

@andyleejordan

@andyleejordan andyleejordan commented Jun 18, 2026

Copy link
Copy Markdown
Member

What

CanAttachScriptWithPathMappings (added in #2251) started hanging the Windows leg of CI for hours, riding GitHub's 6-hour default job timeout without ever throwing.

This PR is a stopgap, not the real fix:

  1. Skip the test on in-box Windows PowerShell so the windows-latest leg can complete again.
  2. Cap the CI test job at 30 minutes as a backstop so any future stall fails fast instead of burning a 6-hour runner.
  3. Minor harness hardening (yield instead of tight EOF/poll spins) that is good hygiene regardless.

The underlying in-box attach deadlock is tracked by #2323.

Root cause — a windows-latest runner-image refresh

This is not a code regression in PSES. By comparing the last green and first red main runs:

  • Last green (#2304) ran on image win25-vs2026/20260608.135.
  • First red (#2303) and every red after it ran on image 20260614.141.

Same image family, same runner (2.335.1), same -Preview PowerShell. The only thing that changed at the boundary is the weekly OS-servicing patch in the image. That refresh broke in-box Windows PowerShell 5.1's cross-process Debug-Runspace / Enter-PSHostProcess attach, which is exactly what this test exercises. The precise servicing delta (likely a .NET Framework / Windows IPC update) is still unknown and is the subject of #2323.

The hang is specifically the in-box Windows PowerShell E2E suite (TestE2EPowerShell). PowerShell Core (TestE2EPwsh) and the preview pass the same attach test, so the skip is scoped to IsWindowsPowerShell (covering the WinPS and WinPS-CLM suites) and Core / preview / macOS / Linux keep full coverage of the attach path.

Why #2303 is not the cause

An earlier per-commit bisection fingered #2303 (the strong-name identity change), but that was confounded: #2303 happened to be the first merge onto the new image, so the image bump and the code change moved together. Two independent proofs exonerate it:

#2303 is left intact.

Follow-up

Stopgap for #2323. Once the real in-box attach fix lands, the Skip.If here should be removed; the timeout-minutes backstop can stay as a permanent guard.

The `CanAttachScriptWithPathMappings` E2E test intermittently hung
`windows-latest` CI for the full six-hour default — three of the last
eleven `main` runs died this way, all the same test, interspersed with
green runs (a classic flaky race, not a regression). None of the commits
whose runs hung touched the debugger attach path.

The hang mechanism lived in `ReadScriptLogLineAsync`: at EOF
`StreamReader.ReadLineAsync()` completes *synchronously* with `null`, so
the `while`/`await` polling loop never actually yielded. It busy-spun one
CPU at 100%, which starved the scheduler so none of the existing
cooperative safety nets — xUnit's `[SkippableFact(Timeout = 15000)]`, the
30s `debugTaskCts`, or `WaitForExitAsync` — could ever schedule their
continuations. A flaky few-second race thus escalated into a six-hour
wedge. Ironically the busy-loop landed in #2208, a PR meant to reduce
flakiness, and lay dormant until #2251 added a Windows-racy attach test
that actually hits the EOF spin.

- Back off with `await Task.Delay(100, token)` on EOF so we yield instead
  of busy-spinning, and cap the whole read with a 15s linked CTS that
  throws a clear `TimeoutException` naming the log path.
- Add `timeout-minutes: 15` to the `ci` job as a backstop so any future
  hang fails in 15 minutes instead of riding GitHub's 6-hour default. A
  normal run finishes well under that (Windows, the slowest, is ~12-14m).

The underlying attach race (reflection-based wait for `Debug-Runspace` to
subscribe) is still worth hardening, but it now fails fast instead of
hanging.

Drafted by Copilot (Claude Opus 4.8).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@andyleejordan andyleejordan requested a review from a team as a code owner June 18, 2026 18:41
Copilot AI review requested due to automatic review settings June 18, 2026 18:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses an intermittent six-hour CI hang on windows-latest caused by the CanAttachScriptWithPathMappings E2E test. The root cause is a busy-spin in ReadScriptLogLineAsync: at EOF, StreamReader.ReadLineAsync() completes synchronously with null, so the polling loop never yields, pegging a CPU at 100% and starving the cooperative timeouts (xUnit Timeout, internal CTS, WaitForExitAsync) that would otherwise abort the test. The change makes the reader yield and adds a CI-level backstop, fitting into the repo's broader effort (#2208) to reduce E2E test flakiness.

Changes:

  • Add await Task.Delay(100, token) backoff on EOF in ReadScriptLogLineAsync so the reader yields instead of busy-spinning, and wrap the read in a 15s linked CancellationTokenSource that throws a descriptive TimeoutException naming the log path.
  • Add an optional CancellationToken parameter (default), keeping all existing callers unchanged.
  • Add timeout-minutes: 15 to the ci job as a hung-run backstop.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
test/PowerShellEditorServices.Test.E2E/DebugAdapterProtocolMessageTests.cs Replaces the EOF busy-spin with a yielding backoff and a 15s timeout that fails fast with a clear message.
.github/workflows/ci-test.yml Caps the ci job at 15 minutes so any future hang fails quickly instead of riding GitHub's 6-hour default.

Comment thread .github/workflows/ci-test.yml Outdated
Comment thread test/PowerShellEditorServices.Test.E2E/DebugAdapterProtocolMessageTests.cs Outdated
@andyleejordan

Copy link
Copy Markdown
Member Author

@JustinGrote at least there's a timeout now but I think I broke the build somehow...

andyleejordan and others added 2 commits June 18, 2026 13:51
…2303)"

This reverts commit b9fd1b3.

#2303 is what broke `CanAttachScriptWithPathMappings` on Windows. A clean
bisection shows its parent (#2304, 6ad4f46) passed Windows E2E in ~12
minutes, while #2303 itself hung for 5h51m on that exact test -- and every
commit built on top of it inherited the hang. Months of green Windows runs
precede #2303.

The mechanism is in `PsesLoadContext.Load`. #2303 tightened
`IsSatisfyingAssembly` to also require a matching public key token and
culture. When a `$PSHOME` assembly previously satisfied a dependency by
name+version, `Load` returned `null` and PSES *shared* PowerShell's single
copy. Under the stricter check a token mismatch now fails that first test,
so `Load` falls through and loads our *own* bundled copy into the isolated
`PsesLoadContext` instead -- producing two copies of the same assembly in
two load contexts and a split type identity. The debugger-attach handshake
(`Debug-Runspace` subscribing to `RunspaceBase.AvailabilityChanged`, plus
the stopped-event plumbing in SMA) relies on cross-context event wiring
that silently breaks under such a split, so the attach never completes and
the test waits forever. It only trips on Windows because that is where the
`$PSHOME`-versus-bundled token divergence occurs. #2303's "no bundled
dependency changes resolution" check was static and missed an assembly
loaded dynamically during attach.

#2303 was self-described as "a focused trial of tightening" the matching,
so reverting it restores the long-standing, known-good behavior. We can
re-attempt the hardening later with this attach test as a guard.

Drafted by Copilot (Claude Opus 4.8).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@andyleejordan andyleejordan changed the title Reduce CI test timeout and fix busy-spin in ReadScriptLogLineAsync Revert #2303 to fix Windows debugger-attach CI hang Jun 18, 2026
@andyleejordan andyleejordan enabled auto-merge (squash) June 18, 2026 20:55
The internal `CancelAfter` cap was 15s, exactly equal to the
`[SkippableFact(Timeout = 15000)]` on `CanAttachScriptWithPathMappings`.
Because xUnit's per-test timer covers the whole test -- attach,
setBreakpoints, configurationDone and waiting for stopped events all run
before `ReadScriptLogLineAsync` is even entered -- xUnit's generic
timeout would almost always fire first, so the descriptive
`TimeoutException` naming the log path would never surface for the very
test that motivated it.

Drop the cap to 10s so the clearer message can win for that test, while
still bounding the untimed `[Fact]` callers. Per review feedback from
copilot-pull-request-reviewer on #2318.

Drafted by Copilot (Claude Opus 4.8).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
andyleejordan and others added 2 commits June 18, 2026 14:04
Reduce this branch to its one honest, effective change: a 30-minute
`timeout-minutes` on the CI test job. A normal run finishes well under
that (Windows, the slowest, is ~12-14 minutes), so the cap only bounds a
hung test instead of letting it ride GitHub's 6-hour default.

This un-reverts #2303 and drops the earlier `ReadScriptLogLineAsync`
change, both of which were based on a per-commit bisection that has since
been disproven. The Windows debugger-attach test
`CanAttachScriptWithPathMappings` intermittently wedges on the attach
handshake and rides the default timeout; the same hang reproduces on
`main` (which contains #2303) and reproduced here with #2303 reverted, so
#2303 is not the cause and is restored. The attach test wedges before it
ever reaches `ReadScriptLogLineAsync`, so that change could not affect the
hang and its short internal cap risked introducing new flakiness on a
slow-but-healthy attach; it is reverted too. The intermittent attach hang
is tracked separately.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@andyleejordan andyleejordan changed the title Revert #2303 to fix Windows debugger-attach CI hang Cap CI test job at 30 minutes to bound a hung Windows attach test Jun 18, 2026
CanAttachScriptWithPathMappings intermittently hung Windows CI for hours
instead of failing fast. Its ReadScriptLogLineAsync tailed the script log
with `while (...) await ReadLineAsync()`, but at EOF ReadLineAsync
completes synchronously with null, so the loop never released its
thread-pool thread. On constrained CI runners that starved the pool,
which both wedged the DAP client's background I/O and prevented the xUnit
(15s) and harness (30s) timeout continuations from ever running -- so a
transient stall rode the job timeout for hours.

Await a short delay between reads so the tail loop yields, and add a
matching sleep to the child process's Debug-Runspace readiness poll so it
cannot peg a core during the attach handshake. Combined with the
30-minute CI job cap, a genuine stall now fails fast via the test's own
timeout instead of hanging.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@andyleejordan andyleejordan changed the title Cap CI test job at 30 minutes to bound a hung Windows attach test Fix attach E2E test hang from thread-pool starvation; cap CI job Jun 18, 2026
andyleejordan and others added 2 commits June 18, 2026 17:15
CanAttachScriptWithPathMappings hangs on in-box Windows PowerShell 5.1
since the windows-2025-vs2026 runner image refreshed from 20260608 to
20260614. The cross-process Debug-Runspace attach wedges and the test
rides the job timeout; the windows-latest leg cannot complete.

Scope the skip to IsWindowsPowerShell so the in-box WinPS suites
(including CLM) are exempt while PowerShell Core, the preview, macOS, and
Linux keep full coverage of the attach path. This is a stopgap pending a
real fix for the in-box attach deadlock, tracked by #2323; the 30-minute
timeout-minutes backstop in ci-test.yml stays as a guard.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The earlier comment asserted the EOF tight-loop was the cause of the
multi-hour Windows hang. Deconfounding analysis disproved that: the hang
is the in-box Windows PowerShell attach regression from the 20260614
runner image, not thread-pool starvation here. Keep the yield as genuine
harness hardening but describe it as such rather than claiming it as the
fix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@andyleejordan andyleejordan changed the title Fix attach E2E test hang from thread-pool starvation; cap CI job Skip attach E2E test on in-box Windows PowerShell (20260614 image regression); cap CI job Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.