Skip attach E2E test on in-box Windows PowerShell (20260614 image regression); cap CI job#2318
Skip attach E2E test on in-box Windows PowerShell (20260614 image regression); cap CI job#2318andyleejordan wants to merge 9 commits intomainPowerShell/PowerShellEditorServices:mainfrom andyleejordan/reduce-ci-test-timeoutPowerShell/PowerShellEditorServices:andyleejordan/reduce-ci-test-timeoutCopy head branch name to clipboard
Conversation
The `CanAttachScriptWithPathMappings` E2E test intermittently hung `windows-latest` CI for the full six-hour default — three of the last eleven `main` runs died this way, all the same test, interspersed with green runs (a classic flaky race, not a regression). None of the commits whose runs hung touched the debugger attach path. The hang mechanism lived in `ReadScriptLogLineAsync`: at EOF `StreamReader.ReadLineAsync()` completes *synchronously* with `null`, so the `while`/`await` polling loop never actually yielded. It busy-spun one CPU at 100%, which starved the scheduler so none of the existing cooperative safety nets — xUnit's `[SkippableFact(Timeout = 15000)]`, the 30s `debugTaskCts`, or `WaitForExitAsync` — could ever schedule their continuations. A flaky few-second race thus escalated into a six-hour wedge. Ironically the busy-loop landed in #2208, a PR meant to reduce flakiness, and lay dormant until #2251 added a Windows-racy attach test that actually hits the EOF spin. - Back off with `await Task.Delay(100, token)` on EOF so we yield instead of busy-spinning, and cap the whole read with a 15s linked CTS that throws a clear `TimeoutException` naming the log path. - Add `timeout-minutes: 15` to the `ci` job as a backstop so any future hang fails in 15 minutes instead of riding GitHub's 6-hour default. A normal run finishes well under that (Windows, the slowest, is ~12-14m). The underlying attach race (reflection-based wait for `Debug-Runspace` to subscribe) is still worth hardening, but it now fails fast instead of hanging. Drafted by Copilot (Claude Opus 4.8). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses an intermittent six-hour CI hang on windows-latest caused by the CanAttachScriptWithPathMappings E2E test. The root cause is a busy-spin in ReadScriptLogLineAsync: at EOF, StreamReader.ReadLineAsync() completes synchronously with null, so the polling loop never yields, pegging a CPU at 100% and starving the cooperative timeouts (xUnit Timeout, internal CTS, WaitForExitAsync) that would otherwise abort the test. The change makes the reader yield and adds a CI-level backstop, fitting into the repo's broader effort (#2208) to reduce E2E test flakiness.
Changes:
- Add
await Task.Delay(100, token)backoff on EOF inReadScriptLogLineAsyncso the reader yields instead of busy-spinning, and wrap the read in a 15s linkedCancellationTokenSourcethat throws a descriptiveTimeoutExceptionnaming the log path. - Add an optional
CancellationTokenparameter (default), keeping all existing callers unchanged. - Add
timeout-minutes: 15to thecijob as a hung-run backstop.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
test/PowerShellEditorServices.Test.E2E/DebugAdapterProtocolMessageTests.cs |
Replaces the EOF busy-spin with a yielding backoff and a 15s timeout that fails fast with a clear message. |
.github/workflows/ci-test.yml |
Caps the ci job at 15 minutes so any future hang fails quickly instead of riding GitHub's 6-hour default. |
|
@JustinGrote at least there's a timeout now but I think I broke the build somehow... |
…2303)" This reverts commit b9fd1b3. #2303 is what broke `CanAttachScriptWithPathMappings` on Windows. A clean bisection shows its parent (#2304, 6ad4f46) passed Windows E2E in ~12 minutes, while #2303 itself hung for 5h51m on that exact test -- and every commit built on top of it inherited the hang. Months of green Windows runs precede #2303. The mechanism is in `PsesLoadContext.Load`. #2303 tightened `IsSatisfyingAssembly` to also require a matching public key token and culture. When a `$PSHOME` assembly previously satisfied a dependency by name+version, `Load` returned `null` and PSES *shared* PowerShell's single copy. Under the stricter check a token mismatch now fails that first test, so `Load` falls through and loads our *own* bundled copy into the isolated `PsesLoadContext` instead -- producing two copies of the same assembly in two load contexts and a split type identity. The debugger-attach handshake (`Debug-Runspace` subscribing to `RunspaceBase.AvailabilityChanged`, plus the stopped-event plumbing in SMA) relies on cross-context event wiring that silently breaks under such a split, so the attach never completes and the test waits forever. It only trips on Windows because that is where the `$PSHOME`-versus-bundled token divergence occurs. #2303's "no bundled dependency changes resolution" check was static and missed an assembly loaded dynamically during attach. #2303 was self-described as "a focused trial of tightening" the matching, so reverting it restores the long-standing, known-good behavior. We can re-attempt the hardening later with this attach test as a guard. Drafted by Copilot (Claude Opus 4.8). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ReadScriptLogLineAsyncThe internal `CancelAfter` cap was 15s, exactly equal to the `[SkippableFact(Timeout = 15000)]` on `CanAttachScriptWithPathMappings`. Because xUnit's per-test timer covers the whole test -- attach, setBreakpoints, configurationDone and waiting for stopped events all run before `ReadScriptLogLineAsync` is even entered -- xUnit's generic timeout would almost always fire first, so the descriptive `TimeoutException` naming the log path would never surface for the very test that motivated it. Drop the cap to 10s so the clearer message can win for that test, while still bounding the untimed `[Fact]` callers. Per review feedback from copilot-pull-request-reviewer on #2318. Drafted by Copilot (Claude Opus 4.8). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reduce this branch to its one honest, effective change: a 30-minute `timeout-minutes` on the CI test job. A normal run finishes well under that (Windows, the slowest, is ~12-14 minutes), so the cap only bounds a hung test instead of letting it ride GitHub's 6-hour default. This un-reverts #2303 and drops the earlier `ReadScriptLogLineAsync` change, both of which were based on a per-commit bisection that has since been disproven. The Windows debugger-attach test `CanAttachScriptWithPathMappings` intermittently wedges on the attach handshake and rides the default timeout; the same hang reproduces on `main` (which contains #2303) and reproduced here with #2303 reverted, so #2303 is not the cause and is restored. The attach test wedges before it ever reaches `ReadScriptLogLineAsync`, so that change could not affect the hang and its short internal cap risked introducing new flakiness on a slow-but-healthy attach; it is reverted too. The intermittent attach hang is tracked separately. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CanAttachScriptWithPathMappings intermittently hung Windows CI for hours instead of failing fast. Its ReadScriptLogLineAsync tailed the script log with `while (...) await ReadLineAsync()`, but at EOF ReadLineAsync completes synchronously with null, so the loop never released its thread-pool thread. On constrained CI runners that starved the pool, which both wedged the DAP client's background I/O and prevented the xUnit (15s) and harness (30s) timeout continuations from ever running -- so a transient stall rode the job timeout for hours. Await a short delay between reads so the tail loop yields, and add a matching sleep to the child process's Debug-Runspace readiness poll so it cannot peg a core during the attach handshake. Combined with the 30-minute CI job cap, a genuine stall now fails fast via the test's own timeout instead of hanging. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CanAttachScriptWithPathMappings hangs on in-box Windows PowerShell 5.1 since the windows-2025-vs2026 runner image refreshed from 20260608 to 20260614. The cross-process Debug-Runspace attach wedges and the test rides the job timeout; the windows-latest leg cannot complete. Scope the skip to IsWindowsPowerShell so the in-box WinPS suites (including CLM) are exempt while PowerShell Core, the preview, macOS, and Linux keep full coverage of the attach path. This is a stopgap pending a real fix for the in-box attach deadlock, tracked by #2323; the 30-minute timeout-minutes backstop in ci-test.yml stays as a guard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The earlier comment asserted the EOF tight-loop was the cause of the multi-hour Windows hang. Deconfounding analysis disproved that: the hang is the in-box Windows PowerShell attach regression from the 20260614 runner image, not thread-pool starvation here. Keep the yield as genuine harness hardening but describe it as such rather than claiming it as the fix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What
CanAttachScriptWithPathMappings(added in #2251) started hanging the Windows leg of CI for hours, riding GitHub's 6-hour default job timeout without ever throwing.This PR is a stopgap, not the real fix:
windows-latestleg can complete again.The underlying in-box attach deadlock is tracked by #2323.
Root cause — a
windows-latestrunner-image refreshThis is not a code regression in PSES. By comparing the last green and first red
mainruns:#2304) ran on imagewin25-vs2026/20260608.135.#2303) and every red after it ran on image20260614.141.Same image family, same runner (
2.335.1), same-PreviewPowerShell. The only thing that changed at the boundary is the weekly OS-servicing patch in the image. That refresh broke in-box Windows PowerShell 5.1's cross-processDebug-Runspace/Enter-PSHostProcessattach, which is exactly what this test exercises. The precise servicing delta (likely a .NET Framework / Windows IPC update) is still unknown and is the subject of #2323.The hang is specifically the in-box Windows PowerShell E2E suite (
TestE2EPowerShell). PowerShell Core (TestE2EPwsh) and the preview pass the same attach test, so the skip is scoped toIsWindowsPowerShell(covering the WinPS and WinPS-CLM suites) and Core / preview / macOS / Linux keep full coverage of the attach path.Why #2303 is not the cause
An earlier per-commit bisection fingered #2303 (the strong-name identity change), but that was confounded: #2303 happened to be the first merge onto the new image, so the image bump and the code change moved together. Two independent proofs exonerate it:
PsesLoadContext.cs(the file Match strong-name identity when resolving PSES dependencies #2303 touched) is<Compile Remove>'d fornet462and uses the Core-onlyAssemblyLoadContext. In-box WinPS 5.1 loadsbin/Desktop/(net462), so Match strong-name identity when resolving PSES dependencies #2303's code isn't even compiled into the hanging configuration.#2303 is left intact.
Follow-up
Stopgap for #2323. Once the real in-box attach fix lands, the
Skip.Ifhere should be removed; thetimeout-minutesbackstop can stay as a permanent guard.