Feat/resilience plugin by chillum-codeX · Pull Request #4537 · google/adk-python

chillum-codeX · Feb 18, 2026

feat(plugins): LlmResiliencePlugin – configurable retries/backoff and model fallbacks

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

Closes: N/A
Related: Add Built-in Retry Mechanism for API Errors (e.g., 429 Too Many Requests) in LLM Agent #1214
Related: ADK retry mechanism doesn't handle common network errors (httpx.RemoteProtocolError) in production environments #2561
Related discussions: How to add LLM Fallback and Max retries in Agents #2292, llm request maximum timeout #3199

2. Or, if no issue exists, describe the change:

Problem:
Production agents need first-class resilience to transient LLM/API failures
(timeouts, HTTP 429/5xx). Today, retry/fallback logic is often ad-hoc and
duplicated across projects.

Solution:
Introduce an opt-in plugin, LlmResiliencePlugin, that handles transient LLM
errors with configurable retries (exponential backoff + jitter) and optional
model fallbacks, without modifying core runner/flow logic.

Summary

Added src/google/adk/plugins/llm_resilience_plugin.py.
Exported LlmResiliencePlugin in src/google/adk/plugins/__init__.py.
Added unit tests in
tests/unittests/plugins/test_llm_resilience_plugin.py:
- test_retry_success_on_same_model
- test_fallback_model_used_after_retries
- test_non_transient_error_bubbles
Added samples/resilient_agent.py demo.

Testing Plan

Unit Tests:

I have added or updated unit tests for my change.
All unit tests pass locally.

Command run:

.venv/Scripts/python -m pytest tests/unittests/plugins/test_llm_resilience_plugin.py -v

Result summary:

collected 3 items
tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_fallback_model_used_after_retries PASSED
tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_non_transient_error_bubbles PASSED
tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_retry_success_on_same_model PASSED
3 passed

Manual End-to-End (E2E) Tests:

Run sample:

.venv/Scripts/python samples/resilient_agent.py

Observed output:

LLM retry attempt 1 failed: TimeoutError('Simulated transient failure')
Collected 1 events
MODEL: Recovered on retry!

Checklist

I have read the CONTRIBUTING.md document.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
I have manually tested my changes end-to-end.
Any dependent changes have been merged and published in downstream modules. (N/A; no dependent changes)

Additional context

Non-breaking: users opt in via
Runner(..., plugins=[LlmResiliencePlugin(...)]).
Transient detection currently targets common HTTP/timeouts and can be extended
in follow-ups (e.g., per-exception policy, circuit breaking).
Live bidirectional streaming paths are out of scope for this PR.

…allbacks; tests for plugin behavior

…esiliencePlugin; import InvocationContext for typing; keep tests consistent

…nContext or Context and robust asyncgen handling

…ncePlugin usage

…t when importing plugins package; rely on duck-typing for context access

…n, tests, and usage

…ror; all plugin tests pass

google-cla · Feb 18, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gemini-code-assist · Feb 18, 2026

Summary of Changes

Hello @chillum-codeX, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reliability of LLM interactions by introducing a centralized LlmResiliencePlugin. This plugin addresses the common problem of ad-hoc error handling for transient LLM/API failures (such as timeouts or HTTP 429/5xx errors) by providing a standardized, configurable mechanism for retries with exponential backoff and model fallbacks. The change improves the robustness of agents in production environments without altering the core application framework.

Highlights

New LLM Resilience Plugin: Introduced LlmResiliencePlugin to provide configurable retries with exponential backoff and jitter, along with optional model fallbacks, for handling transient LLM/API errors.
Opt-in and Non-breaking: The plugin is designed as an opt-in feature, integrating seamlessly without requiring modifications to core runner or flow logic, ensuring existing users are unaffected unless configured.
Comprehensive Testing and Sample: Includes dedicated unit tests covering retry success, fallback model usage, and correct propagation of non-transient errors, alongside a new resilient_agent.py sample demonstrating its practical application.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

CONTRIBUTION_NOTE.txt
- Added a detailed contribution note explaining the plugin's implementation, rationale, design choices, and validation steps.
PR_BODY.md
- Added the pull request description, including problem statement, solution, summary of changes, testing plan, and checklist.
samples/resilient_agent.py
- Added a new sample agent demonstrating the configuration and behavior of the LlmResiliencePlugin.
src/google/adk/plugins/init.py
- Updated to import and export the new LlmResiliencePlugin for discoverability within the plugin module.
src/google/adk/plugins/llm_resilience_plugin.py
- Added the core implementation of the LlmResiliencePlugin, including logic for retry with exponential backoff and jitter, model fallback, and transient error detection.
tests/unittests/plugins/test_llm_resilience_plugin.py
- Added unit tests to verify the LlmResiliencePlugin's functionality, covering scenarios such as retry success, fallback model usage, and correct error propagation.

Activity

The author performed a self-review of the code changes.
Unit tests were added for the new LlmResiliencePlugin and all passed locally.
Manual end-to-end tests were conducted using the new resilient_agent.py sample, demonstrating successful recovery from simulated failures.
The pull request description and testing evidence were updated to align with the repository's PR template.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

adk-bot · Feb 18, 2026

Response from ADK Triaging Agent

Hello @chillum-codeX, thank you for creating this PR!

It looks like you have not yet signed our Contributor License Agreement (CLA). Please visit https://cla.developers.google.com/ to sign it.

This information will help reviewers to review your PR more efficiently. Thanks!

gemini-code-assist

Code Review

This pull request introduces a valuable LlmResiliencePlugin for handling transient LLM errors through retries and fallbacks. The implementation is robust, covering both async generator and coroutine-based LLM providers, and includes comprehensive unit tests for the core logic. The addition of a sample application is also very helpful for understanding its usage.

My review includes a few suggestions to improve maintainability and fix a minor issue in the sample code. Specifically, I've recommended adding type hints to a helper function, narrowing the scope of some exception handlers, and correcting the state management in the demo model to ensure it behaves as intended.

gemini-code-assist · Feb 18, 2026

samples/resilient_agent.py

+class DemoFailThenSucceedModel(BaseLlm):
+  model: str = "demo-fail-succeed"
+  attempts: int = 0
+
+  @classmethod
+  def supported_models(cls) -> list[str]:
+    return ["demo-fail-succeed"]
+
+  async def generate_content_async(
+      self, llm_request: LlmRequest, stream: bool = False
+  ):
+    # Fail for the first attempt, then succeed
+    self.attempts += 1
+    if self.attempts < 2:
+      raise TimeoutError("Simulated transient failure")
+    yield LlmResponse(
+        content=types.Content(
+            role="model",
+            parts=[types.Part.from_text(text="Recovered on retry!")],
+        ),
+        partial=False,
+    )


The stateful logic in DemoFailThenSucceedModel relies on the attempts counter being incremented across calls. However, because the agent is configured with the model name as a string (model="demo-fail-succeed"), a new instance of DemoFailThenSucceedModel is created for the initial call and for each retry attempt. This resets self.attempts to 0 for each new instance, preventing the model from succeeding after a failure as intended in this demo.

To ensure the state is shared across these distinct instances, attempts should be a true class variable, accessed via the class itself.

class DemoFailThenSucceedModel(BaseLlm): model: str = "demo-fail-succeed" _attempts: int = 0 @classmethod def supported_models(cls) -> list[str]: return ["demo-fail-succeed"] async def generate_content_async( self, llm_request: LlmRequest, stream: bool = False ): # Fail for the first attempt, then succeed DemoFailThenSucceedModel._attempts += 1 if DemoFailThenSucceedModel._attempts < 2: raise TimeoutError("Simulated transient failure") yield LlmResponse( content=types.Content( role="model", parts=[types.Part.from_text(text="Recovered on retry!")], ), partial=False, )

What changed:
• Used ClassVar[int] from typing module to properly declare a class variable
• Access the counter via DemoFailThenSucceedModel.attempts instead of self._attempts
• This ensures the counter is shared across all instances created during retries

Verified:
• ✅ Sample runs correctly: Recovered on retry!
• ✅ Pushed to PR

gemini-code-assist · Feb 18, 2026

src/google/adk/plugins/llm_resilience_plugin.py

For better type safety and code clarity, please add type hints for the callback_context parameter and the return value of the _get_invocation_context method.

Suggested change

def _get_invocation_context(self, callback_context):

def _get_invocation_context(

self, callback_context: CallbackContext | InvocationContext

) -> InvocationContext:

gemini-code-assist · Feb 18, 2026

src/google/adk/plugins/llm_resilience_plugin.py

Using a broad except Exception: pass can hide unexpected errors and make debugging more difficult. It's better to catch only the specific exceptions you expect to handle. In this case, you're trying to guard against potential ImportError if the local import fails or AttributeError if the config structure is not as expected.

This same pattern is used in the _try_fallbacks method on lines 253-254 and should also be updated.

Suggested change

except Exception:

pass

except (ImportError, AttributeError):

pass

Updated both exception handlers from except ImportError: to except (ImportError, AttributeError): as suggested by the reviewer.

Adds plugin export, unit tests, resilient sample, PR body updates, and contribution note with validation evidence.

- Add type hints and docstring to _get_invocation_context helper - Narrow exception handlers from Exception to ImportError - Fix demo model state management: use instance variable instead of class variable

Since the agent uses model name as string, new instances are created for each retry. Use typing.ClassVar to ensure the attempts counter is shared across all instances of DemoFailThenSucceedModel.

Catch only the specific exceptions expected when importing StreamingMode or accessing config attributes, rather than broad Exception.

ryanaiagent · Feb 19, 2026

Hi @chillum-codeX ,Thank you for your contribution! We appreciate you taking the time to submit this pull request.
Closing this PR here as it belongs to adk-python-community repo.
We highly recommend releasing the feature as a standalone package that we will then share through: https://google.github.io/adk-docs/integrations/"

chillum-codeX · Feb 19, 2026

Thanks for the guidance! I've submitted the plugin to the community repo as suggested:

google/adk-python-community#90

Appreciate the feedback! 🙏

agent added 9 commits February 14, 2026 15:35

feat(plugins): add LlmResiliencePlugin with retry/backoff and model f…

f53a5d8

…allbacks; tests for plugin behavior

fix(plugins): use CallbackContext directly (no private attrs) in LlmR…

77c3aa8

…esiliencePlugin; import InvocationContext for typing; keep tests consistent

test(plugins): stabilize LlmResiliencePlugin tests; support Invocatio…

7f1cab2

…nContext or Context and robust asyncgen handling

docs(samples): add resilient_agent.py sample demonstrating LlmResilie…

0344186

…ncePlugin usage

chore(plugins): export LlmResiliencePlugin in plugins package __init__

2364f4c

fix(plugins): remove InvocationContext import to avoid circular impor…

971984b

…t when importing plugins package; rely on duck-typing for context access

docs: add PR_BODY.md describing LlmResiliencePlugin motivation, desig…

679f7ba

…n, tests, and usage

fix(plugins): duck-typed InvocationContext resolution to avoid NameEr…

c9875a0

…ror; all plugin tests pass

fix(samples): use valid agent name (underscores) in resilient_agent.py

a216a41

adk-bot added the core [Component] This issue is related to the core interface and implementation label Feb 18, 2026

gemini-code-assist bot reviewed Feb 18, 2026

View reviewed changes

feat(plugins): add LlmResiliencePlugin with retries and fallbacks

2179788

Adds plugin export, unit tests, resilient sample, PR body updates, and contribution note with validation evidence.

chillum-codeX force-pushed the feat/resilience-plugin branch from e0bb1fa to 2179788 Compare February 18, 2026 17:12

chillum-codeX and others added 5 commits February 18, 2026 22:43

Merge branch 'main' into feat/resilience-plugin

b3a6bac

fix: address PR review comments

20a7e71

- Add type hints and docstring to _get_invocation_context helper - Narrow exception handlers from Exception to ImportError - Fix demo model state management: use instance variable instead of class variable

fix(sample): use ClassVar for shared state across model instances

8261659

Since the agent uses model name as string, new instances are created for each retry. Use typing.ClassVar to ensure the attempts counter is shared across all instances of DemoFailThenSucceedModel.

fix: narrow exception handlers to (ImportError, AttributeError)

39a2277

Catch only the specific exceptions expected when importing StreamingMode or accessing config attributes, rather than broad Exception.

Merge branch 'main' into feat/resilience-plugin

f962117

ryanaiagent self-assigned this Feb 19, 2026

ryanaiagent added the community repo [Community] FRs/issues well suited for google/adk-python-community repository label Feb 19, 2026

ryanaiagent closed this Feb 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/resilience plugin#4537

Feat/resilience plugin#4537
chillum-codeX wants to merge 15 commits intogoogle:maingoogle/adk-python:mainfrom
chillum-codeX:feat/resilience-pluginchillum-codeX/adk-python:feat/resilience-pluginCopy head branch name to clipboard

chillum-codeX commented Feb 18, 2026

Uh oh!

google-cla bot commented Feb 18, 2026

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Uh oh!

adk-bot commented Feb 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 18, 2026

Uh oh!

chillum-codeX Feb 18, 2026

Uh oh!

gemini-code-assist bot Feb 18, 2026

Uh oh!

chillum-codeX Feb 18, 2026

Uh oh!

gemini-code-assist bot Feb 18, 2026

Uh oh!

chillum-codeX Feb 18, 2026

Uh oh!

ryanaiagent commented Feb 19, 2026

Uh oh!

chillum-codeX commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

-  def _get_invocation_context(self, callback_context):
+  def _get_invocation_context(
+      self, callback_context: CallbackContext | InvocationContext
+  ) -> InvocationContext:

-    except Exception:
-      pass
+    except (ImportError, AttributeError):
+      pass

Search code, repositories, users, issues, pull requests...

Conversation

chillum-codeX commented Feb 18, 2026

feat(plugins): LlmResiliencePlugin – configurable retries/backoff and model fallbacks

Link to Issue or Description of Change

Summary

Testing Plan

Checklist

Additional context

Uh oh!

google-cla bot commented Feb 18, 2026

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

adk-bot commented Feb 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

chillum-codeX Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

chillum-codeX Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

chillum-codeX Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

ryanaiagent commented Feb 19, 2026

Uh oh!

chillum-codeX commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments