Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Feat/resilience plugin#4537

Closed
chillum-codeX wants to merge 15 commits intogoogle:maingoogle/adk-python:mainfrom
chillum-codeX:feat/resilience-pluginchillum-codeX/adk-python:feat/resilience-pluginCopy head branch name to clipboard
Closed

Feat/resilience plugin#4537
chillum-codeX wants to merge 15 commits intogoogle:maingoogle/adk-python:mainfrom
chillum-codeX:feat/resilience-pluginchillum-codeX/adk-python:feat/resilience-pluginCopy head branch name to clipboard

Conversation

@chillum-codeX
Copy link

feat(plugins): LlmResiliencePlugin – configurable retries/backoff and model fallbacks

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

2. Or, if no issue exists, describe the change:

Problem:
Production agents need first-class resilience to transient LLM/API failures
(timeouts, HTTP 429/5xx). Today, retry/fallback logic is often ad-hoc and
duplicated across projects.

Solution:
Introduce an opt-in plugin, LlmResiliencePlugin, that handles transient LLM
errors with configurable retries (exponential backoff + jitter) and optional
model fallbacks, without modifying core runner/flow logic.

Summary

  • Added src/google/adk/plugins/llm_resilience_plugin.py.
  • Exported LlmResiliencePlugin in src/google/adk/plugins/__init__.py.
  • Added unit tests in
    tests/unittests/plugins/test_llm_resilience_plugin.py:
    • test_retry_success_on_same_model
    • test_fallback_model_used_after_retries
    • test_non_transient_error_bubbles
  • Added samples/resilient_agent.py demo.

Testing Plan

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.

Command run:

.venv/Scripts/python -m pytest tests/unittests/plugins/test_llm_resilience_plugin.py -v

Result summary:

collected 3 items
tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_fallback_model_used_after_retries PASSED
tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_non_transient_error_bubbles PASSED
tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_retry_success_on_same_model PASSED
3 passed

Manual End-to-End (E2E) Tests:

Run sample:

.venv/Scripts/python samples/resilient_agent.py

Observed output:

LLM retry attempt 1 failed: TimeoutError('Simulated transient failure')
Collected 1 events
MODEL: Recovered on retry!

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules. (N/A; no dependent changes)

Additional context

  • Non-breaking: users opt in via
    Runner(..., plugins=[LlmResiliencePlugin(...)]).
  • Transient detection currently targets common HTTP/timeouts and can be extended
    in follow-ups (e.g., per-exception policy, circuit breaking).
  • Live bidirectional streaming paths are out of scope for this PR.

@google-cla
Copy link

google-cla bot commented Feb 18, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @chillum-codeX, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reliability of LLM interactions by introducing a centralized LlmResiliencePlugin. This plugin addresses the common problem of ad-hoc error handling for transient LLM/API failures (such as timeouts or HTTP 429/5xx errors) by providing a standardized, configurable mechanism for retries with exponential backoff and model fallbacks. The change improves the robustness of agents in production environments without altering the core application framework.

Highlights

  • New LLM Resilience Plugin: Introduced LlmResiliencePlugin to provide configurable retries with exponential backoff and jitter, along with optional model fallbacks, for handling transient LLM/API errors.
  • Opt-in and Non-breaking: The plugin is designed as an opt-in feature, integrating seamlessly without requiring modifications to core runner or flow logic, ensuring existing users are unaffected unless configured.
  • Comprehensive Testing and Sample: Includes dedicated unit tests covering retry success, fallback model usage, and correct propagation of non-transient errors, alongside a new resilient_agent.py sample demonstrating its practical application.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • CONTRIBUTION_NOTE.txt
    • Added a detailed contribution note explaining the plugin's implementation, rationale, design choices, and validation steps.
  • PR_BODY.md
    • Added the pull request description, including problem statement, solution, summary of changes, testing plan, and checklist.
  • samples/resilient_agent.py
    • Added a new sample agent demonstrating the configuration and behavior of the LlmResiliencePlugin.
  • src/google/adk/plugins/init.py
    • Updated to import and export the new LlmResiliencePlugin for discoverability within the plugin module.
  • src/google/adk/plugins/llm_resilience_plugin.py
    • Added the core implementation of the LlmResiliencePlugin, including logic for retry with exponential backoff and jitter, model fallback, and transient error detection.
  • tests/unittests/plugins/test_llm_resilience_plugin.py
    • Added unit tests to verify the LlmResiliencePlugin's functionality, covering scenarios such as retry success, fallback model usage, and correct error propagation.
Activity
  • The author performed a self-review of the code changes.
  • Unit tests were added for the new LlmResiliencePlugin and all passed locally.
  • Manual end-to-end tests were conducted using the new resilient_agent.py sample, demonstrating successful recovery from simulated failures.
  • The pull request description and testing evidence were updated to align with the repository's PR template.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the core [Component] This issue is related to the core interface and implementation label Feb 18, 2026
@adk-bot
Copy link
Collaborator

adk-bot commented Feb 18, 2026

Response from ADK Triaging Agent

Hello @chillum-codeX, thank you for creating this PR!

It looks like you have not yet signed our Contributor License Agreement (CLA). Please visit https://cla.developers.google.com/ to sign it.

This information will help reviewers to review your PR more efficiently. Thanks!

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable LlmResiliencePlugin for handling transient LLM errors through retries and fallbacks. The implementation is robust, covering both async generator and coroutine-based LLM providers, and includes comprehensive unit tests for the core logic. The addition of a sample application is also very helpful for understanding its usage.

My review includes a few suggestions to improve maintainability and fix a minor issue in the sample code. Specifically, I've recommended adding type hints to a helper function, narrowing the scope of some exception handlers, and correcting the state management in the demo model to ensure it behaves as intended.

Comment on lines 27 to 48
class DemoFailThenSucceedModel(BaseLlm):
model: str = "demo-fail-succeed"
attempts: int = 0

@classmethod
def supported_models(cls) -> list[str]:
return ["demo-fail-succeed"]

async def generate_content_async(
self, llm_request: LlmRequest, stream: bool = False
):
# Fail for the first attempt, then succeed
self.attempts += 1
if self.attempts < 2:
raise TimeoutError("Simulated transient failure")
yield LlmResponse(
content=types.Content(
role="model",
parts=[types.Part.from_text(text="Recovered on retry!")],
),
partial=False,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The stateful logic in DemoFailThenSucceedModel relies on the attempts counter being incremented across calls. However, because the agent is configured with the model name as a string (model="demo-fail-succeed"), a new instance of DemoFailThenSucceedModel is created for the initial call and for each retry attempt. This resets self.attempts to 0 for each new instance, preventing the model from succeeding after a failure as intended in this demo.

To ensure the state is shared across these distinct instances, attempts should be a true class variable, accessed via the class itself.

class DemoFailThenSucceedModel(BaseLlm):
  model: str = "demo-fail-succeed"
  _attempts: int = 0

  @classmethod
  def supported_models(cls) -> list[str]:
    return ["demo-fail-succeed"]

  async def generate_content_async(
      self, llm_request: LlmRequest, stream: bool = False
  ):
    # Fail for the first attempt, then succeed
    DemoFailThenSucceedModel._attempts += 1
    if DemoFailThenSucceedModel._attempts < 2:
      raise TimeoutError("Simulated transient failure")
    yield LlmResponse(
        content=types.Content(
            role="model",
            parts=[types.Part.from_text(text="Recovered on retry!")],
        ),
        partial=False,
    )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changed:
• Used ClassVar[int] from typing module to properly declare a class variable
• Access the counter via DemoFailThenSucceedModel.attempts instead of self._attempts
• This ensures the counter is shared across all instances created during retries

Verified:
• ✅ Sample runs correctly: Recovered on retry!
• ✅ Pushed to PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better type safety and code clarity, please add type hints for the callback_context parameter and the return value of the _get_invocation_context method.

Suggested change
def _get_invocation_context(self, callback_context):
def _get_invocation_context(
self, callback_context: CallbackContext | InvocationContext
) -> InvocationContext:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Comment on lines 203 to 204
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a broad except Exception: pass can hide unexpected errors and make debugging more difficult. It's better to catch only the specific exceptions you expect to handle. In this case, you're trying to guard against potential ImportError if the local import fails or AttributeError if the config structure is not as expected.

This same pattern is used in the _try_fallbacks method on lines 253-254 and should also be updated.

Suggested change
except Exception:
pass
except (ImportError, AttributeError):
pass

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated both exception handlers from except ImportError: to except (ImportError, AttributeError): as suggested by the reviewer.

Adds plugin export, unit tests, resilient sample, PR body updates, and contribution note with validation evidence.
chillum-codeX and others added 5 commits February 18, 2026 22:43
- Add type hints and docstring to _get_invocation_context helper
- Narrow exception handlers from Exception to ImportError
- Fix demo model state management: use instance variable instead of class variable
Since the agent uses model name as string, new instances are created for each
retry. Use typing.ClassVar to ensure the attempts counter is shared across
all instances of DemoFailThenSucceedModel.
Catch only the specific exceptions expected when importing StreamingMode
or accessing config attributes, rather than broad Exception.
@ryanaiagent ryanaiagent self-assigned this Feb 19, 2026
@ryanaiagent ryanaiagent added the community repo [Community] FRs/issues well suited for google/adk-python-community repository label Feb 19, 2026
@ryanaiagent
Copy link
Collaborator

Hi @chillum-codeX ,Thank you for your contribution! We appreciate you taking the time to submit this pull request.
Closing this PR here as it belongs to adk-python-community repo.
We highly recommend releasing the feature as a standalone package that we will then share through: https://google.github.io/adk-docs/integrations/"

@chillum-codeX
Copy link
Author

Thanks for the guidance! I've submitted the plugin to the community repo as suggested:

google/adk-python-community#90

Appreciate the feedback! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community repo [Community] FRs/issues well suited for google/adk-python-community repository core [Component] This issue is related to the core interface and implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments

Close sidebar
Morty Proxy This is a proxified and sanitized view of the page, visit original site.