Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

MatthewBonanni
Copy link
Contributor

@MatthewBonanni MatthewBonanni commented Oct 14, 2025

Purpose

Add tools for benchmarking attention backends. These can be used to perform parameter tuning as well as selecting optimal backends for particular configurations. These tools were built with heavy use of Claude Code.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

MatthewBonanni and others added 18 commits October 10, 2025 20:01
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
…into benchmark_attention

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
This commit fixes the attention benchmark to properly support both decode
and prefill pipelines for MLA backends after the recent refactor.

Key changes:
- Added MockKVBProj class to mock KV projection layer for prefill mode
- Created _create_input_tensors() to generate both decode and prefill inputs
  - Decode: uses kv_lora_rank (512) dimension
  - Prefill: uses qk_nope_head_dim (128) to stay under FlashAttention's 256 limit
- Added automatic mode selection: calls _forward_decode() or _forward_prefill()
  based on metadata.decode/metadata.prefill
- Fixed threshold setting: changed from class to instance variable
- Added traceback printing for better error debugging

The benchmark now successfully compares decode vs prefill pipelines:
  qlen=2: decode=0.000033s, prefill=0.000303s -> decode is 9.09x faster

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@mergify mergify bot added the performance Performance-related issues label Oct 14, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive attention benchmarking suite, which is a valuable addition for performance tuning and backend selection. The implementation is well-structured, with clear separation of concerns for parsing batch specifications, running benchmarks, and formatting results. However, I've identified several critical issues related to the batch specification parser and its tests. The parser implementation is inconsistent with the test suite and some default configurations, which will lead to runtime errors and test failures. Specifically, the tests use an outdated grammar (with spec and chunk prefixes) that the parser doesn't support, and some default arguments are invalid. Additionally, helper functions for analyzing batch statistics will crash due to referencing non-existent attributes. Addressing these inconsistencies is crucial to make the new benchmarking tools functional and reliable.

benchmarks/attention_benchmarks/batch_spec.py Show resolved Hide resolved
benchmarks/attention_benchmarks/benchmark.py Show resolved Hide resolved
benchmarks/attention_benchmarks/test_batch_spec.py Outdated Show resolved Hide resolved
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

benchmarks/attention_benchmarks/mla_runner.py Outdated Show resolved Hide resolved
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Morty Proxy This is a proxified and sanitized view of the page, visit original site.