-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Add attention benchmarking tools #26835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add attention benchmarking tools #26835
Conversation
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
…into benchmark_attention Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
This commit fixes the attention benchmark to properly support both decode and prefill pipelines for MLA backends after the recent refactor. Key changes: - Added MockKVBProj class to mock KV projection layer for prefill mode - Created _create_input_tensors() to generate both decode and prefill inputs - Decode: uses kv_lora_rank (512) dimension - Prefill: uses qk_nope_head_dim (128) to stay under FlashAttention's 256 limit - Added automatic mode selection: calls _forward_decode() or _forward_prefill() based on metadata.decode/metadata.prefill - Fixed threshold setting: changed from class to instance variable - Added traceback printing for better error debugging The benchmark now successfully compares decode vs prefill pipelines: qlen=2: decode=0.000033s, prefill=0.000303s -> decode is 9.09x faster 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive attention benchmarking suite, which is a valuable addition for performance tuning and backend selection. The implementation is well-structured, with clear separation of concerns for parsing batch specifications, running benchmarks, and formatting results. However, I've identified several critical issues related to the batch specification parser and its tests. The parser implementation is inconsistent with the test suite and some default configurations, which will lead to runtime errors and test failures. Specifically, the tests use an outdated grammar (with spec
and chunk
prefixes) that the parser doesn't support, and some default arguments are invalid. Additionally, helper functions for analyzing batch statistics will crash due to referencing non-existent attributes. Addressing these inconsistencies is crucial to make the new benchmarking tools functional and reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Purpose
Add tools for benchmarking attention backends. These can be used to perform parameter tuning as well as selecting optimal backends for particular configurations. These tools were built with heavy use of Claude Code.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.