blocking batch inference via regular LLM endpoints (not batch ones) #3509
ProblemFactory
started this conversation in
Feature Requests
Replies: 1 comment · 1 reply
-
Hi @ProblemFactory, The gateway itself should be able to handle way more than 500 concurrent inferences with minimal overhead. Let's try to diagnose what's going on:
Thanks! |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Because of my use case, I uses LLMs with very high concurrency (~500), which means I need to spin up 500 inference request at same time on my local client, and make the end to end time unstable. For example in one of my tests, the total end to end latency for 500 concurrent inference request is around 4.19 seconds, but the longest request only takes 3.16s. Is it possible to handle this kind of batching request on tensorzero (just a grouped together normal inferences) to reduce overhead in creating http connection?
Also, if possible, providing method to avoid duplicate arguments across batched requests will be really helpful in reducing network overhead.
For example, my request could contain two arguments,
context
,search_result
, and my (batched) inference requests could look like this:The "User is asking about recent news." could be grouped to avoid repeating the same long context multiple times, which makes the request payload too large.
Beta Was this translation helpful? Give feedback.
All reactions