Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Discussion options

Because of my use case, I uses LLMs with very high concurrency (~500), which means I need to spin up 500 inference request at same time on my local client, and make the end to end time unstable. For example in one of my tests, the total end to end latency for 500 concurrent inference request is around 4.19 seconds, but the longest request only takes 3.16s. Is it possible to handle this kind of batching request on tensorzero (just a grouped together normal inferences) to reduce overhead in creating http connection?

Also, if possible, providing method to avoid duplicate arguments across batched requests will be really helpful in reducing network overhead.
For example, my request could contain two arguments, context, search_result, and my (batched) inference requests could look like this:

[
    {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "arguments": {
                            "context": "User is asking about recent news.",
                            "search_result": "{news 1}"
                        }
                    }
                ]
            }
        ]
    },
    {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "arguments": {
                            "context": "User is asking about recent news.",
                            "search_result": "{news 2}"
                        }
                    }
                ]
            }
        ]
    }
]

The "User is asking about recent news." could be grouped to avoid repeating the same long context multiple times, which makes the request payload too large.

You must be logged in to vote

Replies: 1 comment · 1 reply

Comment options

Hi @ProblemFactory,

The gateway itself should be able to handle way more than 500 concurrent inferences with minimal overhead. Let's try to diagnose what's going on:

  1. Are you using 2025.9.3 (published ~12h ago)? It includes further performance optimizations.
  2. Is the gateway located in the same machine or even cloud region as your application?
  3. Are you using the TensorZero Python SDK? Are you using build_http? Are you initializing the client once and re-using it for all the requests?
  4. What's the approximate size of each inference input? (tokens/characters in the arguments)

Thanks!

You must be logged in to vote
1 reply
@ProblemFactory
Comment options

I am not saying the gateway is the issue. I mean for my use case, I need to create and manage 500 requests from my client, which has a significant overhead for latency aware use cases. Combining those into one request could help reduce the connection overhead and reduce the amount of data being transmitted. So this post is a feature request not a bug report.
BTW, 500 inference requests total about 10MB input size. I am using python aiohttp to send http requests to tensorzero gateway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.