blocking batch inference via regular LLM endpoints (not batch ones) #3509

Sep 12, 2025

ProblemFactory
Sep 12, 2025

Because of my use case, I uses LLMs with very high concurrency (~500), which means I need to spin up 500 inference request at same time on my local client, and make the end to end time unstable. For example in one of my tests, the total end to end latency for 500 concurrent inference request is around 4.19 seconds, but the longest request only takes 3.16s. Is it possible to handle this kind of batching request on tensorzero (just a grouped together normal inferences) to reduce overhead in creating http connection?

Also, if possible, providing method to avoid duplicate arguments across batched requests will be really helpful in reducing network overhead.
For example, my request could contain two arguments, context, search_result, and my (batched) inference requests could look like this:

[
    {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "arguments": {
                            "context": "User is asking about recent news.",
                            "search_result": "{news 1}"
                        }
                    }
                ]
            }
        ]
    },
    {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "arguments": {
                            "context": "User is asking about recent news.",
                            "search_result": "{news 2}"
                        }
                    }
                ]
            }
        ]
    }
]

The "User is asking about recent news." could be grouped to avoid repeating the same long context multiple times, which makes the request payload too large.

ProblemFactory · Sep 12, 2025

GabrielBianconi
Sep 12, 2025
Maintainer

Hi @ProblemFactory,

The gateway itself should be able to handle way more than 500 concurrent inferences with minimal overhead. Let's try to diagnose what's going on:

Are you using 2025.9.3 (published ~12h ago)? It includes further performance optimizations.
Is the gateway located in the same machine or even cloud region as your application?
Are you using the TensorZero Python SDK? Are you using build_http? Are you initializing the client once and re-using it for all the requests?
What's the approximate size of each inference input? (tokens/characters in the arguments)

Thanks!

1 reply

ProblemFactory Sep 12, 2025
Author

I am not saying the gateway is the issue. I mean for my use case, I need to create and manage 500 requests from my client, which has a significant overhead for latency aware use cases. Combining those into one request could help reduce the connection overhead and reduce the amount of data being transmitted. So this post is a feature request not a bug report.
BTW, 500 inference requests total about 10MB input size. I am using python aiohttp to send http requests to tensorzero gateway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

blocking batch inference via regular LLM endpoints (not batch ones) #3509

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment · 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Search code, repositories, users, issues, pull requests...

blocking batch inference via regular LLM endpoints (not batch ones) #3509

Uh oh!

ProblemFactory Sep 12, 2025

Replies: 1 comment · 1 reply

Uh oh!

GabrielBianconi Sep 12, 2025 Maintainer

Uh oh!

Uh oh!

ProblemFactory Sep 12, 2025 Author

ProblemFactory
Sep 12, 2025

GabrielBianconi
Sep 12, 2025
Maintainer

ProblemFactory Sep 12, 2025
Author