[rollout] feat: add agent loop #2124

wuxibin89 · Jun 20, 2025

Checklist Before Starting

Searched for similar PR(s).
Checked PR Title format
- In format of: [modules] type: Title
- modules are in fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- type is in feat, fix, refactor, chore, test
- can involve multiple modules, seperated by , or space, like [megatron, fsdp, doc] feat: xxx

What does this PR do?

Add AgentLoopBase and AgentLoopManager for agentic rollout.

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

High-Level Design

New Components:

AgentLoopBase: abstract class represents the loop of a single prompt's rollout, in the loop, agent may chat with OpenAI compatible LLM server and interact with various environments.
AsyncLLMServerManager: send chat completion requests to multiple LLM servers, providing load balance and sticky session.
AgentLoopManager: get a batch of prompts from dataloader and split to multiple AgentLoopWorker
AgentLoopWorker: for each prompt, create a AgentLoopBase instance, run loop task.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title description if it breaks any API.
Update the documentation about your changes in the docs.
New CI unit test(s) are added to cover the code path.
Rely on existing unit tests on CI that covers the code path.

vermouth1992 · Jun 20, 2025

@ccclyu @SwordFaith Could you help review this design. Basically, it converts recursive llm interactions to loop-based design to make writing complex agent loop more easily. The original idea is from @zw0610

vermouth1992 · Jun 21, 2025

@casper-hansen Could you please also take a look and give some comments? Thanks.

zw0610 · Jun 21, 2025

@vermouth1992 This is a compromised version of what we've discussed. I would still suggest move the agent loop of out the AsyncLLMServer, and manipulate it directly on driver process.

vermouth1992 · Jun 21, 2025

@vermouth1992 This is a compromised version of what we've discussed. I would still suggest move the agent loop of out the AsyncLLMServer, and manipulate it directly on driver process.

It runs on another ray actor. If we remove the ray actor, it directly runs on the driver process

zw0610 · Jun 21, 2025

@vermouth1992 This is a compromised version of what we've discussed. I would still suggest move the agent loop of out the AsyncLLMServer, and manipulate it directly on driver process.

It runs on another ray actor. If we remove the ray actor, it directly runs on the driver process

My concern lays more on the flexibility of debugging and customizing agent loop. Moreover, if we consider agent loops that consumes much more local resources, transforming them into distributed ray actors will release the resource pressure on either the ray actor you mentioned under the this implementation or driver process.

casper-hansen · Jun 21, 2025

My feedback below.

Code Quality Improvement: Agent loop effectively refactors ChatCompletionScheduler by replacing recursion with a while loop, making the implementation simpler for the end-user.
Missing Extensibility: Agent loop lacks the extensibility that ChatCompletionScheduler provides via config.multi_turn.completion_callback. Adding similar functionality would make this more useful.
Development Experience Issue (Agree with @zw0610): Both implementations suffer from slow iteration cycles due to Ray dependency. I've implemented a private interoperable CompletionCallback that works with both training and inference, allowing direct calls to o4-mini or Qwen3 without booting veRL (which takes minutes).

I hope the feedback is useful. Point 3 is a general challenge that I think can be overcome in a future PR (scope is too big), so I would suggest adding it to the roadmap.

vermouth1992 · Jun 21, 2025

My feedback below.

Code Quality Improvement: Agent loop effectively refactors ChatCompletionScheduler by replacing recursion with a while loop, making the implementation simpler for the end-user.

Missing Extensibility: Agent loop lacks the extensibility that ChatCompletionScheduler provides via config.multi_turn.completion_callback. Adding similar functionality would make this more useful.

Development Experience Issue (Agree with @zw0610): Both implementations suffer from slow iteration cycles due to Ray dependency. I've implemented a private interoperable CompletionCallback that works with both training and inference, allowing direct calls to o4-mini or Qwen3 without booting veRL (which takes minutes).

I hope the feedback is useful. Point 3 is a general challenge that I think can be overcome in a future PR (scope is too big), so I would suggest adding it to the roadmap.

Thanks for your feedback! For number 2, I guess we can provide a registration mechanism for agent loop so that you can define by your self outside verl codebase? The main motivation of loop-based agent is for easy implementation when the agent interacts with an environment (e.g., pokemon game). Just wonder do you still prefer recursive style or loop style? For number 3, I agree that it requires substantial optimization in order to make it fast.

SwordFaith

Looking forward to discuss further.

SwordFaith · Jun 21, 2025

verl/agent/agent_loop.py

+        self.server_addresses = server_addresses
+
+        # Least requests load balancing
+        self.weighted_addresses = [[0, address] for address in server_addresses]


least request load balancing may also results to shard equal data size to each dp group if without rate limit or running requests aware back press mechanism.

Yes, we need a more server pressure aware load balancing strategy, @chenhaiq is investigating it.

SwordFaith · Jun 21, 2025

verl/agent/agent_loop.py

+        heapq.heapify(self.weighted_addresses)
+
+        # LRU cache to map request_id to address
+        self.request_id_to_address = LRUCache(maxsize=max_cache_size)


If it possible to use radix-aware or prefix-aware routing hybrid with least requests and max running reqs aware.

With collecting server performance metrics(e.g current requests, gpu cache usage, cpu cache usage, etc), we can implement better routing strategy.

SwordFaith · Jun 21, 2025

verl/agent/agent_loop.py

+
+        return await self._chat_completions(address, request_id, messages=messages, tools=tools, **sampling_params, **kwargs)
+
+    async def _chat_completions(self, address: str, request_id: str, **chat_complete_request) -> ChatCompletion:


So we finally choose to keep http server routing rather than ray based ?

Did you observe a performance diff?

It was suffered with concurrency when with sgl-router, and make Zilin Zhu switch to http2 in several weeks before. But may be ok because we only use between each server.
Another consideration is resource affinity inside message list e.g. tool create resource like image in each turn.
Using ray will help on auto-affinity scheduling, using http will require user equip with s3 like oss service and cause many network traffic during multi-turn generation. I guess support ray-based routing can improve user experience with less infrastructure requirements.
Using http-based routing can more easily porting with existing open source openai compatible server impl, and keep verl simplier, which is also an advantage.

So we finally choose to keep http server routing rather than ray based ?

The message protocol chat_complete_request between client and server should be OpenAI compatible, but the transport protocol can either be http or ray remote call. So there will be 2 _chat_completions implementation, the second one is in develop

_chat_completions_http

_chat_completions_ray

SwordFaith · Jun 21, 2025

verl/agent/agent_loop.py

+    num_turns: int
+
+
+class AgentLoopBase(ABC):


At least we need abort here. I wonder the main difference between agent loop and async rollout request is a normal class or a data class work with rollout or server+tool registry, why we need yet another interaction with env impl here ?

There is a fundamental difference between the two programming abstractions. In real world agent applications, rollout is just a server. The agent receives a prompt and starts to query LLMs for instructions in an agent loop. The tool calling is implemented inside the agent loop because there is no way to implement in the server (e.g., OpenAI models). The agent loop abstraction makes existing agent system easily pluggable, and not just restricted to tools (e.g., agentless or predefined workflows)

For predefined workflow, in my opinion, we should constraint our discussion about shared context / seperate context, hybrid model / single model. Of course, current loop abstraction with http server can make it easier to port existing workflow, but using messages list rather than dict of message list, seems can't support seperate context case as well. Using current AsyncLLMServerManager seems still suffer to add yet another model which not include in current verl training. In #1630 we can support user train single node in a hybrid workflow, no matter it depends on which model for other node. For e2e agent shared model and shared context, current impl works fine. A key challenge is how to train seperate context and single model workflow in verl. Which If there is an agentic loop, I want it to accomplish.

Let me take a close look at

SwordFaith · Jun 21, 2025

verl/agent/agent_loop.py

+        return DataProto(batch=batch, non_tensor_batch={"__num_turns__": num_turns})
+
+
+class AgentLoopManager:


It will be better to consider this as replay buffer or rollout manager, which is the coordinator for partial rollout.

casper-hansen · Jun 21, 2025

For number 2, I guess we can provide a registration mechanism for agent loop so that you can define by your self outside verl codebase?

Yes, allow users to define a custom agent loop easily.

Just wonder do you still prefer recursive style or loop style?

Loop style is preferred. It’s easier to implement your own loop than it is to implement with recursion. Also, looking at the code, loop style is easier to transfer to inference after training.

wuxibin89 · Jun 21, 2025

My feedback below.

Code Quality Improvement: Agent loop effectively refactors ChatCompletionScheduler by replacing recursion with a while loop, making the implementation simpler for the end-user.

Missing Extensibility: Agent loop lacks the extensibility that ChatCompletionScheduler provides via config.multi_turn.completion_callback. Adding similar functionality would make this more useful.

Development Experience Issue (Agree with @zw0610): Both implementations suffer from slow iteration cycles due to Ray dependency. I've implemented a private interoperable CompletionCallback that works with both training and inference, allowing direct calls to o4-mini or Qwen3 without booting veRL (which takes minutes).

I hope the feedback is useful. Point 3 is a general challenge that I think can be overcome in a future PR (scope is too big), so I would suggest adding it to the roadmap.

@casper-hansen Thanks for you feedback! For point 2, I'm adding AgentLoop registry, allowing user to register their custom AgentLoop.

casper-hansen · Jun 23, 2025

@wuxibin89 Two minor issues to consider:

When using Qwen3, you have to specify enable_thinking=False in the _postprocess() function because it will otherwise add empty <think> </think> blocks in your messages since it's enabled by default. In the ChatCompletionScheduler, I am able to override this because we have direct access to modifying _postprocess(). How would we achieve the same thing for the Agent loop?
Implementing streaming for inference with the recursive callbacks is pretty challenging. The Agent loop seems easier because we can modify it to yield messages as they appear. Consider if there should be a streaming implementation for this in an example.

wuxibin89 · Jun 23, 2025

@casper-hansen

Due to the problem describe in: [rollout] feat: Add LangGraph chat scheduler for agent training with tool use support #1946 (comment) , I have replace /v1/chat_completions to AsyncLLM.generate, instead of passing text messages, now we pass prompt_ids and response_ids directly between client and server, so there's no need to tokenize messages in _postprocess any more.
Yes, streaming responses is more easy to implement in AgentLoop, but I don't get why we need streaming responses, because we need <EOS> token or other stop tokens to extract tool calls.

casper-hansen · Jun 23, 2025

Due to the problem describe in: [rollout] feat: Add LangGraph chat scheduler for agent training with tool use support #1946 (comment) , I have replace /v1/chat_completions to AsyncLLM.generate, instead of passing text messages, now we pass prompt_ids and response_ids directly between client and server, so there's no need to tokenize messages in _postprocess any more.

I see now, great update on this! The removal of tokenization is reassuring and should be compatible with all models now.

Yes, streaming responses is more easy to implement in AgentLoop, but I don't get why we need streaming responses, because we need <EOS> token or other stop tokens to extract tool calls.

After I train, I want to move to an inference system where I can see the intermediate messages from the trained agent as this can take 5-10 minutes. This loop makes it easier to create an API in front of the trained model.

vermouth1992 · Jun 27, 2025

verl/experimental/agent_loop/single_agent_loop.py

@@ -0,0 +1,50 @@
+# Copyright 2024 Bytedance Ltd. and/or its affiliates


Can we rename this file to single_turn_agent_loop? single_agent_loop is confusing.

casper-hansen · Jun 27, 2025

@wuxibin89 I see this has been merged now. Will you followup with a PR on extensibility so we can use our CustomAgentLoop by registering it through the config?

waleko · Jun 30, 2025

@wuxibin89 Does this change mean we can no longer use a chat scheduler or completion_callback? That was quite helpful for our use case. Is there any way to restore or support this functionality again?

### Checklist Before Starting - [ ] Searched for similar PR(s). - [ ] Checked PR Title format - In format of: [modules] type: Title - modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data` - type is in `feat, fix, refactor, chore, test` - can involve multiple modules, seperated by `,` or space, like `[megatron, fsdp, doc] feat: xxx` ### What does this PR do? Add AgentLoopBase and AgentLoopManager for agentic rollout. ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### High-Level Design New Components: - **AgentLoopBase**: abstract class represents the loop of a single prompt's rollout, in the loop, agent may chat with OpenAI compatible LLM server and interact with various environments. - **AsyncLLMServerManager**: send chat completion requests to multiple LLM servers, providing load balance and sticky session. - **AgentLoopManager**: get a batch of prompts from dataloader and split to multiple AgentLoopWorker - **AgentLoopWorker**: for each prompt, create a AgentLoopBase instance, run loop task. <img width="885" alt="image" src="https://github.com/user-attachments/assets/1f949719-c000-4b94-9ee2-c8a8ff71b109" /> ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.

yinzhangyue · Sep 6, 2025

Perfect Work! Is there any guide how to use the tool_agent_loop in the Training? Thx~

wuxibin89 requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners June 20, 2025 16:25

PeterSH6 requested a review from zw0610 June 20, 2025 16:54

ccclyu added the status: review in process label Jun 21, 2025

SwordFaith reviewed Jun 21, 2025

View reviewed changes

This comment was marked as duplicate.

Sign in to view

wuxibin89 force-pushed the wuxibin/agent_loop branch from 96f1f35 to 3891051 Compare June 23, 2025 06:58

wuxibin89 requested a review from chenhaiq as a code owner June 23, 2025 06:58

wuxibin89 mentioned this pull request Jun 23, 2025

[rollout] feat: Add LangGraph chat scheduler for agent training with tool use support #1946

Closed

6 tasks

wuxibin89 force-pushed the wuxibin/agent_loop branch 5 times, most recently from 30a3404 to f1c6a96 Compare June 25, 2025 11:41

wuxibin89 added 2 commits June 26, 2025 11:05

feat: add agent loop

c0b286c

sglang: skip decode in async mode

9b7ea7a

wuxibin89 force-pushed the wuxibin/agent_loop branch from f1c6a96 to 9b7ea7a Compare June 26, 2025 08:35

wuxibin89 requested a review from zhaochenyang20 as a code owner June 26, 2025 08:35

wuxibin89 added 2 commits June 26, 2025 17:11

fix unit test

5bd6c67

fix sglang test

5941d46

vermouth1992 reviewed Jun 27, 2025

View reviewed changes

vermouth1992 approved these changes Jun 27, 2025

View reviewed changes

vermouth1992 merged commit 790a8a2 into volcengine:main Jun 27, 2025
37 of 38 checks passed

This was referenced Jul 2, 2025

[BREAKING][Refactor] Support multi chat_scheduler to reduce event loop blocking #1737

Closed

[feat] support chat_scheduler_kwargs when init AsyncLLM #1538

Closed

Jakhongir0103 mentioned this pull request Jul 25, 2025

[rollout] feat: extend agent loop to multimodal modals #2754

Closed

7 tasks


		return await self._chat_completions(address, request_id, messages=messages, tools=tools, sampling_params, kwargs)

		async def _chat_completions(self, address: str, request_id: str, **chat_complete_request) -> ChatCompletion:

		return DataProto(batch=batch, non_tensor_batch={"__num_turns__": num_turns})


		class AgentLoopManager:

		@@ -0,0 +1,50 @@
		# Copyright 2024 Bytedance Ltd. and/or its affiliates

Search code, repositories, users, issues, pull requests...

[rollout] feat: add agent loop #2124

[rollout] feat: add agent loop #2124

Uh oh!

Conversation

wuxibin89 commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist Before Starting

What does this PR do?

Test

High-Level Design

Specific Changes

API

Usage Example

Checklist Before Submitting

Uh oh!

vermouth1992 commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vermouth1992 commented Jun 21, 2025

Uh oh!

zw0610 commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vermouth1992 commented Jun 21, 2025

Uh oh!

zw0610 commented Jun 21, 2025

Uh oh!

casper-hansen commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vermouth1992 commented Jun 21, 2025

Uh oh!

SwordFaith left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SwordFaith Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wuxibin89 Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as duplicate.

Uh oh!

casper-hansen commented Jun 21, 2025

Uh oh!

wuxibin89 commented Jun 21, 2025

Uh oh!

casper-hansen commented Jun 23, 2025

Uh oh!

wuxibin89 commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casper-hansen commented Jun 23, 2025

Uh oh!

Choose a reason for hiding this comment

wuxibin89 commented Jun 20, 2025 •

edited

Loading

vermouth1992 commented Jun 20, 2025 •

edited

Loading

zw0610 commented Jun 21, 2025 •

edited

Loading

casper-hansen commented Jun 21, 2025 •

edited

Loading

SwordFaith Jun 21, 2025 •

edited

Loading

wuxibin89 Jun 21, 2025 •

edited

Loading

wuxibin89 commented Jun 23, 2025 •

edited

Loading