Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

wuxibin89
Copy link
Collaborator

@wuxibin89 wuxibin89 commented Jun 20, 2025

Checklist Before Starting

  • Searched for similar PR(s).
  • Checked PR Title format
    • In format of: [modules] type: Title
    • modules are in fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • type is in feat, fix, refactor, chore, test
    • can involve multiple modules, seperated by , or space, like [megatron, fsdp, doc] feat: xxx

What does this PR do?

Add AgentLoopBase and AgentLoopManager for agentic rollout.

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

High-Level Design

New Components:

  • AgentLoopBase: abstract class represents the loop of a single prompt's rollout, in the loop, agent may chat with OpenAI compatible LLM server and interact with various environments.
  • AsyncLLMServerManager: send chat completion requests to multiple LLM servers, providing load balance and sticky session.
  • AgentLoopManager: get a batch of prompts from dataloader and split to multiple AgentLoopWorker
  • AgentLoopWorker: for each prompt, create a AgentLoopBase instance, run loop task.
image

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this 

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title description if it breaks any API.
  • Update the documentation about your changes in the docs.
  • New CI unit test(s) are added to cover the code path.
  • Rely on existing unit tests on CI that covers the code path.

@vermouth1992
Copy link
Collaborator

vermouth1992 commented Jun 20, 2025

@ccclyu @SwordFaith Could you help review this design. Basically, it converts recursive llm interactions to loop-based design to make writing complex agent loop more easily. The original idea is from @zw0610

@PeterSH6 PeterSH6 requested a review from zw0610 June 20, 2025 16:54
@vermouth1992
Copy link
Collaborator

@casper-hansen Could you please also take a look and give some comments? Thanks.

@zw0610
Copy link
Collaborator

zw0610 commented Jun 21, 2025

@vermouth1992 This is a compromised version of what we've discussed. I would still suggest move the agent loop of out the AsyncLLMServer, and manipulate it directly on driver process.

@vermouth1992
Copy link
Collaborator

@vermouth1992 This is a compromised version of what we've discussed. I would still suggest move the agent loop of out the AsyncLLMServer, and manipulate it directly on driver process.

It runs on another ray actor. If we remove the ray actor, it directly runs on the driver process

@zw0610
Copy link
Collaborator

zw0610 commented Jun 21, 2025

@vermouth1992 This is a compromised version of what we've discussed. I would still suggest move the agent loop of out the AsyncLLMServer, and manipulate it directly on driver process.

It runs on another ray actor. If we remove the ray actor, it directly runs on the driver process

My concern lays more on the flexibility of debugging and customizing agent loop. Moreover, if we consider agent loops that consumes much more local resources, transforming them into distributed ray actors will release the resource pressure on either the ray actor you mentioned under the this implementation or driver process.

@casper-hansen
Copy link
Contributor

casper-hansen commented Jun 21, 2025

My feedback below.

  1. Code Quality Improvement: Agent loop effectively refactors ChatCompletionScheduler by replacing recursion with a while loop, making the implementation simpler for the end-user.
  2. Missing Extensibility: Agent loop lacks the extensibility that ChatCompletionScheduler provides via config.multi_turn.completion_callback. Adding similar functionality would make this more useful.
  3. Development Experience Issue (Agree with @zw0610): Both implementations suffer from slow iteration cycles due to Ray dependency. I've implemented a private interoperable CompletionCallback that works with both training and inference, allowing direct calls to o4-mini or Qwen3 without booting veRL (which takes minutes).

I hope the feedback is useful. Point 3 is a general challenge that I think can be overcome in a future PR (scope is too big), so I would suggest adding it to the roadmap.

@vermouth1992
Copy link
Collaborator

My feedback below.

  1. Code Quality Improvement: Agent loop effectively refactors ChatCompletionScheduler by replacing recursion with a while loop, making the implementation simpler for the end-user.
  2. Missing Extensibility: Agent loop lacks the extensibility that ChatCompletionScheduler provides via config.multi_turn.completion_callback. Adding similar functionality would make this more useful.
  3. Development Experience Issue (Agree with @zw0610): Both implementations suffer from slow iteration cycles due to Ray dependency. I've implemented a private interoperable CompletionCallback that works with both training and inference, allowing direct calls to o4-mini or Qwen3 without booting veRL (which takes minutes).

I hope the feedback is useful. Point 3 is a general challenge that I think can be overcome in a future PR (scope is too big), so I would suggest adding it to the roadmap.

Thanks for your feedback! For number 2, I guess we can provide a registration mechanism for agent loop so that you can define by your self outside verl codebase? The main motivation of loop-based agent is for easy implementation when the agent interacts with an environment (e.g., pokemon game). Just wonder do you still prefer recursive style or loop style? For number 3, I agree that it requires substantial optimization in order to make it fast.

Copy link
Collaborator

@SwordFaith SwordFaith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to discuss further.

self.server_addresses = server_addresses

# Least requests load balancing
self.weighted_addresses = [[0, address] for address in server_addresses]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

least request load balancing may also results to shard equal data size to each dp group if without rate limit or running requests aware back press mechanism.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need a more server pressure aware load balancing strategy, @chenhaiq is investigating it.

heapq.heapify(self.weighted_addresses)

# LRU cache to map request_id to address
self.request_id_to_address = LRUCache(maxsize=max_cache_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it possible to use radix-aware or prefix-aware routing hybrid with least requests and max running reqs aware.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With collecting server performance metrics(e.g current requests, gpu cache usage, cpu cache usage, etc), we can implement better routing strategy.


return await self._chat_completions(address, request_id, messages=messages, tools=tools, **sampling_params, **kwargs)

async def _chat_completions(self, address: str, request_id: str, **chat_complete_request) -> ChatCompletion:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we finally choose to keep http server routing rather than ray based ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you observe a performance diff?

Copy link
Collaborator

@SwordFaith SwordFaith Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was suffered with concurrency when with sgl-router, and make Zilin Zhu switch to http2 in several weeks before. But may be ok because we only use between each server.
Another consideration is resource affinity inside message list e.g. tool create resource like image in each turn.
Using ray will help on auto-affinity scheduling, using http will require user equip with s3 like oss service and cause many network traffic during multi-turn generation. I guess support ray-based routing can improve user experience with less infrastructure requirements.
Using http-based routing can more easily porting with existing open source openai compatible server impl, and keep verl simplier, which is also an advantage.

Copy link
Collaborator Author

@wuxibin89 wuxibin89 Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we finally choose to keep http server routing rather than ray based ?

The message protocol chat_complete_request between client and server should be OpenAI compatible, but the transport protocol can either be http or ray remote call. So there will be 2 _chat_completions implementation, the second one is in develop

  • _chat_completions_http
  • _chat_completions_ray

num_turns: int


class AgentLoopBase(ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least we need abort here. I wonder the main difference between agent loop and async rollout request is a normal class or a data class work with rollout or server+tool registry, why we need yet another interaction with env impl here ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a fundamental difference between the two programming abstractions. In real world agent applications, rollout is just a server. The agent receives a prompt and starts to query LLMs for instructions in an agent loop. The tool calling is implemented inside the agent loop because there is no way to implement in the server (e.g., OpenAI models). The agent loop abstraction makes existing agent system easily pluggable, and not just restricted to tools (e.g., agentless or predefined workflows)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For predefined workflow, in my opinion, we should constraint our discussion about shared context / seperate context, hybrid model / single model. Of course, current loop abstraction with http server can make it easier to port existing workflow, but using messages list rather than dict of message list, seems can't support seperate context case as well. Using current AsyncLLMServerManager seems still suffer to add yet another model which not include in current verl training. In #1630 we can support user train single node in a hybrid workflow, no matter it depends on which model for other node. For e2e agent shared model and shared context, current impl works fine. A key challenge is how to train seperate context and single model workflow in verl. Which If there is an agentic loop, I want it to accomplish.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a close look at

return DataProto(batch=batch, non_tensor_batch={"__num_turns__": num_turns})


class AgentLoopManager:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be better to consider this as replay buffer or rollout manager, which is the coordinator for partial rollout.

SwordFaith

This comment was marked as duplicate.

@casper-hansen
Copy link
Contributor

For number 2, I guess we can provide a registration mechanism for agent loop so that you can define by your self outside verl codebase?

Yes, allow users to define a custom agent loop easily.

Just wonder do you still prefer recursive style or loop style?

Loop style is preferred. It’s easier to implement your own loop than it is to implement with recursion. Also, looking at the code, loop style is easier to transfer to inference after training.

@wuxibin89
Copy link
Collaborator Author

My feedback below.

  1. Code Quality Improvement: Agent loop effectively refactors ChatCompletionScheduler by replacing recursion with a while loop, making the implementation simpler for the end-user.
  2. Missing Extensibility: Agent loop lacks the extensibility that ChatCompletionScheduler provides via config.multi_turn.completion_callback. Adding similar functionality would make this more useful.
  3. Development Experience Issue (Agree with @zw0610): Both implementations suffer from slow iteration cycles due to Ray dependency. I've implemented a private interoperable CompletionCallback that works with both training and inference, allowing direct calls to o4-mini or Qwen3 without booting veRL (which takes minutes).

I hope the feedback is useful. Point 3 is a general challenge that I think can be overcome in a future PR (scope is too big), so I would suggest adding it to the roadmap.

@casper-hansen Thanks for you feedback! For point 2, I'm adding AgentLoop registry, allowing user to register their custom AgentLoop.

@wuxibin89 wuxibin89 force-pushed the wuxibin/agent_loop branch from 96f1f35 to 3891051 Compare June 23, 2025 06:58
@wuxibin89 wuxibin89 requested a review from chenhaiq as a code owner June 23, 2025 06:58
@casper-hansen
Copy link
Contributor

@wuxibin89 Two minor issues to consider:

  1. When using Qwen3, you have to specify enable_thinking=False in the _postprocess() function because it will otherwise add empty <think> </think> blocks in your messages since it's enabled by default. In the ChatCompletionScheduler, I am able to override this because we have direct access to modifying _postprocess(). How would we achieve the same thing for the Agent loop?
  2. Implementing streaming for inference with the recursive callbacks is pretty challenging. The Agent loop seems easier because we can modify it to yield messages as they appear. Consider if there should be a streaming implementation for this in an example.

@wuxibin89
Copy link
Collaborator Author

wuxibin89 commented Jun 23, 2025

@casper-hansen

  1. Due to the problem describe in: [rollout] feat: Add LangGraph chat scheduler for agent training with tool use support #1946 (comment) , I have replace /v1/chat_completions to AsyncLLM.generate, instead of passing text messages, now we pass prompt_ids and response_ids directly between client and server, so there's no need to tokenize messages in _postprocess any more.
  2. Yes, streaming responses is more easy to implement in AgentLoop, but I don't get why we need streaming responses, because we need <EOS> token or other stop tokens to extract tool calls.

@casper-hansen
Copy link
Contributor

  1. Due to the problem describe in: [rollout] feat: Add LangGraph chat scheduler for agent training with tool use support #1946 (comment) , I have replace /v1/chat_completions to AsyncLLM.generate, instead of passing text messages, now we pass prompt_ids and response_ids directly between client and server, so there's no need to tokenize messages in _postprocess any more.

I see now, great update on this! The removal of tokenization is reassuring and should be compatible with all models now.

  1. Yes, streaming responses is more easy to implement in AgentLoop, but I don't get why we need streaming responses, because we need <EOS> token or other stop tokens to extract tool calls.

After I train, I want to move to an inference system where I can see the intermediate messages from the trained agent as this can take 5-10 minutes. This loop makes it easier to create an API in front of the trained model.

@wuxibin89 wuxibin89 force-pushed the wuxibin/agent_loop branch 5 times, most recently from 30a3404 to f1c6a96 Compare June 25, 2025 11:41
@@ -0,0 +1,50 @@
# Copyright 2024 Bytedance Ltd. and/or its affiliates
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this file to single_turn_agent_loop? single_agent_loop is confusing.

@vermouth1992 vermouth1992 merged commit 790a8a2 into volcengine:main Jun 27, 2025
37 of 38 checks passed
@casper-hansen
Copy link
Contributor

@wuxibin89 I see this has been merged now. Will you followup with a PR on extensibility so we can use our CustomAgentLoop by registering it through the config?

@waleko
Copy link

waleko commented Jun 30, 2025

@wuxibin89 Does this change mean we can no longer use a chat scheduler or completion_callback? That was quite helpful for our use case. Is there any way to restore or support this functionality again?

oseyosey pushed a commit to oseyosey/verl that referenced this pull request Jul 28, 2025
### Checklist Before Starting

- [ ] Searched for similar PR(s).
- [ ] Checked PR Title format
  - In format of: [modules] type: Title
- modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci,
training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data`
  - type is in `feat, fix, refactor, chore, test`
- can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?

Add AgentLoopBase and AgentLoopManager for agentic rollout.

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### High-Level Design
New Components:
- **AgentLoopBase**: abstract class represents the loop of a single
prompt's rollout, in the loop, agent may chat with OpenAI compatible LLM
server and interact with various environments.
- **AsyncLLMServerManager**: send chat completion requests to multiple
LLM servers, providing load balance and sticky session.
- **AgentLoopManager**: get a batch of prompts from dataloader and split
to multiple AgentLoopWorker
- **AgentLoopWorker**: for each prompt, create a AgentLoopBase instance,
run loop task.

<img width="885" alt="image"
src="https://github.com/user-attachments/assets/1f949719-c000-4b94-9ee2-c8a8ff71b109"
/>


### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.
Juniper1021 pushed a commit to Juniper1021/verl that referenced this pull request Aug 7, 2025
### Checklist Before Starting

- [ ] Searched for similar PR(s).
- [ ] Checked PR Title format
  - In format of: [modules] type: Title
- modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci,
training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data`
  - type is in `feat, fix, refactor, chore, test`
- can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?

Add AgentLoopBase and AgentLoopManager for agentic rollout.

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### High-Level Design
New Components:
- **AgentLoopBase**: abstract class represents the loop of a single
prompt's rollout, in the loop, agent may chat with OpenAI compatible LLM
server and interact with various environments.
- **AsyncLLMServerManager**: send chat completion requests to multiple
LLM servers, providing load balance and sticky session.
- **AgentLoopManager**: get a batch of prompts from dataloader and split
to multiple AgentLoopWorker
- **AgentLoopWorker**: for each prompt, create a AgentLoopBase instance,
run loop task.

<img width="885" alt="image"
src="https://github.com/user-attachments/assets/1f949719-c000-4b94-9ee2-c8a8ff71b109"
/>


### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.
whatadayG pushed a commit to whatadayG/verl that referenced this pull request Sep 5, 2025
### Checklist Before Starting

- [ ] Searched for similar PR(s).
- [ ] Checked PR Title format
  - In format of: [modules] type: Title
- modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci,
training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data`
  - type is in `feat, fix, refactor, chore, test`
- can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?

Add AgentLoopBase and AgentLoopManager for agentic rollout.

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### High-Level Design
New Components:
- **AgentLoopBase**: abstract class represents the loop of a single
prompt's rollout, in the loop, agent may chat with OpenAI compatible LLM
server and interact with various environments.
- **AsyncLLMServerManager**: send chat completion requests to multiple
LLM servers, providing load balance and sticky session.
- **AgentLoopManager**: get a batch of prompts from dataloader and split
to multiple AgentLoopWorker
- **AgentLoopWorker**: for each prompt, create a AgentLoopBase instance,
run loop task.

<img width="885" alt="image"
src="https://github.com/user-attachments/assets/1f949719-c000-4b94-9ee2-c8a8ff71b109"
/>


### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.
@yinzhangyue
Copy link

Perfect Work! Is there any guide how to use the tool_agent_loop in the Training? Thx~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.