-
Notifications
You must be signed in to change notification settings - Fork 0
⚡ Customization of VERL for ai-infra #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: rel/finai-v0.0.1
Are you sure you want to change the base?
Conversation
data = TensorDict(data, batch_size=self.config.data.train_batch_size).to(self.device_name) | ||
metric = self.training_step(data) | ||
train_time += metric["train/time(s)"] | ||
train_time += metric["train/time_s"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐞Causing MLFlow metric logging error
def _compute_loss_and_backward(self, batch, do_backward=True, n_micro_batches=1): | ||
"""Compute loss with optional sequence parallelism and remove padding features""" | ||
use_sp = self.use_remove_padding and self.config.ulysses_sequence_parallel_size > 1 | ||
use_sp = self.use_remove_padding and self.config.ulysses_sequence_parallel_size >= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐞Previously didn't allow us to skip padding without sequence parallelism which meant memory usage was incredibly high. Super small fix for the same.
sampling_metadata: SamplingMetadata, | ||
) -> torch.Tensor: | ||
logits = original_compute_logits(hidden_states, sampling_metadata) | ||
logits = original_compute_logits(hidden_states) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐞Failing after vllm 0.11.0 update since SamplingMetadata
was deprecated and removed. Checked in vllm repo and made a small fix to keep this aligned.
rollout_reward_scores = data_item.non_tensor_batch.get("reward_scores", {}) | ||
extra_info["num_turns"] = num_turns | ||
extra_info["rollout_reward_scores"] = rollout_reward_scores | ||
extra_info["prompt_str"] = prompt_str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing prompt string since we need it for some of our reward computation.
# If experiment does not exist, will create a new experiment | ||
experiment = mlflow.set_experiment(project_name) | ||
mlflow.start_run(experiment_id=experiment.experiment_id, run_name=experiment_name) | ||
mlflow_tags = os.getenv("MLFLOW_TAGS", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhance to support MLFlow tags. Good for experiment logging in general.
elif is_npu_available: | ||
torch.distributed.all_reduce(step_loss) | ||
step_loss /= self.device_mesh.size(0) | ||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more metrics for tracking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overrides to the default verl behaviour specifically for our use cases
docker/setup_and_run.sh
Outdated
fi | ||
|
||
echo "== Setup the active user ==" | ||
if [ -n "${HOST_UID}" ] && [ -n "${HOST_GID}" ] && [ -n "${INTERCOM_USER}" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Useful for local but failing for EKS. Need to debug why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipping running as active user and defaulting to root for now. Feels like a time sink at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 This is brittle but could use a review/rewrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restructured a fair bit but still not able to move in the form of new recipes. Should do for now though.
Review the following changes in direct dependencies. Learn more about Socket for GitHub. |
Warning Review the following alerts detected in dependencies. According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed a bunch of things:
- Using the same base image for rest of model_training/ eks jobs
- Using uv for installation (lot of pkgs still require no build isolation)
- Tried to make build faster and image smaller (reduced by ~5 GB and builds in 20mins from 30mins earlier)
- Pin cores deps with uv.lock
- Remove unnecessary steps
- Removed hack to install
ai-datasets
(Image still has the API key though)
Doc with summary of changes - Link