Training Library Usage by cdoern · Pull Request #1370 · instructlab/instructlab

cdoern · Jun 14, 2024

This is the implementation of high fidelity training using the new backend library

adds all flags outlined in new training library to ilab model train
The defaults for these flags come from a new _train class in the config. The train class is composed of the exact classes introduced in the training library, I imported them.
I added a --train-profile to ilab init allowing a user to specify a yaml file matching the exact format of the training library args. The two sections in the yaml are train_args and torch_args.
I cleaned up mismatched flags and outdated flags necessary for this new library.

the way to run this is:

if you want to run linux train just run ilab config init and then ilab train --legacy=true
if you want to use the backend code, ilab config init now has a --train-profile flag which takes a valid yaml of the following format:

train_args:
  model_path: instructlab/granite-7b-lab
  data_path: sample-data/train_all_pruned_SDG.jsonl
  ckpt_output_dir: checkpoints
  data_output_dir: output
  max_seq_len: 4096
  max_batch_len: 10000
  num_epochs: 10
  effective_batch_size: 3840
  save_samples: 25000
  learning_rate: 2e-6
  warmup_steps: 800
  is_padding_free: False
  random_seed: 42
torch_args:
  node_rank: 0
  nnodes: 1
  nproc_per_node: 1
  rdzv_id: 123
  rdzv_endpoint: '127.0.0.1:12345'

These are the exact TrainingArgs and TorchrunArgs from the training repo.

These options then funnel into the flags for ilab train which can be used as overrides but the hope is @RobotSail and team can provide pre-baked training yaml for specific hardware so users will not need to specify ANY flags (this works in testing so far which is really cool)

macos train is detected based off the chip type, and users will auto run that if they are on a mac.

src/instructlab/configuration.py

src/instructlab/model/train.py

cdoern · Jun 18, 2024

some status here. This PR does the following:

adds all flags outlined in new training library to ilab model train
The defaults for these flags come from a new _train class in the config. The train class is composed of the exact classes introduced in the training library, I imported them.
I added a --train-profile to ilab init allowing a user to specify a yaml file matching the exact format of the training library args. The two sections in the yaml are train_args and torch_args.
I cleaned up mismatched flags and outdated flags necessary for this new library.

the commits might be a bit messy rn. I will clean these up in the morning. Functionally, this is the structure we should go for. Eventually, all of the top level config stuff should live in a profile. The amount of flags being added to training is... quite a lot.

cdoern · Jun 18, 2024

usage:

ilab config init --train-profile /path/to/trainargs.yaml
ilab train (no flags unless you want to override the yaml. use --legacy to use old linux train code.

a train profile takes the following format:

train_args:
  model_path: instructlab/granite-7b-lab
  data_path: sample-data/train_all_pruned_SDG.jsonl
  ckpt_output_dir: checkpoints
  data_output_dir: output
  max_seq_len: 4096
  max_batch_len: 10000
  num_epochs: 10
  effective_batch_size: 3840
  save_samples: 25000
  learning_rate: 2e-6
  warmup_steps: 800
  is_padding_free: False
  random_seed: 42
torch_args:
  node_rank: 0
  nnodes: 1
  nproc_per_node: 1
  rdzv_id: 123
  rdzv_endpoint: 127.0.0.1:12345

thats it :)

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

specify --legacy for tests Signed-off-by: Charlie Doern <cdoern@redhat.com>

data-path can be a path or a dir depending on if linux_train or library train. remove old flags commented out, kargs should be fine for now. Signed-off-by: Charlie Doern <cdoern@redhat.com>

the trainingargs class has lora and ds class embedded inside of it. This necessitated some custom handling since in order for the default map to auto populate, all options need to be under a top level train entry in the internal_map. Signed-off-by: Charlie Doern <cdoern@redhat.com>

instructlab/dolomite does not support Python 3.12 or 3.9. 3.9 could be removed fully as it is an older version. However, 3.12 support will eventually exist for dolomite. linux train still works when using --legacy on 3.12 and macos train is fully in tact. For now, I am removing CI for 3.12 but keeping the docs and pyproject lines saying we support 3.12 Signed-off-by: Charlie Doern <cdoern@redhat.com>

After adding the training library, the runner is running out of disk space. Re-use our hack for freeing up disk space in this workflow by making it an action we can use in multiple workflows. Signed-off-by: Russell Bryant <rbryant@redhat.com>

Signed-off-by: Charlie Doern <cdoern@redhat.com>

cdoern · Jun 25, 2024

I rebased, and added one more timeout bump.

JamesKunstle · Jun 25, 2024

We're failing one MacOS test. I manually ran the test to see if it's a test problem or a real problem. The manual testing passed.

i.e. if the server is running and we are in a chat window, and we kill the server while chat is emitting, the chat window will exit and let the user know that it's ceasing to emit a response.

I used:
watch -n 1 sudo lsof -iTCP -sTCP:LISTEN -nP

to validate that the serving process that we killed was the one that the chat instance was communicating on.

macos has some additional timing issues due to the torch logs and validation in the training library. Signed-off-by: Charlie Doern <cdoern@redhat.com>

RobotSail

LGTM

russellb · Jun 26, 2024

For future reference, this commit history could have used a bit of squashing cleanup prior to merging.

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jun 14, 2024

cdoern force-pushed the training branch from c6ce914 to 4362c32 Compare June 16, 2024 22:06

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jun 16, 2024

cdoern mentioned this pull request Jun 17, 2024

Create interface for accessing the training code instructlab/training#12

Merged

RobotSail reviewed Jun 17, 2024

View reviewed changes

src/instructlab/configuration.py Outdated Show resolved Hide resolved

RobotSail reviewed Jun 17, 2024

View reviewed changes

src/instructlab/model/train.py Show resolved Hide resolved

RobotSail reviewed Jun 17, 2024

View reviewed changes

src/instructlab/model/train.py Show resolved Hide resolved

cdoern force-pushed the training branch from cbb27ca to 95287e4 Compare June 17, 2024 19:58

mergify bot removed the ci-failure PR has at least one CI failure label Jun 17, 2024

cdoern force-pushed the training branch from 95287e4 to 026753b Compare June 17, 2024 19:59

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jun 17, 2024

cdoern force-pushed the training branch from c7f3b08 to b5f1460 Compare June 18, 2024 03:23

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jun 18, 2024

cdoern force-pushed the training branch from b5f1460 to 9b4ac07 Compare June 18, 2024 14:20

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jun 18, 2024

cdoern force-pushed the training branch from 9b4ac07 to 6cdbdcc Compare June 18, 2024 15:18

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jun 18, 2024

RobotSail and others added 8 commits June 25, 2024 15:42

fxix type reference

359fb97

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

simplify logic for MacOS compatibility

7291009

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

fix e2e test and unit tests

90f4217

specify --legacy for tests Signed-off-by: Charlie Doern <cdoern@redhat.com>

library specific fixes

2cd5e82

data-path can be a path or a dir depending on if linux_train or library train. remove old flags commented out, kargs should be fine for now. Signed-off-by: Charlie Doern <cdoern@redhat.com>

Torch cuda/rocm log clarification

6d07929

Signed-off-by: Charlie Doern <cdoern@redhat.com>

cdoern dismissed stale reviews from JamesKunstle, n1hility, RobotSail, and nathan-weinberg via 201f168 June 25, 2024 19:47

cdoern force-pushed the training branch from 1a15261 to 201f168 Compare June 25, 2024 19:47

mergify bot removed the ci-failure PR has at least one CI failure label Jun 25, 2024

mergify bot added the ci-failure PR has at least one CI failure label Jun 25, 2024

cdoern mentioned this pull request Jun 25, 2024

Add vLLM backend #1442

Merged

5 tasks

mergify bot removed the ci-failure PR has at least one CI failure label Jun 25, 2024

up functional test timeouts.

519e4bd

macos has some additional timing issues due to the torch logs and validation in the training library. Signed-off-by: Charlie Doern <cdoern@redhat.com>

mergify bot added the ci-failure PR has at least one CI failure label Jun 25, 2024

cdoern force-pushed the training branch from 201f168 to 519e4bd Compare June 25, 2024 20:40

mergify bot removed the ci-failure PR has at least one CI failure label Jun 25, 2024

JamesKunstle approved these changes Jun 25, 2024

View reviewed changes

RobotSail approved these changes Jun 25, 2024

View reviewed changes

mergify bot merged commit 8c8f5d7 into instructlab:main Jun 25, 2024

tiran mentioned this pull request Jun 27, 2024

Version bumps broke Intel Gaudi support #1490

Closed

ktam3 mentioned this pull request Jun 27, 2024

[Epic] RHEL AI backend commands #1503

Closed

RobotSail mentioned this pull request Jul 15, 2024

adds example for accessing the training library #1329

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Training Library Usage#1370

Training Library Usage#1370
mergify[bot] merged 24 commits intoinstructlab:maininstructlab/instructlab:mainfrom
cdoern:trainingcdoern/instructlab:trainingCopy head branch name to clipboard

cdoern commented Jun 14, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cdoern commented Jun 18, 2024

Uh oh!

cdoern commented Jun 18, 2024

Uh oh!

cdoern commented Jun 25, 2024

Uh oh!

JamesKunstle commented Jun 25, 2024

Uh oh!

RobotSail left a comment

Uh oh!

russellb commented Jun 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Search code, repositories, users, issues, pull requests...

Comments

Conversation

cdoern commented Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cdoern commented Jun 18, 2024

Uh oh!

cdoern commented Jun 18, 2024

Uh oh!

cdoern commented Jun 25, 2024

Uh oh!

JamesKunstle commented Jun 25, 2024

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

russellb commented Jun 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cdoern commented Jun 14, 2024 •

edited

Loading