Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Comments

Close side panel

Training Library Usage#1370

Merged
mergify[bot] merged 24 commits intoinstructlab:maininstructlab/instructlab:mainfrom
cdoern:trainingcdoern/instructlab:trainingCopy head branch name to clipboard
Jun 25, 2024
Merged

Training Library Usage#1370
mergify[bot] merged 24 commits intoinstructlab:maininstructlab/instructlab:mainfrom
cdoern:trainingcdoern/instructlab:trainingCopy head branch name to clipboard

Conversation

@cdoern
Copy link
Contributor

@cdoern cdoern commented Jun 14, 2024

This is the implementation of high fidelity training using the new backend library

  1. adds all flags outlined in new training library to ilab model train
  2. The defaults for these flags come from a new _train class in the config. The train class is composed of the exact classes introduced in the training library, I imported them.
  3. I added a --train-profile to ilab init allowing a user to specify a yaml file matching the exact format of the training library args. The two sections in the yaml are train_args and torch_args.
  4. I cleaned up mismatched flags and outdated flags necessary for this new library.

the way to run this is:

  • if you want to run linux train just run ilab config init and then ilab train --legacy=true
  • if you want to use the backend code, ilab config init now has a --train-profile flag which takes a valid yaml of the following format:
train_args:
  model_path: instructlab/granite-7b-lab
  data_path: sample-data/train_all_pruned_SDG.jsonl
  ckpt_output_dir: checkpoints
  data_output_dir: output
  max_seq_len: 4096
  max_batch_len: 10000
  num_epochs: 10
  effective_batch_size: 3840
  save_samples: 25000
  learning_rate: 2e-6
  warmup_steps: 800
  is_padding_free: False
  random_seed: 42
torch_args:
  node_rank: 0
  nnodes: 1
  nproc_per_node: 1
  rdzv_id: 123
  rdzv_endpoint: '127.0.0.1:12345'

These are the exact TrainingArgs and TorchrunArgs from the training repo.

These options then funnel into the flags for ilab train which can be used as overrides but the hope is @RobotSail and team can provide pre-baked training yaml for specific hardware so users will not need to specify ANY flags (this works in testing so far which is really cool)

  • macos train is detected based off the chip type, and users will auto run that if they are on a mac.

@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jun 14, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jun 16, 2024
src/instructlab/configuration.py Outdated Show resolved Hide resolved
src/instructlab/model/train.py Show resolved Hide resolved
src/instructlab/model/train.py Show resolved Hide resolved
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Jun 17, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jun 17, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jun 18, 2024
@cdoern
Copy link
Contributor Author

cdoern commented Jun 18, 2024

some status here. This PR does the following:

  1. adds all flags outlined in new training library to ilab model train
  2. The defaults for these flags come from a new _train class in the config. The train class is composed of the exact classes introduced in the training library, I imported them.
  3. I added a --train-profile to ilab init allowing a user to specify a yaml file matching the exact format of the training library args. The two sections in the yaml are train_args and torch_args.
  4. I cleaned up mismatched flags and outdated flags necessary for this new library.

the commits might be a bit messy rn. I will clean these up in the morning. Functionally, this is the structure we should go for. Eventually, all of the top level config stuff should live in a profile. The amount of flags being added to training is... quite a lot.

@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jun 18, 2024
@cdoern
Copy link
Contributor Author

cdoern commented Jun 18, 2024

usage:

  1. ilab config init --train-profile /path/to/trainargs.yaml
  2. ilab train (no flags unless you want to override the yaml. use --legacy to use old linux train code.

a train profile takes the following format:

train_args:
  model_path: instructlab/granite-7b-lab
  data_path: sample-data/train_all_pruned_SDG.jsonl
  ckpt_output_dir: checkpoints
  data_output_dir: output
  max_seq_len: 4096
  max_batch_len: 10000
  num_epochs: 10
  effective_batch_size: 3840
  save_samples: 25000
  learning_rate: 2e-6
  warmup_steps: 800
  is_padding_free: False
  random_seed: 42
torch_args:
  node_rank: 0
  nnodes: 1
  nproc_per_node: 1
  rdzv_id: 123
  rdzv_endpoint: 127.0.0.1:12345

thats it :)

@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jun 18, 2024
RobotSail and others added 8 commits June 25, 2024 15:42
Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
specify --legacy for tests

Signed-off-by: Charlie Doern <cdoern@redhat.com>
data-path can be a path or a dir depending on if linux_train or library train.

remove old flags commented out, kargs should be fine for now.

Signed-off-by: Charlie Doern <cdoern@redhat.com>
the trainingargs class has lora and ds class embedded inside of it. This necessitated some custom handling since in order for the default map to auto populate, all options need to be under a top level train entry in the internal_map.

Signed-off-by: Charlie Doern <cdoern@redhat.com>
instructlab/dolomite does not support Python 3.12 or 3.9. 3.9 could be removed fully as it is an older version. However, 3.12 support will eventually exist for dolomite.
linux train still works when using --legacy on 3.12 and macos train is fully in tact. For now, I am removing CI for 3.12 but keeping the docs and pyproject lines saying we support 3.12

Signed-off-by: Charlie Doern <cdoern@redhat.com>
After adding the training library, the runner is running out of disk
space. Re-use our hack for freeing up disk space in this workflow by
making it an action we can use in multiple workflows.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Charlie Doern <cdoern@redhat.com>
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Jun 25, 2024
@cdoern
Copy link
Contributor Author

cdoern commented Jun 25, 2024

I rebased, and added one more timeout bump.

@mergify mergify bot added the ci-failure PR has at least one CI failure label Jun 25, 2024
@cdoern cdoern mentioned this pull request Jun 25, 2024
5 tasks
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Jun 25, 2024
@JamesKunstle
Copy link
Contributor

We're failing one MacOS test. I manually ran the test to see if it's a test problem or a real problem. The manual testing passed.

i.e. if the server is running and we are in a chat window, and we kill the server while chat is emitting, the chat window will exit and let the user know that it's ceasing to emit a response.

I used:
watch -n 1 sudo lsof -iTCP -sTCP:LISTEN -nP

to validate that the serving process that we killed was the one that the chat instance was communicating on.

macos has some additional timing issues due to the torch logs and validation in the training library.

Signed-off-by: Charlie Doern <cdoern@redhat.com>
@mergify mergify bot added the ci-failure PR has at least one CI failure label Jun 25, 2024
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Jun 25, 2024
Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot merged commit 8c8f5d7 into instructlab:main Jun 25, 2024
@russellb
Copy link
Contributor

For future reference, this commit history could have used a bit of squashing cleanup prior to merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.