Training Library Usage#1370
Training Library Usage#1370mergify[bot] merged 24 commits intoinstructlab:maininstructlab/instructlab:mainfrom
Conversation
|
some status here. This PR does the following:
the commits might be a bit messy rn. I will clean these up in the morning. Functionally, this is the structure we should go for. Eventually, all of the top level config stuff should live in a profile. The amount of flags being added to training is... quite a lot. |
|
usage:
a train profile takes the following format: thats it :) |
Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
specify --legacy for tests Signed-off-by: Charlie Doern <cdoern@redhat.com>
data-path can be a path or a dir depending on if linux_train or library train. remove old flags commented out, kargs should be fine for now. Signed-off-by: Charlie Doern <cdoern@redhat.com>
the trainingargs class has lora and ds class embedded inside of it. This necessitated some custom handling since in order for the default map to auto populate, all options need to be under a top level train entry in the internal_map. Signed-off-by: Charlie Doern <cdoern@redhat.com>
instructlab/dolomite does not support Python 3.12 or 3.9. 3.9 could be removed fully as it is an older version. However, 3.12 support will eventually exist for dolomite. linux train still works when using --legacy on 3.12 and macos train is fully in tact. For now, I am removing CI for 3.12 but keeping the docs and pyproject lines saying we support 3.12 Signed-off-by: Charlie Doern <cdoern@redhat.com>
After adding the training library, the runner is running out of disk space. Re-use our hack for freeing up disk space in this workflow by making it an action we can use in multiple workflows. Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Charlie Doern <cdoern@redhat.com>
201f168
|
I rebased, and added one more timeout bump. |
|
We're failing one MacOS test. I manually ran the test to see if it's a test problem or a real problem. The manual testing passed. i.e. if the server is running and we are in a chat window, and we kill the server while chat is emitting, the chat window will exit and let the user know that it's ceasing to emit a response. I used: to validate that the serving process that we killed was the one that the chat instance was communicating on. |
macos has some additional timing issues due to the torch logs and validation in the training library. Signed-off-by: Charlie Doern <cdoern@redhat.com>
|
For future reference, this commit history could have used a bit of squashing cleanup prior to merging. |
This is the implementation of high fidelity training using the new backend library
the way to run this is:
ilab config initand thenilab train --legacy=true--train-profileflag which takes a valid yaml of the following format:These are the exact TrainingArgs and TorchrunArgs from the training repo.
These options then funnel into the flags for ilab train which can be used as overrides but the hope is @RobotSail and team can provide pre-baked training yaml for specific hardware so users will not need to specify ANY flags (this works in testing so far which is really cool)