#1006 - Optimize CPU training on Linux#1010
#1006 - Optimize CPU training on Linux#1010jsight wants to merge 2 commits intoinstructlab:maininstructlab/instructlab:mainfrom jsight:issue_1006_cpu_optimizationjsight/instructlab:issue_1006_cpu_optimizationCopy head branch name to clipboard
Conversation
- Timeout for chat mode increased to 300 seconds. Without this, it fails in
a nearly silent manner in 30 seconds. This causes issues for CPU chat
- Return the filename when converting the final resulting file and use this
for copying. This is important when using f32, as llamam_to_gguf will have
a different filename from expected (f16).
- Determine the system memory available and use this to determine the batch
size in cpu mode. This enables more parallelization on smaller systems.
- Disable fp16 and bf16 in cpu mode. This also enables paralleization. Without
it, training is unusable on my i7 with 64GB ram. With it, training takes
<30 minutes for one epoch of ~26 iterations.
- Turn off auto for dtype when loading the model. This greatly improved inference
performance on my i7 as well. It went from single threaded to fully utilizing
the CPU.
- Updated the logging to be 1 based instead of zero based :)
Signed-off-by: Jess Sightler <jesse.sightler@gmail.com>
|
Resolves: #1006 |
| # CPU performance is very bad with fp16 or bf16, so disable both | ||
| use_fp16 = use_bf16 = False |
There was a problem hiding this comment.
The best option to use likely depends on specific CPU used here. Xeon CPUs with AVX512 and ipex may actually get better performance leaving bf16 enabled. For the sake of better CPU defaults, this is probably fine. But this is another example where the work @derekhiggins is doing around exposing all of these training arguments would be useful so that users could ultimately tweak these for their specific system until or unless we get smart enough to make optimal decisions in most cases.
There was a problem hiding this comment.
I'll test this to compare on my HW but it will be early next week before I get a chance
Here is the PR I've been testing to allow arbitrary training options
#1008
There was a problem hiding this comment.
Agreed. IMO, it'd be nice to have something like "profiles" and have it automatically select a reasonable default if none is specified.
There was a problem hiding this comment.
I have been facing a similar problem for Intel Gaudi (Habana Labs HPUs and SynpseAI) and created an abstract layer for Torch device properties. I just ripped the code out of my experimental branch, cleaned it up, and published it as draft PR #1015.
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # Standard | ||
| import psutil # to determine system memory |
There was a problem hiding this comment.
It's a 3rd party import.
Please add the requirement to requirements.txt, too. Right now it's a transient dependency from peft.
There was a problem hiding this comment.
Thanks, I have added it to requirements.txt
Signed-off-by: Jess Sightler <jesse.sightler@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Closing... will followup in 1109 |
Changes
Which issue is resolved by this Pull Request:
Resolves #
Description of your changes: