Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

#1006 - Optimize CPU training on Linux#1010

Closed
jsight wants to merge 2 commits intoinstructlab:maininstructlab/instructlab:mainfrom
jsight:issue_1006_cpu_optimizationjsight/instructlab:issue_1006_cpu_optimizationCopy head branch name to clipboard
Closed

#1006 - Optimize CPU training on Linux#1010
jsight wants to merge 2 commits intoinstructlab:maininstructlab/instructlab:mainfrom
jsight:issue_1006_cpu_optimizationjsight/instructlab:issue_1006_cpu_optimizationCopy head branch name to clipboard

Conversation

@jsight
Copy link

@jsight jsight commented Apr 25, 2024

  • Timeout for chat mode increased to 300 seconds. Without this, it fails in a nearly silent manner in 30 seconds. This causes issues for CPU chat
  • Return the filename when converting the final resulting file and use this for copying. This is important when using f32, as llamam_to_gguf will have a different filename from expected (f16).
  • Determine the system memory available and use this to determine the batch size in cpu mode. This enables more parallelization on smaller systems.
  • Disable fp16 and bf16 in cpu mode. This also enables paralleization. Without it, training is unusable on my i7 with 64GB ram. With it, training takes <30 minutes for one epoch of ~26 iterations.
  • Turn off auto for dtype when loading the model. This greatly improved inference performance on my i7 as well. It went from single threaded to fully utilizing the CPU.
  • Updated the logging to be 1 based instead of zero based :)

Changes

Which issue is resolved by this Pull Request:
Resolves #

Description of your changes:

 - Timeout for chat mode increased to 300 seconds. Without this, it fails in
     a nearly silent manner in 30 seconds. This causes issues for CPU chat
 - Return the filename when converting the final resulting file and use this
     for copying. This is important when using f32, as llamam_to_gguf will have
     a different filename from expected (f16).
 - Determine the system memory available and use this to determine the batch
     size in cpu mode. This enables more parallelization on smaller systems.
 - Disable fp16 and bf16 in cpu mode. This also enables paralleization. Without
     it, training is unusable on my i7 with 64GB ram. With it, training takes
     <30 minutes for one epoch of ~26 iterations.
 - Turn off auto for dtype when loading the model. This greatly improved inference
     performance on my i7 as well. It went from single threaded to fully utilizing
     the CPU.
 - Updated the logging to be 1 based instead of zero based :)

Signed-off-by: Jess Sightler <jesse.sightler@gmail.com>
@jsight
Copy link
Author

jsight commented Apr 25, 2024

Resolves: #1006

Comment on lines +263 to +264
# CPU performance is very bad with fp16 or bf16, so disable both
use_fp16 = use_bf16 = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best option to use likely depends on specific CPU used here. Xeon CPUs with AVX512 and ipex may actually get better performance leaving bf16 enabled. For the sake of better CPU defaults, this is probably fine. But this is another example where the work @derekhiggins is doing around exposing all of these training arguments would be useful so that users could ultimately tweak these for their specific system until or unless we get smart enough to make optimal decisions in most cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll test this to compare on my HW but it will be early next week before I get a chance

Here is the PR I've been testing to allow arbitrary training options
#1008

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. IMO, it'd be nice to have something like "profiles" and have it automatically select a reasonable default if none is specified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been facing a similar problem for Intel Gaudi (Habana Labs HPUs and SynpseAI) and created an abstract layer for Torch device properties. I just ripped the code out of my experimental branch, cleaned it up, and published it as draft PR #1015.

# SPDX-License-Identifier: Apache-2.0

# Standard
import psutil # to determine system memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a 3rd party import.

Please add the requirement to requirements.txt, too. Right now it's a transient dependency from peft.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have added it to requirements.txt

Signed-off-by: Jess Sightler <jesse.sightler@gmail.com>
@mergify mergify bot added the needs-rebase This Pull Request needs to be rebased label May 2, 2024
@mergify
Copy link
Contributor

mergify bot commented May 2, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @jsight please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@jsight
Copy link
Author

jsight commented May 3, 2024

Closing... will followup in 1109

@jsight
Copy link
Author

jsight commented May 3, 2024

#1109

@jsight jsight closed this May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase This Pull Request needs to be rebased

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.