Auto-detect bf16 support for CUDA by tiran · Pull Request #993 · instructlab/instructlab

tiran · Apr 25, 2024

Changes

Which issue is resolved by this Pull Request:
See #647

Description of your changes:

bf16 (bfloat16) is not available on older CUDA versions < 11.0 as well as devices with CUDA support level < 8.0. linux_train now detects and reports bf16 support. Training on CUDA falls back to fp16 (half precision float).

also closes #1006

tiran · May 2, 2024

@Mergifyio rebase

mergify · May 2, 2024

rebase

❌ Unable to rebase: user `tiran` is unknown.

Details

Please make sure tiran has logged in Mergify dashboard.

tiran · May 2, 2024

@Mergifyio rebase

mergify · May 2, 2024

rebase

❌ Base branch update has failed

Details

tiran does not have write access to the forked repository.

mergify · May 6, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @tiran please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · May 7, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @tiran please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

leseb · May 23, 2024

@tiran what's the status on this? Thanks!

tiran · May 23, 2024

@leseb I have rebased the PR. Let's see if tests are now passing.

On a test system with 64 GB RAM, this memory calculation came out as 62, not 64. Check for 60 instead of 64. Obviously this is not very scientific as we're making very rough assumptions about what is required. It would be better to enhance the code further to actually calculate a memory requirement based on the model instead just hard coding a rough guess. Signed-off-by: Russell Bryant <rbryant@redhat.com>

russellb · Jun 5, 2024

I spoke with @leseb on Slack and we determine that the memory check came out to 62 on his 64 GB system, so I've changed the rough check in the code to now be < 60 instead of < 64. I'd like to see if that now gets him a boost, as his system should work with dtype=None (using float32).

leseb · Jun 6, 2024

Here are the results I've been waiting for :), the same system as commented in #993 (comment):

Previously it took 1h28min to barely reach 29% of the training, now the whole training took 1h19min:

LINUX_TRAIN.PY: TRAINING
{'train_runtime': 79.7499, 'train_samples_per_second': 0.075, 'train_steps_per_second': 0.075, 'train_loss': 1.6997551918029785, 'epoch': 1.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [01:19<00:00, 13.29s/it]

leseb · Jun 6, 2024

src/instructlab/train/linux_train.py

+    torch_dtype = "auto" if device.type == "cuda" else None
+    if device.type == "cpu":
+        total_memory = psutil.virtual_memory().total / (1024**3)
+        if total_memory < 60:


Suggested change

if total_memory < 60:

if total_memory < 62:

A system with 64GB of RAM, will report:

>>> import psutil >>> mem = psutil.virtual_memory() >>> mem svmem(total=67228049408, available=31099351040, percent=53.7, used=35383861248, free=468701184, active=27983499264, inactive=37159084032, buffers=1079336960, cached=30296150016, shared=2109440, slab=1340628992)

And we have. 67228049408 Bytes converted to GiB gives us 67228049408 / 1024 ** 3 gives us 62.6 GiB

leseb · Jun 6, 2024

src/instructlab/train/linux_train.py

-            # There's more going on here and needs deeper exploration to find
-            # the right parameters to be checking for choosing the best
-            # configuration.
+            # Anecdotally, 64 GB seems to be enough, but this calculation


A system with 64GB of RAM will report ~62.6 GiB so we base our calculation on 62.

Since it's such a rough guess, 60 still seems fine? We need to actually do some math at some point ...

I'll share my math in a few :) stay tuned!

Some more numbers:

The training part take ~30GB of RAM to process, there is a very small chance that this could work on very minimal Linux installation, by minimal I mean, only system critical services run and nothing else.

The inference part takes ~35GB of RAM

Essentially a system with 48GB of RAM should be able to run both training and inferencing. Although 48 GB of RAM is not very common.

mergify · Jul 9, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @tiran please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

github-actions · Oct 8, 2024

This pull request has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days.

mergify · Jan 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @tiran please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · Feb 16, 2025

This pull request has merge conflicts that must be resolved before it can be merged. @tiran please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

courtneypacheco · Mar 6, 2025

Hi @tiran! Are you still working on this PR? We're looking to do some housekeeping and close out stale PRs, including drafts.

If we don't hear back within 7 days, we will close this PR, but please know that you are more than welcome to reopen it if you'd like! Thank you!

mergify · Mar 27, 2025

This pull request has merge conflicts that must be resolved before it can be merged. @tiran please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · Apr 28, 2025

This pull request has merge conflicts that must be resolved before it can be merged. @tiran please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

github-actions · Jun 28, 2025

This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

tiran force-pushed the cuda_bf16 branch from a633376 to 7db0437 Compare April 25, 2024 06:46

This was referenced Apr 26, 2024

Training options only allow for well known/tested HW #1007

Closed

Allow arbitary trainging args to be overridden #1008

Closed

tiran force-pushed the cuda_bf16 branch 7 times, most recently from d646d59 to ecc1a38 Compare May 6, 2024 04:28

mergify bot added needs-rebase This Pull Request needs to be rebased and removed needs-rebase This Pull Request needs to be rebased labels May 6, 2024

tiran force-pushed the cuda_bf16 branch from ecc1a38 to 3ad61dd Compare May 7, 2024 15:58

mergify bot removed the needs-rebase This Pull Request needs to be rebased label May 7, 2024

tiran force-pushed the cuda_bf16 branch 2 times, most recently from 8e3d568 to 7999e89 Compare May 7, 2024 16:41

mergify bot added the testing Relates to testing label May 7, 2024

tiran force-pushed the cuda_bf16 branch 2 times, most recently from 5f01310 to 0519a1d Compare May 7, 2024 18:05

tiran mentioned this pull request May 14, 2024

train rework, introduce --backend and --dtype flags #1157

Closed

tiran force-pushed the cuda_bf16 branch from 0519a1d to 34f6440 Compare May 23, 2024 11:53

mergify bot added the ci-failure PR has at least one CI failure label Jun 5, 2024

leseb reviewed Jun 6, 2024

View reviewed changes

russellb mentioned this pull request Jul 2, 2024

Add CPU-Only Support to the training library instructlab/training#117

Closed

tiran marked this pull request as draft July 9, 2024 08:41

mergify bot added the needs-rebase This Pull Request needs to be rebased label Jul 9, 2024

github-actions bot added the stale label Oct 8, 2024

mergify bot added the dependencies Relates to dependencies label Oct 8, 2024

github-actions bot removed the stale label Oct 9, 2024

mergify bot removed the needs-rebase This Pull Request needs to be rebased label Jan 6, 2025

mergify bot added the needs-rebase This Pull Request needs to be rebased label Jan 6, 2025

mergify bot removed the needs-rebase This Pull Request needs to be rebased label Feb 16, 2025

mergify bot added the needs-rebase This Pull Request needs to be rebased label Feb 16, 2025

mergify bot removed the needs-rebase This Pull Request needs to be rebased label Mar 27, 2025

mergify bot added the needs-rebase This Pull Request needs to be rebased label Mar 27, 2025

mergify bot removed the needs-rebase This Pull Request needs to be rebased label Apr 28, 2025

mergify bot added the needs-rebase This Pull Request needs to be rebased label Apr 28, 2025

github-actions bot added the stale label Jun 28, 2025

nathan-weinberg closed this Jul 2, 2025

Search code, repositories, users, issues, pull requests...

Comments

Conversation

tiran commented Apr 25, 2024 • edited by russellb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

tiran commented May 2, 2024

Uh oh!

mergify bot commented May 2, 2024

❌ Unable to rebase: user tiran is unknown.

Uh oh!

tiran commented May 2, 2024

Uh oh!

mergify bot commented May 2, 2024

❌ Base branch update has failed

Uh oh!

mergify bot commented May 6, 2024

Uh oh!

mergify bot commented May 7, 2024

Uh oh!

leseb commented May 23, 2024

Uh oh!

tiran commented May 23, 2024

Uh oh!

russellb commented Jun 5, 2024

Uh oh!

leseb commented Jun 6, 2024

Uh oh!

leseb Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

leseb Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

russellb Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

leseb Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

leseb Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jul 9, 2024

Uh oh!

github-actions bot commented Oct 8, 2024

Uh oh!

mergify bot commented Jan 6, 2025

Uh oh!

mergify bot commented Feb 16, 2025

Uh oh!

courtneypacheco commented Mar 6, 2025

Uh oh!

mergify bot commented Mar 27, 2025

Uh oh!

mergify bot commented Apr 28, 2025

Uh oh!

github-actions bot commented Jun 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

tiran commented Apr 25, 2024 •

edited by russellb

Loading

❌ Unable to rebase: user `tiran` is unknown.