Auto-detect bf16 support for CUDA#993
Auto-detect bf16 support for CUDA#993tiran wants to merge 4 commits intoinstructlab:maininstructlab/instructlab:mainfrom
Conversation
|
@Mergifyio rebase |
❌ Unable to rebase: user
|
|
@Mergifyio rebase |
❌ Base branch update has failedDetails
|
d646d59 to
ecc1a38
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
8e3d568 to
7999e89
Compare
5f01310 to
0519a1d
Compare
|
@tiran what's the status on this? Thanks! |
|
@leseb I have rebased the PR. Let's see if tests are now passing. |
On a test system with 64 GB RAM, this memory calculation came out as 62, not 64. Check for 60 instead of 64. Obviously this is not very scientific as we're making very rough assumptions about what is required. It would be better to enhance the code further to actually calculate a memory requirement based on the model instead just hard coding a rough guess. Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
I spoke with @leseb on Slack and we determine that the memory check came out to |
|
Here are the results I've been waiting for :), the same system as commented in #993 (comment): Previously it took 1h28min to barely reach 29% of the training, now the whole training took 1h19min: |
| torch_dtype = "auto" if device.type == "cuda" else None | ||
| if device.type == "cpu": | ||
| total_memory = psutil.virtual_memory().total / (1024**3) | ||
| if total_memory < 60: |
There was a problem hiding this comment.
| if total_memory < 60: | |
| if total_memory < 62: |
A system with 64GB of RAM, will report:
>>> import psutil
>>> mem = psutil.virtual_memory()
>>> mem
svmem(total=67228049408, available=31099351040, percent=53.7, used=35383861248, free=468701184, active=27983499264, inactive=37159084032, buffers=1079336960, cached=30296150016, shared=2109440, slab=1340628992)
And we have. 67228049408 Bytes converted to GiB gives us 67228049408 / 1024 ** 3 gives us 62.6 GiB
| # There's more going on here and needs deeper exploration to find | ||
| # the right parameters to be checking for choosing the best | ||
| # configuration. | ||
| # Anecdotally, 64 GB seems to be enough, but this calculation |
There was a problem hiding this comment.
A system with 64GB of RAM will report ~62.6 GiB so we base our calculation on 62.
There was a problem hiding this comment.
Since it's such a rough guess, 60 still seems fine? We need to actually do some math at some point ...
There was a problem hiding this comment.
I'll share my math in a few :) stay tuned!
There was a problem hiding this comment.
Some more numbers:
- The training part take ~30GB of RAM to process, there is a very small chance that this could work on very minimal Linux installation, by minimal I mean, only system critical services run and nothing else.
- The inference part takes ~35GB of RAM
Essentially a system with 48GB of RAM should be able to run both training and inferencing. Although 48 GB of RAM is not very common.
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be merged. @tiran please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
|
Hi @tiran! Are you still working on this PR? We're looking to do some housekeeping and close out stale PRs, including drafts. If we don't hear back within 7 days, we will close this PR, but please know that you are more than welcome to reopen it if you'd like! Thank you! |
|
This pull request has merge conflicts that must be resolved before it can be merged. @tiran please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
|
This pull request has merge conflicts that must be resolved before it can be merged. @tiran please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
|
This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days. |

Changes
Which issue is resolved by this Pull Request:
See #647
Description of your changes:
bf16 (bfloat16) is not available on older CUDA versions < 11.0 as well as devices with CUDA support level < 8.0. linux_train now detects and reports bf16 support. Training on CUDA falls back to fp16 (half precision float).
also closes #1006