-
Notifications
You must be signed in to change notification settings - Fork 12
FSDP2: Fix Single-GPU Tests & Update Configs #319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…f_trainable_parameters
le1nux
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Looks basically mergeable to me. I only a few things were unclear to me.
| grads = [p.grad for p in self.wrapped_model.parameters() if p.grad is not None] | ||
| total_norm = torch.nn.utils.get_total_norm( | ||
| tensors=grads, norm_type=self.norm_type, error_if_nonfinite=False, foreach=True | ||
| tensors=grads, norm_type=self.norm_type.value, error_if_nonfinite=False, foreach=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a potential bug, that would explain the effects during model training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to have been introduced in one of PRs related to the FSDP2 integration as in main (8fb80fd) self.norm_type.value is correctly used:
| gradient_norm_score = self.wrapped_model.clip_grad_norm_(max_norm=self.max_norm, norm_type=self.norm_type.value) |
| gradient_norm_score = self.wrapped_model.clip_grad_norm_(max_norm=torch.inf, norm_type=self.norm_type.value) |
Therefore, this does not affet previous model runs.
| return model | ||
|
|
||
|
|
||
| def _load_gpt2( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def _load_gpt2( | |
| def _load_fsdp1_sharded_gpt2_model( |
| std: float | str = 0.02, | ||
| sharding_strategy: ShardingStrategy = ShardingStrategy.NO_SHARD, | ||
| ) -> FSDP: | ||
| """load gpt2 or coca model from config and fsdp-wrap it""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this really support coca?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not support COCA model (see if conditions).
TODO: Adapt docstring.
| [ | ||
| ("gpt2", ShardingStrategy.NO_SHARD, 145009152), | ||
| ("gpt2", ShardingStrategy.FULL_SHARD, 145009152), | ||
| ("gpt2", ShardingStrategy.HYBRID_SHARD, 145009152), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we extend this to test also for CoCa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to adress this in a seprate PR.
|
I will merge this already, despite the open discussion points (we should still discuss those) and continue the integration of FSDP2 testing. |
What does this PR do?
This PR
attention_norm_config&AppStateIn addition, the typing annotation
FSDPX = FSDP1 | FSDP2is introduced (but only used inmfu.pyso far)General Changes
get_total_number_of_trainable_parameters()is properly tested on multi-gpu instead of single-gpu with mockingBreaking Changes
Checklist before submitting final PR
python tests/tests.py)CHANGELOG_DEV.md)