Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

@flxst
Copy link
Member

@flxst flxst commented Mar 14, 2025

What does this PR do?

This PR

  • updates configs to account for recent changes regarding attention_norm_config & AppState
  • updates single-gpu tests and fixes various bugs

In addition, the typing annotation FSDPX = FSDP1 | FSDP2 is introduced (but only used in mfu.py so far)

General Changes

  • get_total_number_of_trainable_parameters() is properly tested on multi-gpu instead of single-gpu with mocking

Breaking Changes

  • None

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@flxst flxst self-assigned this Mar 14, 2025
@flxst flxst marked this pull request as draft March 14, 2025 10:30
@flxst flxst changed the title FSDP2: Fix Tests & Update Configs FSDP2: Fix Single-GPU Tests & Update Configs Mar 14, 2025
@flxst flxst marked this pull request as ready for review March 14, 2025 16:59
@flxst flxst requested a review from le1nux March 14, 2025 16:59
Copy link
Member

@le1nux le1nux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Looks basically mergeable to me. I only a few things were unclear to me.

grads = [p.grad for p in self.wrapped_model.parameters() if p.grad is not None]
total_norm = torch.nn.utils.get_total_norm(
tensors=grads, norm_type=self.norm_type, error_if_nonfinite=False, foreach=True
tensors=grads, norm_type=self.norm_type.value, error_if_nonfinite=False, foreach=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a potential bug, that would explain the effects during model training?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to have been introduced in one of PRs related to the FSDP2 integration as in main (8fb80fd) self.norm_type.value is correctly used:

gradient_norm_score = self.wrapped_model.clip_grad_norm_(max_norm=self.max_norm, norm_type=self.norm_type.value)
and
gradient_norm_score = self.wrapped_model.clip_grad_norm_(max_norm=torch.inf, norm_type=self.norm_type.value)

Therefore, this does not affet previous model runs.

return model


def _load_gpt2(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _load_gpt2(
def _load_fsdp1_sharded_gpt2_model(

std: float | str = 0.02,
sharding_strategy: ShardingStrategy = ShardingStrategy.NO_SHARD,
) -> FSDP:
"""load gpt2 or coca model from config and fsdp-wrap it"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this really support coca?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not support COCA model (see if conditions).

TODO: Adapt docstring.

[
("gpt2", ShardingStrategy.NO_SHARD, 145009152),
("gpt2", ShardingStrategy.FULL_SHARD, 145009152),
("gpt2", ShardingStrategy.HYBRID_SHARD, 145009152),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we extend this to test also for CoCa?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to adress this in a seprate PR.

@le1nux
Copy link
Member

le1nux commented Mar 26, 2025

I will merge this already, despite the open discussion points (we should still discuss those) and continue the integration of FSDP2 testing.

@le1nux le1nux merged commit 3820195 into fsdp2_min_integration Mar 26, 2025
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.