FSDP2: Fix Single-GPU Tests & Update Configs #319

flxst · Mar 14, 2025

What does this PR do?

This PR

updates configs to account for recent changes regarding attention_norm_config & AppState
updates single-gpu tests and fixes various bugs

In addition, the typing annotation FSDPX = FSDP1 | FSDP2 is introduced (but only used in mfu.py so far)

General Changes

get_total_number_of_trainable_parameters() is properly tested on multi-gpu instead of single-gpu with mocking

Breaking Changes

None

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

… AppState

…rd method call)

…f_trainable_parameters

…n_tests

le1nux

Nice work! Looks basically mergeable to me. I only a few things were unclear to me.

le1nux · Mar 25, 2025

src/modalities/training/gradient_clipping/fsdp_gradient_clipper.py

        grads = [p.grad for p in self.wrapped_model.parameters() if p.grad is not None]
        total_norm = torch.nn.utils.get_total_norm(
-            tensors=grads, norm_type=self.norm_type, error_if_nonfinite=False, foreach=True
+            tensors=grads, norm_type=self.norm_type.value, error_if_nonfinite=False, foreach=True


Is this a potential bug, that would explain the effects during model training?

This appears to have been introduced in one of PRs related to the FSDP2 integration as in main (8fb80fd) self.norm_type.value is correctly used:

modalities/src/modalities/training/gradient_clipping/fsdp_gradient_clipper.py

Line 49 in 8fb80fd

gradient_norm_score = self.wrapped_model.clip_grad_norm_(max_norm=self.max_norm, norm_type=self.norm_type.value)

and

modalities/src/modalities/training/gradient_clipping/fsdp_gradient_clipper.py

Line 78 in 8fb80fd

gradient_norm_score = self.wrapped_model.clip_grad_norm_(max_norm=torch.inf, norm_type=self.norm_type.value)

Therefore, this does not affet previous model runs.

le1nux · Mar 25, 2025

tests/end2end_tests/test_utils.py

+    return model
+
+
+def _load_gpt2(


Suggested change

def _load_gpt2(

def _load_fsdp1_sharded_gpt2_model(

le1nux · Mar 25, 2025

tests/end2end_tests/test_utils.py

+    std: float | str = 0.02,
+    sharding_strategy: ShardingStrategy = ShardingStrategy.NO_SHARD,
+) -> FSDP:
+    """load gpt2 or coca model from config and fsdp-wrap it"""


does this really support coca?

It does not support COCA model (see if conditions).

TODO: Adapt docstring.

le1nux · Mar 25, 2025

tests/end2end_tests/test_utils.py

+    [
+        ("gpt2", ShardingStrategy.NO_SHARD, 145009152),
+        ("gpt2", ShardingStrategy.FULL_SHARD, 145009152),
+        ("gpt2", ShardingStrategy.HYBRID_SHARD, 145009152),


should we extend this to test also for CoCa?

I would suggest to adress this in a seprate PR.

le1nux · Mar 26, 2025

I will merge this already, despite the open discussion points (we should still discuss those) and continue the integration of FSDP2 testing.

flxst added 3 commits March 13, 2025 13:05

fix: config_files (scheduler -> lr_scheduler, wandb offline)

93f7f3a

fix: update configs to account for changes in attention_norm_config &…

341462b

… AppState

fix: gym-related tests (AppState & direct model call instead of forwa…

bf60f91

…rd method call)

flxst self-assigned this Mar 14, 2025

flxst marked this pull request as draft March 14, 2025 10:30

flxst added 2 commits March 14, 2025 11:01

fix: main test

0c097dc

test: use multi-gpu test instead of single-gpu for get_total_number_o…

871f3e3

…f_trainable_parameters

flxst changed the title ~~FSDP2: Fix Tests & Update Configs~~ FSDP2: Fix Single-GPU Tests & Update Configs Mar 14, 2025

flxst added 2 commits March 14, 2025 16:01

chore: introduce type annotation FSDPX

42afd9b

chore: Merge branch 'fsdp2_min_integration' into fsdp2_min_integratio…

bd3e68b

…n_tests

flxst marked this pull request as ready for review March 14, 2025 16:59

flxst requested a review from le1nux March 14, 2025 16:59

le1nux requested changes Mar 25, 2025

View reviewed changes

le1nux merged commit 3820195 into fsdp2_min_integration Mar 26, 2025
3 of 5 checks passed

le1nux mentioned this pull request Mar 26, 2025

Addressing discussion points in #319 #328

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FSDP2: Fix Single-GPU Tests & Update Configs #319

FSDP2: Fix Single-GPU Tests & Update Configs #319

Uh oh!

flxst commented Mar 14, 2025 •

edited

Loading

Uh oh!

le1nux left a comment

Uh oh!

le1nux Mar 25, 2025

Uh oh!

mali-git Apr 14, 2025

Uh oh!

le1nux Mar 25, 2025

Uh oh!

le1nux Mar 25, 2025

Uh oh!

mali-git Apr 14, 2025

Uh oh!

le1nux Mar 25, 2025

Uh oh!

mali-git Apr 14, 2025

Uh oh!

le1nux commented Mar 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Search code, repositories, users, issues, pull requests...

FSDP2: Fix Single-GPU Tests & Update Configs #319

FSDP2: Fix Single-GPU Tests & Update Configs #319

Uh oh!

Conversation

flxst commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

Uh oh!

le1nux left a comment

Choose a reason for hiding this comment

Uh oh!

le1nux Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

mali-git Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

le1nux Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

le1nux Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

mali-git Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

le1nux Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

mali-git Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

le1nux commented Mar 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flxst commented Mar 14, 2025 •

edited

Loading