Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Encoding checkpoint reshaping guide#349

Draft
tjruwase wants to merge 2 commits intomainbigscience-workshop/Megatron-DeepSpeed:mainfrom
universal_ckpt_infobigscience-workshop/Megatron-DeepSpeed:universal_ckpt_infoCopy head branch name to clipboard
Draft

Encoding checkpoint reshaping guide#349
tjruwase wants to merge 2 commits intomainbigscience-workshop/Megatron-DeepSpeed:mainfrom
universal_ckpt_infobigscience-workshop/Megatron-DeepSpeed:universal_ckpt_infoCopy head branch name to clipboard

Conversation

@tjruwase
Copy link
Copy Markdown
Collaborator

This PR is a step towards generalizing the universal checkpointing approach that enables arbitrary reshapes of 3D parallel checkpoints. This PR eliminated the hardcoding of BLOOM model architecture in the current implementation as follow:

  1. Client encodes any shape information required for extracting and merging tensor slices (e.g., slices to be averaged rather than concatenated). This information is included in the checkpoint file.
  2. Replace constant strings with symbolic constants defined by deeepspeed library.

Requires the companion DS PR.

@tjruwase tjruwase requested a review from stas00 September 20, 2022 03:33
@tjruwase
Copy link
Copy Markdown
Collaborator Author

@stas00, I don't intend for this to be merged. Rather, I am sharing this PR to get your feedback for the generalization effort. As discussed earlier, the core logic will eventually move to DS.

Copy link
Copy Markdown
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty clean to me. Thank you for working on that, Tunji!

I know it's not as much fun when you're now working on it alone and without an immediately applicable context.

Comment thread megatron/checkpointing.py Outdated
@tjruwase tjruwase marked this pull request as draft September 20, 2022 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.