add e2e testing for quantized backend training by cdoern · Pull Request #1494 · instructlab/instructlab

cdoern · Jun 27, 2024

adds another training test which runs after the --legacy=true test

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the
conventional commits.
Changelog updated with breaking and/or notable changes for the next minor release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

mergify · Jun 28, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

russellb · Jun 28, 2024

It looks like e2e actually failed, but it says it passed, not sure why

https://github.com/instructlab/instructlab/actions/runs/9702039128/job/26776964340?pr=1494

cdoern · Jun 29, 2024

huh yeah I noticed that @russellb I will see if I can get it passing today

mergify · Jun 30, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · Jul 1, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

adds a jsonl file for backend training so we don't need to worry about generation, uses LoRA Signed-off-by: Charlie Doern <cdoern@redhat.com>

Signed-off-by: Charlie Doern <cdoern@redhat.com>

cdoern · Jul 1, 2024

switched to merlinite, lets see if that gets around ampere limitation. If not @Maxusmusti has a fix in training library to disable flash attn

Signed-off-by: Charlie Doern <cdoern@redhat.com>

mergify · Jul 2, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

cdoern · Jul 8, 2024

wondering if we should close this in favor of just using the A10s in #1557 @russellb @Maxusmusti WDYT? Is there any chance with lora I could get this running on the smaller instances?

Maxusmusti · Jul 8, 2024

@cdoern what GPU is being used in these instances?

russellb · Jul 8, 2024

@cdoern yeah, focusing new training on the larger instances makes sense to me.

I'm going to propose a workflow that uses 4x A10Gs. I think that would be a great place to introduce this coverage.

cdoern · Jul 12, 2024

closing in favor of #1557 which merged. If we need this version we can reopen another PR

mergify bot added the ci-failure PR has at least one CI failure label Jun 27, 2024

cdoern force-pushed the training branch from 43bf8ab to 27ef746 Compare June 27, 2024 15:04

mergify bot removed the ci-failure PR has at least one CI failure label Jun 27, 2024

cdoern force-pushed the training branch from 27ef746 to fa00cb3 Compare June 27, 2024 15:07

mergify bot added the ci-failure PR has at least one CI failure label Jun 27, 2024

cdoern force-pushed the training branch from fa00cb3 to 3527883 Compare June 27, 2024 15:17

mergify bot removed the ci-failure PR has at least one CI failure label Jun 27, 2024

cdoern force-pushed the training branch 2 times, most recently from 4648f8c to e0723c0 Compare June 27, 2024 17:48

ktam3 mentioned this pull request Jun 27, 2024

[Epic] RHEL AI backend commands #1503

Closed

ktam3 linked an issue Jun 27, 2024 that may be closed by this pull request

[Epic] RHEL AI backend commands #1503

Closed

cdoern force-pushed the training branch from e0723c0 to aeeb037 Compare June 27, 2024 18:50

mergify bot added the needs-rebase This Pull Request needs to be rebased label Jun 28, 2024

cdoern force-pushed the training branch from aeeb037 to d17bf7f Compare June 29, 2024 20:03

mergify bot added ci-failure PR has at least one CI failure and removed needs-rebase This Pull Request needs to be rebased labels Jun 29, 2024

russellb mentioned this pull request Jun 29, 2024

Use new training code in e2e CI job #1470

Closed

russellb linked an issue Jun 29, 2024 that may be closed by this pull request

Use new training code in e2e CI job #1470

Closed

mergify bot added the needs-rebase This Pull Request needs to be rebased label Jun 30, 2024

cdoern force-pushed the training branch from 310abd9 to 4f61973 Compare June 30, 2024 21:42

mergify bot removed the ci-failure PR has at least one CI failure label Jun 30, 2024

cdoern force-pushed the training branch from 4f61973 to 23c4b11 Compare June 30, 2024 21:44

mergify bot added ci-failure PR has at least one CI failure and removed needs-rebase This Pull Request needs to be rebased ci-failure PR has at least one CI failure labels Jun 30, 2024

mergify bot removed the ci-failure PR has at least one CI failure label Jun 30, 2024

nathan-weinberg added the testing Relates to testing label Jul 1, 2024

mergify bot added the needs-rebase This Pull Request needs to be rebased label Jul 1, 2024

RobotSail mentioned this pull request Jul 1, 2024

Revert "remove incorrect end-to-end implementation" instructlab/training#77

Closed

cdoern force-pushed the training branch from 6468329 to 5fcf2bc Compare July 1, 2024 21:13

cdoern added 2 commits July 1, 2024 17:14

add e2e testing for quantized backend training

fcbab10

adds a jsonl file for backend training so we don't need to worry about generation, uses LoRA Signed-off-by: Charlie Doern <cdoern@redhat.com>

bump training lib, fix broken save_samples temporarily

10fa8e2

Signed-off-by: Charlie Doern <cdoern@redhat.com>

cdoern force-pushed the training branch from 5fcf2bc to b0cff88 Compare July 1, 2024 21:14

mergify bot added ci-failure PR has at least one CI failure and removed needs-rebase This Pull Request needs to be rebased labels Jul 1, 2024

cdoern force-pushed the training branch from b0cff88 to 1dfafdd Compare July 2, 2024 01:50

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jul 2, 2024

cdoern force-pushed the training branch from 1dfafdd to 1d7d258 Compare July 2, 2024 02:05

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jul 2, 2024

flash-attn and packaging dependency

96c7897

Signed-off-by: Charlie Doern <cdoern@redhat.com>

cdoern force-pushed the training branch from 1d7d258 to 96c7897 Compare July 2, 2024 02:32

mergify bot added needs-rebase This Pull Request needs to be rebased and removed ci-failure PR has at least one CI failure labels Jul 2, 2024

cdoern closed this Jul 12, 2024

ktam3 added this to the 0.18.0 milestone Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

add e2e testing for quantized backend training#1494

add e2e testing for quantized backend training#1494
cdoern wants to merge 3 commits intoinstructlab:maininstructlab/instructlab:mainfrom
cdoern:trainingcdoern/instructlab:trainingCopy head branch name to clipboard

cdoern commented Jun 27, 2024

Uh oh!

mergify bot commented Jun 28, 2024

Uh oh!

russellb commented Jun 28, 2024

Uh oh!

cdoern commented Jun 29, 2024

Uh oh!

mergify bot commented Jun 30, 2024

Uh oh!

mergify bot commented Jul 1, 2024

Uh oh!

cdoern commented Jul 1, 2024

Uh oh!

mergify bot commented Jul 2, 2024

Uh oh!

cdoern commented Jul 8, 2024

Uh oh!

Maxusmusti commented Jul 8, 2024

Uh oh!

russellb commented Jul 8, 2024

Uh oh!

cdoern commented Jul 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Search code, repositories, users, issues, pull requests...

Comments

Conversation

cdoern commented Jun 27, 2024

Uh oh!

mergify bot commented Jun 28, 2024

Uh oh!

russellb commented Jun 28, 2024

Uh oh!

cdoern commented Jun 29, 2024

Uh oh!

mergify bot commented Jun 30, 2024

Uh oh!

mergify bot commented Jul 1, 2024

Uh oh!

cdoern commented Jul 1, 2024

Uh oh!

mergify bot commented Jul 2, 2024

Uh oh!

cdoern commented Jul 8, 2024

Uh oh!

Maxusmusti commented Jul 8, 2024

Uh oh!

russellb commented Jul 8, 2024

Uh oh!

cdoern commented Jul 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants