Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

flexthink
Copy link
Collaborator

What does this PR do?

(Work in progress) A universal feature extractor to extract arbitrary features from dataset (discrete tokens, continuous representations, etc) and save them using arbitrary formats.

Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

Copy link
Collaborator

@Adel-Moumen Adel-Moumen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of a train.py integration of your new tokens loader?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this script should be here. I think it should be dataset dependent similar to what we are doing for let's say librispeech_preparation.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was borrowed from DASB - my older approach integrated that with preparation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you should move this a unit-test. I think the extraction will requires extensive tests to make sure they are correct in the loading/saving process

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create unit tests

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Adel-Moumen: Unit tests created in #2938

@flexthink
Copy link
Collaborator Author

Do you have an example of a train.py integration of your new tokens loader?

I have some private examples - but they are on new work in progress not ready to be merged yet, as well as older incarnations of Tokotron.

I would suggest choosing one existing recipe and integrating it.

@Adel-Moumen
Copy link
Collaborator

Also, quick question to @pplantinga, don't you think we should maybe aim for a single backend? Given that we are trying to minimise the number of dependencies, I would find it better to just stick to the best and more general purpose solution (instead of having something too general). I believe that most of them share similar pos and cons. In our context, I am not sure if we really need something very sota, I would prefer having something easy to use, where we only need low efforts to maintain the integration. So maybe, something like numpy or h5 is enough.

@flexthink
Copy link
Collaborator Author

See #2938 for a simplified H5-only version.

@TParcollet TParcollet added this to the v1.1.0 milestone Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.