-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Generic feature extraction POC #2876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example of a train.py
integration of your new tokens loader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this script should be here. I think it should be dataset dependent similar to what we are doing for let's say librispeech_preparation.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was borrowed from DASB - my older approach integrated that with preparation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you should move this a unit-test. I think the extraction will requires extensive tests to make sure they are correct in the loading/saving process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Adel-Moumen: Unit tests created in #2938
I have some private examples - but they are on new work in progress not ready to be merged yet, as well as older incarnations of Tokotron. I would suggest choosing one existing recipe and integrating it. |
Also, quick question to @pplantinga, don't you think we should maybe aim for a single backend? Given that we are trying to minimise the number of dependencies, I would find it better to just stick to the best and more general purpose solution (instead of having something too general). I believe that most of them share similar pos and cons. In our context, I am not sure if we really need something very sota, I would prefer having something easy to use, where we only need low efforts to maintain the integration. So maybe, something like numpy or h5 is enough. |
See #2938 for a simplified H5-only version. |
What does this PR do?
(Work in progress) A universal feature extractor to extract arbitrary features from dataset (discrete tokens, continuous representations, etc) and save them using arbitrary formats.
Before submitting
PR review
Reviewer checklist