-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Tokotron: Tokenized TTS (lite version - minimal dependencies) #2849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work minimizing the dependencies, this is still quite large however and will take some time to review, I only finished one pass and didn't have time to look at everything. I will also have to go and try to run the LibriTTS and LJSpeech recipes.
Overall, the code quality looks quite good, if a little verbose for my taste -- e.g. I'm not sure if the Additive Embedding and Null Embedding and Embedding Injection are really needed or if something simpler could be done. And some of the docstrings have extra spaces that don't quite match overall SpeechBrain docstring style.
Anything you can do to simplify and keep only the parts that are really necessary will be a huge help for my review, as well as future users!
The subfolder "fastspeech2" contains the recipes for training the non-autoregressive transformer based TTS model [FastSpeech2](https://arxiv.org/abs/2006.04558). | ||
|
||
# Tokotron | ||
The subfolder "tokotron" contains the recipes for training the transformer-based that uses discrete audio representations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subfolder "tokotron" contains the recipes for training the transformer-based that uses discrete audio representations. | |
The subfolder "tokotron" contains the recipes for training a transformer-based model that uses discrete audio representations. |
compatibility | ||
g2p_src : str | ||
The source (HuggingFace hub or path) of the G2P model to | ||
be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used for what? Under what circumstances?
if model_name in ["Tacotron2", "FastSpeech2WithAlignment"]: | ||
if extract_phonemes: | ||
logger.info( | ||
"Computing phonemes for LJSpeech labels using SpeechBrain G2P. This may take a while." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Computing phonemes for LJSpeech labels using SpeechBrain G2P. This may take a while." | |
f"Using G2P {g2p_src} to convert LJSpeech labels to phonemes. This may take a while." |
if model_name is not None and "FastSpeech2" in model_name: | ||
if extract_phonemes: | ||
logger.info( | ||
"Computing pitch as required for FastSpeech2. This may take a while." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the pitch required for Tokotron
as well? At the least this message should be updated.
hubert: chaanks/hifigan-hubert-l1-3-7-12-18-23-LibriTTS | ||
wav2vec: chaanks/hifigan-hubert-l1-3-7-12-18-23-LibriTTS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these supposed to be the same?
["out", "gate_out", "dec_self_attn", "dec_attn", "alignments", "context"], | ||
) | ||
|
||
TokotronDecoderInfernceOutput = namedtuple( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inference, not Infernce
], | ||
) | ||
|
||
TokotronInfernceOutput = namedtuple( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inference, not Infernce
) | ||
nn.init.xavier_normal_(self.in_proj.w.weight) | ||
|
||
"""A simple embedding mechanism that adds the embedding to the inputs before the layer""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""A simple embedding mechanism that adds the embedding to the inputs before the layer""" |
loss = super().fit_batch(batch) | ||
if self.hparams.lr_annealing_mode == "step": | ||
self.hparams.lr_annealing(self.optimizer) | ||
return loss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Iterate epochs and datasets to improve objective. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of just copying the brain
docstring this should state the changes that required overriding the default one.
Perhaps one thing we could do here is move the core changes to another PR: i.e. the four core (non-lobes) files in |
What does this PR do?
Introduces a simple TTS architecture based on discrete speech representations from self-supervised models
Related to #2696
This version omits
Before submitting
PR review
Reviewer checklist