Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

flexthink
Copy link
Collaborator

What does this PR do?

Introduces a simple TTS architecture based on discrete speech representations from self-supervised models
Related to #2696

This version omits

  • Evaluation (UTMOS/dWER)
  • Utilities for the saving of samples and alignments
Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

@flexthink flexthink requested a review from pplantinga March 4, 2025 15:01
Copy link
Collaborator

@pplantinga pplantinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work minimizing the dependencies, this is still quite large however and will take some time to review, I only finished one pass and didn't have time to look at everything. I will also have to go and try to run the LibriTTS and LJSpeech recipes.

Overall, the code quality looks quite good, if a little verbose for my taste -- e.g. I'm not sure if the Additive Embedding and Null Embedding and Embedding Injection are really needed or if something simpler could be done. And some of the docstrings have extra spaces that don't quite match overall SpeechBrain docstring style.

Anything you can do to simplify and keep only the parts that are really necessary will be a huge help for my review, as well as future users!

The subfolder "fastspeech2" contains the recipes for training the non-autoregressive transformer based TTS model [FastSpeech2](https://arxiv.org/abs/2006.04558).

# Tokotron
The subfolder "tokotron" contains the recipes for training the transformer-based that uses discrete audio representations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The subfolder "tokotron" contains the recipes for training the transformer-based that uses discrete audio representations.
The subfolder "tokotron" contains the recipes for training a transformer-based model that uses discrete audio representations.

compatibility
g2p_src : str
The source (HuggingFace hub or path) of the G2P model to
be used
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used for what? Under what circumstances?

if model_name in ["Tacotron2", "FastSpeech2WithAlignment"]:
if extract_phonemes:
logger.info(
"Computing phonemes for LJSpeech labels using SpeechBrain G2P. This may take a while."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Computing phonemes for LJSpeech labels using SpeechBrain G2P. This may take a while."
f"Using G2P {g2p_src} to convert LJSpeech labels to phonemes. This may take a while."

if model_name is not None and "FastSpeech2" in model_name:
if extract_phonemes:
logger.info(
"Computing pitch as required for FastSpeech2. This may take a while."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the pitch required for Tokotron as well? At the least this message should be updated.

Comment on lines +52 to +53
hubert: chaanks/hifigan-hubert-l1-3-7-12-18-23-LibriTTS
wav2vec: chaanks/hifigan-hubert-l1-3-7-12-18-23-LibriTTS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these supposed to be the same?

["out", "gate_out", "dec_self_attn", "dec_attn", "alignments", "context"],
)

TokotronDecoderInfernceOutput = namedtuple(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inference, not Infernce

],
)

TokotronInfernceOutput = namedtuple(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inference, not Infernce

)
nn.init.xavier_normal_(self.in_proj.w.weight)

"""A simple embedding mechanism that adds the embedding to the inputs before the layer"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""A simple embedding mechanism that adds the embedding to the inputs before the layer"""

Comment on lines +416 to +419
loss = super().fit_batch(batch)
if self.hparams.lr_annealing_mode == "step":
self.hparams.lr_annealing(self.optimizer)
return loss
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +430 to +431
"""Iterate epochs and datasets to improve objective.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of just copying the brain docstring this should state the changes that required overriding the default one.

@pplantinga pplantinga added this to the v1.1.0 milestone Mar 7, 2025
@pplantinga pplantinga added the recipes Changes to recipes only (add/edit) label Mar 7, 2025
@pplantinga
Copy link
Collaborator

pplantinga commented Mar 7, 2025

Perhaps one thing we could do here is move the core changes to another PR: i.e. the four core (non-lobes) files in nnet, utils, and loss. We could merge this one first and then the tokotron one would depend on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

recipes Changes to recipes only (add/edit)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.