Tokotron: Tokenized TTS (lite version - minimal dependencies) #2849

flexthink · Mar 4, 2025

What does this PR do?

Introduces a simple TTS architecture based on discrete speech representations from self-supervised models
Related to #2696

This version omits

Evaluation (UTMOS/dWER)
Utilities for the saving of samples and alignments

Before submitting

Did you read the contributor guideline?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
Review the self-review checklist to ensure the code is ready for review

…ency with tgt

pplantinga

Thanks for your work minimizing the dependencies, this is still quite large however and will take some time to review, I only finished one pass and didn't have time to look at everything. I will also have to go and try to run the LibriTTS and LJSpeech recipes.

Overall, the code quality looks quite good, if a little verbose for my taste -- e.g. I'm not sure if the Additive Embedding and Null Embedding and Embedding Injection are really needed or if something simpler could be done. And some of the docstrings have extra spaces that don't quite match overall SpeechBrain docstring style.

Anything you can do to simplify and keep only the parts that are really necessary will be a huge help for my review, as well as future users!

pplantinga · Mar 5, 2025

recipes/LJSpeech/TTS/README.md

 The subfolder "fastspeech2" contains the recipes for training the non-autoregressive transformer based TTS model [FastSpeech2](https://arxiv.org/abs/2006.04558).

+# Tokotron
+The subfolder "tokotron" contains the recipes for training the transformer-based that uses discrete audio representations.


Suggested change

The subfolder "tokotron" contains the recipes for training the transformer-based that uses discrete audio representations.

The subfolder "tokotron" contains the recipes for training a transformer-based model that uses discrete audio representations.

pplantinga · Mar 5, 2025

recipes/LJSpeech/ljspeech_prepare.py

+        compatibility
+    g2p_src : str
+        The source (HuggingFace hub or path) of the G2P model to
+        be used


Used for what? Under what circumstances?

pplantinga · Mar 5, 2025

recipes/LJSpeech/ljspeech_prepare.py

-    if model_name in ["Tacotron2", "FastSpeech2WithAlignment"]:
+    if extract_phonemes:
        logger.info(
            "Computing phonemes for LJSpeech labels using SpeechBrain G2P. This may take a while."


Suggested change

"Computing phonemes for LJSpeech labels using SpeechBrain G2P. This may take a while."

f"Using G2P {g2p_src} to convert LJSpeech labels to phonemes. This may take a while."

pplantinga · Mar 5, 2025

recipes/LJSpeech/ljspeech_prepare.py

-    if model_name is not None and "FastSpeech2" in model_name:
+    if extract_phonemes:
        logger.info(
            "Computing pitch as required for FastSpeech2. This may take a while."


Is the pitch required for Tokotron as well? At the least this message should be updated.

pplantinga · Mar 5, 2025

recipes/LibriTTS/TTS/tokotron/hparams/train.yaml

+        hubert: chaanks/hifigan-hubert-l1-3-7-12-18-23-LibriTTS
+        wav2vec: chaanks/hifigan-hubert-l1-3-7-12-18-23-LibriTTS


Are these supposed to be the same?

pplantinga · Mar 5, 2025

speechbrain/lobes/models/discrete/Tokotron.py

+    ["out", "gate_out", "dec_self_attn", "dec_attn", "alignments", "context"],
+)
+
+TokotronDecoderInfernceOutput = namedtuple(


Inference, not Infernce

pplantinga · Mar 5, 2025

speechbrain/lobes/models/discrete/Tokotron.py

+    ],
+)
+
+TokotronInfernceOutput = namedtuple(


Inference, not Infernce

pplantinga · Mar 5, 2025

speechbrain/lobes/models/discrete/Tokotron.py

+            )
+            nn.init.xavier_normal_(self.in_proj.w.weight)
+
+    """A simple embedding mechanism that adds the embedding to the inputs before the layer"""


Suggested change

"""A simple embedding mechanism that adds the embedding to the inputs before the layer"""

pplantinga · Mar 5, 2025

recipes/LibriTTS/TTS/tokotron/train.py

+        loss = super().fit_batch(batch)
+        if self.hparams.lr_annealing_mode == "step":
+            self.hparams.lr_annealing(self.optimizer)
+        return loss


Maybe use this?
https://speechbrain.readthedocs.io/en/latest/API/speechbrain.core.html#speechbrain.core.Brain.on_fit_batch_end

pplantinga · Mar 5, 2025

recipes/LibriTTS/TTS/tokotron/train.py

+        """Iterate epochs and datasets to improve objective.
+


Maybe instead of just copying the brain docstring this should state the changes that required overriding the default one.

pplantinga · Mar 7, 2025

Perhaps one thing we could do here is move the core changes to another PR: i.e. the four core (non-lobes) files in nnet, utils, and loss. We could merge this one first and then the tokotron one would depend on it.

flexthink added 5 commits March 3, 2025 15:54

Tokotron: Add LJSpeech support (single-speaker)

212eb43

Tokotron: Add LibriTTS support

499ad7c

Tokotron: Update sample selection, replace the praat-textgrids depend…

81665e2

…ency with tgt

Tokotron: Fix recipe tests

e36b23a

Tokotron: Fix YAML consistency

fcb542f

flexthink requested a review from pplantinga March 4, 2025 15:01

flexthink added 5 commits March 4, 2025 18:40

Tokotron: Fix continuous mode

f4d939e

Tokotron: Device handling fixes

5bdb061

Tokotron: Fixes

e78e614

Tokotron: Fix typos

0739212

Tokotron: Device fixes

de6e90e

pplantinga reviewed Mar 5, 2025

View reviewed changes

pplantinga added this to the v1.1.0 milestone Mar 7, 2025

pplantinga assigned flexthink Mar 7, 2025

pplantinga added the recipes Changes to recipes only (add/edit) label Mar 7, 2025

Tokotron: Fixes

808650a

	The subfolder "tokotron" contains the recipes for training the transformer-based that uses discrete audio representations.
	The subfolder "tokotron" contains the recipes for training a transformer-based model that uses discrete audio representations.

	"Computing phonemes for LJSpeech labels using SpeechBrain G2P. This may take a while."
	f"Using G2P {g2p_src} to convert LJSpeech labels to phonemes. This may take a while."

		hubert: chaanks/hifigan-hubert-l1-3-7-12-18-23-LibriTTS
		wav2vec: chaanks/hifigan-hubert-l1-3-7-12-18-23-LibriTTS

Search code, repositories, users, issues, pull requests...

Tokotron: Tokenized TTS (lite version - minimal dependencies) #2849

Are you sure you want to change the base?

Tokotron: Tokenized TTS (lite version - minimal dependencies) #2849

Uh oh!

Conversation

flexthink commented Mar 4, 2025

What does this PR do?

PR review

Uh oh!

pplantinga left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pplantinga commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pplantinga commented Mar 7, 2025 •

edited

Loading