Adapting transducer greedy decoding #2975

younessdkhissi · Sep 25, 2025

Adding a decoding loop on each decoding timestep. It is more adapted to RNN-T loss and it improves significantly the results (especially for streaming recipes)

What does this PR do?

Fixes the transducer greedy decoding (this probably fixes the issue #2753)

Before submitting

Did you read the contributor guideline?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
Review the self-review checklist to ensure the code is ready for review

Adding a decoding loop on each decoding timestep. It is more adapted to RNN-T loss and it improves significantly the results (especially for streaming recipes)

Set buffer_chunk_size to -1 to prevent dropping first chunks. This is the case especially when a small chunk size is used (e.g., 160ms)

Adel-Moumen · Oct 8, 2025

Hi @younessdkhissi,

Thanks for the PR.

Could you please develop a bit more what are your suggested changes, and why do you think they are necessary? Furthermore, can you report the general improvements you've got in a streaming setting? Ideally, we would like to imrpove our transfucer decoding interfaces by being more aligned to the literature, so if you could provide some references it would be great as well.

Thanks Youness!

younessdkhissi · Oct 8, 2025

Thanks @Adel-Moumen for taking time to look at my PR.
The idea behind the changes made in the greedy search is to represent all the alignments we learn with the RNN-T loss during training. In the figure below, we see that many of the RNN-T alignment paths suppose that many tokens could be decoded in the same timestamp.
In the current speechbrain greedy search, we could decode only one token from each timestep (similar to CTC) but this is different from RNN-Transducer objective.
In Streaming ASR, transducer has more tendency to delay the predictions until it is very confident by visiting more future frames (http://arxiv.org/abs/2111.01690). This push the model to learn decoding many tokens at the end. With the current algorithm, a number of deletions is expected at the end of th transcriptions. That why I mentionned the issue #2753 where there is a difference between the evaluation inside the training script and the evaluation using StreamingASR class: in this class, there is a number of zero chunks injected at the end of the input stream which gives more chance to decode more tokens at the end.

I have tried to implement the following paper "DUAL-MODE ASR: UNIFY AND IMPROVE STREAMINGASR WITH FULL-CONTEXT MODELING" https://openreview.net/pdf?id=Pz_dcqfcKW8 (that could be a future PR) and these are the results I get on LibriSpeech using the streaming mode:

Dual-mode ASR (streaming):
Using the current greedy search:
- test-clean: 4.32%
- test-other: 11.04%
Using the greedy search proposed in the PR:
- test-clean: 3.88%
- test-other: 9.88%
original paper results:
- test-clean: 3.7%
- test-other: 9.2%

TParcollet · Oct 9, 2025

Hi @younessdkhissi thanks for the work. I am really not sure what this PR does. I see that there is a new arbitrary for loop for each time steps. Do you have a paper describing what is being done here formally? This would be important to attach to such a change to greedy decoding.

younessdkhissi · Oct 10, 2025

Hello @TParcollet
In this PR, I changed the greedy search so that for each timestep, We decode until we predict the blank token because the original Transducer paper (https://arxiv.org/abs/1211.3711?utm_source=openai) defines alignments over a T×U lattice (see the figure in the previous comment) with a blank symbol, allowing multiple label emissions inthe same time step.
To avoid infinite loop, I defined a for loop of 'max_iterations' which consiste that for each timetep we could decode at most "max_itarations" of non-blank token.
Here is a paper of NVIDIA where they describe their greedy search algorithm (https://www.isca-archive.org/interspeech_2024/galvez24_interspeech.pdf): the 9th line of Algorithm 1 is equilevent to the for loop that I have added in my PR.
The choice of "max_iterations" value was inpired from this paper too.

TParcollet · Oct 10, 2025

Hi @younessdkhissi and thank you. I'll take a closer look into this asap. In the meantime, could you please provide a few measurements of how the decoding speed is impacted by this change? Many thanks!

younessdkhissi · Oct 15, 2025

Hi @TParcollet
I measured the inference time using both decoding methods and take the mean of 5 different runs with different seeds.
I used the RTX 2080 Ti GPU card. The measurements have been made on 12-layer conformer transducer (similar architecture to the existing SpeechBrain recipe) using a chunk size of 320ms without taking any left context. The inference has been made on LibriSpeech test set using batch size of 4.

With the current greedy decoding(test-clean/test-other):
- WER : 3.85%/10.36%
- Inference time : 2min 29s / 2mins 18s
With the proposed greedy decoding:
- WER : 3.79%/10.30%
- Inference time : 2min 47s / 2mins 35s

If you want more measurements let me know :)

younessdkhissi added 2 commits September 25, 2025 17:12

Adapting transducer greedy decoding

b5d6eb7

Adding a decoding loop on each decoding timestep. It is more adapted to RNN-T loss and it improves significantly the results (especially for streaming recipes)

Fixing StreamingASR class

8ebaeec

Set buffer_chunk_size to -1 to prevent dropping first chunks. This is the case especially when a small chunk size is used (e.g., 160ms)

TParcollet added this to the v1.1.0 milestone Oct 9, 2025

TParcollet self-assigned this Oct 9, 2025

Merge branch 'develop' into patch-1

aef4ae4

Merge branch 'develop' into patch-1

c4fdcbe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adapting transducer greedy decoding #2975

Adapting transducer greedy decoding #2975

Uh oh!

younessdkhissi commented Sep 25, 2025 •

edited

Loading

Uh oh!

Adel-Moumen commented Oct 8, 2025

Uh oh!

younessdkhissi commented Oct 8, 2025

Uh oh!

TParcollet commented Oct 9, 2025

Uh oh!

younessdkhissi commented Oct 10, 2025

Uh oh!

TParcollet commented Oct 10, 2025

Uh oh!

younessdkhissi commented Oct 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Search code, repositories, users, issues, pull requests...

Adapting transducer greedy decoding #2975

Are you sure you want to change the base?

Adapting transducer greedy decoding #2975

Uh oh!

Conversation

younessdkhissi commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

Adel-Moumen commented Oct 8, 2025

Uh oh!

younessdkhissi commented Oct 8, 2025

Uh oh!

TParcollet commented Oct 9, 2025

Uh oh!

younessdkhissi commented Oct 10, 2025

Uh oh!

TParcollet commented Oct 10, 2025

Uh oh!

younessdkhissi commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

younessdkhissi commented Sep 25, 2025 •

edited

Loading

younessdkhissi commented Oct 15, 2025 •

edited

Loading