Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

younessdkhissi
Copy link

@younessdkhissi younessdkhissi commented Sep 25, 2025

Adding a decoding loop on each decoding timestep. It is more adapted to RNN-T loss and it improves significantly the results (especially for streaming recipes)

What does this PR do?

Fixes the transducer greedy decoding (this probably fixes the issue #2753)

Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

Adding a decoding loop on each decoding timestep. It is more adapted to RNN-T loss and it improves significantly the results (especially for streaming recipes)
Set buffer_chunk_size to -1 to prevent dropping first chunks. This is the case especially when a small chunk size is used (e.g., 160ms)
@Adel-Moumen
Copy link
Collaborator

Hi @younessdkhissi,

Thanks for the PR.

Could you please develop a bit more what are your suggested changes, and why do you think they are necessary? Furthermore, can you report the general improvements you've got in a streaming setting? Ideally, we would like to imrpove our transfucer decoding interfaces by being more aligned to the literature, so if you could provide some references it would be great as well.

Thanks Youness!

@younessdkhissi
Copy link
Author

Thanks @Adel-Moumen for taking time to look at my PR.
The idea behind the changes made in the greedy search is to represent all the alignments we learn with the RNN-T loss during training. In the figure below, we see that many of the RNN-T alignment paths suppose that many tokens could be decoded in the same timestamp.
In the current speechbrain greedy search, we could decode only one token from each timestep (similar to CTC) but this is different from RNN-Transducer objective.
In Streaming ASR, transducer has more tendency to delay the predictions until it is very confident by visiting more future frames (http://arxiv.org/abs/2111.01690). This push the model to learn decoding many tokens at the end. With the current algorithm, a number of deletions is expected at the end of th transcriptions. That why I mentionned the issue #2753 where there is a difference between the evaluation inside the training script and the evaluation using StreamingASR class: in this class, there is a number of zero chunks injected at the end of the input stream which gives more chance to decode more tokens at the end.

I have tried to implement the following paper "DUAL-MODE ASR: UNIFY AND IMPROVE STREAMINGASR WITH FULL-CONTEXT MODELING" https://openreview.net/pdf?id=Pz_dcqfcKW8 (that could be a future PR) and these are the results I get on LibriSpeech using the streaming mode:

  • Dual-mode ASR (streaming):
  • Using the current greedy search:
    • test-clean: 4.32%
    • test-other: 11.04%
  • Using the greedy search proposed in the PR:
    • test-clean: 3.88%
    • test-other: 9.88%
  • original paper results:
    • test-clean: 3.7%
    • test-other: 9.2%
rnn_t_alignment_paths

@TParcollet TParcollet added this to the v1.1.0 milestone Oct 9, 2025
@TParcollet TParcollet self-assigned this Oct 9, 2025
@TParcollet
Copy link
Collaborator

Hi @younessdkhissi thanks for the work. I am really not sure what this PR does. I see that there is a new arbitrary for loop for each time steps. Do you have a paper describing what is being done here formally? This would be important to attach to such a change to greedy decoding.

@younessdkhissi
Copy link
Author

Hello @TParcollet
In this PR, I changed the greedy search so that for each timestep, We decode until we predict the blank token because the original Transducer paper (https://arxiv.org/abs/1211.3711?utm_source=openai) defines alignments over a T×U lattice (see the figure in the previous comment) with a blank symbol, allowing multiple label emissions inthe same time step.
To avoid infinite loop, I defined a for loop of 'max_iterations' which consiste that for each timetep we could decode at most "max_itarations" of non-blank token.
Here is a paper of NVIDIA where they describe their greedy search algorithm (https://www.isca-archive.org/interspeech_2024/galvez24_interspeech.pdf): the 9th line of Algorithm 1 is equilevent to the for loop that I have added in my PR.
The choice of "max_iterations" value was inpired from this paper too.

@TParcollet
Copy link
Collaborator

Hi @younessdkhissi and thank you. I'll take a closer look into this asap. In the meantime, could you please provide a few measurements of how the decoding speed is impacted by this change? Many thanks!

@younessdkhissi
Copy link
Author

younessdkhissi commented Oct 15, 2025

Hi @TParcollet
I measured the inference time using both decoding methods and take the mean of 5 different runs with different seeds.
I used the RTX 2080 Ti GPU card. The measurements have been made on 12-layer conformer transducer (similar architecture to the existing SpeechBrain recipe) using a chunk size of 320ms without taking any left context. The inference has been made on LibriSpeech test set using batch size of 4.

  • With the current greedy decoding(test-clean/test-other):
    • WER : 3.85%/10.36%
    • Inference time : 2min 29s / 2mins 18s
  • With the proposed greedy decoding:
    • WER : 3.79%/10.30%
    • Inference time : 2min 47s / 2mins 35s

If you want more measurements let me know :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.