Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

asumagic
Copy link
Collaborator

@asumagic asumagic commented Jun 4, 2024

What does this PR do?

Fixes #2030

WIP, introduces:

  • triangularclassic triangular filters (name is a WIP)
  • Functionality to allow aligning filterbank central frequencies to the closest lower FFT frequency bin

No breaking changes expected. Should test model training to see if those make a difference.

Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

@asumagic
Copy link
Collaborator Author

asumagic commented Jun 5, 2024

Currently training LibriSpeech/ASR/transducer/ with triangularclassic and without to see if it works. I don't expect the resulting WER difference to be any significant past the margin of error, but it is worth checking.

I am not sure if I would want to keep the bin alignment option; kaldi/k2 doesn't seem to be doing it and it doesn't feel like it would be beneficial especially without high n_fft.

@TParcollet
Copy link
Collaborator

It took us a long time due to many high priority things, and mostly because it does not seem to affect the results, but this is an important fix. Maybe @idruker-cerence wants to have a look.

@asumagic
Copy link
Collaborator Author

asumagic commented Jun 6, 2024

FWIW I have a stock RNN-T LibriSpeech Conformer being trained again with the triangularclassic option and at epoch ~34 there doesn't seem to be much of a difference yet.
Will report on the results when actually finished.

@idruker-cerence
Copy link

It took us a long time due to many high priority things, and mostly because it does not seem to affect the results, but this is an important fix. Maybe @idruker-cerence wants to have a look.

Hello

I am not a python expert so it's hard to me to review the code. However, I have a C++ implementation. We can do the following. You calculate the mel-spectrum with your new function and give me the numbers. Then I verify them with my C++ function.

@asumagic
Copy link
Collaborator Author

asumagic commented Jun 25, 2024

FWIW I have a stock RNN-T LibriSpeech Conformer being trained again with the triangularclassic option and at epoch ~34 there doesn't seem to be much of a difference yet. Will report on the results when actually finished.

Current code, lines are respectively test-clean and test-other:

Epoch loaded: 100 - test loss: 1.80e-02, test CER: 8.81e-01, test WER: 2.57
Epoch loaded: 100 - test loss: 2.16e-02, test CER: 2.50, test WER: 5.88

New "triangularclassic":

Epoch loaded: 100 - test loss: 1.88e-02, test CER: 8.57e-01, test WER: 2.59
Epoch loaded: 100 - test loss: 2.12e-02, test CER: 2.56, test WER: 5.97

That's only one run with 100 epochs at that, but it doesn't seem like a significant accuracy difference (~1% relative difference for only one run).

As for correctness, from my testing, "triangularclassic" matches what torchaudio produces, which itself is adapted from kaldi (which I also checked against, or rather k2, which reuses a good chunk of the same logic). So I think it would be ok to add for correctness.

I am tempted to turn this PR into only adding the "correct" filter as an option, but keep the default the same, and remove the snap_to_bins (which I don't think matters, and that I didn't test extensively).

At least in speech processing, correct me if I'm wrong, it doesn't seem so common to round the center frequencies to the closest bin in definitions of the mel filterbank (which is what I tried to implement as snap_to_bin).
I think you can only get "clean-looking" filter banks as usually shown in blog articles/papers if you really crank up the FFT resolution, because otherwise you're either trading off between obviously distorted central frequency spacing (due to the rounding) or not having the central frequency have a peak of 1.0 for on a specific FFT bin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: wrong formula of calculation of mel filterbank?

4 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.