Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Comments

Close side panel

Add Granite Speech multimodal speech-to-text implementation#499

Open
gsmoon97 wants to merge 11 commits intofoundation-model-stack:mainfoundation-model-stack/foundation-model-stack:mainfrom
columbia-hpml-granite:granite-speech-devcolumbia-hpml-granite/foundation-model-stack:granite-speech-devCopy head branch name to clipboard
Open

Add Granite Speech multimodal speech-to-text implementation#499
gsmoon97 wants to merge 11 commits intofoundation-model-stack:mainfoundation-model-stack/foundation-model-stack:mainfrom
columbia-hpml-granite:granite-speech-devcolumbia-hpml-granite/foundation-model-stack:granite-speech-devCopy head branch name to clipboard

Conversation

@gsmoon97
Copy link

Add Granite Speech multimodal speech-to-text implementation

This PR integrates IBM's Granite Speech 3.3 model into FMS, developed as part of a
Columbia University course project (COMSE6998 High Performance Machine Learning) in collaboration with IBM Research.

Components

  • GraniteSpeech: Multimodal speech-to-text model combining Conformer encoder, Q-Former projector, and Granite decoder with LoRA adapter support
  • Conformer: Audio encoder with blocked self-attention, depthwise-separable convolution, Macaron-style feedforward, Shaw's relative positional embeddings, and optional CTC auxiliary loss
  • SpeechProjector: Q-Former bridge for projecting acoustic features to language model dimension using learnable query tokens and cross-attention
  • GraniteSpeechFeatureExtractor: Mel-spectrogram extraction with log normalization and 2-channel frame stacking
  • GraniteSpeechProcessor: Unified processor for audio-text multimodal inputs with automatic audio token expansion

Integration

  • Register granite_speech architecture with 3.3-2b and 3.3-8b variants
  • Add HF config mapping for GraniteSpeechForConditionalGeneration in utils.py
  • Add serialization adapters for HF checkpoint loading:
    • LoRA weight merging for decoder adapters
    • HF-to-FMS naming convention conversion
    • RoPE weight transformations for query/key projections
    • Attention and MLP weight fusion for inference efficiency
  • Support for frozen encoder/decoder finetuning modes
  • Generation interface with KV caching and automatic LoRA toggling

Features

  • Audio token placeholder replacement with projected audio embeddings
  • Window-based processing for efficient long-form audio
  • Compatible with standard HuggingFace checkpoint format
  • Comprehensive inline documentation following FMS conventions

Testing & Validation

  • Unit tests for all components (conformer, projector, model)
  • HuggingFace numerical equivalence validation
  • Complete end-to-end inference notebook (notebooks/granite_speech_inference.ipynb) demonstrating real audio transcription with LibriSpeech dataset

Review Notes

We appreciate your time reviewing this contribution. All tests pass on CPU, with GPU-dependent equivalence tests marked for CUDA environments. The implementation follows FMS conventions and includes inline comments for maintainability.

Please let us know if any changes are needed or if you have questions about the implementation approach.

For more details about the project, please refer here.

Acknowledgements

Special thanks to our mentors @rzbhatti and @kaoutar55 from IBM Research for their invaluable guidance, technical insights, and support throughout this project. Their expertise in FMS architecture and multimodal models was instrumental in achieving a production-ready implementation.


Team:

Course: COMSE6998 High Performance Machine Learning (Fall 2025)

Supervisors:

gsmoon97 and others added 10 commits December 21, 2025 13:39
Co-authored-by: In Keun Kim <ik2619@columbia.edu>
Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>
Co-authored-by: In Keun Kim <ik2619@columbia.edu>
Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>
Co-authored-by: Aneesh Durai <126147060+aneeshdurai@users.noreply.github.com>
Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>
Co-authored-by: In Keun Kim <ik2619@columbia.edu>
Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>
Co-authored-by: In Keun Kim <ik2619@columbia.edu>
Co-authored-by: Zachary Zusin <zacharyzusin@gmail.com>
Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>
Co-authored-by: In Keun Kim <ik2619@columbia.edu>
Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>
Co-authored-by: In Keun Kim <ik2619@columbia.edu>
Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>
Co-authored-by: In Keun Kim <ik2619@columbia.edu>
Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>
Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>
Co-authored-by: In Keun Kim <ik2619@columbia.edu>
Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>
Signed-off-by: In Keun Kim <ik2619@columbia.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.