Add Granite Speech multimodal speech-to-text implementation by gsmoon97 · Pull Request #499 · foundation-model-stack/foundation-model-stack

gsmoon97 · Dec 21, 2025

Add Granite Speech multimodal speech-to-text implementation

This PR integrates IBM's Granite Speech 3.3 model into FMS, developed as part of a
Columbia University course project (COMSE6998 High Performance Machine Learning) in collaboration with IBM Research.

Components

GraniteSpeech: Multimodal speech-to-text model combining Conformer encoder, Q-Former projector, and Granite decoder with LoRA adapter support
Conformer: Audio encoder with blocked self-attention, depthwise-separable convolution, Macaron-style feedforward, Shaw's relative positional embeddings, and optional CTC auxiliary loss
SpeechProjector: Q-Former bridge for projecting acoustic features to language model dimension using learnable query tokens and cross-attention
GraniteSpeechFeatureExtractor: Mel-spectrogram extraction with log normalization and 2-channel frame stacking
GraniteSpeechProcessor: Unified processor for audio-text multimodal inputs with automatic audio token expansion

Integration

Register granite_speech architecture with 3.3-2b and 3.3-8b variants
Add HF config mapping for GraniteSpeechForConditionalGeneration in utils.py
Add serialization adapters for HF checkpoint loading:
- LoRA weight merging for decoder adapters
- HF-to-FMS naming convention conversion
- RoPE weight transformations for query/key projections
- Attention and MLP weight fusion for inference efficiency
Support for frozen encoder/decoder finetuning modes
Generation interface with KV caching and automatic LoRA toggling

Features

Audio token placeholder replacement with projected audio embeddings
Window-based processing for efficient long-form audio
Compatible with standard HuggingFace checkpoint format
Comprehensive inline documentation following FMS conventions

Testing & Validation

Unit tests for all components (conformer, projector, model)
HuggingFace numerical equivalence validation
Complete end-to-end inference notebook (notebooks/granite_speech_inference.ipynb) demonstrating real audio transcription with LibriSpeech dataset

Review Notes

We appreciate your time reviewing this contribution. All tests pass on CPU, with GPU-dependent equivalence tests marked for CUDA environments. The implementation follows FMS conventions and includes inline comments for maintainability.

Please let us know if any changes are needed or if you have questions about the implementation approach.

For more details about the project, please refer here.

Acknowledgements

Special thanks to our mentors @rzbhatti and @kaoutar55 from IBM Research for their invaluable guidance, technical insights, and support throughout this project. Their expertise in FMS architecture and multimodal models was instrumental in achieving a production-ready implementation.

Team:

Course: COMSE6998 High Performance Machine Learning (Fall 2025)

Supervisors:

Dr. Kaoutar El Maghraoui (@kaoutar55 )
Dr. Rashed Bhatti (@rzbhatti)

Co-authored-by: In Keun Kim <ik2619@columbia.edu> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Co-authored-by: Aneesh Durai <126147060+aneeshdurai@users.noreply.github.com> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Co-authored-by: In Keun Kim <ik2619@columbia.edu> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Co-authored-by: In Keun Kim <ik2619@columbia.edu> Co-authored-by: Zachary Zusin <zacharyzusin@gmail.com> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Co-authored-by: In Keun Kim <ik2619@columbia.edu> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Co-authored-by: In Keun Kim <ik2619@columbia.edu> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Signed-off-by: In Keun Kim <ik2619@columbia.edu>

gsmoon97 and others added 10 commits December 21, 2025 13:39

Add Conformer encoder for acoustic feature extraction

28ddea6

Co-authored-by: In Keun Kim <ik2619@columbia.edu> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Add unit tests for Conformer encoder

2b99a5f

Co-authored-by: In Keun Kim <ik2619@columbia.edu> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Add SpeechProjector with Q-Former architecture

b366a78

Co-authored-by: Aneesh Durai <126147060+aneeshdurai@users.noreply.github.com> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Add unit tests for SpeechProjector

92ecaea

Co-authored-by: In Keun Kim <ik2619@columbia.edu> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Add GraniteSpeech multimodal speech-to-text model

ce488a2

Co-authored-by: In Keun Kim <ik2619@columbia.edu> Co-authored-by: Zachary Zusin <zacharyzusin@gmail.com> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Add core tests for GraniteSpeech model

45646bf

Co-authored-by: In Keun Kim <ik2619@columbia.edu> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Add HuggingFace equivalence validation tests

74386a9

Co-authored-by: In Keun Kim <ik2619@columbia.edu> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Align model configs with HuggingFace and add auto-config support

5e334db

Co-authored-by: In Keun Kim <ik2619@columbia.edu> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Align model naming convention

e1e8eb8

Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

Add GraniteSpeech inference notebook with usage examples

50080ac

Co-authored-by: In Keun Kim <ik2619@columbia.edu> Signed-off-by: Geonsik Moon <gsmoon97@gmail.com>

gsmoon97 force-pushed the granite-speech-dev branch from 0cc74e0 to 50080ac Compare December 21, 2025 18:39

Add device handling utilities to comparison.py

9cbbe8d

Signed-off-by: In Keun Kim <ik2619@columbia.edu>

kaoutar55 requested review from ani300, kaoutar55 and rzbhatti January 7, 2026 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add Granite Speech multimodal speech-to-text implementation#499

gsmoon97 commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Search code, repositories, users, issues, pull requests...

Comments

Conversation

gsmoon97 commented Dec 21, 2025

Add Granite Speech multimodal speech-to-text implementation

Components

Integration

Features

Testing & Validation

Review Notes

Acknowledgements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants