Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

wavekat/wavekat-asr

Open more actions menu

Repository files navigation

WaveKat ASR

Crates.io docs.rs CI

Unified streaming speech-to-text for WaveKat voice pipelines, wrapping multiple ASR engines behind common Rust traits. Same pattern as wavekat-vad, wavekat-turn, and wavekat-tts.

Warning

Pre-1.0. The trait surface may iterate as more backends land. Pin to an exact patch version.

Backends

Backend Feature flag Transport Languages Status License
sherpa-onnx (streaming Zipformer / Paraformer) sherpa-onnx Local ONNX EN, ZH, EN+ZH ✅ Available Apache 2.0

Local-first by design: the bundled sherpa-onnx backend ships today and runs entirely on-device.

Quick start

cargo add wavekat-asr --features sherpa-onnx
use wavekat_asr::backends::sherpa_onnx::SherpaOnnxAsr;
use wavekat_asr::{AudioFrame, Channel, StreamingAsr, TranscriptEvent};

let (mut asr, rx) = SherpaOnnxAsr::new()?;  // auto-downloads bilingual model on first run

let samples = vec![0.0f32; 16_000];          // 1 s of 16 kHz mono audio
let frame = AudioFrame::new(samples.as_slice(), 16_000);
asr.push_audio(&frame, Channel::Local)?;
asr.finish()?;

for event in rx.try_iter() {
    if let TranscriptEvent::Final { text, confidence, .. } = event {
        println!("final ({confidence:.2}): {text}");
    }
}

The StreamingAsr trait

All backends implement a common trait so you can write code generic over backends:

pub trait StreamingAsr: Send {
    fn push_audio(&mut self, frame: &AudioFrame, channel: Channel) -> Result<(), AsrError>;
    fn finish(&mut self) -> Result<(), AsrError>;
    fn reset(&mut self, channel: Channel) -> Result<(), AsrError>;
}

Transcript events come back through an mpsc::Receiver<TranscriptEvent> the backend hands you at construction time:

pub enum TranscriptEvent {
    SpeechStarted { channel, ts_ms },
    SpeechEnded   { channel, ts_ms },
    Partial       { channel, ts_ms, text },
    Final         { channel, ts_ms, end_ms, text, confidence },
    Warning(String),
}

Channel::{Local, Remote} tags which side of a two-channel call each event belongs to — the daemon tees both RTP directions through one ASR instance.

Architecture

wavekat-vad   →  "is someone speaking?"
wavekat-turn  →  "are they done speaking?"
wavekat-asr   →  "what did they say?"
wavekat-tts   →  "synthesize the response"
     │                   │                     │                    │
     └───────────────────┴─────────────────────┴────────────────────┘
                                  │
                            AudioFrame (wavekat-core)

The trait surface stays deliberately small. Backends own their own resampling, network state, and tokenizer.

   AudioFrame ──▶  push_audio(frame, channel)  ──▶  ┌───────────┐
                                                    │  Backend  │
   end of call ─▶  finish()                    ──▶  │           │
                                                    │           │
                                  TranscriptEvent ◀─│           │
                                  on Receiver       └───────────┘

Why sync push + receiver, rather than async fn? The intended consumer already runs an event loop and fans events out to clients; matching that shape avoids forcing a tokio runtime through the trait. Backends that need their own runtime spawn one internally.

sherpa-onnx backend

Local streaming Zipformer / Paraformer via sherpa-onnx. Auto-downloads the selected model from HuggingFace on first use; cached under $HF_HOME/hub/ (default ~/.cache/huggingface/hub/).

Model presets

Model choice is a construction-time call — the ONNX files load into the recognizer, so switching models requires rebuilding the backend.

WAVEKAT_ASR_PRESET Constant HF repo Best for
bilingual (default) BILINGUAL_ZH_EN csukuangfj/sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20 Mixed EN+ZH calls
en ZIPFORMER_EN csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26 English-only
zh PARAFORMER_ZH csukuangfj/sherpa-onnx-streaming-paraformer-zh Mandarin-only (often beats bilingual on ZH WER)
paraformer-zh-en PARAFORMER_BILINGUAL_ZH_EN csukuangfj/sherpa-onnx-streaming-paraformer-bilingual-zh-en ZH-leaning bilingual alternative

Examples

Two runnable examples ship behind --features sherpa-onnx. First run auto-downloads the selected model.

# Transcribe a 16 kHz mono WAV file
cargo run --release --example transcribe_wav --features sherpa-onnx -- audio.wav

# Live mic transcription (Ctrl-C to stop)
cargo run --release --example transcribe_mic --features sherpa-onnx

# Pick a different model (default is `bilingual`)
WAVEKAT_ASR_PRESET=en cargo run --release --example transcribe_mic --features sherpa-onnx

Feature flags

Flag Default Description
sherpa-onnx No Local streaming Zipformer / Paraformer via sherpa-onnx; pulls in hf-hub for first-run model download

Building from source

Enabling sherpa-onnx pulls in sherpa-onnx-sys, which builds vendored ONNX Runtime through CMake. You'll need:

  • A C++ toolchain (clang or gcc) and cmake on PATH.
  • Linux only — and only for the transcribe_mic example: ALSA dev headers (libasound2-dev on Debian/Ubuntu, alsa-lib-devel on Fedora). The library itself has no system audio dependency.

The first build of sherpa-onnx-sys is slow (5–10 min); subsequent builds are cached by Cargo.

Important notes

  • Sample rate. The StreamingAsr trait accepts any AudioFrame sample rate; backends resample internally. The sherpa-onnx backend currently expects 16 kHz f32 input — 8 kHz telephony resampling lands in a follow-up (see docs/03-sherpa-onnx-backend.md).
  • Dual-channel routing. Channel::{Local, Remote} is wired through the trait today; per-channel state isolation in sherpa-onnx is Phase 2.

About WaveKat

wavekat-asr is part of WaveKat, an open-source ecosystem of Rust crates for building real-time voice pipelines. It handles streaming speech-to-text, alongside sibling crates for voice activity detection, turn detection, and text-to-speech.

See wavekat.com for the full project.

Stars

wavekat/wavekat-asr stars

License

Licensed under Apache 2.0.

Copyright 2026 WaveKat.

Acknowledgements

About

Streaming speech-to-text library for Rust with a unified trait interface over multiple backends (sherpa-onnx Zipformer/Paraformer). Part of the WaveKat voice pipeline.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.