Skip to main content

Command Palette

Search for a command to run...

Real-time speech-to-text for Flutter using Groq's Whisper API

Updated
4 min read
Real-time speech-to-text for Flutter using Groq's Whisper API
A
I am a Software Engineer focused on user‑friendly experiences & passionate about building excellent software that improves the lives of those around me.

I Built a Flutter Plugin for Real-Time Speech-to-Text Using Groq's Whisper API.

I needed real-time speech-to-text in a Flutter app. The existing options were either paid SDKs, platform-locked, or didn't give me enough control over the transcription pipeline. So I built my own — groq_whisper_stt.

It captures audio from the device mic, detects speech using voice activity detection, sends chunks to Groq's Whisper API, and streams transcription results back. Works on both Android and iOS.

Package: pub.dev/packages/groq_whisper_stt

Source: github.com/akshatbhuhagal/groq-whisper-stt


Why Groq + Whisper?

Groq runs Whisper models on their LPU hardware, which makes inference incredibly fast. For real-time STT, speed matters — you don't want your users waiting seconds for each transcription chunk. Groq's Whisper endpoint consistently returns results in under a second, which makes the whole experience feel live.

And it's just an API call. No native ML models to bundle, no platform-specific speech frameworks to fight with.

How the Pipeline Works

The plugin runs a full audio pipeline on-device before anything hits the network:

Microphone -> VAD -> Audio Buffer -> WAV Encode -> Groq API -> Result Assembly -> Stream

1. Microphone capture — Native platform code (Kotlin on Android, Swift on iOS) captures 16kHz mono 16-bit PCM audio in 20ms frames.

2. Voice Activity Detection — An energy-based VAD with an adaptive noise floor analyzes each frame. It only triggers recording when actual speech is detected — no sending silence to the API and burning through your quota.

3. Pre-speech ring buffer — A 300ms ring buffer continuously stores recent audio. When the VAD detects speech, this buffer is prepended to the recording so the beginning of the utterance isn't clipped. This was one of those "obvious in hindsight" features that made a huge quality difference.

4. Chunked sending with overlap — During continuous speech, audio is sent to Groq every 3 seconds (configurable). Each chunk overlaps the previous one by 500ms to avoid cutting words at boundaries.

5. Result assembly — The assembler deduplicates overlapping words at chunk boundaries by comparing the last 5 words of each result. It also chains prompt context from previous transcriptions so Whisper maintains continuity across chunks.

6. Smart filtering — Chunks where Whisper reports noSpeechProb > 0.9 are silently dropped. This prevents phantom transcriptions from background noise.

Usage

Install it:

dependencies:
  groq_whisper_stt: ^0.1.0

Use it:

import 'package:groq_whisper_stt/groq_whisper_stt.dart';

final stt = GroqWhisperStt(
  apiKey: 'your-groq-api-key',
  config: const SttConfig(
    model: WhisperModel.largev3Turbo,
    language: 'en',
  ),
);

await stt.initialize();

stt.transcriptionStream.listen((result) {
  print(result.text);          // This chunk's text
  print(result.sessionText);   // Full session transcript
});

await stt.start();
// ... user speaks ...
await stt.stop();
stt.dispose();

The API is stream-based. You get three broadcast streams:

  • transcriptionStream — transcription results with text, word timestamps, confidence scores

  • stateStream — lifecycle state changes (idle, listening, recording, processing)

  • errorStream — errors with typed exceptions

Configuration

Everything is tunable via SttConfig:

const config = SttConfig(
  model: WhisperModel.largev3Turbo, // Fast model (809M params)
  language: 'en',                    // Or null for auto-detect
  chunkDuration: Duration(seconds: 3),
  silenceTimeout: Duration(milliseconds: 800),
  enableWordTimestamps: true,
  prompt: 'Medical terminology',     // Context hint for Whisper
  maxRetries: 3,                     // Auto-retry on 429s and 5xx
);

The prompt parameter is useful — if you know your domain (medical, legal, technical), passing a context hint significantly improves accuracy for specialized vocabulary.

Challenges I Ran Into

Byte alignment on Android — Android's AudioRecord sends frames that aren't always 2-byte aligned in the underlying buffer. Creating an Int16List view for the VAD's RMS calculation would crash with RangeError: Offset must be a multiple of BYTES_PER_ELEMENT. Fixed by using ByteData.getInt16() which reads at arbitrary offsets.

Chunk boundary artifacts — Without overlap, words at chunk boundaries would either get cut off or duplicated. The 500ms overlap + 5-word deduplication window solved this cleanly.

Pre-speech clipping — The VAD needs a few frames to confirm speech onset, so by the time it fires speechStart, you've already missed the first ~100-300ms. The ring buffer captures this pre-speech audio and prepends it to the recording.

What's Next

  • Streaming partial results (currently each chunk is a complete result)

  • More Whisper models as Groq adds them

  • Web platform support

If you're building a Flutter app that needs speech-to-text, give it a try. Feedback and contributions welcome.

Links: