Building a Text-to-Speech Engine in Pure C

title: Building a Text-to-Speech Engine in Pure C published: true description: Building a minimal Text-to-Speech engine in pure C using Qwen3-TTS without ML frameworks. tags: c, ai, tts, opensource cover_image: https://dev-to-uploads.s3.amazonaws.com/uploads/articles/k8vpv35ge7n4xjcsglc7.png

I built a pure C inference engine for Qwen3-TTS, Alibaba's open-source text-to-speech model. The goal: run high-quality multilingual TTS on CPU, with zero Python dependencies, inspired by antirez's approach to minimal C inference engines (specifically his qwen-asr project). The code is on GitHub: gabriele-mastrapasqua/qwen3-tts.

What started as a "let's just get the basic pipeline working" turned into a full-featured TTS engine with streaming output, an HTTP server, voice cloning, and custom voice design — all in a single C binary.

Why pure C?

The official Qwen3-TTS runs on PyTorch with the usual stack of transformers, tokenizers, and CUDA. That's fine for a GPU server, but I wanted something that runs anywhere — a single binary, no runtime dependencies, just mmap the model weights and go.

The result: make blas, point it at a model directory, and you get a ~200KB binary that does everything.

The architecture

Qwen3-TTS is a three-stage pipeline:

Talker — a 28-layer causal Qwen3 LLM (0.6B or 1.7B params) with GQA, RoPE, and SwiGLU that generates discrete audio frame tokens from text
Code Predictor — a 5-layer transformer that runs 15 sequential passes per frame, filling in the remaining codebook entries
Speech Decoder — a causal ConvNet with Snake activations, ResBlocks, and 480x upsampling that converts discrete codes to 24kHz audio

Each stage was reimplemented from scratch in C. The model supports 9 preset voices, 10 languages, and both 0.6B and 1.7B model sizes (auto-detected from weights).

BF16 weights, float32 compute

The model weights are stored in bfloat16 and memory-mapped directly from standard HuggingFace safetensors files. On Apple Silicon, bf16-to-f32 conversion is essentially free (it's just a left shift), and this approach gives bit-identical results to the Python reference with greedy decoding.

I did experiment with INT4 quantization, but for the 0.6B model the matrices are too small to be bandwidth-bound — the Q4 unpack overhead actually made it 20% slower. BF16 turned out to be the sweet spot.

Streaming output

The speech decoder is fully causal (no lookahead), which made streaming architecturally possible. The engine generates N frames, decodes a chunk through the speech decoder, and writes audio immediately — no need to wait for the full sequence.

# Pipe raw PCM to an audio player for real-time playback
./qwen_tts -d qwen3-tts-0.6b --text "Hello world" --stdout | \
    play -t raw -r 24000 -e signed -b 16 -c 1 -

First audio arrives within ~1 second. The speech decoder uses incremental decoding with KV caching, so each streaming chunk is O(chunk_size) rather than re-processing the full sequence.

HTTP server

The engine includes an embedded HTTP server — no nginx, no FastAPI, just start it and send requests:

# Start server (model loaded once, shared across requests)
./qwen_tts -d qwen3-tts-0.6b --serve 8080

# Generate speech
curl -X POST http://localhost:8080/v1/tts \
  -H "Content-Type: application/json" \
  -d '{"text":"Hello world","speaker":"ryan","language":"English"}' \
  -o output.wav

It also has an OpenAI-compatible endpoint (/v1/audio/speech) so you can use it as a drop-in replacement for OpenAI's TTS API in existing apps, plus a streaming endpoint that sends chunked PCM as it generates.

Voice cloning

Using the Base model variant, you can clone any voice from a few seconds of reference audio:

./qwen_tts -d qwen3-tts-0.6b-base --text "Hello, this is my cloned voice." \
    --ref-audio reference.wav -o cloned.wav

Under the hood, this runs a full ECAPA-TDNN speaker encoder to extract a 1024-dim speaker embedding from the reference audio's mel spectrogram. You can save and reload embeddings to avoid re-extracting:

# Extract and save
./qwen_tts -d qwen3-tts-0.6b-base --text "Hello" \
    --ref-audio ref.wav --save-voice my_voice.bin -o out.wav

# Reuse later (instant)
./qwen_tts -d qwen3-tts-0.6b-base --text "Another sentence" \
    --load-voice my_voice.bin -o out2.wav

The speech tokenizer encoder (Mimi-based, with 4-stage strided convolutions, 8-layer transformer, and split RVQ quantization) was also implemented for the full ICL mode.

VoiceDesign

The 1.7B VoiceDesign model can create entirely new voices from natural language descriptions:

./qwen_tts -d qwen3-tts-voice-design -l English \
    --instruct "A deep male voice with a British accent, speaking slowly and calmly" \
    --text "Hello, this is a test." -o british.wav

No reference audio needed — just describe what you want.

Style and emotion control

The 1.7B CustomVoice model supports an --instruct flag to control speaking style:

./qwen_tts -d qwen3-tts-1.7b --text "I cannot believe you did that." \
    --instruct "Speak in a very angry and aggressive tone" -o angry.wav

./qwen_tts -d qwen3-tts-1.7b --text "I cannot believe you did that." \
    --instruct "Speak very slowly and softly, in a sad whisper" -o whisper.wav

Same text, completely different delivery.

Performance

On Apple Silicon (M-series, 4 threads):

Model	Speed	Per-frame
0.6B	~0.7-0.86x realtime	Talker 24ms + CP 70ms
1.7B	~0.48x realtime	Talker 92ms + CP 75ms

The bottleneck is the Code Predictor — 15 sequential autoregressive passes per frame, no way around it.

Key optimizations:

NEON-optimized bf16 matvec with multi-row fusion (2-row fused dispatch)
Fused gate+up projections for SwiGLU in both Talker and Code Predictor
Unified QKV dispatch to reduce threading overhead
NEON kernels for RMSNorm, attention (dot+V accum), RoPE, Snake activation
Fused argmax+matvec in the Code Predictor hot loop
im2col + BLAS sgemm for the ConvNet decoder, with tiling for large sequences
Incremental speech decoder with KV cache for streaming
4-thread dispatch_apply (sweet spot — 8 threads hit the memory bandwidth ceiling)

Starting from ~0.4x realtime pre-optimization, these brought the 0.6B model to ~0.86x realtime.

The Metal GPU detour

I implemented a full Metal GPU backend — compute shaders, GPU-side transformer, the works. The result? ~1.3x slower than the optimized NEON CPU path. On Apple Silicon, CPU and GPU share the same memory bus, so there's no bandwidth advantage. The NEON path was already near-optimal for these model sizes. Deleted the whole thing.

The debugging journey

Getting bit-identical output required tracking down some non-obvious issues:

The model config says "interleaved": true for RoPE, but the Python code actually uses NeoX split-half rotation (the opposite!)
The Code Predictor's first codebook uses the Talker's codec embedding, not its own
Snake activations store alpha and beta in log space — sin²(exp(alpha) * x), not sin²(alpha * x)
All convolutions in the speech decoder are causal (left-only padding), including transposed convolutions
ResBlock dilations are [1, 3, 9], not [1, 1, 1] as you might assume
The 1.7B model needs a projection layer (2048→1024) between the Talker and Code Predictor that isn't in the 0.6B

Each of these was a "why doesn't my output match?" rabbit hole. The final validation: correlation 0.999996 with the Python reference across the full pipeline.

What it looks like

# Build
make blas

# Basic usage
./qwen_tts -d qwen3-tts-0.6b --text "Hello, how are you?" -o hello.wav

# Stream to speaker
./qwen_tts -d qwen3-tts-0.6b --text "Hello world" --stdout | \
    play -t raw -r 24000 -e signed -b 16 -c 1 -

# Start HTTP server
./qwen_tts -d qwen3-tts-0.6b --serve 8080

The project supports macOS (ARM/x86), Linux (ARM/x86), and Windows via WSL2. NEON and AVX SIMD paths are included. The 0.6B model needs ~3 GB of memory, the 1.7B needs ~8 GB.

The code is on GitHub: gabriele-mastrapasqua/qwen3-tts

Building a Text-to-Speech Engine in Pure C

title: Building a Text-to-Speech Engine in Pure C published: true description: Building a minimal Text-to-Speech engine in pure C using Qwen3-TTS without ML frameworks. tags: c, ai, tts, opensource cover_image: https://dev-to-uploads.s3.amazonaws.com/uploads/articles/k8vpv35ge7n4xjcsglc7.png

Why pure C?

The architecture

BF16 weights, float32 compute

Streaming output

HTTP server

Voice cloning

VoiceDesign

Style and emotion control

Performance

The Metal GPU detour

The debugging journey

What it looks like

Tags

Comments

More Blog

Five Gemma-4 models, one accelerator: what porting E2B 31B to AWS Inferentia2 taught me

Hey DEV, I'm Tobore. Let's actually connect.

I burned through thousands of AI tokens. Then a friend did it for free

Claude might be saturating your machine

Automated GitHub Code Reviews Using Google Gemini

What is an "agentic harness," actually?

Ready-made automations for this