Note from June 4, 2026

Miso-TTS — open-source expressive voice model

Miso-TTS (announced by @AodenTeoMT at MisoLabs) is an 8B-parameter text-to-speech model focused on emotive, human-like delivery. The weights are open-sourced.

Why it's interesting

110 ms latency — compared to ~300–700 ms for most competitors. Fast enough to feel real-time.
One-shot voice cloning from a 10-second sample.
Emotive output — the demo emphasizes emotional expressiveness, not just intelligibility.
Open weights on HuggingFace (MisoLabs/MisoTTS) — designed for local deployment.
English only at launch.

Install locally

Repo: github.com/MisoLabsAI/MisoTTS

Requirements: Python 3.10, CUDA GPU with enough VRAM for an 8B model in bfloat16 (so ~16GB+).

# 1. Install uv (or skip and use pip)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and set up
git clone https://github.com/MisoLabsAI/MisoTTS.git
cd MisoTTS
uv sync --python 3.10
source .venv/bin/activate

# 3. Run inference (downloads weights from HuggingFace on first run)
uv run python run_misotts.py

Output lands at full_conversation.wav. Sony's SilentCipher watermarking model is auto-downloaded as part of the pipeline.

Pip alternative:

python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .
python run_misotts.py

Things to try

Drop in a 10-second clip of my own voice and see how good the clone is.
Test latency end-to-end on my GPU (claimed 110 ms is the model alone, not network).
Compare side-by-side with ElevenLabs / OpenAI TTS on the same script.