Note from June 4, 2026
Miso-TTS — open-source expressive voice model
Miso-TTS (announced by @AodenTeoMT at MisoLabs) is an 8B-parameter text-to-speech model focused on emotive, human-like delivery. The weights are open-sourced.
Why it's interesting
- 110 ms latency — compared to ~300–700 ms for most competitors. Fast enough to feel real-time.
- One-shot voice cloning from a 10-second sample.
- Emotive output — the demo emphasizes emotional expressiveness, not just intelligibility.
- Open weights on HuggingFace (
MisoLabs/MisoTTS) — designed for local deployment. - English only at launch.
Install locally
Repo: github.com/MisoLabsAI/MisoTTS
Requirements: Python 3.10, CUDA GPU with enough VRAM for an 8B model in bfloat16 (so ~16GB+).
# 1. Install uv (or skip and use pip)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone and set up
git clone https://github.com/MisoLabsAI/MisoTTS.git
cd MisoTTS
uv sync --python 3.10
source .venv/bin/activate
# 3. Run inference (downloads weights from HuggingFace on first run)
uv run python run_misotts.pyOutput lands at full_conversation.wav. Sony's SilentCipher watermarking model is auto-downloaded as part of the pipeline.
Pip alternative:
python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .
python run_misotts.pyThings to try
- Drop in a 10-second clip of my own voice and see how good the clone is.
- Test latency end-to-end on my GPU (claimed 110 ms is the model alone, not network).
- Compare side-by-side with ElevenLabs / OpenAI TTS on the same script.