Bogdan Dragomir
/
Notes

Note from June 4, 2026

Miso-TTS — open-source expressive voice model

Miso-TTS (announced by @AodenTeoMT at MisoLabs) is an 8B-parameter text-to-speech model focused on emotive, human-like delivery. The weights are open-sourced.

Why it's interesting

  • 110 ms latency — compared to ~300–700 ms for most competitors. Fast enough to feel real-time.
  • One-shot voice cloning from a 10-second sample.
  • Emotive output — the demo emphasizes emotional expressiveness, not just intelligibility.
  • Open weights on HuggingFace (MisoLabs/MisoTTS) — designed for local deployment.
  • English only at launch.

Install locally

Repo: github.com/MisoLabsAI/MisoTTS

Requirements: Python 3.10, CUDA GPU with enough VRAM for an 8B model in bfloat16 (so ~16GB+).

# 1. Install uv (or skip and use pip)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and set up
git clone https://github.com/MisoLabsAI/MisoTTS.git
cd MisoTTS
uv sync --python 3.10
source .venv/bin/activate

# 3. Run inference (downloads weights from HuggingFace on first run)
uv run python run_misotts.py

Output lands at full_conversation.wav. Sony's SilentCipher watermarking model is auto-downloaded as part of the pipeline.

Pip alternative:

python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .
python run_misotts.py

Things to try

  • Drop in a 10-second clip of my own voice and see how good the clone is.
  • Test latency end-to-end on my GPU (claimed 110 ms is the model alone, not network).
  • Compare side-by-side with ElevenLabs / OpenAI TTS on the same script.