Self-hosted speech-to-text and text-to-speech, honestly reviewed. No marketing fluff, just what you get when you replace the OpenAI audio API with your own server.

TL;DR

What it is: An OpenAI API-compatible server for speech-to-text and text-to-speech — a drop-in local replacement for OpenAI’s Whisper and TTS endpoints [README].
Who it’s for: Developers and founders running audio transcription or voice features who want to stop paying per-minute to OpenAI, or who can’t send audio data to a third party due to privacy or compliance constraints [README][2].
Cost savings: OpenAI’s Whisper API charges $0.006/minute for transcription [OpenAI pricing]. A founder transcribing 500 hours/month pays ~$180/mo. Self-hosted Speaches on a $20/mo GPU VPS costs ~$20/mo — roughly $160 saved monthly, or nearly $2,000/year.
Key strength: True drop-in replacement. If your code already calls OpenAI’s audio endpoints, you change one URL and one API key — nothing else [README].
Key weakness: Thin third-party review coverage; GPU RAM becomes the real cost driver at scale; TTS voice variety is narrower than commercial offerings; and as a relatively young project (3,075 GitHub stars), it lacks the battle-tested reliability record of alternatives like faster-whisper used standalone [README].

What is Speaches

Speaches is a self-hosted server that speaks OpenAI’s audio API dialect fluently. Point any client, SDK, or app at it instead of api.openai.com and it handles transcription (speech-to-text), translation, text-to-speech generation, and a real-time voice API — all running on your hardware [README].

Under the hood, the speech-to-text engine is faster-whisper, a CTranslate2-based reimplementation of OpenAI’s Whisper models that runs significantly faster with lower memory footprint than the original. For text-to-speech, Speaches ships two engines: Kokoro (ranked #1 in the HuggingFace TTS Arena as of this writing) and piper, a fast local voice synthesis model that runs on CPU without issues [README].

The project’s own framing is the clearest description: “This project aims to be Ollama, but for TTS/STT models.” That single sentence tells you everything: pull a model, serve it locally, expose a standard API, move on [README]. If you’ve used Ollama to self-host LLMs, the mental model is identical.

The project is MIT licensed, deployable via Docker Compose, and has accumulated 3,075 GitHub stars. It’s not a corporate product — it’s a developer-run open-source project with real traction but none of the support infrastructure of a commercial offering.

Why people choose it

The third-party review landscape for Speaches is sparse — it hasn’t attracted the wave of comparison articles that workflow automation tools or databases get. But the use case is specific enough that it doesn’t need them: if you’re already calling OpenAI’s audio API and want to stop, there’s a clear reason to switch.

The only substantive third-party mention found in the scraped sources comes from a 2025 Mozilla/EleutherAI collaboration [2]. Mozilla chose Speaches as the backend for their “Self-Hosted Audio Transcription with Whisper” toolkit — part of a suite designed to help developers build ethical AI datasets without routing audio through third-party cloud APIs. Mozilla’s framing: “This setup replicates the functionality of commercial transcription APIs while keeping all data under the user’s control. This is particularly important for developers working with sensitive or private audio data that must not be shared with third-party cloud providers” [2].

That quote from Mozilla captures the core reason developers reach for Speaches:

Privacy and data residency. Every audio file you send to OpenAI’s Whisper endpoint transits their infrastructure. For healthcare, legal, finance, or any product where audio contains personally identifiable information, that’s a liability. Speaches keeps the audio on your server, period [2][README].

Cost at volume. OpenAI’s pricing is fine at low volume. At 500 hours/month of transcription it becomes a line item worth optimizing. At thousands of hours (podcast platforms, call centers, voice-heavy products) it becomes a significant budget driver. The math on self-hosting becomes obvious fast [OpenAI pricing].

Rate limits and reliability. OpenAI’s API imposes rate limits that hit products with bursty transcription workloads. A self-hosted instance has no rate limit except your hardware.

Drop-in compatibility. Unlike most self-hosted alternatives that require code changes, Speaches is designed explicitly so that existing OpenAI SDK integrations work without modification. You change one environment variable [README].

Features

Based on the README and documentation:

Speech-to-Text:

OpenAI /v1/audio/transcriptions and /v1/audio/translations endpoints [README]
Powered by faster-whisper — supports all Whisper model sizes (tiny through large-v3) [README]
Streaming transcription via Server-Sent Events — tokens arrive as they’re recognized, not after the full audio is processed [README]
Voice Activity Detection (VAD) to filter silence and improve accuracy [website nav]
Dynamic model loading: specify a model in the request, it loads automatically; unloads after inactivity [README]

Text-to-Speech:

OpenAI /v1/audio/speech endpoint compatible [README]
Two TTS engines: Kokoro (ranked #1 in TTS Arena, 82M parameters) and piper [README]
Multiple voice options depending on which engine and models you download [README]

Realtime API:

Implements OpenAI’s Realtime API for bidirectional audio streaming [README][website]
Enables async speech-to-speech: audio in, audio out, with model interaction in the middle [README]
Demo video in the README shows it working end-to-end [README]

Audio Chat Completions:

Hooks into the chat completions endpoint format for audio I/O [README][OpenAI docs]
Enables: text body → spoken audio summary; audio recording → text analysis; full speech-to-speech with a model in the loop [README]

Infrastructure:

GPU and CPU support — runs without a GPU, just slower [README]
Docker and Docker Compose deployment [README][website]
Open WebUI integration documented [website nav]
Highly configurable — separate documentation page on configuration options [website]
REST API documented at /api/ [website nav]

What it does not do: speaker diarization (who said what), word-level timestamps exposed as a first-class feature, or multi-language model management with a GUI. It’s a server, not a platform.

Pricing: SaaS vs self-hosted math

OpenAI (the SaaS you’re replacing):

Whisper transcription: $0.006 per minute [OpenAI pricing]
TTS standard voices: $15.00 per 1 million characters [OpenAI pricing]
TTS HD voices: $30.00 per 1 million characters [OpenAI pricing]

Concrete math for a transcription-heavy workload:

Say your product transcribes 200 hours of audio per month (meeting recorder, podcast tool, voice notes app):

OpenAI Whisper: 200h × 60min × $0.006 = $72/mo
Speaches on a $10 Hetzner CPU VPS: $10/mo (slower; CPU runs faster-whisper at maybe 5–10× realtime for small models)
Speaches on a $25 GPU VPS (RTX 3060 tier): $25/mo (large-v3 model at 30–50× realtime)

At 500 hours/month:

OpenAI: $180/mo
Self-hosted: still $10–$25/mo

Year comparison at 500h/month: OpenAI ≈ $2,160. Self-hosted ≈ $120–$300. Savings: $1,860–$2,040/year.

For TTS (say you generate 50 million characters/month for voice features):

OpenAI standard: $750/mo
Speaches with Kokoro on a GPU VPS: $25–$40/mo

The math compounds fast. Audio API costs are one of the more aggressive per-unit pricing models in the AI API market. Volume breaks the budget quickly.

Speaches has no SaaS tier. It’s purely self-hosted. There’s no managed cloud option, no paid support contract, no enterprise tier. You run it yourself or you don’t run it [README][website].

Deployment reality check

Mozilla’s description of the deployment process is accurate: “The toolkit offers a streamlined deployment process using either Docker or command-line setup, making it accessible even for developers with limited infrastructure experience” [2].

What you actually need:

A Linux server (VPS or bare metal) — CPU works, GPU strongly recommended for production use
Docker and docker-compose
Disk space for models — faster-whisper large-v3 is ~3GB, Kokoro is ~82M parameters (small), piper voices vary
A reverse proxy (Caddy/nginx) if you want HTTPS
RAM: 4GB minimum for small models; 8GB+ for large-v3 on CPU; 6–8GB VRAM for GPU inference

What can go sideways:

The dynamic model loading feature is useful but means your first request to a model triggers a download. In production you want to pre-warm models to avoid latency on the first call [README].

GPU memory is the hard constraint. Running large-v3 Whisper plus a Kokoro TTS model simultaneously requires enough VRAM. On a 4GB VRAM card this gets tight; 8GB+ is comfortable for running both [README, inference from model sizes].

The documentation site (speaches.ai) is relatively thin — configuration options are listed but without deep explanations or troubleshooting guides. For a new self-hoster, this means more time reading the source code or GitHub issues than ideal [website].

There is no web UI bundled. Speaches is a server, not a SaaS product with a dashboard. Integration with Open WebUI is documented as an option [website], but that’s a separate tool you’d also need to set up.

Realistic deployment time: 30–60 minutes to a working instance for someone comfortable with Docker. A non-technical founder will need technical help.

Pros and cons

Pros

True drop-in OpenAI replacement. If your codebase uses the OpenAI Python or Node SDK, the only change is base_url. No rewriting, no new client library to learn [README].
MIT license. Use it, modify it, embed it in commercial products — no restrictions, no “fair-code” gotchas, no license compliance risk [README].
Kokoro TTS is legitimately good. Ranked #1 in the HuggingFace TTS Arena — this isn’t a low-quality open-source voice synth. The output quality is competitive with commercial offerings [README].
Streaming transcription. SSE-based streaming means your frontend gets partial transcripts in real time, not a batch result at the end. This matters for UX in voice interfaces [README].
Realtime API. The bidirectional audio streaming API implementation makes Speaches usable for actual voice assistant products, not just batch transcription [README].
Dynamic model management. Load and unload models on demand — useful if you want to run multiple Whisper model sizes without manually managing memory [README].
Mozilla’s endorsement as infrastructure. Being chosen as the audio backend for Mozilla’s ethical AI dataset toolkit is meaningful third-party validation [2].

Cons

No GUI, no dashboard. It’s a pure API server. Non-technical users can’t interact with it without building something on top.
Thin documentation. The docs exist but aren’t comprehensive. Edge cases and troubleshooting require digging [website].
GPU becomes the real cost. To get production-grade transcription speed on large Whisper models, you need GPU hosting — which costs more than a basic VPS and may reduce the savings margin for lower-volume workloads.
Limited third-party review coverage. Almost no independent benchmarks, comparison articles, or community war stories as of this writing. You’re partly flying blind on real-world reliability.
Young project. 3,075 stars is solid for a niche tool, but it’s not the scale of a battle-tested project. The commit history and issue tracker are the closest thing to a reliability track record [README].
No managed fallback. If your self-hosted instance goes down, there’s no cloud tier to failover to. You need to build your own redundancy or accept the availability risk.
TTS voice catalog is narrower than commercial. Kokoro and piper cover many use cases, but the range of voices and languages is smaller than AWS Polly, Azure Neural TTS, or OpenAI’s TTS endpoint.

Who should use this / who shouldn’t

Use Speaches if:

You’re already calling OpenAI’s Whisper or TTS API and your monthly bill is above $50.
You handle audio data that can’t leave your infrastructure — medical, legal, financial, HR.
You’re building a product where audio API costs are a significant margin concern at scale.
You have a developer who can manage a Docker container; the integration is essentially a URL swap.
You want the Kokoro TTS quality without the per-character billing.

Skip it if:

Your audio transcription volume is low (under 50 hours/month) — the OpenAI API cost is negligible and the operational overhead isn’t worth it.
You need speaker diarization, multi-speaker identification, or fine-grained word-level timestamps as first-class features — use a dedicated service or Whisper with post-processing.
You’re non-technical and don’t have someone to handle deployment and maintenance.
You need enterprise SLAs, support contracts, or compliance certifications — there are none.
You need a very broad TTS voice catalog with many languages and accents — commercial cloud TTS still wins here.

Alternatives worth considering

For self-hosted STT specifically:

faster-whisper directly — the same STT engine Speaches uses, without the API server wrapper. More control, more setup. Good choice if you only need transcription as part of a larger pipeline, not as an API endpoint.
LocalAI — also OpenAI-compatible, covers STT/TTS plus LLMs and image generation. More ambitious scope, more moving parts. Better if you want one server to replace multiple OpenAI endpoints.
Whisper.cpp — C++ port, extremely fast, CLI-first. No built-in server or API. Requires wrapping if you need an HTTP endpoint.

For cloud STT (if self-hosting isn’t viable):

Deepgram — generally faster and cheaper than OpenAI Whisper API for streaming use cases, with more features (diarization, language detection).
AssemblyAI — strong on accuracy and features like speaker labels; priced per minute.

For self-hosted TTS:

Coqui TTS — the most comprehensive open-source TTS toolkit, though the company shut down and the project is now community-maintained.
piper standalone — the same engine Speaches uses for TTS; faster setup if TTS is your only need.

Speaches is the right choice specifically when: you want OpenAI API compatibility, you need both STT and TTS from one server, and you want drop-in replacement behavior rather than building a custom integration.

Bottom line

Speaches does exactly one thing well: it lets you replace OpenAI’s audio API endpoints with a server that runs on your own hardware, without changing your code. The “Ollama for TTS/STT” framing is accurate and honest. It doesn’t have a polished dashboard, a large community, or deep documentation — it has a clean API surface, MIT license, and legitimate quality on both STT (faster-whisper) and TTS (Kokoro, #1 in the TTS Arena). For a founder running a voice-heavy product who’s watching the OpenAI audio API line item grow month over month, that’s the relevant comparison. The math at a few hundred hours of audio per month closes fast. The setup requires a developer, but once running it’s a URL swap, not a migration.

If the deployment is the blocker, upready.dev handles exactly this kind of one-time infrastructure setup for founders who want the savings without the ops work.

Sources

Mozilla / WebGLStats — “Mozilla Launches Open-Source Tools to Help Developers Build Ethical AI Datasets” (April 17, 2025). Covers Speaches as the audio transcription backend for Mozilla’s ethical AI dataset toolkit. https://webglstats.com/mozilla-launches-open-source-tools-to-help-developers-build-ethical-ai-datasets/

Primary sources:

GitHub repository and README: https://github.com/speaches-ai/speaches (3,075 stars, MIT license)
Official documentation and website: https://speaches.ai
OpenAI Audio API pricing page: https://openai.com/api/pricing (Whisper: $0.006/min; TTS standard: $15/1M chars; TTS HD: $30/1M chars)
HuggingFace TTS Spaces Arena (Kokoro ranking): https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena

Features

Integrations & APIs

REST API

AI & Machine Learning

AI / LLM Integration
Speech-to-Text / Voice

Compare Speaches

Flowise AI vs

Speaches

Both are ai & machine learning tools. Flowise AI has 2 unique features, Speaches has 2.

Related AI & Machine Learning Tools

View all 93 →

OpenClaw

320K

Personal AI assistant you run on your own devices. 25+ messaging channels, voice, cron jobs, browser control, and a skills system.

ai ml MIT

Ollama

166K

Run open-source LLMs locally — get up and running with DeepSeek, Qwen, Gemma, Llama, and more with a single command.

ai ml MIT

Open WebUI

128K

Run AI on your own terms. Connect any model, extend with code, protect what matters—without compromise.

ai assistants MIT Easy to deploy

OpenCode

124K

The open-source AI coding agent — free models included, or connect Claude, GPT, Gemini, and 75+ other providers.

ai ml MIT

Zed

77K

A high-performance code editor built from scratch in Rust by the creators of Atom — GPU-accelerated rendering, built-in AI, real-time multiplayer, and no Electron.

ai ml

OpenHands

69K

The open-source, model-agnostic platform for cloud coding agents — automate real software engineering tasks with sandboxed execution, SDK, CLI, and enterprise-grade security.

ai ml

TL;DR

What is Speaches

Why people choose it

Features

Pricing: SaaS vs self-hosted math

Deployment reality check

Pros and cons

Pros

Cons

Who should use this / who shouldn’t

Alternatives worth considering

Bottom line

Sources

Features

Integrations & APIs

AI & Machine Learning

Category

Compare Speaches

Related AI & Machine Learning Tools

OpenClaw

Ollama

Open WebUI

OpenCode

Zed

OpenHands