ebook2audiobook
Generate audiobooks from e-books, voice cloning & 1158+ languages! - DrewThomasson/ebook2audiobook
Self-hosted audiobook conversion, honestly reviewed. 1,158 languages, voice cloning, no subscription required.
TL;DR
- What it is: Open-source (Apache-2.0) converter that turns eBooks into full audiobooks with chapters, metadata, and optionally your own cloned voice — using local AI TTS engines [1][2].
- Who it’s for: Founders and knowledge workers who consume books on commutes, language learners who want audiobooks in non-English languages that Audible doesn’t carry, and anyone who owns a legal eBook library they can’t listen to [2][4].
- Cost savings: Audible runs $14.95/month. Speechify Premium runs $139/year. ebook2audiobook is $0 in licensing — you convert unlimited books on your own hardware, forever [2].
- Key strength: The breadth is genuinely remarkable — 8 TTS engines, 1,158 languages, 20+ eBook formats, voice cloning from a 6-second audio sample, OCR for image-based PDFs. Nothing in this space comes close on paper [1][2].
- Key weakness: “On paper” is doing real work in that sentence. Conversion is slow — a 300-page book on CPU can take hours — and the quality gap between cheap TTS engines (Fairseq) and expensive ones (XTTSv2) is large enough to matter for daily listening [1][2].
What is ebook2audiobook
ebook2audiobook is a Python project that converts eBooks into audiobooks using open-source TTS models. The pipeline is: Calibre parses and converts your eBook into a usable text format, the tool splits it by chapter, then feeds each chapter through whichever TTS engine you’ve configured. Output lands as a properly chaptered audio file — .m4b, .mp3, .flac, or several others — with embedded metadata [1][2].
The project is built by Drew Thomasson and sits at 18,499 GitHub stars with 1.5K forks. The Apache-2.0 license means you can use it commercially, modify it, and redistribute without restrictions [4]. It ships with a Gradio web GUI for non-technical users and a headless CLI for scripting and automation.
What makes it unusual is the combination of scope and local execution. Most audiobook services either require a subscription (Audible, Speechify) or are limited to narrow use cases. ebook2audiobook does local AI inference — the TTS runs on your hardware, no API keys, no per-character billing, no data sent to third parties [2]. For someone with a library of technical PDFs or foreign-language novels, that’s the entire pitch.
Version 2.0, released in late 2024, added a proper GUI interface, easier local installation on Mac/Windows/Linux, and streamlined access to the fine-tuned model collection [1].
Why people choose it
The core case for ebook2audiobook comes down to three problems it solves that nothing commercial does cleanly.
Audible doesn’t have the book you want. This is the quiet majority use case. Audible’s catalog is enormous but still misses academic titles, niche technical books, self-published works, and large chunks of non-English literature. If you legally own a .epub of a Python book from 2018 or a Japanese novel, ebook2audiobook will read it to you. Audible won’t [2][4].
Commercial TTS is per-character and expensive at scale. ElevenLabs charges per character. OpenAI’s TTS API charges per character. Converting a 100,000-word book at commercial API rates costs real money — and if you want to convert a library, multiply that. ebook2audiobook converts locally with no per-use cost after hardware [2].
Voice cloning. This one’s the community favorite. XTTSv2 supports cloning from a 6-second .wav file at 24000Hz — meaning you can have books narrated in your own voice, or a specific voice you prefer. The demos in the repository include a David Attenborough clone and an ASMR voice, which gives you a sense of the quality ceiling [2].
The AlternativeTo listing [4] categorizes it as “Text to Speech” running on Self-Hosted, Docker, Windows, Mac, and Linux — which covers the deployment reality accurately. It’s not a consumer product. It’s a tool for people comfortable running Python or Docker.
Features
Based on the README and review sources:
TTS engines:
- XTTSv2 (Coqui) — highest quality, supports voice cloning, requires more VRAM [1][2]
- Bark — good for expressive speech, slower
- Fairseq — the Facebook model that unlocks most of the 1,158-language support, lower quality but fast [1]
- VITS, Tacotron2, GlowTTS, YourTTS, Tortoise — range from fast/lower-quality to slow/higher-quality [2]
- Custom fine-tuned XTTSv2 models — you can upload your own zip file or use community models from the team’s collection
Input formats:
- .epub, .mobi, .azw3, .fb2, .pdf, .txt, .rtf, .doc, .docx, .html, .odt, and more — 20+ formats total [2]
- OCR support for PDFs where pages are images rather than selectable text [README]
- Best results with .epub and .mobi (chapter detection is more reliable) [2]
Output formats:
- .m4b, .m4a, .mp3, .flac, .aac, .ogg, .wav, .webm, .mp4, .mov — mono or stereo [README]
- Full chapter markers and metadata embedded [1]
Language support:
- 1,158 languages and dialects via Fairseq [1][2]
- Major languages (English, Chinese, Spanish, French, German, Japanese, Hindi, etc.) with higher-quality voice options
- Voice cloning is limited to the engines that support it (primarily XTTSv2)
SML tags:
- Fine-grained control over breaks, pauses, and voice switching within the text [README]
- Lets you mark up the source text to control narration rhythm
Run modes:
- Local via Gradio web GUI (browser-based, no command line required) [1]
- Headless CLI for scripting
- Docker and docker-compose for server deployment [3]
- Hugging Face Spaces for zero-install cloud use (free tier, slow, may time out on long books) [1]
- Google Colab and Kaggle notebooks for GPU-accelerated conversion without a local GPU [README]
Hardware requirements:
- Minimum: 2GB RAM, 1GB VRAM [README]
- Recommended: 8GB RAM, 4GB VRAM
- CPU-only mode works but is slow — the README explicitly warns that modern TTS engines on CPU are very slow and recommends lower-quality engines (YourTTS, Tacotron2) for CPU-only setups [README]
- Supports CUDA, ROCm, Apple Silicon MPS, Intel/AMD XPU [README]
Pricing: SaaS vs self-hosted math
ebook2audiobook has no paid tier — it’s free software. The cost comparison is against commercial audiobook creation services and subscription platforms.
Audible:
- $14.95/month for one credit (one book)
- Credits don’t cover backlist academic or niche titles
- $179.40/year for 12 books
Speechify Premium:
- $139/year for TTS listening — you import PDFs and it reads them aloud in a voice
- Decent quality, but no voice cloning, no self-hosted option, and you’re paying annually
ElevenLabs (API, for bulk conversion):
- ~$0.30 per 1,000 characters at Starter tier
- A 300-page book ≈ 450,000 characters ≈ $13.50 per book in API fees
- Library of 50 books ≈ $675 in API costs
ebook2audiobook self-hosted:
- Software: $0 (Apache-2.0) [4]
- Server or NAS: already covered if you have one. A Raspberry Pi 5 or a Synology NAS handles it [3]
- Electricity for GPU conversion: a few cents per book
- Hugging Face free tier: $0, but limited to free CPU compute — slow and unreliable for long books [1]
The math for a realistic use case: Someone who wants to convert 20 books per year from their legal eBook library. On Audible, that’s $300+ per year and assumes the books are in the catalog. With ElevenLabs API, that’s ~$270. With ebook2audiobook on existing hardware, that’s $0 ongoing, one afternoon of setup.
The caveat: if you need GPU speed and don’t have hardware, a short cloud VM session (Colab Pro, ~$10/month) covers most conversion needs.
Deployment reality check
Three paths exist depending on your technical comfort level.
Zero-install (Hugging Face Spaces or Colab): The project runs on Hugging Face Spaces — visit the link, upload your eBook, convert. This works, but the free tier’s CPU compute makes it impractical for anything longer than a short story. The Notebookcheck review [1] explicitly notes potential timeouts on long books. Google Colab gives you GPU access on the free tier, and the project ships a notebook for it — more reliable for book-length content.
Local install (Windows/Mac/Linux): Version 2.0 added a proper installer. You run a shell script (.sh on Mac/Linux, .cmd on Windows), it installs dependencies including Calibre and ffmpeg, and launches the Gradio GUI. Minimum requirements are modest — 2GB RAM, 1GB VRAM — but the quality-to-speed trade-off is real. If you’re on a MacBook with Apple Silicon, MPS acceleration helps. If you’re on an old Windows laptop without a GPU, expect CPU-mode speed.
Docker (for server/NAS deployment): A docker-compose.yml ships with the project. Mariushosting lists it as a viable Synology NAS deployment [3], which suggests the Docker path is clean enough for home server users. The README also includes a podman-compose.yml for rootless environments.
What can go sideways:
- Long books on CPU mode take hours. A 400-page novel with XTTSv2 on CPU can run 8–12 hours. Use a GPU or the Colab notebook [1][2].
- The Hugging Face hosted version times out on anything substantial [1].
- Voice cloning requires a clean audio sample — background noise, compression artifacts, or a sample under 3 seconds will degrade the result significantly [2].
- PDF conversion quality varies. Text-native PDFs work well; scanned PDFs rely on OCR, which introduces errors that show up as mispronunciations or skipped lines [README].
- Fine-tuned models are community-contributed and vary in quality. The “official” preset list is small; for languages other than English, you’re often relying on Fairseq which has a noticeably more synthetic sound [1][2].
Realistic time estimate: 30–60 minutes to first working audiobook for someone comfortable with Docker or Python. 2–4 hours for a non-technical user on the local GUI installer, including downloading models (which are large — XTTSv2 alone is several GB).
Pros and cons
Pros
- Apache-2.0 license. Use it commercially, embed it in your product, no strings [4]. This matters if you’re thinking about building a service on top of it.
- Broadest language support in the category. 1,158 languages via Fairseq is not marketing hyperbole — it’s the Facebook MMS model which genuinely covers minority languages that no commercial service touches [1][2].
- Voice cloning from a 6-second clip. XTTSv2’s voice cloning is good enough for extended listening. The demos are convincing [2].
- No per-use cost. Once set up, unlimited conversion with no API bills [2].
- Multiple run modes. Hugging Face for zero-install, Colab for free GPU, local GUI for non-technical users, CLI for power users, Docker for servers. The project doesn’t lock you into one path [1][README].
- OCR support. Covers scanned PDFs that other converters refuse to touch [README].
- Chapter-aware output. Embedded chapter markers in .m4b files means proper navigation in any audiobook player [1].
Cons
- CPU conversion is genuinely slow. This isn’t a “slightly slow” situation — hours per book on CPU with quality TTS engines [1]. Either you need a GPU or you need to use cloud notebooks and tolerate the setup friction.
- Quality gap between engines is large. Fairseq (the broad-language engine) sounds synthetic. XTTSv2 sounds much better but needs more compute. You can’t get 1,158-language coverage AND high-quality narration simultaneously without the right hardware [1][2].
- No active cloud product. The Hugging Face instance times out. Colab is free but requires Google account management and notebook setup. There’s no hosted version that “just works” for non-technical users [1].
- Calibre dependency. The conversion pipeline depends on Calibre being installed, which adds a step and occasional format-specific quirks.
- Community project maintenance. The project is maintained by an individual developer supported via Ko-fi donations, not a funded company. The issue tracker has accumulated reports; longevity is not guaranteed.
- PDF quality is hit-or-miss. Scanned PDFs with OCR produce errors that appear as mispronunciations mid-sentence. Text-native PDFs work far better [README].
- Large model downloads. XTTSv2 model files are several GB. First run requires download time and disk space you may not have budgeted for.
Who should use this / who shouldn’t
Use ebook2audiobook if:
- You have a legal eBook library — technical books, academic papers, self-published novels — that you want to listen to and Audible doesn’t carry it.
- You have a GPU (even a modest one — 4GB VRAM is comfortable) or are fine using Google Colab notebooks.
- You want audiobooks in a language that no commercial service supports. 1,158 languages is not a feature any SaaS can match.
- You want voice cloning — hearing a specific voice narrate your books is a genuinely different experience.
- You’re comfortable with Docker or a Python installer and can invest an afternoon in setup.
Skip it (use Speechify or read-aloud browser extensions) if:
- You want something that works in five minutes with no setup. ebook2audiobook is not a consumer app.
- You only need to listen to books that are on Audible. Buy the credit; the quality and convenience aren’t worth the self-hosting friction.
- You don’t have a GPU and don’t want to use cloud notebooks. CPU mode is too slow to be a daily tool.
Skip it (use ElevenLabs or a professional narrator) if:
- You need broadcast-quality audio for a podcast or commercial audiobook production. ebook2audiobook’s best output is good for personal listening; it’s not production-ready without significant post-processing.
Alternatives worth considering
- Speechify — subscription-based, polished iOS/Android/browser app, solid TTS quality, reads PDFs and web articles. No self-hosting, no voice cloning, no obscure language support. The right choice if you want it to just work.
- Kokoro TTS — newer open-source TTS model with better English quality than XTTSv2 at lower compute cost. Not an end-to-end ebook converter, but can be integrated if you want better English narration.
- Calibre alone — Calibre has basic TTS read-aloud built in. Lower quality than ebook2audiobook but zero setup friction if you’re already using it as your eBook library manager.
- Piper TTS — fast, lightweight, runs on a Raspberry Pi. Used in Home Assistant. Less polished than XTTSv2 but excellent for low-power hardware. Not an ebook converter — you’d need to script it yourself.
- XTTS WebUI (standalone) — if you want voice cloning without the ebook conversion wrapper, the underlying Coqui XTTS model has its own web interfaces.
- Audible + Whispersync — for Amazon ecosystem users with Kindle books, this is still the most friction-free path to audiobooks. Works only within Amazon’s walled garden.
Bottom line
ebook2audiobook does something genuinely useful that no commercial product covers: it converts your legal eBook library into audiobooks using local AI, with voice cloning and a language list that extends far beyond English. The Apache-2.0 license, 18K+ GitHub stars, and eight supported TTS engines make it the clear choice in its specific niche.
The catches are real. CPU speed makes it impractical without a GPU or cloud compute. Quality varies significantly by engine. The hosted options are either slow (Hugging Face) or require notebook fiddling (Colab). This is a hobbyist and power-user tool, not a consumer product.
For a non-technical founder, the value case depends on your library. If you have dozens of technical books you bought and never listened to, the one-time setup investment pays off quickly. If you’re starting from zero with mainstream titles, Speechify or Audible is less friction. If you need a specific language or voice cloning, nothing else comes close.
Sources
-
Stephen Pereyra, Notebookcheck — “Open-source ebook to audiobook converter supports a massive 1000+ languages” (Dec 30, 2024). https://www.notebookcheck.net/Open-source-ebook-to-audiobook-converter-supports-a-massive-1000-languages.938716.0.html
-
april, Medium — “ebook2audiobook: Turn Any Legal eBook into a High-Quality Audiobook in 1110+ Languages!” (Oct 21, 2025). https://medium.com/@april-4/ebook2audiobook-turn-any-legal-ebook-into-a-high-quality-audiobook-in-1110-languages-7ae375a3b435
-
Marius Hosting — “Synology: Best Docker Container Converters”. https://mariushosting.com/synology-best-docker-container-converters/
-
AlternativeTo — “Voice Book Reader Alternatives — ebook2audiobook listing”. https://alternativeto.net/software/voice-book-reader/
Primary sources:
- GitHub repository and README: https://github.com/DrewThomasson/ebook2audiobook (18,499 stars, Apache-2.0 license)
- Docker Hub image: https://hub.docker.com/r/athomasson2/ebook2audiobook
- Hugging Face Space: https://huggingface.co/spaces/drewThomasson/ebook2audiobook
Features
Search & Discovery
- Tags / Labels
Media & Files
- OCR / Text Recognition
Category
Replaces
Related Media & Streaming Tools
View all 334 →Immich
95KHigh-performance self-hosted photo and video management — automatic backup, ML-powered search, and a Google Photos-like experience on your own server.
Jellyfin
49KThe volunteer-built media solution that puts you in control of your media. Stream movies, shows, music, and photos to any device from your own server.
PhotoPrism
39KAI-Powered Photos App for the Decentralized Web. Tag and find pictures automatically without getting in your way.
Cobalt
39KSave what you love without ads, tracking, paywalls or other nonsense. Just paste the link and you're ready to rock.
qBittorrent
36KAn open-source software alternative to uTorrent. Feature-rich and runs on all major platforms.
SRS
29KSimple, high efficiency, realtime video server. Supports RTMP, WebRTC, HLS, HTTP-FLV, SRT, MPEG-DASH and GB28181.