Local Voiceover/Audiobook generation (infosec.pub)

submitted 2 months ago* (last edited 2 months ago) by splendoruranium@infosec.pub to c/localllama@sh.itjust.works

6 comments fedilink hide all child comments

I'm looking to locally generate voiceovers from text and also try to generate audiobooks. Does anyone have experience with sherpa-onnx? There also appear to be two separate frontends for Kokoro specifically dedicated for audiobook creation, but they appear to both be abandoned. Or am I barking up the completely wrong tree?
Thanks!

top 6 comments

sorted by: hot top controversial new old

[-] hendrik@palaver.p3x.de 10 points 2 months ago* (last edited 2 months ago)

Sherpa looks like a nice all-in-one solution. I'm just not sure if Kaldi still is the latest and greatest when it comes to high quality TTS.

I've tried Kokoro, Orpheus, Chatterbox and those sound very nice, but it's just the TTS. And then a few ones I haven't tried are fish and OuteTTS. And then there is Kaldi and xTTSv2.

There are some projects building upon TTS to create similar things, though I've mainly looked into the dialogue and podcast creation:

Dia (does dialogue, works for English and can do a few rudimentary things like laugh and chuckle)
Mozilla's Document-to-Podcast
Chatterbox-Audio is someone trying to do audiobooks with Chatterbox, though I'm not sure about the quality and progress of the attempt
And a few abandoned attempts like Podcastify, Podcast-LLM from when Google introduced NotebookLM and people tried to recreate that. But I don't think it's very helpful if you want a tool that actually works.

I ended up writing a Python script myself to split up the text into paragraphs, feed it to Kokoro/Orpheus and then stitch back together the generated audio into a dialogue. I didn't find anything that did the exact thing I wanted to do. And Dia gave me a particularly bad time because the talking speed was all over the place and the voices weren't very consistent.

[-] splendoruranium@infosec.pub 3 points 2 months ago

Thanks, that's a lot of pointers!
I'm fine with not the absolutely latest and greatest. In fact, since basically all of this is still cutting-edge I'm mainly trying to weed out the hype, the scams and the malware and at least find the somewhat well-trodden paths. It's all terribly fascinating but casually downloading month-old programs from random github repos and running cryptic docker containers with instructions that are half in Chinese and half in LLM-generated English (which still ends up too technical for me to fully understand) is kind of... stressful for me. That's just not how I usually use a computer 😅

[-] hendrik@palaver.p3x.de 2 points 2 months ago* (last edited 2 months ago)

Yes. I also run into the same issues. I haven't had scams and malware yet. But I frequently read some news about the latest nice thing, and then I'm hyped and want to use it. But oftentimes it's a tech-demo and it would need months of additional work to make it usable for real-world applications. And then we have some companies who just dump stuff to Github. Including quite some Chinese startups and such. More than once I've downloaded something and it didn't work and it couldn't ever work because the Python code had several errors or wasn't complete. They often upload something different from what they used for their experiments or tech-demo. And then AI moves so fast, people will have moved on to the next thing before finishing the previous one. So we also end up with a lot of abandoned projects which never got to the state where they'd be useful. It's really hard in the field of AI.

(And a bit unrelated, but I'm a bit disappointed that we don't have good TTS built right into our Linux desktop. Like Kokoro, just multilingual and with an open training set, plus the right toolkit to tie it into the desktop, browser, commandline and arbitrary tools building on top of it.)

[-] Smokeydope@lemmy.world 3 points 2 months ago

Nice post Hendrik thanks for sharing your knowledge and helping people out :)

[-] hedgehog@ttrpg.network 6 points 2 months ago

Is your goal to create things that can be published or used in a project, or to create audiobooks for yourself to listen to?

For voiceovers for text, I use Kokoro Fast API, which has a web frontend. The frontend is only compatible with Chromium browsers on desktop or Android, which sucks as my daily driver is Firefox and an iPhone (there are workarounds in the thread) but it supports voice mixing, speed changes, etc.. It also has an issue where it keeps the models (about 3GB) in memory; I keep the CPU version loaded normally and swap to the GPU version if I need it to be faster. If you want something similar for Bark, check out Bark-GUI.

I’ve also dabbled a bit in some TTS features that have Comfy nodes, though at this point mostly just in terms of getting them set up. For my purposes thus far Kokoro has been fine (and I prefer the FastAPI project over the Comfy nodes for most of my uses), but I’ve found nodes for Kokoro, Dia, F5 TTS, Orpheus, and Zonos.

Autiobooks and audiblez both look promising. A few weeks ago, I used the Kokoro FastAPI web frontend to create an audiobook for an ebook I worked on that used entirely self-hosted AI generation for the outlining and prose. Audiblez, which I found about like two days after that, looks like it would have simplified that process substantially. Still, I’d personally like something more like an audiobook studio, where I can more easily swap voices back and forth, add emotions, play with speed on a more granular level, etc.. I’m thinking about building something that contains that at some point myself, but it’ll be a minute - hopefully someone else will beat me there.

I posted a comment here a few weeks back on a similar topic. I’ve since used OpenReader-WebUI and like it, though that’s not for producing audiobooks, but for a read-along experience. Reproducing the comment below in case it’s helpful for you:

If you want to generate audiobooks using your own / a hosted TTS server, check out one of these options:

OpenReader-WebUI - this has built-in read along capability and can be deployed as a PWA that can allow you to download the audiobooks to your phone so you can use them offline
p0n1/epub to audiobook
ebook2audiobook If you don’t have a decent GPU, Kokoro is a great option as it’s fast enough to run on CPU and still sounds very good. If you’re going to use Kokoro, Audiblez (posted by another commenter) looks like it makes that more of an all-in-one option. If you want something that you can use without an upfront building of the audiobook, of the above options, only OpenReader-WebUI supports that. RealtimeTTS is a library that handles that, but I don’t know if there are already any apps out there that integrate it. If you have the audiobook generation handled and just want to be able to follow along with text / switch between text and audio, check out https://storyteller-platform.gitlab.io/storyteller/

[-] Smokeydope@lemmy.world 2 points 2 months ago

I once got kobold.CPP working with their collection of TTS model+ wav tokenizer system. Here's the wiki page on it.

It may not be as natural as a commercial voice model but may be enough to wet your appetite in the event that other solutions feel overwhelmingly complicated

this post was submitted on 13 Jun 2025

27 points (100.0% liked)

LocalLLaMA

3540 readers

9 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago

MODERATORS

SkySyrup@sh.itjust.works

pax@sh.itjust.works

noneabove1182@sh.itjust.works

Smokeydope@lemmy.world

MonsterBug@sh.itjust.works