For the podcast automation feature that I’m planning for a future version of poketto.me, I’ve been experimenting with various text-to-speech solutions. The easiest and highest-quality approach would have been the ElevenLabs API. However, considering the “throwaway” nature of these audio files – most of which would only be listened to once by one person – and the cost structure that this would introduce, I desperately need a cheaper approach.
The Python library 🐸 CoquiTTS is pretty awesome: There are many different models to choose from, ranging from 'super low latency' to 'high quality' (including voice cloning). Therefore, poketto.me users could choose from many different voices, and from a commercial perspective, I could set different price points for different levels of quality and latency. However, they all require significant computing power to function.
Being very naive about this initially, I tried to run CoquiTTS directly in a Cloud Run instance. After that (predictably) failed (more on this tomorrow!), I reworked the architecture so that these workloads would run in a dedicated preemptive VM.
However, integrating this with the rest of my cloud infrastructure wasn’t easy:
👷I set up a different Cloud Build for the text-to-speech service.
⛅️This service isn't deployed to Cloud Run, but is instead pushed to my Artifact Registry.
📄A separate deployment script creates the text-to-speech VM.
🐳The VM's startup script installs Docker, then pulls and runs the image from the Artifact Registry.
🚀 As the VM is preemptive for cost reasons, it needs a static IP address, and the actual 'main' poketto.me backend needs to be aware of this and kickstart the VM in case it's not running.
Conclusion? TTS in the cloud is harder than you'd think!