Generally there are not LLMs that do this, but you start building up a workflow. You speak, one service reads in the audio and translates it to text. Then you feed that into an LLM, it responds in text, and you have another service translate that into audio.
Home Assistant is the easiest way to get them all put together.
https://www.home-assistant.io/integrations/assist_pipeline
Edit agree with others below. Use the apps that are made for it.
- Whisper for STT
- Any hosted LLM can work, text-generation-webui or tabbyapi
- I use xttsv2 for TTS