See : https://en.m.wikipedia.org/wiki/Secure_voice
It's all about bandwidth. At a minimum cellular voice signals have 4.7 kilobits of bandwidth per second.
Plus the cell phone network and the cell phones themselves, work really really hard, at compressing that data stream for human voices only.
So if you're going to use some acoustic coupling to send data acoustically over the cellular network, you're not going to get 4.7 kilobits per second, you're going to get less than that.
Encrypting your voice stream, is going to take some bandwidth, so let's say there's a 10% overhead, so we're at about 4 kilobits per second of total voice bandwidth after encryption.
http://www.whence.com/minimodem/ is a neat program that does software audio encoding and decoding so you could run a virtual modem from your desktop or phone.
Here is a demonstration of the general concept: https://www.youtube.com/watch?v=uQqWHLZjOjA
https://spectrum.ieee.org/why-mobile-voice-quality-still-stinksand-how-to-fix-it
Here is some secure voice modulation demonstrations that you could use as a starting point: https://www.youtube.com/watch?v=BLKHf40K0Wk
This program does exactly what you want. Implement secure voice in software. You would just have to transmit this over your cell phone. It's possible this could be built out for Android for your specific use case, but right now it's built for generalized radio transmission. So you could build two of these, for either party. And connect via audio call over the cell phone. You just might not get as much bandwidth, and call quality as you want. You'd have to keep reducing the settings until it worked
https://github.com/aarmono/crypto_transceiver_instructions
One big obstacle, is you're going to have to do physical key exchange for your endpoints. If you use the internet for that key exchange, you might as well use encrypted voip.