Testing voice apps when you cannot be the microphone
We built a voice roleplay app: you pitch, an AI buyer pushes back, you get a score at the end. Useful for practice. Hard to test.
You cannot click "say hello." Every release risks breaking the listen → think → speak loop. Manual QA means a human talking to their laptop on repeat.
We tried automating the conversation itself—two agents, no person in the loop—and learned where that works and where it lies to you.
The setup that looked good (but was not)
First version:
- Agent A generates a sales pitch
- TTS turns it into audio
- Chromium plays that audio as fake microphone input
- The app transcribes and responds
- We capture the response audio and loop back
Green CI. Pretty diagram. Not a real voice test.
The Web Speech API stub never heard the WAV. We pushed text through a test hook instead:
await page.evaluate((text) => (window as any).__fakeSR?.emit(text), pitch)
The app passed because we fed it strings—not because audio flowed end to end.
What we changed
We rebuilt the loop to exercise the actual path:
- Real audio injection where the platform allows it
- Transcription output compared against expected meaning (not exact strings)
- Timeouts for stalled sessions
- Failure artifacts: audio files, transcripts, screenshots
The test got slower and flakier. It also started catching real regressions.
What automated voice testing is good for
- Smoke coverage on every deploy
- Regression detection when API or model behavior shifts
- Rubric checks—did the session end with a score, did required fields populate
What it is bad for
- Judging whether the buyer "feels human"
- Accent, pacing, emotion
- Legal or compliance nuance
Keep humans in the loop for product feel. Use agents for plumbing.
Practical takeaway
If your test bypasses the hard part—the microphone, the codec, the latency—you are not testing a voice app. You are testing a chat app wearing headphones.
Name that honestly in your suite. We split tests into:
- Unit / API — fast, deterministic
- Audio path — slower, runs on main branch
- Human spot check — before big releases
Where this goes next
We are experimenting with scoring tolerances (semantic match vs exact transcript) and recording failing sessions for replay. The goal is not infinite automation—it is confidence without burning an hour per deploy.
Building something with voice, forms, or AI workflows on the web? DroSeo ships production systems—testing included.