Testing voice apps when you cannot be the microphone

We built a voice roleplay app: you pitch, an AI buyer pushes back, you get a score at the end. Useful for practice. Hard to test.

You cannot click "say hello." Every release risks breaking the listen → think → speak loop. Manual QA means a human talking to their laptop on repeat.

We tried automating the conversation itself—two agents, no person in the loop—and learned where that works and where it lies to you.

The setup that looked good (but was not)

First version:

Agent A generates a sales pitch
TTS turns it into audio
Chromium plays that audio as fake microphone input
The app transcribes and responds
We capture the response audio and loop back

Green CI. Pretty diagram. Not a real voice test.

The Web Speech API stub never heard the WAV. We pushed text through a test hook instead:

await page.evaluate((text) => (window as any).__fakeSR?.emit(text), pitch)

The app passed because we fed it strings—not because audio flowed end to end.

What we changed

We rebuilt the loop to exercise the actual path:

Real audio injection where the platform allows it
Transcription output compared against expected meaning (not exact strings)
Timeouts for stalled sessions
Failure artifacts: audio files, transcripts, screenshots

The test got slower and flakier. It also started catching real regressions.

What automated voice testing is good for

Smoke coverage on every deploy
Regression detection when API or model behavior shifts
Rubric checks—did the session end with a score, did required fields populate

What it is bad for

Judging whether the buyer "feels human"
Accent, pacing, emotion
Legal or compliance nuance

Keep humans in the loop for product feel. Use agents for plumbing.

Practical takeaway

If your test bypasses the hard part—the microphone, the codec, the latency—you are not testing a voice app. You are testing a chat app wearing headphones.

Name that honestly in your suite. We split tests into:

Unit / API — fast, deterministic
Audio path — slower, runs on main branch
Human spot check — before big releases

Where this goes next

We are experimenting with scoring tolerances (semantic match vs exact transcript) and recording failing sessions for replay. The goal is not infinite automation—it is confidence without burning an hour per deploy.

Building something with voice, forms, or AI workflows on the web? DroSeo ships production systems—testing included.

Testing Voice Apps When You Cannot Be the Microphone

About this article

Recent Posts

Get a free consultation