Testing Voice Apps When You Cannot Be the Microphone

April 19, 2026
2 min read

About this article

How we used two AI agents in a Playwright loop to stress-test a voice roleplay app—and why the first passing test did not count.

Category:Engineering
Published:April 19, 2026
Reading time:2 minutes

Get a free consultation

Want to learn more about how we can help your business with AI solutions?

Contact Us

Testing voice apps when you cannot be the microphone

We built a voice roleplay app: you pitch, an AI buyer pushes back, you get a score at the end. Useful for practice. Hard to test.

You cannot click "say hello." Every release risks breaking the listen → think → speak loop. Manual QA means a human talking to their laptop on repeat.

We tried automating the conversation itself—two agents, no person in the loop—and learned where that works and where it lies to you.

The setup that looked good (but was not)

First version:

  1. Agent A generates a sales pitch
  2. TTS turns it into audio
  3. Chromium plays that audio as fake microphone input
  4. The app transcribes and responds
  5. We capture the response audio and loop back

Green CI. Pretty diagram. Not a real voice test.

The Web Speech API stub never heard the WAV. We pushed text through a test hook instead:

await page.evaluate((text) => (window as any).__fakeSR?.emit(text), pitch)

The app passed because we fed it strings—not because audio flowed end to end.

What we changed

We rebuilt the loop to exercise the actual path:

  • Real audio injection where the platform allows it
  • Transcription output compared against expected meaning (not exact strings)
  • Timeouts for stalled sessions
  • Failure artifacts: audio files, transcripts, screenshots

The test got slower and flakier. It also started catching real regressions.

What automated voice testing is good for

  • Smoke coverage on every deploy
  • Regression detection when API or model behavior shifts
  • Rubric checks—did the session end with a score, did required fields populate

What it is bad for

  • Judging whether the buyer "feels human"
  • Accent, pacing, emotion
  • Legal or compliance nuance

Keep humans in the loop for product feel. Use agents for plumbing.

Practical takeaway

If your test bypasses the hard part—the microphone, the codec, the latency—you are not testing a voice app. You are testing a chat app wearing headphones.

Name that honestly in your suite. We split tests into:

  • Unit / API — fast, deterministic
  • Audio path — slower, runs on main branch
  • Human spot check — before big releases

Where this goes next

We are experimenting with scoring tolerances (semantic match vs exact transcript) and recording failing sessions for replay. The goal is not infinite automation—it is confidence without burning an hour per deploy.


Building something with voice, forms, or AI workflows on the web? DroSeo ships production systems—testing included.

Share this article