ASR Fairness Evaluation Pipeline

Built an end-to-end pipeline to evaluate speech recognition providers across languages, accents, and demographics for law enforcement use cases. Published as FairLENS.

Speech RecognitionFairnessEvaluationPython

Overview

Axon offers automatic speech recognition services for law enforcement, used to transcribe interviews, incident reports, and other recordings. Customers span North America and Europe, which means the ASR systems need to handle multiple languages, dialects, and accents. The end users, members of the public interacting with law enforcement, are demographically diverse in ways that standard benchmarks do not capture.

Evaluating a new ASR provider took weeks. The process was largely manual: someone would integrate the provider's API, run test audio through it, compare transcripts against ground truth, compute error rates, and compile the results. Every time a new provider entered the picture or an existing one released an update, the cycle started over. With tens of providers to evaluate across five or more languages, this did not scale.

This was my first project at Axon. I built an automated evaluation pipeline and a fairness evaluation framework that became the basis for a published paper.

Result

Evaluation time for a new ASR provider went from weeks to minutes.

What I Built

Evaluation Pipeline

The pipeline evaluates any ASR provider end to end: integrate the provider's API, run Axon's proprietary evaluation dataset (a curated set of audio recordings with ground truth transcripts) through it, compute Word Error Rate and other metrics on the output, generate standardized reports, push results to a database, and rank providers in a leaderboard.

The proprietary dataset was critical. Public ASR benchmarks are useful for general comparisons, but they do not reflect the acoustic conditions or speaker demographics that Axon's products encounter in practice. Law enforcement recordings include background noise, cross-talk, varying microphone quality, and speakers from a wide range of demographic backgrounds. Axon built a dataset that captured these conditions, and the pipeline used it as the evaluation standard.

The speed improvement came from automation, but the more important outcome was consistency. Every provider was evaluated against the same dataset, with the same metrics, in the same pipeline. Product managers could make provider selection decisions based on standardized data instead of ad hoc test results.

Fairness Evaluation

I built a fairness evaluation component on top of the accuracy pipeline. It sliced results by four dimensions: accent, gender, age, and dialect. For each provider and language, you could see not just the overall WER but how that WER varied across demographic groups. This made disparities visible and quantifiable instead of anecdotal.

I also built a dashboard that surfaced these fairness results in a format accessible to executives and product managers. The people making provider selection and deployment decisions were not ML engineers. They needed to understand the fairness implications of their choices without digging through raw metrics tables.

FairLENS

This work grew into a broader research effort that we published as FairLENS: Assessing Fairness in Law Enforcement Speech Recognition (Wang, Cusick, Laila, Puech, Ji, Hu, Wilson, Spitzer-Williams, Wheeler, Ibrahim). The paper evaluated 12 ASR systems, one open-source and 11 commercial, using a systematic framework with new assessment methods for comparing fairness disparities across models.

One finding from the research: acoustic domain shifts can introduce new biases. A model that performs equitably in one acoustic environment may show significant disparities in another. Fairness is not a static property of a model. It needs to be re-evaluated when the deployment context changes, when you move to a new locale, when the acoustic conditions differ, or when the speaker population shifts.

Results

  • Cut ASR provider evaluation time from weeks to minutes.
  • Evaluated 12 ASR systems across multiple languages and demographic dimensions.
  • Fairness dashboard gave executives and PMs direct visibility into demographic performance disparities for provider selection.
  • Published as a peer-reviewed paper: FairLENS (arXiv:2405.13166).