Project Euphonia: advancing inclusive speech recognition through expanded data collection and evaluation

Centre for Digital Language Inclusion
May 12, 2026
Global
Academic Research Publications

Abstract

Speech recognition models, predominantly trained on standard speech, often exhibit lower accuracy for individuals with accents, dialects, or speech impairments. This disparity is particularly pronounced for economically or socially marginalized communities, including those with disabilities or diverse linguistic backgrounds. Project Euphonia, a Google initiative originally launched in English dedicated to improving Automatic Speech Recognition (ASR) of disordered speech, is expanding its data collection and evaluation efforts to include international languages like Spanish, Japanese, French and Hindi, in a continued effort to enhance inclusivity. This paper presents an overview of the extension of processes and methods used for English data collection to more languages and locales, progress on the collected data, and details about our model evaluation process, focusing on meaning preservation based on Generative AI.

1 Introduction

Traditional ASR models, trained primarily on standard speech patterns, often fail to accurately interpret the diverse spectrum of human voices. This creates a communication barrier that can perpetuate social inequities, with serious consequences in healthcare settings where misinterpretations can lead to misdiagnosis, incorrect treatment, and even patient harm (Topaz et al., 2018). Project Euphonia, a Google Research initiative, is tackling this challenge by building the world's largest dataset of disordered speech. Using a proprietary web-based audio tool, Project Euphonia gathers speech samples from consented participants who record prompted phrases. As of February 2025, this dataset includes over 1.5 million utterances from ~3,000 speakers.

This paper outlines our approach to collecting and curating a high quality disordered speech dataset with the goal of supporting improved ASR accuracy for international languages and diverse speech patterns. Building on the success of our English-language dataset, we have expanded our efforts globally, capturing the rich diversity of speech within languages like Spanish, French, Japanese, and Hindi (Jiang, 2022). This global expansion allows us to create a high-quality, multilingual corpus for training more inclusive and accurate ASR models. Notably, prior research demonstrates that personalized ASR models trained on Project Euphonia data can outperform human transcribers for individuals with disordered speech, highlighting the transformative potential of this approach (Green et al., 2021).

This work underscores the critical importance of incorporating disordered speech data into ASR model development. By building more inclusive datasets, we enable the creation of models that better serve users with diverse speech patterns, directly supporting initiatives like the Speech Accessibility Project, which aims to make voice-enabled technology accessible to all users, regardless of their speech characteristics (University of Illinois at Urbana-Champaign, 2024).

While Project Euphonia's data corpus is proprietary and accessible only to Google researchers for training and fine-tuning ASR models, this paper contributes to the broader community in three significant ways. Firstly, it details important considerations for curating datasets aimed at enhancing ASR across multiple languages. Secondly, it provides open-source resources to enable researchers and developers to replicate these data curation and ASR improvement processes. Thirdly, it introduces an alternative approach to analyzing ASR model performance in non-English languages, focusing on meaning preservation as a complement to the more traditional metric of word error rate.