Many Google products involve speech recognition. For example, Google Assistant allows you to ask for help by voice, Gboard lets you dictate messages to your friends, and Google Meet provides auto captioning for your meetings.
Speech technologies increasingly rely on deep neural networks, a type of machine learning that helps us build more accurate and faster speech recognition models. Generally deep neural networks need larger amounts of data to work well and improve over time. This process of improvement is called model training.
What technologies we use to train speech models
Google’s speech team uses 3 broad classes of technologies to train speech models: conventional learning, federated learning, and ephemeral learning. Depending on the task and situation, some of these are more effective than others, and in some cases, we use a combination of them. This allows us to achieve the best quality possible, while providing privacy by design.Conventional learning
Conventional learning is how most of our speech models are trained.
How conventional learning works to train speech models
- With your explicit consent, audio samples are collected and stored on Google’s servers.
- A portion of these audio samples are annotated by human reviewers.
- A training algorithm learns from annotated audio data samples.
- In supervised training: Models are trained to mimic annotations from human reviewers for the same audio.
- In unsupervised training: Machine annotations are used instead of human annotations.
When training on equal amounts of data, supervised training typically results in better speech recognition models than unsupervised training because the annotations are higher quality. On the other hand, unsupervised training can learn from more audio samples since it learns from machine annotations, which are easier to produce.
How your data stays private
Federated learning is a privacy preserving technique developed at Google to train AI models directly on your phone or other device. We use federated learning to train a speech model when the model runs on your device and data is available for the model to learn from.
How federated learning works to train speech models
With federated learning, we train speech models without sending your audio data to Google’s servers.
- To enable federated learning, we save your audio data on your device.
- A training algorithm learns from this data on your device.
- A new speech model is formed by combining the aggregated learnings from your device along with learnings from all other participating devices.
How your data stays privateLearn how your voice & audio data stays private while Google Assistant improves.
How ephemeral learning works to train speech models
- As our systems convert incoming audio samples into text, those samples are sent to short-term memory (RAM).
- While the data is in RAM, a training algorithm learns from those audio data samples in real time.
- These audio data samples are deleted from short-term memory within minutes.
How your data stays private
With ephemeral learning, your audio data samples are:
- Only held in short-term memory (RAM) and for no more than a few minutes.
- Never accessible by a human.
- Never stored on a server.
- Used to train models without any additional data that can identify you.
How Google will use & invest in these technologies
We’ll continue to use all 3 technologies, often in combination for higher quality. We’re also actively working to improve both federated and ephemeral learning for speech technologies. Our goal is to make them more effective and useful, and in ways that preserve privacy by default.