Aller au contenu principal
Deux silhouettes reliées par des ondes audio lumineuses et un cristal IA central, illustrant l'architecture audio-to-audio native de Gemini 3.1 Flash Live

Gemini 3.1 Flash Live: Google’s real-time AI voice assistant

Back to blog
Artificial Intelligence
Nicolas
11 min read
Deux silhouettes reliées par des ondes audio lumineuses et un cristal IA central, illustrant l'architecture audio-to-audio native de Gemini 3.1 Flash Live

On March 26, 2026, Google announced Gemini 3.1 Flash Live, its new AI model built for real-time voice conversations.

Two silhouettes connected by luminous audio waves and a central AI crystal, illustrating the native audio-to-audio architecture of Gemini 3.1 Flash Live

This is not a simple update: it’s a complete architectural overhaul that drops the classic transcription → reasoning → speech synthesis pipeline in favor of native audio-to-audio processing.

The result: latency cut by three, unprecedented emotional understanding, and an immediate global rollout through Search Live in more than 200 countries.

Here’s what this concretely changes for developers, businesses, and French-speaking users.

Key takeaways:

  • Native audio-to-audio architecture: no more STT → LLM → TTS pipeline, latency drops to ~300 ms vs. 800 ms to 2 s previously
  • Score of 90.8% on ComplexFuncBench Audio: voice agents can now run multi-step workflows directly from voice input
  • Currently free in preview via Google AI Studio, with post-GA pricing estimated at $3-5/M audio input tokens
  • Two features are still missing: async function calling and affective dialogue, both available in the previous 2.5 Flash model
  • Search Live goes global with this model: conversational voice search is now available in 90+ languages

What Google announced on March 26, 2026

Native audio-to-audio model

The March 26 announcement marks the end of an era for AI voice assistants.

Until now, all AI voice systems ran on a three-step pipeline: a speech recognition engine converts audio to text, an LLM processes that text and generates a response, then a TTS engine converts it back to speech.

This model creates cumulative latency: Deepgram STT adds ~150 ms, ElevenLabs TTS ~75 ms, and the round trips between these fragmented services push the total to 800 ms or even 2 seconds.

Gemini 3.1 Flash Live takes raw PCM audio as input and generates PCM audio as output, with no intermediate conversion to text.

It’s the difference between sending a letter (STT→LLM→TTS pipeline) and making a phone call (native audio-to-audio): the message arrives instantly, with no transcription or back-translation.

The technical specs: input in 16-bit PCM at 16 kHz, output in PCM at 24 kHz for richer voice quality.

The model directly captures prosodic nuances, pitch variations, and the emotional markers that intermediate transcription consistently stripped away.

Doubled context window and 90+ languages

Gemini 3.1 Flash Live features a 131,072-token input context window and 65,536 tokens for output.

In practice, an audio-only session lasts up to 15 minutes without losing the thread of the conversation.

The model handles real-time language switching: a user can go from French to English to Spanish within the same session without any special configuration.

The 90+ supported languages include tonal languages (Mandarin, Vietnamese) where melodic variations carry lexical meaning, a dimension that consistently challenged transcription-based architectures.

This native multilingual support made the global rollout of Search Live possible in a single launch on March 26.

Technical specs worth knowing

Benchmarks

Google has published three reference scores for Gemini 3.1 Flash Live.

On ComplexFuncBench Audio, the model reaches 90.8%: this benchmark measures the ability to execute chains of interdependent function calls directly from audio input, without transcription.

On Scale AI Audio MultiChallenge (with thinking mode enabled), it scores 36.1%: this benchmark tests performance in real-world conditions, with interruptions, hesitations, and background noise, not clean lab recordings.

On Big Bench Audio (thinking high), the score reaches 95.9%, compared to 70.5% in minimal thinking mode.

This gap illustrates an important architectural trade-off for developers: reduced latency or deeper reasoning, both configurable depending on the use case.

Latency and noise handling

The model maintains stable latency throughout the session, without degradation as context accumulates.

The 16 kHz PCM audio input captures the human voice frequency range (80 Hz to 8 kHz) with enough precision to distinguish regional accents, hesitations, and mid-sentence corrections.

Gemini 3.1 Flash Live handles interruptions, filler words, and background noise without loss of comprehension, as validated by the Scale AI Audio MultiChallenge.

Companies like Verizon and Home Depot have confirmed in their feedback that the model detects customer frustration signals and dynamically adjusts its response style, shifting to a more empathetic tone when needed.

An AI voice session that picks up frustration before the sentence is even finished, and adjusts its tone accordingly: that’s what Verizon is running in production today.

SynthID audio watermark

SynthID is the invisible watermarking system developed by Google DeepMind to authenticate AI-generated content.

Gemini 3.1 Flash Live embeds SynthID directly into generated audio, encoding a watermark that is imperceptible to the human ear but detectable by analysis tools.

The watermark is designed to counter the spread of voice deepfakes: a voice synthesized by Gemini can be identified as such, even after compression or re-encoding.

The reliability question remains open: a voice deepfake produced by a different model won’t carry the SynthID watermark, which limits the defensive reach of the system to Google models only.

Voice AI showdown 2026: GPT-4o vs Gemini 3.1 Flash Live

Latency and naturalness

OpenAI Realtime API (GPT-4o Voice) shows an average latency of around 0.32 seconds under normal conditions, compared to ~0.21 seconds for a natural human response.

Gemini 3.1 Flash Live maintains comparable but more stable latency on long conversations and large contexts, where OpenAI’s Realtime API shows noticeable degradation.

On voice quality, GPT-4o voices have a timbral richness that works well for short exchanges; Gemini 3.1 Flash Live voices hold up better over long interactions, without the rhythm variations that can sometimes give away synthesis.

For use cases with frequent interruptions (dictation, rapid corrections), OpenAI’s Realtime API responds marginally faster.

For guided conversational workflows (customer support, tutorials), Gemini’s latency stability creates a smoother experience.

Worth noting: our analysis of Gemini 2.5 Pro had already highlighted Google’s growing strength in large context window models.

API and pricing

During the preview period, Gemini 3.1 Flash Live is completely free via Google AI Studio.

Post-GA, estimated pricing sits around $3 to $5 per million audio input tokens and $12 to $20 per million audio output tokens.

OpenAI’s Realtime API currently charges $0.06 per minute of audio input and $0.24 per minute of audio output for GPT-4o Realtime, which works out to roughly $3.60/hour of input and $14.40/hour of output.

Pricing is comparable at full maturity, but Google is offering a free adoption window that developers would be wrong to overlook.

On Vertex AI, enterprise deployments get reserved capacity with volume discounts and managed support.

Google’s free preview period is no accident: it’s an aggressive adoption strategy to pull developers in before OpenAI’s Realtime API becomes the market standard.

Impact on AI agent developers

Live API: capabilities and architecture

Google’s Live API uses stateful bidirectional WebSocket (WSS) connections that keep the session alive without HTTP request-response cycles.

Audio-only sessions last up to 15 minutes; sessions combining audio and video are capped at 2 minutes (bandwidth constraint), with video streamed as JPEG frames at ~1 frame per second.

Function calls work directly from audio input: the agent can query a database or trigger an external workflow in response to a voice command, without prior transcription.

For teams building voice agents integrated with Google Workspace, our guide on Google Workspace AI agents via GWS CLI covers the available integration patterns.

Production limits on Vertex AI: 1,000 concurrent sessions per project and 4 million tokens per minute.

Migrating from the previous Gemini 2.5 Flash model

The previous model was called gemini-live-2.5-flash-native-audio; the new one is gemini-3.1-flash-live-preview.

The migration is not seamless: proactive audio (filtering out conversations not directed at the device) and affective dialogue (adapting style based on detected emotions) are both absent in version 3.1.

Developers migrating from 2.5 Flash need to remove the configuration code for both features to avoid initialization errors.

Production-validated use cases

Verizon uses it for voice customer support: problem identification, account access via function calling, natural response, automatic escalation when frustration is detected.

Home Depot is deploying a visual assistant: customers point their camera at a product to assemble, ask questions by voice, and receive step-by-step instructions with links to the relevant manuals and videos.

LiveKit integrates the model into its real-time communication platform for developers who want to build voice agents without managing WebSocket infrastructure manually.

These use cases are part of the broader trend toward autonomous agents: our analysis of Manus AI and autonomous agents covers the architectures that come closest to what Gemini 3.1 Flash Live makes possible in voice.

Search Live: conversation as a search engine

Google rolled out Search Live to more than 200 countries and territories on March 26, 2026, simultaneously with the model announcement.

Access is straightforward: open the Google app (Android or iOS), tap the Live icon below the search bar, and start talking.

Search Live supports the 90+ languages of the model, making it the most widely available conversational voice search system on the market.

The integration with Google Lens is the standout feature: users point their camera at an object and ask questions by voice, and the model responds seeing exactly what the user sees.

For French-language SEO, the impact is real: if Search Live captures part of text-based search traffic, long-form and conversational queries will grow, shifting ranking patterns for informational content.

Google has not published data on Search Live’s adoption rate relative to classic search.

The strategic question remains: if AI answers directly out loud, how many users still click through to organic results?

Limitations and what to watch for

Function calling works synchronously: the model pauses during external function execution before resuming audio generation.

On slow API calls (database queries, third-party services), this creates silences perceived as delays that degrade the conversational experience.

Async function calling, which would allow the model to keep talking during execution, is a known limitation still awaiting development.

Session context is capped at 128K tokens: on long sessions, developers need to add explicit context overflow handling to avoid abrupt cutoffs.

The absence of proactive audio remains a barrier for applications in multi-person environments (meeting rooms, open offices, home assistants).

GDPR compliance in Europe for enterprise deployments via Vertex AI requires verifying data storage and voice processing configurations per jurisdiction.

On the ethical side, the growing naturalness of AI voices creates a risk of voice manipulation that SynthID can only partially address: it only covers voices generated by Google models.

Verdict

Gemini 3.1 Flash Live is the most capable AI voice model available to developers in March 2026.

The native audio-to-audio architecture structurally solves the latency problem that classic pipelines could never overcome.

The scores on ComplexFuncBench Audio and Scale AI Audio MultiChallenge validate real production use cases, not just lab benchmarks.

The free preview period is an adoption window worth seizing: teams building their voice agents now will have a head start before paid GA.

The current limitations (synchronous function calling, no affective dialogue) are real but manageable for the vast majority of use cases.

With Search Live going global, Google is positioning voice as the next interface layer between users and information.

Building a voice agent or looking for the best AI assistant in 2026?

Test Gemini 3.1 Flash Live in Google AI Studio and tell us in the comments how it compares to your current setup.

FAQ

What is Gemini 3.1 Flash Live and how is it different from previous voice assistants?

It’s Google’s native audio-to-audio model: it processes the audio signal directly without intermediate transcription, which reduces latency and preserves the prosodic and emotional nuances lost in classic STT→LLM→TTS pipelines.

What is the actual latency of Gemini 3.1 Flash Live compared to GPT-4o Voice?

Both systems achieve latency below 400 ms under normal conditions; Gemini stands out for more stable latency on long sessions and large contexts, where GPT-4o Realtime API shows degradation.

Is Gemini 3.1 Flash Live free for developers?

Yes, during the preview period via Google AI Studio, access is free.

Post-GA, estimated pricing is around $3 to $5 per million audio input tokens and $12 to $20 per million output tokens.

How to migrate from Gemini 2.5 Flash Native Audio?

Replace the model identifier with gemini-3.1-flash-live-preview and remove the configuration code for proactive audio and affective dialogue, as both features are not yet available in version 3.1.

Does the model work well in French?

French is among the 90+ natively supported languages; the model recognizes regional accents, hesitations, and speech rate variations typical of spoken French, though Google has not published language-specific metrics.

What is SynthID and does it effectively protect against voice deepfakes?

SynthID encodes an imperceptible watermark into audio generated by Gemini, allowing it to be identified as AI content; the limit is that it only covers voices produced by Google models, not those generated by other systems.

What are the main current technical limitations?

The three key limitations: function calling is synchronous (blocking during execution), proactive audio and affective dialogue are absent, and audio+video sessions are limited to 2 minutes of continuous streaming.

Does Search Live replace classic Google search?

Search Live is a conversational layer added on top of search; it does not yet replace organic results, but its growing adoption will shift query patterns toward longer, conversational formulations.

Is Gemini 3.1 Flash Live available in Europe with GDPR compliance?

Access via Vertex AI on Google Cloud enables GDPR-compliant deployment configurations, but each deployment must be verified against data storage and voice processing rules for the relevant jurisdiction.

Which professional use cases are already validated in production?

Verizon uses it for voice customer support with emotion detection and function calling on account data; Home Depot is deploying a voice visual assistant for DIY projects; LiveKit integrates it as infrastructure for developers building conversational agents.

Related Articles

Ready to scale your business?

Anthem Creation supports you in your AI transformation

Disponibilité : 1 nouveau projet pour Avril/Mai
Book a discovery call
Une question ?
✉️

Encore quelques questions ?

Laissez-moi votre email pour qu'on puisse continuer cette conversation. Promis, je garde ça précieusement (et je ne vous bombarderai pas de newsletters).

  • 💬 Accès illimité au chatbot
  • 🚀 Des réponses plus poussées
  • 🔐 Vos données restent entre nous
Cette réponse vous a-t-elle aidé ? Merci !