Connect with us

NEWS

Google and Microsoft Take Opposite Paths on Speech AI

Google’s on-device Eloquent and Microsoft’s MAI-Transcribe-1.5 hit the market the same week with opposite architectures and very different enterprise trade-offs.

Published

on

At Build 2026 on June 2, Microsoft unveiled MAI-Transcribe-1.5, a cloud speech API supporting 43 languages that processes an hour of audio in under 15 seconds through Azure. One day later, Google put a fully offline, on-device speech recognition app on macOS, powered by Gemma 4 12B and keeping all audio on the machine at no cost.

Google’s On-Device Dictation Reaches the Mac

Google AI Edge Eloquent, free and available from the Mac App Store, transcribes speech, strips filler words, and applies editorial polish through a customizable keyboard shortcut that works on top of any open application. All audio stays on the device. No internet connection required. Google released it alongside Gemma 4 12B on June 3.

The Gemma 4 12B Foundation

The model requires at least 16GB of unified memory and runs on Apple Silicon through Google’s LiteRT-LM inference engine. Google describes Gemma 4 12B as capable of performance comparable to its 26-billion-parameter mixture-of-experts variant while fitting the hardware limits of a standard consumer MacBook. It handles text, vision, and audio from a single checkpoint. Weights are on Hugging Face and Kaggle under Apache 2.0, permitting commercial use; companies can build products on top of the model weights without a separate licensing arrangement with Google.

Google’s LiteRT-LM command-line tool also gained a serve command this week, exposing a local endpoint compatible with the OpenAI API format and letting existing developer tools point at an on-device model without code changes. The companion app for running the Gemma family locally, Google AI Edge Gallery, limits users to Google’s own model weights as of this week’s launch. Platforms like Ollama and LM Studio accept any compatible open-weight model from Hugging Face; Gallery currently does not.

What Eloquent Handles and Its Limits

Voice Edit is the headline capability. Select text on screen, speak a command aloud, and the model rewrites it locally: translate a paragraph, summarize an email thread, reformat rough notes into an executive briefing. Output appears in whatever application is active. Custom vocabulary can be added for recurring proper nouns and technical terms to cut correction frequency, and writing style options let users match the tone of the target document.

The app launches English-only. Additional languages are promised but undated. On machines with less than 16GB of unified memory, users can fall back to lighter Gemma 4 E2B or E4B variants through the Gallery app, which handle simpler transcription tasks but lose the complex rewrite capability that Voice Edit requires.

Seven New Models at Microsoft Build 2026

On June 2, Microsoft announced seven in-house AI models spanning transcription, voice, image, reasoning, and coding, all trained on clean licensed data with zero distillation from third-party models and co-designed with the company’s proprietary Maia 200 AI accelerator chip. The family is available through Azure AI Foundry and the MAI Playground, with MAI-Thinking-1 still limited to private preview.

MAI-Transcribe-1.5 in Production

The model lands third on the Artificial Analysis speech-to-text leaderboard, a research firm that benchmarks commercial STT models, with a word error rate of 2.4% on the FLEURS multilingual benchmark across 43 languages. Alibaba’s Fun-Realtime-ASR-preview holds first at 1.7% WER; ElevenLabs’ Scribe v2 is second at 2.2%. The previous version, MAI-Transcribe-1, launched in April 2026 with 25 languages at the same price point; the 1.5 update expanded to 43 languages without raising the fee.

Speed is where the numbers pull away from the competition. Artificial Analysis clocks the model at roughly 276 times real-time, more than double the second-fastest model in the top ten for accuracy. Keyword biasing, detailed in Microsoft’s MAI-Transcribe-1.5 launch announcement, lets developers supply up to 200 domain-specific terms and cuts word error rate on specialized vocabulary by 30%. Pricing is $0.36 per hour of audio through Azure Speech. The model now processes audio behind Teams meeting transcription, Copilot, GitHub, and Dynamics 365 Contact Center, so Azure enterprise customers encounter it without a separate integration step.

Voice and the Broader Stack

All seven MAI models were built without data from OpenAI, Anthropic, or other labs. Mustafa Suleyman, Microsoft AI’s chief executive, framed the direction at the June 2 Build keynote:

building a hill-climbing machine to keep you at the frontier

MAI-Voice-2, the companion text-to-speech model, covers 15 languages with emotion and style controls and supports zero-shot voice cloning from as little as five seconds of sample audio, with consent safeguards built into the pipeline. Microsoft reports it was preferred over MAI-Voice-1 in 72% of side-by-side listener evaluations. Pricing is $22 per million characters through Azure Speech. The full family, available through Azure AI Foundry:

Model Capability Starting Price Status
MAI-Transcribe-1.5 Speech-to-text, 43 languages $0.36 per audio hour Available
MAI-Voice-2 Text-to-speech, 15 languages, voice cloning $22 per 1M characters Available
MAI-Image-2.5 Text-to-image and editing $47 per 1M output tokens Available
MAI-Image-2.5 Flash Faster image generation $33 per 1M output tokens Available
MAI-Thinking-1 Reasoning, 35B active params, 256K context Not yet public Private preview
MAI-Code-1-Flash Agentic coding, GitHub Copilot Below Haiku pricing Rolling out

MAI-Thinking-1, a sparse mixture-of-experts model with 35 billion active parameters and a 256K token context window, matches Claude Opus 4.6 on SWE-Bench Pro and scored 97% on AIME 2025, per Microsoft’s published benchmarks. It remains in private preview pending wider production validation.

First-Look Tests Complicate the Benchmark Story

PCMag UK’s first-look comparison of Microsoft’s transcription model against Google’s Gemini cloud transcription model recorded 13 errors for the Microsoft model and six for Google’s in a real-world dictation session. The same first look described MAI-Voice-2 as sounding “robotic.” Both models are in limited preview, and Microsoft has not set a general availability date for either.

Both models in that session were cloud APIs. The result measures how the Microsoft transcription API performed against Google’s Gemini cloud model on specific test audio, not cloud against on-device processing.

The gap between leaderboard rank and a journalist’s dictation session has a structural component. FLEURS and the Artificial Analysis WER metric use curated multilingual audio under controlled conditions. A real-world session adds conversational pacing, ambient noise, and accent variation that benchmarks weight differently. Microsoft’s speed claim of 276 times real-time comes from batch transcription of pre-recorded clean audio, the workload the model’s current optimization targets most precisely. Microsoft’s preview label for both speech models acknowledges that production edge cases take longer to surface than benchmark runs.

Eloquent’s accuracy profile is shaped by different constraints. Running on a 12B-parameter on-device model means far less compute than a cloud API applies to the same audio clip. Automatic filler-word removal and editorial polishing help the output read cleanly, but for verbatim legal or medical dictation, where the precise wording of the original speech matters, that cleanup rewrites content the user intended to preserve.

Formal accuracy leadership sits outside both companies. Alibaba’s Fun-Realtime-ASR-preview holds first on the Artificial Analysis leaderboard at 1.7% WER, a figure neither company’s model has matched on that benchmark.

NVIDIA and Others Press the Open-Weights Case

NVIDIA released Nemotron 3.5 ASR on June 4, a streaming speech recognition model covering 40 language locales from a single 600-million-parameter checkpoint. The Cache-Aware FastConformer-RNNT architecture processes each audio frame exactly once, eliminating the redundant recomputation that conventional buffered-streaming systems apply across overlapping audio windows. Latency is configurable at deployment time between 80 milliseconds and 1.12 seconds with no retraining required. Weights ship on Hugging Face under an open model license and run on Apple Silicon, NVIDIA GPU, or CPU hardware.

NVIDIA reports 17 times the concurrent stream capacity of buffered alternatives on an H100. The Hugging Face release includes a complete fine-tuning recipe covering data preparation, training, evaluation, and deployment; NVIDIA reports a 31% relative word error rate drop on Greek and 32% on Bulgarian from fine-tuning on domain-specific audio. Teams building voice agents for regulated applications, including medical transcription and compliance recording, can fine-tune the model on their own audio without routing a single second of that data to an external server. A NIM (NVIDIA Inference Microservices) deployment package with gRPC streaming support is planned for later this month.

Other speech-related releases from the same week:

  • DictaFlow appeared in the Microsoft Store for Windows, offering push-to-talk dictation and screen intelligence for editing text by voice across Windows apps, including remote desktop environments.
  • AssemblyAI updated its Universal-3 Pro model, reporting a 19% reduction in errors on multilingual tests and a 30% drop in latency.
  • Manus released a meeting notes tool that records, transcribes, and summarizes conversations with automatic speaker identification.
  • Google Gboard integrated Rambler, a Gemini-powered feature for real-time speech interpretation and noise suppression on Pixel devices.

The Deployment Choice That Benchmarks Cannot Make

Mordor Intelligence’s 2026 voice recognition market report puts global market value at $22.5 billion this year on a trajectory toward $61.7 billion by 2031, a 22.38% compound annual growth rate. Cloud deployment captured 67.9% of 2025 market revenue, per the same report. The fastest-growing deployment subsegment is on-device and edge voice AI, forecast at 22.96% compound annual growth through 2031. Healthcare providers, at 12.8% of end-user vertical revenue in 2025, are forecast to grow at 23.94% annually, the highest projected rate in the study, with patient-data residency regulations cited as a driver of local-processing adoption across the vertical.

For regulated industries, the decision between cloud and on-device transcription is primarily a compliance determination. A hospital under strict patient-audio regulations cannot route speech data through an external cloud regardless of benchmark rank. European financial institutions subject to data residency requirements generate audio that cannot travel to non-local cloud infrastructure without additional legal safeguards. Local processing sidesteps the routing question entirely. Running locally also means English-only availability for now and a hardware dependency that cloud APIs don’t impose on the buyer.

Microsoft’s cloud API, with broad multilingual coverage and keyword biasing for specialized vocabulary, integrates through Teams and Copilot as a credible backend for customer service, legal document processing, and enterprise communications, so long as Azure’s data residency terms satisfy the compliance requirement. Google’s macOS app addresses a different buyer: one where the device boundary is non-negotiable and waiting on more languages is an acceptable interim cost.

Google has committed to adding more languages to Eloquent but hasn’t said when. Microsoft hasn’t published a general availability date for either new speech model, both currently in limited preview.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending