OpenAI’s gpt-realtime Promises New Era for Enterprise Voice AI

New releases make voice agents more capable through access to additional tools and context The post OpenAI’s gpt-realtime Promises New Era for Enterprise Voice AI appeared first on Analytics India Magazine.

Programming, App Development, Web Development Aug 31, 2025 0 19 Add to Reading List

OpenAI’s gpt-realtime Promises New Era for Enterprise Voice AI

With OpenAI making its Realtime API generally available with new features and releasing its “most advanced” speech-to-speech model, gpt-realtime, developers and enterprises can now build reliable, production-ready voice agents that sound more natural and expressive.

The API now supports Model Context Protocol (MCP) servers, image inputs, and even phone calling through Session Initiation Protocol (SIP), OpenAI announced.

The company claimed that gpt-realtime is better at interpreting system messages and developer prompts—whether that’s reading disclaimer scripts word-for-word on a support call, repeating back alphanumerics, or switching seamlessly between languages mid-sentence.

While traditional voice AI pipelines involve multiple models for speech-to-text (SST) and text-to-speech conversion (TTS) to achieve results, OpenAI’s Realtime API is said to process and generate audio directly through a single model and API, resulting in reduced latency and more natural, expressive responses.

OpenAI also shared testimonials from customers using the new speech-to-speech model in OpenAI’s Realtime API, which include Zillow, T-Mobile, StubHub, Oscar Health, and Lemonade.

Kirby Thornton, senior product manager of AI at T-Mobile, said it’s the first voice stack they’ve used that pairs a great experience with the reliability needed in production. “The model follows instructions reliably and, with function calling, securely connects to our pricing, inventory, and promotions.”

Reduced Latency, Refined Response

OpenAI states that the new model can capture non-verbal cues, such as laughs, switch languages mid-sentence, and adapt to varying tones.

Traditional voice stacks chain speech-to-text, an LLM, and text-to-speech, each adding hundreds of milliseconds (for example, 200 to 300 ms for STT and additional LLM/TTS processing).

“Those milliseconds add up quickly, turning natural conversation into awkward exchanges where users wonder if their agent is still listening,” read a blogpost from Assembly AI.

OpenAI’s gpt-realtime avoids most of that by generating and consuming audio natively in a single model/API, eliminating cross-service hops and cutting turn lag.

Google’s Gemini Live API remains in preview while testing AI phone calls for consumers. In comparison, OpenAI’s Realtime API, now generally available with SIP and MCP support, is the more enterprise-ready choice at present.

Brooke Hopkins, founder of Coval, which offers simulation and evaluation services for AI agents, stated that in testing, she found there is a “great leap forward in terms of instruction following,” which she claims has been the biggest pain point of the OpenAI Realtime models so far.

Ankur Edkie, founder of Murf AI, a voice AI platform serving enterprise clients including Pfizer, Cisco, Honeywell, and VMware, told AIM that latency has been one of the main issues for enterprises in deploying voice agents. “It’s just in the past one, one-and-a-half years, where it has become possible for people to get sub-second latency with a meaningful response,” he said, indicating the progress of the voice AI ecosystem.

200m latency is the threshold for true human level conversation

voice AI has to decide when to interrupt you in 200ms or less. that's why it still feels awkward when you pause to think and AI jumps in too soon. pic.twitter.com/mhWsIVNPBc
— Auren Hoffman (@auren) May 27, 2025

Enterprises Want to Go All in on Voice AI

In April, Deepgram, an enterprise voice AI platform, surveyed 400 business leaders across North America, where 83% hailed from large enterprises generating over $100 million in annual revenue, and 36% came from organisations with revenue exceeding $1 billion.

It was found that 97% of respondents are currently using voice technology for speech recognition, or speech agents/analytics, and 67% of businesses now view voice technology as ‘foundational’ to their strategy.

Among them, 41% developed their own voice technology solutions in-house, and 57% relied on external service providers. The survey also revealed that over 84% of survey respondents plan to increase their budgets for voice technology within the following year.

And customer support is one of the most significant use cases.

For instance, Meesho, one of India’s largest e-commerce platforms, employs a GenAI-driven customer support system.

The voice-based agent is available in six Indian languages and is reported to resolve around 90% of queries at just one-fifth of the original cost. The voice-bot handles approximately 60,000 calls per day.

Edkie stated that the adoption of voice AI agents among enterprises was driven by the cumbersome user experience provided by traditional IVR systems, where users had to type in numbers on a keypad to interact with customer support.

No business should be answering their phones with humans anymore.

At worst, an AI voice agent can pick up on the first ring, triage, and send to a human. At best, they fully handle the call.

I’m seeing an explosion of “call answering”apps along these lines, especially for SMBs.
— Olivia Moore (@omooretweets) June 1, 2025

Human-First Customer Support as a USP?

This raises the obvious question about the fate of all humans working in the customer support industry.

Contrary to the fear, Edkie said that the current state involves augmenting human capabilities rather than replacing them entirely, as agent handoff is required for anything complex and unplanned, regardless of how good the bot is.

He said that the companies trying to deploy it now want to augment the capacity of their teams to scale their operations. “I think the reason we are seeing a lot more success in that sphere than actually trying to replace an agent is because there is a lot more enterprise energy behind it.”

“Everyone is supportive of the idea that you don’t want to have mundane activities done by humans, and you want bots to take care of them. There’s no point in getting humans to do certain activities — like 24/7 calls, interactions on the weekend, and so on,” Edkie added.

Many also think that maintaining human interaction with customers helps build stronger trust in the brand. “Way too many companies overinvest in acquiring customers and underinvest in customer support. We have reached a point where when I can talk with a human, that brand becomes memorable to me,” said a user on X.

Another X user remarked, “Human-first customer support might become a moat/USP that brands showcase on their homepage or marketing pages in the near future.”

For context, a survey conducted this year by Kinsta revealed that 5% of customers “frequently” require escalation; 61% received inaccurate information from AI, and 71% say AI struggles with complex issues.

It will be interesting to observe how recent advances, such as OpenAI’s latest models, once widely adopted, will enable companies to enhance the capabilities of AI agents in customer service.

The post OpenAI’s gpt-realtime Promises New Era for Enterprise Voice AI appeared first on Analytics India Magazine.

Read Original