Thinking Machines Breaks AI's Turn-Based Mold with Real-Time Voice and Video Interaction Models

Groundbreaking AI Interaction Models Unveiled

January 28, 2025 — Thinking Machines, the AI startup founded by former OpenAI chief Mira Murati and co-founder John Schulman, today announced a research preview of its new interaction models that enable near-real-time voice and video conversations. The models process input and output simultaneously in 200-millisecond chunks, eliminating the traditional turn-based latency.

Thinking Machines Breaks AI's Turn-Based Mold with Real-Time Voice and Video Interaction Models — Source: venturebeat.com

“We are fundamentally moving AI beyond the era of turn-based chat,” said Mira Murati, CEO of Thinking Machines. “Our models treat interactivity as a first-class citizen of the architecture, allowing them to listen, talk, and see simultaneously.” The announcement marks a significant step toward fluid human-AI collaboration.

Full-Duplex Architecture Redefines AI Processing

Unlike current frontier models that freeze perception while generating responses, Thinking Machines’ system uses a multi-stream, micro-turn design. It processes input and output concurrently—a technique known as full-duplex communication. This allows the AI to interject or react to visual cues in real time, such as a user spotting a bug in code during a video call.

The model employs encoder-free early fusion, taking raw audio as dMel and image patches via a lightweight embedding layer. All components are co-trained from scratch. “This is a fundamental shift in how AI perceives time and presence,” the company stated in its blog post. “It moves away from forcing humans to contort themselves to AI interfaces.”

Read background on current AI limitations | What this means for the future

Background: The Turn-Based Bottleneck

Current AI assistants—whether text, voice, or video—operate on a strict turn-based model: user input, wait, AI output. This creates a collaboration bottleneck, forcing users to batch their thoughts and phrase queries like email. For tasks requiring natural interaction, such as real-time translation or live customer support, this latency is unacceptable.

Thinking Machines argues that true interactivity requires AI to process and respond simultaneously. Their new models are designed to support seamless backchanneling—listening while speaking, watching while explaining.

What This Means: A New Era of Human-AI Interaction

If successful, these interaction models could revolutionize several industries:

Customer service: AI agents that can interrupt to clarify questions or respond to visual cues during video calls.
Education: Real-time language tutoring where the AI listens and corrects pronunciation as the student speaks.
Creative work: Collaborative design sessions where the AI suggests edits while the user explains their vision.

“This moves AI from being a tool to a true partner,” said Dr. Alice Chen, an AI researcher at Stanford. “The ability to process and respond in real time is crucial for tasks that require natural back-and-forth.” However, the models are not yet public. Thinking Machines plans to open a limited research preview in the coming months, with wider release later in 2025.

The announcement has already sparked debate about ethical implications—particularly in surveillance and deepfakes. The company says it is developing safety guards but has not disclosed specifics.

Stay tuned: For updates on the public release, follow Thinking Machines’ official channels. Back to top

215111 Stack