In the annals of technological progress, few trajectories have been as frustrating as that of voice AI. For years, we've endured the ineptitude of Siri and the infuriation of automated phone systems, wondering if the promise of conversational AI would ever be realised.
But the tides are turning.
If you've seen the GPT-4o demo or spent 30 seconds cloning your voice on Play.ai, you know: voice AI has levelled up. The question is no longer whether it works, but how it will reshape our world.
Two distinct camps have emerged in this new landscape:
The Horizontalists: The platform players
The Verticalists: The industry specialists
In one corner, we have the Horizontalists: startups racing to build the definitive voice AI platform. But this space is already hyper-competitive, with well-funded startups jockeying for position. More importantly, does anyone want to bet against OpenAI or AWS swooping in and dominating this market? The economies of scale and existing customer base give tech giants a formidable advantage.
In the other, we find a more intriguing cohort: the Verticalists. These are the companies crafting AI workers with deep domain expertise. They're not just adding a voice layer to existing software; they're engineering digital workers that truly understand context and nuance.
Imagine:
Car dealerships where AI agents handle 80% of initial inquiries, schedule test drives, and even begin price negotiations.
Logistics companies employing voice agents to manage carrier communications, provide real-time shipment tracking, and dynamically optimise routes, reducing manual call time by 60%.
This isn't speculative. Companies like Toma and Happy Robot are already bringing these visions to life.
It isn’t really about Voice at all, it’s about new software that’s multi-modal and uses AI to complete workflows only a person could have completed before.
Why does this approach matter? Three reasons:
Bridging the knowledge gap. Most businesses lack the technical expertise to implement and customise generic AI solutions. Vertical products offer immediate value without extensive onboarding or the need for in-house AI talent.
Employee empowerment. Just as accounting software allowed bookkeepers to evolve into strategic financial advisors, AI workers will free up employees to focus on high-value, relationship-driven work. The result? Increased job satisfaction and better outcomes for customers.
Budgetary sleight-of-hand. Instead of requiring a new "AI budget," these solutions slot into existing line items. "We can do the work of an employee, but cheaper and 24/7" is a much easier sell than "invest in the future of AI."
The Road Ahead
Building these systems isn’t easy. It requires deep industry knowledge, robust integrations with existing systems, and a willingness to start with an 80% solution, relying on early adopters to refine the offering.
Finding a compelling insertion point matters and we expect industries where it’s hard to find talent will be quickest to adopt technology that could enhance their existing workforce to improve productivity.
It will also be easiest to get traction in industries where the software tools they’re currently using either don’t capture the information required to build differentiated models or don’t have the talent to ship new features quickly — Gong is much more likely to ship AI workers than the Dealer Management Software at the Car Dealership.
But for those who can thread this needle, the opportunity is immense. We're looking at potential category-defining companies emerging in multiple industries.
If you’re building industry-specific software incorporating voice AI, I want to hear from you. If you disagree with any part of this analysis, let’s debate! This is the first sketch of an idea — it’s designed to be improved with feedback.
Another little aspect of voice is overlooked; voice is not just text read aloud. Voice contains a lot more information, intonation, data than the typed word. Often I think we write and read so much text, that we forget that they are (terribly convenient!) abstractions of the richer communication.
In the 'only Apple' approach, ie., what are things a technology enables that were impossible before, one might ask "what can *only* voice, do?" State of mind, areas of focus, satisfaction or dissatisfaction are all encoded in the sounds we make.
And, of course, voice is additive. We've all experienced the frustration of Siri, or Alexa not understanding when we're *not* talking to it, or what we're talking to it about. Voice alone makes that a hard problem for even a human. But our computers' new-found voice abilities add to, or you might say, multiply by, other senses/modalities like vision…then we start getting a robust understanding of a humans intention, focus, mental state.
You wrote this in July! It’s as if you manifested us only months later… :)