Probably bets $9M that AI can stop hallucinating before users ever see them

Probably raised $9M to build AI reliability tools that prevent hallucinations with deterministic validators and smaller models.

Hallucinations remain one of artificial intelligence’s most stubborn weaknesses. Even as large language models have improved dramatically, they still make mistakes that can be persuasive, fast and expensive. For businesses that want AI systems to handle sensitive work, that gap between fluency and accuracy is becoming a major operational problem.

San Francisco startup Probably believes the answer is not simply a better model, but a better system around the model. The company has raised $9 million in seed funding from Andreessen Horowitz to build what it sees as a more reliable form of AI — one designed to block factual errors before they reach the end user.

Founded by Peter Elias, Probably is taking aim at a central tension in the AI industry: the most powerful models are often the most brittle in production, while the most practical enterprise use cases demand precision. Rather than chasing bigger models, the startup is building an engineering layer that constrains model behavior, checks outputs against deterministic rules and reduces the chance that an answer is wrong in the first place.

The company’s first product is a data science tool that generates quick answers from complex datasets. It is built to return citations and an audit trail, giving users a way to trace where a result came from and how it was assembled. But Probably says the real innovation is the infrastructure underneath: a validation system that acts as a guardrail between model output and the user.

The strategy reflects a growing shift in enterprise AI. As more companies deploy chatbots, assistants and internal copilots, the question is no longer whether a model can produce an answer. It is whether the answer can be trusted, repeated and defended.

Why hallucinations are still the industry’s hardest problem

LLMs are good at generating plausible language, but that same strength creates a weakness. When they do not know something, they can still produce an answer that sounds confident. In consumer-facing settings, that may be annoying. In business settings, it can be disastrous.

Companies that use AI for analytics, finance, medicine, operations or compliance cannot afford frequent errors. A single faulty data summary can distort a decision. A mistaken accounting output can create a reconciliation headache. A medical-related error can become a liability issue. That is why the market is increasingly interested in systems that can prove correctness rather than merely suggest it.

Probably’s premise is straightforward: if you want near-deterministic performance, you need to design for it from the ground up. That means the model itself is only one part of the stack. The other part is everything that shapes the model’s context, constrains its output and validates the result.

In Elias’s view, that is the only way to approach the kind of reliability people expect from conventional software.

Peter Elias says the company’s goal is to prevent hallucinations and basic factual errors from ever making it to the user, with a long-term aim of reaching the kind of 99.99% accuracy people associate with deterministic systems.

A “mech suit” for the model

Probably’s core idea is to wrap the language model in a dense layer of checks and feedback loops. Elias has described that setup as a kind of “data science mech suit” — a metaphor that captures both the complexity and the intent of the system.

Instead of letting the model produce a response and sending it directly to the user, Probably runs the output through a deterministic validator. If the answer does not line up with the underlying dataset, it gets rejected and sent back for correction. In effect, the AI is not allowed to improvise beyond the boundaries of the source material.

The company says it also trains the model against the validator itself, so the model learns how to stay within the rules. That approach turns the system into a tighter loop: the model generates, the validator checks, the training adjusts and the final answer becomes less ambiguous over time.

The result is a product that leans less on raw model scale and more on structural discipline. In a market obsessed with frontier models, Probably is making a contrarian argument: sometimes the best way to get better performance is to ask less of the model and more of the system around it.

Smaller models, lower costs

One of the clearest advantages of this approach is cost. Elias says Probably’s current system runs on a model that is roughly four tiers below the most advanced frontier systems. Because the architecture is built to minimize ambiguity, the model does not need to be as large or as powerful to perform well.

That matters for two reasons. First, smaller models are cheaper to run. Second, they can often be deployed on local hardware rather than in a large cloud data center, which reduces token spend and can simplify operational overhead.

For companies watching AI budgets tighten, that is a meaningful proposition. Token costs have become a major line item as more organizations move from experiments to full production use. Many teams are now reassessing which workloads deserve premium model access and which can be served by a leaner system.

Probably is positioning itself squarely in that second group — the segment of customers who do not want the flashiest answer, but the most dependable one.

What local deployment changes

Running on smaller models can reshape the economics of enterprise AI in several ways:

  • It reduces inference costs for high-volume workflows.
  • It makes certain applications possible on desktop-class hardware.
  • It can improve data control by limiting dependence on large remote infrastructure.
  • It often makes latency more predictable for fast-turnaround tasks.

That combination is especially attractive for precision-sensitive jobs where consistency matters more than creative breadth.

Beyond analytics: a broader enterprise play

Although Probably’s first product is focused on data science, Elias says the underlying architecture is meant to travel. The company sees the same verification-first engine being adapted to other categories where accuracy is critical, including accounting and medical services.

That framing matters because it places Probably in a larger category of enterprise AI startups trying to solve not just a single workflow, but a class of workflows. The company is not merely selling a dashboard or a chatbot. It is selling a reliability layer that could be embedded into many different applications.

In practice, that could mean tools for:

  1. financial analysis and reconciliation
  2. compliance and audit workflows
  3. clinical or administrative medical support
  4. operations and logistics reporting
  5. internal knowledge systems that require citations

The emphasis on “precision-sensitive” use cases is important. Probably is not targeting creative writing or open-ended brainstorming, where a little unpredictability may be tolerated. It is aiming at environments where the system must be right, or at least provably grounded.

Why the big model labs may not prioritize this

Elias also made a pointed argument about the incentives of major AI labs. In his view, the largest model developers are not especially motivated to eliminate this kind of error at the system level, because the current product model rewards repeated corrections and ongoing interaction.

That is a provocative claim, but it reflects a broader debate in the industry. Model providers often optimize for engagement, latency, breadth and scale. Enterprise buyers, by contrast, are asking for traceability, cost efficiency and confidence. Those priorities are not always aligned.

The startup’s thesis is that reliability is not just a technical challenge; it is also a product design choice. If the goal is to create a system that behaves like software rather than a conversational toy, the architecture has to be built around correctness from the start.

Elias argues that the biggest AI labs have not seriously pursued this kind of precision-first design, saying their business incentives may favor systems that require more user correction.

Whether or not that critique holds broadly, it points to a real market opening. As AI deployments mature, companies are becoming less impressed by impressive demos and more focused on measurable performance. The next wave of winners may be the companies that can make AI boring in the best possible way: reliable, predictable and inexpensive.

How the verification model fits into the AI market

Probably’s fundraising comes at a moment when the industry is increasingly split between two paths. One path is scale: larger models, broader capabilities, multimodal systems and expensive infrastructure. The other is control: smaller models, specialized workflows, grounding and guardrails.

Probably is firmly in the second camp, but that does not make it a small idea. In fact, reliability could become one of the defining battlegrounds for enterprise AI over the next several years. Customers are no longer satisfied with generative systems that are merely useful in a demo. They want systems that can survive audits, support regulated decisions and keep operating at scale without constant human intervention.

The startup is betting that the most valuable AI systems will not necessarily be the ones that answer the most questions, but the ones that answer the right questions with the fewest mistakes.

Where this could matter most

Several sectors stand out as likely adopters of verification-heavy AI:

  • Finance: reconciliations, reporting and analysis
  • Healthcare: administrative workflows and clinical support
  • Legal and compliance: document review and policy interpretation
  • Operations: structured reporting and decision support
  • Data teams: summaries drawn from large, messy datasets

In each case, the central value proposition is not creativity. It is reduced ambiguity.

Fundraising signals confidence in reliability-first AI

The $9 million seed round led by Andreessen Horowitz gives Probably the resources to keep developing its core system and push beyond its first product category. While seed-stage funding is still early, the involvement of one of the most prominent venture firms in AI suggests that reliability tooling is becoming a meaningful investment theme.

Investors have spent the last few years pouring capital into foundational model companies, infrastructure providers and application startups. More recently, attention has shifted toward products that help companies actually use those models safely and economically. Verification, observability, governance and evaluation are all gaining prominence as businesses move from experimentation to production.

Probably fits neatly into that trend. It is not trying to out-frontier the frontier models. It is trying to make them more useful.

Key item Details
Company Probably
Founder Peter Elias
Funding raised $9 million seed round
Lead investor Andreessen Horowitz
Initial product Data science tool with citations and audit trail
Core technique Deterministic validator system and harness engineering
Target outcome Near-deterministic accuracy and fewer hallucinations
Potential expansion Accounting, medical services and other precision-sensitive workflows

The bigger question: can AI be engineered for certainty?

Probably’s approach raises a broader question about the future of AI. Is the long-term answer to hallucinations simply more powerful models, or is it better engineering discipline around the models we already have?

The startup’s answer is the latter. It suggests that a great deal of current AI weakness comes not from the inability of models to generate information, but from the failure of systems to constrain and verify that information well enough before it is delivered.

That idea may sound incremental, but it could have significant consequences. If the industry can learn to build AI systems that are more like rigorously controlled software components and less like freewheeling conversational engines, the addressable market expands dramatically. More enterprises may be willing to adopt AI when the downside risk is reduced.

At the same time, the work is hard. Validation layers need to be robust, dataset-aware and fast enough not to destroy the user experience. Training a model to cooperate with a deterministic checker can also introduce new engineering tradeoffs. The system has to be both strict and useful, which is not always easy to balance.

Still, the direction is clear. As the AI market matures, the premium may shift from raw capability to dependable execution.

What Probably’s raise says about the next phase of AI

Probably’s funding round arrives as companies across the sector are being forced to confront the limits of model-centric thinking. The early hype around generative AI focused on scale and novelty. The next phase is likely to be shaped by reliability, unit economics and integration into real business workflows.

That is where a startup like Probably could matter. If it succeeds, it may help define a new category of AI infrastructure: systems designed not to wow users with endless flexibility, but to ensure every answer stays inside the lines.

For customers operating in high-stakes environments, that is not a small improvement. It is the difference between a toy and a tool.

And for an industry still learning how to manage the risks of generative AI, that distinction may turn out to be one of the most important in the market.

Timeline of Probably’s current pitch

The company’s story is still early, but the sequence of events around its first public funding announcement helps clarify its strategy.

Stage What happened Why it matters
Problem identified LLM hallucinations remained a persistent production issue Created demand for more reliable AI systems
Product built Probably developed a data science tool with citations and audit trails Demonstrated the company’s verification approach
System designed Deterministic validators and harness engineering were added Reduced ambiguity and improved output trustworthiness
Efficiency improved The tool ran on a smaller model than frontier alternatives Lowered compute and token costs
Seed funding closed Andreessen Horowitz backed the startup with $9 million Gave the company capital to expand its platform

For now, Probably is making a focused bet that the future of enterprise AI belongs to systems that can prove what they know. If that bet pays off, the company may not just reduce hallucinations. It may help redefine what “trustworthy AI” looks like in production.

Share this 🚀