AI Safety Research Has Gaps That Enterprise Leaders Should Know About
AI safety research has made genuine progress over the past two years. Techniques like constitutional AI, reinforcement learning from human feedback, red teaming, and automated adversarial testing have all improved. But if you’re an enterprise leader deploying AI systems in production, you need to know that significant gaps remain — gaps that directly affect the reliability and trustworthiness of the systems your teams are building.
This isn’t an alarmist take. I’m not worried about superintelligent AI going rogue. I’m worried about the mundane but consequential failure modes that current safety research hasn’t adequately addressed for enterprise use cases.
The Hallucination Problem Is Not Solved
Let’s start with the most obvious gap. Despite what some marketing materials suggest, hallucination in large language models remains a fundamental problem, not a bug that’s being incrementally fixed.
Yes, retrieval-augmented generation (RAG) reduces hallucination rates. Yes, grounding techniques help. Yes, reasoning models with chain-of-thought are more reliable. But “reduced” is not “eliminated,” and for enterprise applications where factual accuracy matters — legal analysis, medical information, financial reporting, regulatory compliance — even a 2-3% hallucination rate is a serious problem.
The research community is split on whether hallucination can be fully solved within the current transformer architecture, or whether it’s an inherent property of how these models generate text. For enterprise leaders, the practical implication is the same either way: you need verification layers, human review processes, and confidence scoring in any production system that generates factual claims.
Adversarial Robustness in Business Contexts
Most AI safety research on adversarial attacks focuses on dramatic scenarios — jailbreaking chatbots, generating harmful content, bypassing content filters. These are important, but they’re not the adversarial risks most enterprises actually face.
The real threats for business AI systems are subtler. Competitors or bad actors feeding misleading data into your RAG pipeline. Employees gaming AI-powered performance evaluation systems. Customers manipulating AI-driven pricing algorithms. Suppliers submitting invoices designed to exploit weaknesses in your AI-powered accounts payable system.
Research on these “mundane adversarial” scenarios is surprisingly thin. The security community and the AI safety community don’t overlap enough, and most published research focuses on attacks against the models themselves rather than attacks against the business systems built around them.
If you’re deploying AI in any context where there’s a financial incentive for someone to manipulate the system, you need to invest in security testing that goes beyond standard AI safety benchmarks. This means adversarial testing designed around your specific business logic, not generic red teaming. The teams at firms like Anthropic are doing important foundational work here, but translating that into enterprise-specific security practices is still largely a DIY exercise.
The Evaluation Gap
Here’s a gap that doesn’t get discussed enough: we don’t have good methods for evaluating AI system behaviour in complex, long-running enterprise workflows.
Standard AI evaluation relies on benchmarks — curated test sets with known correct answers. This works fine for isolated tasks. But enterprise AI deployments involve chains of actions, accumulated context, and decisions that depend on previous decisions. An AI system that scores 95% on a benchmark might still make correlated errors in production that cascade through a multi-step workflow, causing problems that no benchmark would predict.
The MLCommons AI Safety benchmark is a good start, but it’s designed for evaluating models, not systems. What enterprises need is tooling for evaluating end-to-end AI workflows under realistic conditions, including edge cases, adversarial inputs, and failure scenarios specific to their domain.
Bias Is Getting Better, But Slowly
Bias detection and mitigation have improved, but mostly for the biases that are easy to measure — gender bias in language, racial bias in image generation, age bias in resume screening. These are important, and the progress is real.
But bias in enterprise AI often manifests in ways that standard fairness metrics don’t capture. A lending model might be technically fair across protected categories while still systematically disadvantaging businesses in certain industries or geographic areas. A hiring AI might pass fairness audits while subtly penalising candidates with non-traditional career paths.
Detecting these domain-specific biases requires deep understanding of the business context, the data, and the specific ways that unfairness can manifest in a particular industry. This is fundamentally a human expert problem, not an automated testing problem.
What Enterprises Should Do
I’m not suggesting that enterprises should avoid deploying AI until all safety problems are solved — that would mean never deploying AI, since some of these problems may take a decade or more to fully resolve.
Instead, the practical approach is:
Build verification and human review into every production AI system, especially for high-stakes decisions. The cost of human review is real, but it’s almost always less than the cost of a consequential AI error.
Invest in monitoring that goes beyond model accuracy metrics. Track downstream business outcomes, look for patterns in AI system behaviour over time, and build alerting for anomalies that standard metrics would miss.
Test adversarially with your specific business context in mind. Generic AI safety testing is necessary but not sufficient.
Be honest with stakeholders about what AI can and cannot do reliably. Overpromising on AI safety leads to under-investment in the human oversight that keeps systems trustworthy.
The safety research will catch up eventually. But the gap between where the research is today and what enterprise deployments require is real, and pretending otherwise doesn’t serve anyone.