Jan 30, 2026

Where AI Agents Are Actually Working (Not Just Demo-ing)

Every AI company is building agents. Every demo shows autonomous systems handling complex workflows. Every pitch deck promises human-level task completion.

The reality is more nuanced. AI agents work well in some contexts and poorly in others. Understanding the difference matters more than following the hype.

What We Mean by “Agents”

The term gets used loosely. For this analysis, “AI agent” means:

An AI system that takes actions in external systems
Operates with some degree of autonomy (not just answering questions)
Handles multi-step tasks without continuous human guidance
Can adapt to unexpected situations within its domain

This excludes chatbots (they answer questions but don’t take actions) and simple automation (triggers responses but doesn’t adapt).

Where Agents Are Working

Software Development

AI coding agents represent the most mature agent category. Examples:

GitHub Copilot Workspace (entire feature implementation)
Cursor and similar IDE-integrated agents
Various specialised debugging and testing agents

Why it works:

Clear success criteria (code compiles, tests pass)
Immediate feedback loops
Rich training data (billions of lines of public code)
Actions are reversible (version control)
Limited real-world consequences from errors

Production adoption is real. Engineering teams report 20-40% productivity gains on certain task types. Not all tasks—agents still struggle with architectural decisions, ambiguous requirements, and novel problem spaces.

Customer Service Triage

Not full customer service automation—that remains problematic—but triage and routing:

Classifying incoming requests
Gathering initial information
Routing to appropriate human agents
Handling simple, well-defined inquiries

Why it works:

Large volumes of historical examples
Limited action space (classify, route, respond to simple queries)
Easy escalation path to humans
Errors are recoverable (human reviews and corrects)

Data Processing and Analysis

Agents that clean, transform, and analyse data with natural language instructions:

Extracting information from documents
Reconciling data across systems
Generating reports and visualizations
Identifying anomalies

Why it works:

Operations are verifiable
Undo is typically possible
Training data abundant
Clear accuracy metrics

Research and Information Gathering

Agents that search, compile, and summarise information:

Competitive analysis
Market research compilation
Literature review assistance
Due diligence data gathering

Why it works:

Output is advisory, not operational
Errors are identifiable through review
Complements rather than replaces human judgment
Low consequence of mistakes

Where Agents Struggle

High-Stakes Single Decisions

Agents that make irreversible, consequential decisions without human review:

Direct financial transactions
Medical treatment selection
Legal document finalisation
Personnel decisions

Why it fails:

Error cost too high
No undo option
Regulatory and liability concerns
Edge cases are dangerous

Dynamic Physical Environments

Agents controlling physical systems in unpredictable settings:

General-purpose robotics
Autonomous driving (edge cases)
Complex manufacturing with variability

Why it fails:

Physical world is messier than digital
Novel situations common
Sensor limitations
Safety criticality

Long-Horizon Planning

Agents handling projects spanning weeks or months:

Strategic planning
Complex project management
Relationship management over time

Why it fails:

Goal drift over time
Context windows insufficient
Need for human judgment on trade-offs
Accountability for long-term outcomes

Creative/Judgment Tasks

Agents making decisions requiring taste, ethics, or social judgment:

Content moderation (edge cases)
Design decisions
Interpersonal communication
Strategy formulation

Why it fails:

No ground truth to train against
Subjective evaluation
Cultural and contextual nuance
Consequences of misalignment

Patterns for Success

Examining where agents work reveals common characteristics:

Clear success criteria. Agents need to know when they’ve succeeded. “Did the code compile?” is clear. “Is this marketing creative good?” is not.

Reversibility. Successful agent domains typically allow undoing mistakes. Version control for code, draft states for documents, confirmation loops for actions.

Abundant training data. Tasks with rich historical examples perform better than novel domains.

Narrow scope. Successful agents do one thing well rather than handling broad task categories.

Human-in-loop design. Even successful agents typically include checkpoints where humans review and approve.

Graceful degradation. Good agent systems recognise their limits and escalate appropriately.

Business Implications

For organisations evaluating agent adoption:

Start with constrained domains. Internal tools, development workflows, data processing—areas where agents have proven track records.

Design for human oversight. Assume agents will make mistakes. Build systems that catch and correct errors.

Measure carefully. Agent demos are impressive. Production metrics often tell different stories. Pilot extensively before committing.

Plan for hybrid workflows. The most effective approaches typically combine agent automation with human judgment at key points.

Watch the vendors. Claims about autonomous agent capabilities often exceed reality. Verify with reference customers and your own testing.

Looking Forward

Agent capabilities will improve. More tasks will become automatable. But the fundamental patterns—where autonomous systems work well versus poorly—will likely persist.

For businesses building AI strategy, the path forward isn’t waiting for universal agent capability. It’s identifying specific workflows where current agent capabilities align with business needs, implementing carefully, and expanding from proven success.

The hype suggests agents will do everything. The reality is they’ll do specific things well. Understanding the difference determines whether agent investments pay off.