Where AI Agents Are Actually Working (Not Just Demo-ing)


Every AI company is building agents. Every demo shows autonomous systems handling complex workflows. Every pitch deck promises human-level task completion.

The reality is more nuanced. AI agents work well in some contexts and poorly in others. Understanding the difference matters more than following the hype.

What We Mean by “Agents”

The term gets used loosely. For this analysis, “AI agent” means:

  • An AI system that takes actions in external systems
  • Operates with some degree of autonomy (not just answering questions)
  • Handles multi-step tasks without continuous human guidance
  • Can adapt to unexpected situations within its domain

This excludes chatbots (they answer questions but don’t take actions) and simple automation (triggers responses but doesn’t adapt).

Where Agents Are Working

Software Development

AI coding agents represent the most mature agent category. Examples:

  • GitHub Copilot Workspace (entire feature implementation)
  • Cursor and similar IDE-integrated agents
  • Various specialised debugging and testing agents

Why it works:

  • Clear success criteria (code compiles, tests pass)
  • Immediate feedback loops
  • Rich training data (billions of lines of public code)
  • Actions are reversible (version control)
  • Limited real-world consequences from errors

Production adoption is real. Engineering teams report 20-40% productivity gains on certain task types. Not all tasks—agents still struggle with architectural decisions, ambiguous requirements, and novel problem spaces.

Customer Service Triage

Not full customer service automation—that remains problematic—but triage and routing:

  • Classifying incoming requests
  • Gathering initial information
  • Routing to appropriate human agents
  • Handling simple, well-defined inquiries

Why it works:

  • Large volumes of historical examples
  • Limited action space (classify, route, respond to simple queries)
  • Easy escalation path to humans
  • Errors are recoverable (human reviews and corrects)

Data Processing and Analysis

Agents that clean, transform, and analyse data with natural language instructions:

  • Extracting information from documents
  • Reconciling data across systems
  • Generating reports and visualizations
  • Identifying anomalies

Why it works:

  • Operations are verifiable
  • Undo is typically possible
  • Training data abundant
  • Clear accuracy metrics

Research and Information Gathering

Agents that search, compile, and summarise information:

  • Competitive analysis
  • Market research compilation
  • Literature review assistance
  • Due diligence data gathering

Why it works:

  • Output is advisory, not operational
  • Errors are identifiable through review
  • Complements rather than replaces human judgment
  • Low consequence of mistakes

Where Agents Struggle

High-Stakes Single Decisions

Agents that make irreversible, consequential decisions without human review:

  • Direct financial transactions
  • Medical treatment selection
  • Legal document finalisation
  • Personnel decisions

Why it fails:

  • Error cost too high
  • No undo option
  • Regulatory and liability concerns
  • Edge cases are dangerous

Dynamic Physical Environments

Agents controlling physical systems in unpredictable settings:

  • General-purpose robotics
  • Autonomous driving (edge cases)
  • Complex manufacturing with variability

Why it fails:

  • Physical world is messier than digital
  • Novel situations common
  • Sensor limitations
  • Safety criticality

Long-Horizon Planning

Agents handling projects spanning weeks or months:

  • Strategic planning
  • Complex project management
  • Relationship management over time

Why it fails:

  • Goal drift over time
  • Context windows insufficient
  • Need for human judgment on trade-offs
  • Accountability for long-term outcomes

Creative/Judgment Tasks

Agents making decisions requiring taste, ethics, or social judgment:

  • Content moderation (edge cases)
  • Design decisions
  • Interpersonal communication
  • Strategy formulation

Why it fails:

  • No ground truth to train against
  • Subjective evaluation
  • Cultural and contextual nuance
  • Consequences of misalignment

Patterns for Success

Examining where agents work reveals common characteristics:

Clear success criteria. Agents need to know when they’ve succeeded. “Did the code compile?” is clear. “Is this marketing creative good?” is not.

Reversibility. Successful agent domains typically allow undoing mistakes. Version control for code, draft states for documents, confirmation loops for actions.

Abundant training data. Tasks with rich historical examples perform better than novel domains.

Narrow scope. Successful agents do one thing well rather than handling broad task categories.

Human-in-loop design. Even successful agents typically include checkpoints where humans review and approve.

Graceful degradation. Good agent systems recognise their limits and escalate appropriately.

Business Implications

For organisations evaluating agent adoption:

Start with constrained domains. Internal tools, development workflows, data processing—areas where agents have proven track records.

Design for human oversight. Assume agents will make mistakes. Build systems that catch and correct errors.

Measure carefully. Agent demos are impressive. Production metrics often tell different stories. Pilot extensively before committing.

Plan for hybrid workflows. The most effective approaches typically combine agent automation with human judgment at key points.

Watch the vendors. Claims about autonomous agent capabilities often exceed reality. Verify with reference customers and your own testing.

Looking Forward

Agent capabilities will improve. More tasks will become automatable. But the fundamental patterns—where autonomous systems work well versus poorly—will likely persist.

For businesses building AI strategy, the path forward isn’t waiting for universal agent capability. It’s identifying specific workflows where current agent capabilities align with business needs, implementing carefully, and expanding from proven success.

The hype suggests agents will do everything. The reality is they’ll do specific things well. Understanding the difference determines whether agent investments pay off.