Skip to main content

ByLast reviewed

Advanced Patterns

Advanced implementation patterns for BiModal Design, including defensive safety constraints, Human-in-the-Loop confirmation, and bypassing live DOM navigation.

These advanced patterns address complex agent interactions identified by benchmarks like ST-WebAgentBench, τ-bench, and WebVoyager, and provide standard mechanisms for secure desktop execution and MCP web discovery.

Secure Desktop Execution (Local MCP)

Anthropic's Claude Cowork architecture demonstrates that powerful desktop agents require a robust security boundary. By executing code and tool calls inside a Virtual Machine (with only the user's selected folder mounted) and using Local MCP servers as the bridge to the host file system, it prevents malicious or unintended system-wide modifications.

The agent loop itself runs outside the VM so it can recover if the VM restarts, while execution stays contained. Local MCP servers are treated as user-installed software with strictly scoped access, solving the containment problem.

MCP Server Discovery

<!-- Layer 5: MCP Server Discovery -->
<link rel="alternate" type="application/mcp+json" href="/mcp-server" />

Pattern 6: Defensive Form Constraints (Safety)

Agents evaluated in benchmarks like ST-WebAgentBench can make destructive mistakes if inputs aren't constrained. Use standard HTML5 validation to proactively guide agent tool use:

<form id="transfer-funds-form">
  <!-- The pattern and min/max explicitly constrain agent behavior before submission -->
  <input
    type="text"
    name="account"
    required
    pattern="[0-9]{10}"
    aria-label="10-digit Account Number"
  />
  <input
    type="number"
    name="amount"
    required
    min="1"
    max="5000"
    aria-label="Transfer Amount (Max 5000)"
  />
  <button type="submit" aria-label="Confirm Transfer">Transfer</button>
</form>

Pattern 7: Human-in-the-Loop Confirmation (τ-bench)

When an agent attempts a critical action, use <dialog aria-modal="true"> and schema.org/ConfirmAction to pause execution and prompt the human. This satisfies multi-turn interaction requirements highlighted by τ-bench.

<!-- The dialog blocks the rest of the UI, explicitly pausing the agent -->
<dialog
  id="confirmation-modal"
  aria-modal="true"
  aria-labelledby="confirm-title"
  itemscope
  itemtype="https://schema.org/ConfirmAction"
>
  <h2 id="confirm-title" itemprop="name">Confirm Transfer</h2>
  <p itemprop="description">
    Are you sure you want to transfer $5,000 to Account 1234567890?
  </p>

  <form method="dialog">
    <!-- The human or agent must interact with these explicit controls -->
    <button value="cancel" aria-label="Cancel transfer">Cancel</button>
    <button value="confirm" aria-label="Confirm transfer">Confirm</button>
  </form>
</dialog>

<script>
  // Show the modal explicitly to trigger a focus change and pause flow
  document.getElementById('confirmation-modal').showModal();
</script>

Pattern 8: Bypass Live DOM with Structured Data (WebVoyager)

WebVoyager highlights that dynamic, multi-step UI navigation on live pages is a major failure point for agents. Avoid forcing agents to navigate dynamic DOM components (like custom dropdowns or modals) by exposing direct, parameterized routing via schema.org potentialAction.

<head>
  <script type="application/ld+json">
    {
      "@context": "https://schema.org",
      "@type": "WebSite",
      "url": "https://www.example.com/",
      "potentialAction": {
        "@type": "SearchAction",
        "target": "https://www.example.com/search?q={search_term}",
        "query-input": "required name=search_term"
      }
    }
  </script>
</head>

Pattern 10: MCP Async Tasks for Long-Running Operations

Benchmarks like OSWorld-Human (arXiv:2506.16042) show computer-use agents lose most of their time to end-to-end latency, with even top performers taking 1.4-2.7x more steps than a human reference trajectory. Forcing an agent through a long synchronous tool call makes this worse and risks the transport timing out before the work completes.

The MCP Tasks primitive (introduced in the 2025-11-25 spec, SEP-1686, currently experimental) addresses this with a "call-now, fetch-later" flow. Tasks are an augmentation of an existing tools/call, not a separate kind of tool you register: the client augments the call with a task, the server returns a task handle immediately, and the client polls tasks/get for status (workingcompleted/failed/cancelled) then retrieves the CallToolResult via tasks/result.

The Tasks API is experimental and may change. Pin to the 2025-11-25 spec version.

A server opts in by declaring the tasks capability and marking a tool task-augmentable via its execution field. Task state is persisted through a TaskStore (InMemoryTaskStore ships for reference; use a durable store in production).

import {
  InMemoryTaskStore,
  isTerminal,
  Server,
} from '@modelcontextprotocol/server';
import type {
  CallToolResult,
  CreateTaskOptions,
  CreateTaskResult,
  GetTaskPayloadResult,
  GetTaskResult,
  Tool,
} from '@modelcontextprotocol/server';

const taskStore = new InMemoryTaskStore();

const server = new Server(
  { name: 'enterprise-reporting-agent', version: '2.1.0' },
  {
    capabilities: {
      tools: {},
      tasks: { requests: { tools: { call: {} } } }, // declare task support
    },
  }
);

server.setRequestHandler(
  'tools/list',
  async (): Promise<{ tools: Tool[] }> => ({
    tools: [
      {
        name: 'generate_comprehensive_audit',
        description:
          'Runs a long audit across departments and returns a report',
        inputSchema: {
          type: 'object',
          properties: {
            startDate: { type: 'string' },
            endDate: { type: 'string' },
            departments: { type: 'array', items: { type: 'string' } },
          },
          required: ['startDate', 'endDate', 'departments'],
        },
        execution: { taskSupport: 'required' }, // 'optional' also allows sync calls
      },
    ],
  })
);

// Create the task, return the handle immediately, run the work in the background.
server.setRequestHandler(
  'tools/call',
  async (request, ctx): Promise<CallToolResult | CreateTaskResult> => {
    const { name, arguments: args } = request.params;
    const taskParams = (request.params._meta?.task ?? request.params.task) as
      | { ttl?: number; pollInterval?: number }
      | undefined;
    if (!taskParams) throw new Error(`Tool ${name} requires task mode`);

    const options: CreateTaskOptions = {
      ttl: taskParams.ttl,
      pollInterval: taskParams.pollInterval ?? 2000,
    };
    const task = await taskStore.createTask(
      options,
      ctx.mcpReq.id,
      request,
      ctx.sessionId
    );

    void (async () => {
      try {
        await taskStore.updateTaskStatus(task.taskId, 'working', 'Working...');
        const report = await runAudit(args); // your long-running work
        await taskStore.storeTaskResult(task.taskId, 'completed', {
          content: [{ type: 'text', text: report.summary }],
        });
      } catch (error) {
        await taskStore.storeTaskResult(task.taskId, 'failed', {
          content: [{ type: 'text', text: `Audit failed: ${String(error)}` }],
          isError: true,
        });
      }
    })();

    return { task }; // the handle, not the result
  }
);

server.setRequestHandler(
  'tasks/get',
  async (request): Promise<GetTaskResult> => {
    const task = await taskStore.getTask(request.params.taskId);
    if (!task) throw new Error(`Task ${request.params.taskId} not found`);
    return task;
  }
);

server.setRequestHandler(
  'tasks/result',
  async (request): Promise<GetTaskPayloadResult> => {
    const task = await taskStore.getTask(request.params.taskId);
    if (!task) throw new Error(`Task ${request.params.taskId} not found`);
    if (!isTerminal(task.status)) {
      throw new Error(
        `Task ${request.params.taskId} not finished; keep polling`
      );
    }
    return taskStore.getTaskResult(request.params.taskId);
  }
);

Transport and session wiring use standard Streamable HTTP; see the full runnable version in the MCP Async Tasks example. The SDK also exposes a higher-level helper, server.experimental.tasks.registerToolTask(...), documented in the MCP TypeScript SDK server guide (also experimental).

Pattern 11: Agent-Accessible Web Components via ElementInternals

As applications increasingly rely on Web Components and the Shadow DOM for encapsulation, AI agents encounter a new interaction barrier. Shadow DOM hides the semantic meaning and state of custom elements from the Accessibility Object Model (AOM), turning components into "black boxes" for Level 2 browser-automation agents and Level 3 computer-use agents.

The ElementInternals API is the standards-first answer. attachInternals() lets a custom element communicate its role, accessible name, and ARIA state (like ariaExpanded or ariaChecked) directly to the AOM — no light-DOM ARIA scaffolding, no data-agent-* workarounds, no leaking encapsulation. Agents read the same semantic tree humans get through assistive technology.

class AgentToggle extends HTMLElement {
  static get formAssociated() {
    return true;
  }

  constructor() {
    super();
    this.attachShadow({ mode: 'open' });

    // 1. Attach internals to communicate with the AOM
    this._internals = this.attachInternals();

    // 2. Set the semantic role directly on the AOM
    this._internals.role = 'switch';

    // 3. Ensure keyboard focusability
    if (!this.hasAttribute('tabindex')) {
      this.setAttribute('tabindex', '0');
    }
  }

  connectedCallback() {
    // 4. Expose the accessible name to the AOM
    this._internals.ariaLabel = this.textContent.trim();
    this._updateState();
  }

  _updateState() {
    // 5. Keep the AOM state in sync with the internal state
    this._internals.ariaChecked = this.checked ? 'true' : 'false';
  }
}
customElements.define('agent-toggle', AgentToggle);

The element shows up in the light DOM as <agent-toggle>Enable Notifications</agent-toggle> — deceptively simple — but Level 2 and Level 3 agents querying the AOM find a fully-formed switch with a label and a current value. This is graceful degradation done with the platform, not against it.

Pattern 12: Trajectory Efficiency via Relational Navigation (Odysseys)

The Odysseys benchmark ("Benchmarking Web Agents on Realistic Long Horizon Tasks," arXiv:2604.24964) evaluates agents on multi-site, long-horizon workflows derived from real user browsing. Even top frontier models achieve only a 44.5% success rate — and their trajectory efficiency (how close their step count is to the human reference) sits at just 1.15%. Agents succeed by wandering, not by routing.

Standard HTML <link> relations in the document head give agents a deterministic shortcut. Instead of visually parsing a "Next" button buried in the rendered UI, an agent reads the topology directly from the headers.

<head>
  <!-- Expose logical topology to agents in the head -->
  <link rel="prev"       href="/checkout/step1"   title="Return to Cart" />
  <link rel="next"       href="/checkout/step3"   title="Proceed to Payment" />
  <link rel="collection" href="/account/orders"   title="View All Orders" />
</head>

Agents bypass visual parsing of the UI entirely to find the next step — they programmatically extract the URL from rel="next". The same applies to rel="prev" for backtracking and rel="collection" for jumping to the index page. Trajectory efficiency goes up; compounded navigation failures over long-running tasks go down.

<link rel> values are part of the HTML standard and have been understood by user agents for decades. The cost to implementers is one line per relation. The benefit to agents is removing an entire category of UI-traversal failures.

Looking further ahead: the OSWorld benchmark shows Computing User Agents (CUAs) moving from ~6% task completion sixteen months ago to ~45% today. As that trajectory continues toward solved UI execution, the bottleneck shifts from "can the agent click this" to "is the agent making the right decision." Patterns like relational navigation and ElementInternals prepare interfaces for that transition by giving agents the semantic structure they need to operate reliably once execution stops being the hard part.

Pattern 13: DOM Optimization for Browser-Use Agents

The 2026 surge in open-source browser automation — browser-use alone is past 97,000 GitHub stars — has surfaced a new bottleneck: context-window cost. When an LLM-driven agent connects to a browser, it must parse the DOM (or accessibility tree) into its context window before it can make decisions. Modern apps with deeply nested <div> soup, excessive wrappers, and inline SVGs without labels burn through that context window before the agent gets to the meaningful elements. Failure rates and latency climb together.

BiModal Design's answer is DOM pruning and semantic density: structure the page so that the markup an agent sees is short, semantic, and dense with meaning. Keep the tree shallow. Use real elements (<button>, <nav>, <main>) rather than divs with classes. Label everything once with ARIA, not via class name conventions agents have to learn.

✗ Bloated — wastes context

<div class="product-wrapper">
  <div class="product-inner">
    <div class="product-content">
      <div class="title-container">
        <span class="product-title">
          Wireless Headphones
        </span>
      </div>
      <div class="price-container">
        <div class="price-inner">
          <span>$99.99</span>
        </div>
      </div>
      <div class="action-container">
        <div class="btn-wrapper">
          <div class="btn" role="button"
               onclick="addToCart()">
            Add to Cart
          </div>
        </div>
      </div>
    </div>
  </div>
</div>

✓ Pruned — dense and parseable

<article aria-label="Wireless Headphones">
  <h2>Wireless Headphones</h2>
  <p>$99.99</p>
  <button type="button"
          aria-label="Add Wireless Headphones to cart">
    Add to Cart
  </button>
</article>

The pruned version uses roughly a third the tokens, exposes the same semantic information, and tells browser-use-style agents exactly what kind of element they're looking at and what action it performs. Same human UX, drastically lower agent cost.

Pattern 14: Hybrid Agent Handoff (UI → MCP)

The most reliable agents in production aren't pure-pixel or pure-DOM — they're hybrid. They use UI traversal (Level 2 / Level 3) for discovery and reasoning, then switch to API or MCP calls (Level 4 / Level 5) for the actual transaction. Hybrid agents outperform pure-UI agents on both accuracy and latency, and they sidestep the production failure mode known as DOM selector drift — brittle CSS classes or structural rearrangements that break automation overnight.

Real-world evaluations show a stark gap between benchmark scores (~78% on WebArena) and production success rates (~22% in the wild) — selector drift is a primary cause. The BiModal Design answer is to make the UI a discovery surface and let the agent hand off the execution to a stable protocol. Three layers do the work together:

<head>
  <!-- LAYER 5 DISCOVERY: agent reads this and knows it can switch to MCP -->
  <link rel="alternate"
        type="application/mcp+json"
        href="https://api.acme.com/mcp" />

  <!-- LAYER 3 STRUCTURED DATA: agent gets the exact entity without DOM parsing -->
  <script type="application/ld+json">
    {
      "@context": "https://schema.org/",
      "@type": "Product",
      "sku": "SRV-PRO-99X",
      "name": "Acme Pro Server",
      "offers": {
        "@type": "Offer",
        "price": "1999.00",
        "priceCurrency": "USD",
        "availability": "https://schema.org/InStock"
      }
    }
  </script>
</head>
<body>
  <!-- LAYER 2 SEMANTIC STRUCTURE: resilience if the agent stays in the UI.
       No reliance on classes like .btn-primary that selector drift will break. -->
  <main role="main" aria-label="Product Details: Acme Pro Server">
    <h1 aria-label="Product Name">Acme Pro Server</h1>
    <p aria-label="Price">$1,999.00</p>
    <!-- ... -->
  </main>
</body>

A hybrid agent reads the rel="alternate" type="application/mcp+json" in the head, recognizes the protocol surface, and routes the actual purchase or write through the MCP tool call instead of clicking the "Buy" button. The semantic markup is still there as a resilient fallback — ARIA labels rather than class names — so if the agent stays in the UI, it survives a CSS refactor or a class rename. Defense-in-depth where every layer protects against a different failure mode.

New benchmarks to know about: MCP-Universe evaluates LLM agents on real-world multi-server MCP toolsets — the more realistic test for Layer 5 implementations than synthetic MCP fixtures. WorkArena++ (NeurIPS 2024) extends WebArena into enterprise knowledge-work tasks (ServiceNow-flavored IT tickets, inventory, orchestration) where Tool-UI interoperability is the whole game — pure API often isn't exposed, so agents must navigate dense GUIs and use Layer 2 semantic structure to complete high-stakes workflows safely.

Pattern 15: Code Execution with MCP (Context Efficiency)

As the number of connected MCP servers grows, passing every tool definition and every intermediate tool result through the LLM's context window becomes a severe latency and token bottleneck. Anthropic's research on code execution with MCP (Nov 2025) shows a different model: expose MCP servers as code APIs and let the agent write scripts to interact with them. Context token usage drops by up to 98.7% — in one example, from 150,000 tokens to 2,000.

Three properties make this work:

  • Progressive disclosure — instead of loading every tool definition upfront, the agent discovers tools dynamically by exploring a filesystem-like structure of MCP servers and only loads what the task needs.
  • Context-efficient data filtering — rather than streaming a 10,000-row dataset back through the model, the agent runs a script that fetches, filters, and summarizes the data in the execution environment and passes only the final insight back into context.
  • Privacy-preserving operations — intermediate results stay in the execution environment. Sensitive payloads can be processed and forwarded without ever entering the model's context window.

For BiModal Design's Layer 5 (Protocol-Native Agents), this changes how implementations should expose MCP servers. Tool surfaces designed as code APIs — with discoverable structure, idempotent operations, and explicit input/output schemas — let hybrid agents (see Pattern 14) hand off to MCP without paying the per-call context tax.

Reference: Anthropic — "Code execution with MCP: Building more efficient agents" (Nov 2025).