← Blog

LLM Test Oracles + Self-Healing Locators: The Combination That Actually Works

Self-healing fixes broken locators. LLM oracles catch what changed behind them. Together they eliminate an entire class of silent regressions that neither approach catches alone.

reading now
views
comments

Where We Left Off

In the previous post I laid out three tiers of self-healing intelligence — from reactive locator patching all the way to intent-based resolution. The conclusion was that Tier 3, intent-based locators, is the architectural north star.

But there's a gap that even Tier 3 doesn't close.

A locator can resolve perfectly — finding exactly the right element — while the content behind it has regressed silently. The button is there. The form submits. The confirmation message appears. And yet the user experience is subtly, meaningfully broken in a way no locator-level check will ever catch.

That's where LLM-powered test oracles come in. And the combination of both is the real unlock.

The Silent Regression Problem

Consider a checkout confirmation screen. Your test does this:

await page.getByRole('heading', { name: /confirmation/i }).waitFor();
expect(await page.getByTestId('order-summary').isVisible()).toBe(true);

Both assertions pass. The heading is there. The order summary is visible.

But a developer refactored the summary component and the line-item breakdown now shows incorrect quantities. Or the confirmation email address displayed on screen is wrong. Or the total has a rounding error that only appears for certain currency locales.

Your test has no idea. It checked structure, not meaning.

What LLM Oracles Actually Do

An LLM oracle doesn't replace your locators — it sits alongside them and evaluates semantic correctness rather than structural presence.

import { LLMOracle } from './oracle';

const oracle = new LLMOracle({
  model: 'claude-sonnet',
  systemPrompt: `You are a QA evaluator. Given a page snapshot and an 
    intent description, determine if the page correctly satisfies the intent.
    Respond with JSON: { passed: boolean, reason: string, confidence: number }`
});

test('checkout confirmation is semantically correct', async ({ page }) => {
  await checkout.completeOrder({ items: cart, total: 49.99 });

  // Structural assertion — locator-level
  await expect(page.getByRole('heading', { name: /confirmation/i })).toBeVisible();

  // Semantic assertion — LLM oracle
  const snapshot = await page.content();
  const result = await oracle.evaluate({
    intent: `Order confirmation page showing 2 items totalling $49.99, 
             with a valid confirmation number and delivery estimate`,
    content: snapshot
  });

  expect(result.passed).toBe(true);
  // If failed: result.reason explains the semantic gap in plain English
});

The oracle reads the actual rendered page content and asks: does this satisfy what we actually intended to show the user?

Integrating Both Layers

The architecture that works in practice has three distinct layers, each catching a different failure class:

Layer 1 — Structural (locators)
  ↓ catches: element missing, selector broken, layout collapse

Layer 2 — Self-Healing (intent locators)
  ↓ catches: locator drift, DOM refactor, ID/class rename

Layer 3 — Semantic (LLM oracle)
  ↓ catches: content regression, wrong data, logic errors,
             tone/compliance violations, subtle UX breakdowns

None of these layers is redundant. A failure at Layer 1 means the page is structurally broken. A failure at Layer 3, with Layers 1 and 2 passing, means the page looks right but means something wrong.

A Practical Implementation Pattern

Here's how to structure this in a Playwright project without making every test slow and expensive:

// oracle.config.ts
export const ORACLE_CONFIG = {
  // Only run semantic checks on critical user journeys
  enabledSuites: ['checkout', 'auth', 'account-management'],
  
  // Skip oracle on fast feedback loops; run on staging/pre-prod
  enabledEnvs: ['staging', 'preprod'],
  
  // Confidence below this threshold routes to human review queue
  humanReviewThreshold: 0.75
};
// base-test.ts
import { test as base } from '@playwright/test';
import { LLMOracle } from './oracle';
import { ORACLE_CONFIG } from './oracle.config';

export const test = base.extend({
  oracle: async ({}, use) => {
    const oracle = new LLMOracle({ model: 'claude-sonnet' });
    await use(oracle);
  }
});
// checkout.spec.ts
test('order summary is semantically accurate', async ({ page, oracle }) => {
  const order = await placeTestOrder({ items: 3, total: 124.50 });

  const result = await oracle.evaluate({
    intent: `Confirmation page for order ${order.id}: 
             3 items, total $124.50, with shipping estimate`,
    content: await page.content()
  });

  if (result.confidence < ORACLE_CONFIG.humanReviewThreshold) {
    await humanReviewQueue.push({ result, url: page.url() });
    test.skip(true, 'Low confidence — routed to human review');
  }

  expect(result.passed, result.reason).toBe(true);
});

The Confidence Score Is Non-Negotiable

One mistake teams make when adopting LLM oracles: treating them as binary pass/fail systems. They're not. The model can be wrong, especially for edge cases and ambiguous UI states.

Always surface the confidence score and build a review path for low-confidence evaluations:

type OracleResult = {
  passed: boolean;
  reason: string;        // Plain English explanation
  confidence: number;    // 0.0 – 1.0
  evidence: string[];    // Specific elements that informed the verdict
};

In practice, a well-calibrated oracle operating on clear intents will return confidence > 0.9 for the vast majority of evaluations. Low confidence is itself a signal — it usually means your intent description is ambiguous or the page state is genuinely unusual and worth human eyes.

What This Combination Catches That Nothing Else Does

Here are the failure classes this combined approach has surfaced in real production pipelines that traditional testing completely missed:

Content drift after A/B test bleed-through — a variant's copy leaked into the control group. Locators passed, structure was correct, but the wrong message was being shown to users.

Localisation regression — a currency formatter change caused totals to display without decimal places in certain locales. Structural tests saw a number. The oracle saw the wrong number.

Compliance language removal — a legal disclaimer was accidentally removed from a terms acceptance flow during a component refactor. Every locator test passed. The oracle flagged it immediately.

Tone regression in generated content — an AI-generated product description started producing outputs that violated brand tone guidelines. Rule-based assertions had no mechanism to detect this.

The Architecture North Star, Revisited

In the self-healing post, I defined the north star as reducing Mean Time to Locator Recovery (MTTLR) to zero through proactive prevention.

This post adds the second dimension: reducing Mean Time to Semantic Regression Detection (MTSRD) — the time between when the UI says the wrong thing and when a test catches it.

With locator-level self-healing alone, MTSRD is infinite. The test never fires because the structure never broke.

With LLM oracles in the critical path, MTSRD collapses to the next CI run.

That's the combination that actually works.

Discussion

Loading...

Leave a Comment

All comments are reviewed before appearing. No links please.

0 / 1000