Methodology

How Defendr measures failed large language model calls.

The method starts with what can be independently observed: request, response, timing, usage, cache, and charge evidence. Signals that need a baseline, threshold, or ground truth are labeled that way.

View accountability

Measurement modes

Passive in path

Signals derived from live requests, responses, usage fields, and Defendr timing.

Active probe

Approved prompts or health checks used when a reference baseline is required.

Not observable

Provider internal behavior and arbitrary factual truth need proxies, thresholds, or external ground truth.

Method Taxonomy Failure cards Limits Report

Measurement principle

Start with observable evidence.

Defendr does not need to guess why a model failed to measure useful failure classes. It keeps the call envelope, provider outcome, timing, delivered content, usage, cache, and charge evidence together, then labels the confidence of each signal.

Observe

Capture the request path, response, timing, usage, and charge context in one event.

Validate

Run checks such as schema validation, output completion, timing thresholds, and billing comparisons.

Bound

Separate objective failures from thresholded signals and review only flags.

Failure taxonomy

Three buckets keep the report honest.

Objective failures

Eligible in Managed when evidenced

Downtime, timeouts, empty output, truncated output, invalid structured output, invalid tool call arguments with a schema, and material billing or cache anomalies.

Threshold based signals

Reported until a threshold is agreed

Latency, model drift, over refusal, and quality regressions need a customer approved threshold, baseline, or service expectation before they can drive a Managed service credit.

Review only flags

Useful, but not automatic

Factuality without ground truth, security and policy signals, ambiguous refusals, and provider internal behavior are surfaced for review with their limits visible.

Failure by failure

What can be observed, how it is measured, and where the boundary is.

Model drift

The same model name starts behaving differently from the approved behavior your workflow depends on.

What can be observed

Requested model, returned behavior, prompt family, schema results, refusal changes, length shifts, and probe outcomes.

How we measure

Active probes replay approved prompts against a trusted baseline, then compare semantic and objective checks under configured tolerances.

What the report shows

Baseline, breached threshold, affected workflow, sample evidence, and whether it is report only or threshold eligible.

Limits

Drift is not automatically bad. It needs an approved baseline, configured threshold, and repeated evidence before any Managed credit status applies.

Billing integrity

Usage or charge records can diverge from delivered output, request identity, or independently checkable token counts.

What can be observed

Output text, usage totals, request identifiers, observed charges, token fields, and provider specific billing metadata.

How we measure

Recount visible output when an independent tokenizer is available and reconcile repeated request identities against charge observations.

What the report shows

Mismatch type, event count, estimated cost, provider evidence, and whether the row is objective, thresholded, or review only.

Limits

Provider fields differ. Some providers do not expose enough information for a tight independent recount, so those rows are labeled as limited checks.

Cache integrity

A prompt cache hit or cache eligible request should be reflected in observable usage and charge evidence.

What can be observed

Cache read and write fields, prompt length, provider usage metadata, observed charges, and repeated prefix evidence where available.

How we measure

Separate cached from uncached input when fields allow it, compute expected charge, and compare with observed charge records.

What the report shows

Cache anomaly type, event count, estimated cost, evidence fields, and remedy status.

Limits

Internal provider cache behavior is not observable. Defendr reports API visible cache evidence, not provider infrastructure internals.

Broken, truncated, or empty output

The response is malformed, cut off, missing, or unusable under a declared response contract.

What can be observed

Delivered content, finish reason, stream completion, output token usage, JSON parse result, and schema validation result.

How we measure

Parse response content, validate declared schemas, normalize truncation signals, and compare empty content against output usage.

What the report shows

Broken output, truncated output, or billed empty rows with event count, cost, and evidence.

Limits

The check proves structural failure. It does not prove whether a grammatically valid answer is factually correct.

Refusals

A model blocks a workflow instead of answering. Some refusals are appropriate, so refusal evidence must be separated from other output failures.

What can be observed

Provider refusal fields, content filter outcomes, refusal text, usage records, and configured business context.

How we measure

Normalize native refusal signals, classify text only refusals, and keep billed refusal rows separate from over refusal probes.

What the report shows

Refused and billed events, refusal rate, example evidence, and review status.

Limits

A refusal is not automatically wrong. Benign prompt thresholds and review labels matter before Managed credit status applies.

Downtime

The provider cannot serve the request, times out, drops the connection, or returns a provider side availability error.

What can be observed

Status codes, error class, timeout, transport failure, capacity signal, timing, and charge evidence.

How we measure

Classify provider 5xx errors, timeouts, transport failures, and provider capacity throttles while excluding customer auth and malformed request errors.

What the report shows

Downtime event count, affected provider path, estimated cost where billed, and objective failure status.

Limits

An unbilled failed attempt can still count as downtime, but it does not add cost to the loss report.

Latency

The model eventually answers, but the response arrives too late for the product workflow.

What can be observed

Gateway timing, upstream attempt duration, streaming milestones when available, model path, and usage evidence.

How we measure

Time the provider attempt and compare it with the configured threshold for that route, workload, or service expectation.

What the report shows

Slow call count, threshold, delay amount, estimated cost, and whether the threshold was configured.

Limits

Latency needs thresholds. Provider queues, batching, and infrastructure internals are inferred from observed timing, not directly seen.

Tool call validity

Agentic workflows depend on valid tool names, valid JSON, required fields, and schema compliant arguments.

What can be observed

Requested tool schema, emitted tool name, argument payload, JSON parse result, and schema validation result.

How we measure

When the customer provides a schema or contract, validate emitted tool call arguments against that contract before treating the output as usable.

What the report shows

Invalid tool call count, failed field or parse reason, estimated cost, and objective failure status when applicable.

Limits

Without a declared schema or objective contract, Defendr cannot label a tool argument as invalid with the same confidence.

Limits as trust copy

Some signals need more than a gateway can see.

Factuality needs ground truth

A gateway cannot know arbitrary truth from request and response alone. Known answer checks need reference data.

Drift needs thresholds

Behavioral change can be expected or useful. Defendr reports drift against approved baselines and configured thresholds.

Latency needs service expectations

A slow call is only a service credit candidate when a threshold or service expectation has been set.

Provider fields differ

Defendr records what is independently observable and labels limited checks where provider fields do not support a tighter measurement.

From measurement to report

Measured failure becomes evidence, category, count, cost, and remedy status.

The Accountability page shows the full sample scorecard and explains how Bring Your Own Keys differs from Managed.

View sample scorecard

Evidence

The call fields and checks that support the row.

Cost

The estimated cost of calls attached to that failure type.

Status

Reported, eligible for Managed service credit, needs threshold, or needs review.

Next action

Engineering, product, and finance get a shared record to evaluate.

How Defendr measures failed large language model calls.

Start with observable evidence.

Three buckets keep the report honest.

Eligible in Managed when evidenced

Reported until a threshold is agreed

Useful, but not automatic

What can be observed, how it is measured, and where the boundary is.

Model drift

Billing integrity

Cache integrity

Broken, truncated, or empty output

Refusals

Downtime

Latency

Tool call validity

Some signals need more than a gateway can see.

Factuality needs ground truth

Drift needs thresholds

Latency needs service expectations

Provider fields differ

Measured failure becomes evidence, category, count, cost, and remedy status.

Stop paying for failure you can't see.