Observe
Capture the request path, response, timing, usage, and charge context in one event.
Methodology
The method starts with what can be independently observed: request, response, timing, usage, cache, and charge evidence. Signals that need a baseline, threshold, or ground truth are labeled that way.
Measurement modes
Passive in path
Signals derived from live requests, responses, usage fields, and Defendr timing.
Active probe
Approved prompts or health checks used when a reference baseline is required.
Not observable
Provider internal behavior and arbitrary factual truth need proxies, thresholds, or external ground truth.
Measurement principle
Defendr does not need to guess why a model failed to measure useful failure classes. It keeps the call envelope, provider outcome, timing, delivered content, usage, cache, and charge evidence together, then labels the confidence of each signal.
Observe
Capture the request path, response, timing, usage, and charge context in one event.
Validate
Run checks such as schema validation, output completion, timing thresholds, and billing comparisons.
Bound
Separate objective failures from thresholded signals and review only flags.
Failure taxonomy
Objective failures
Downtime, timeouts, empty output, truncated output, invalid structured output, invalid tool call arguments with a schema, and material billing or cache anomalies.
Threshold based signals
Latency, model drift, over refusal, and quality regressions need a customer approved threshold, baseline, or service expectation before they can drive a Managed service credit.
Review only flags
Factuality without ground truth, security and policy signals, ambiguous refusals, and provider internal behavior are surfaced for review with their limits visible.
Failure by failure
The same model name starts behaving differently from the approved behavior your workflow depends on.
What can be observed
Requested model, returned behavior, prompt family, schema results, refusal changes, length shifts, and probe outcomes.
How we measure
Active probes replay approved prompts against a trusted baseline, then compare semantic and objective checks under configured tolerances.
What the report shows
Baseline, breached threshold, affected workflow, sample evidence, and whether it is report only or threshold eligible.
Limits
Drift is not automatically bad. It needs an approved baseline, configured threshold, and repeated evidence before any Managed credit status applies.
Usage or charge records can diverge from delivered output, request identity, or independently checkable token counts.
What can be observed
Output text, usage totals, request identifiers, observed charges, token fields, and provider specific billing metadata.
How we measure
Recount visible output when an independent tokenizer is available and reconcile repeated request identities against charge observations.
What the report shows
Mismatch type, event count, estimated cost, provider evidence, and whether the row is objective, thresholded, or review only.
Limits
Provider fields differ. Some providers do not expose enough information for a tight independent recount, so those rows are labeled as limited checks.
A prompt cache hit or cache eligible request should be reflected in observable usage and charge evidence.
What can be observed
Cache read and write fields, prompt length, provider usage metadata, observed charges, and repeated prefix evidence where available.
How we measure
Separate cached from uncached input when fields allow it, compute expected charge, and compare with observed charge records.
What the report shows
Cache anomaly type, event count, estimated cost, evidence fields, and remedy status.
Limits
Internal provider cache behavior is not observable. Defendr reports API visible cache evidence, not provider infrastructure internals.
The response is malformed, cut off, missing, or unusable under a declared response contract.
What can be observed
Delivered content, finish reason, stream completion, output token usage, JSON parse result, and schema validation result.
How we measure
Parse response content, validate declared schemas, normalize truncation signals, and compare empty content against output usage.
What the report shows
Broken output, truncated output, or billed empty rows with event count, cost, and evidence.
Limits
The check proves structural failure. It does not prove whether a grammatically valid answer is factually correct.
A model blocks a workflow instead of answering. Some refusals are appropriate, so refusal evidence must be separated from other output failures.
What can be observed
Provider refusal fields, content filter outcomes, refusal text, usage records, and configured business context.
How we measure
Normalize native refusal signals, classify text only refusals, and keep billed refusal rows separate from over refusal probes.
What the report shows
Refused and billed events, refusal rate, example evidence, and review status.
Limits
A refusal is not automatically wrong. Benign prompt thresholds and review labels matter before Managed credit status applies.
The provider cannot serve the request, times out, drops the connection, or returns a provider side availability error.
What can be observed
Status codes, error class, timeout, transport failure, capacity signal, timing, and charge evidence.
How we measure
Classify provider 5xx errors, timeouts, transport failures, and provider capacity throttles while excluding customer auth and malformed request errors.
What the report shows
Downtime event count, affected provider path, estimated cost where billed, and objective failure status.
Limits
An unbilled failed attempt can still count as downtime, but it does not add cost to the loss report.
The model eventually answers, but the response arrives too late for the product workflow.
What can be observed
Gateway timing, upstream attempt duration, streaming milestones when available, model path, and usage evidence.
How we measure
Time the provider attempt and compare it with the configured threshold for that route, workload, or service expectation.
What the report shows
Slow call count, threshold, delay amount, estimated cost, and whether the threshold was configured.
Limits
Latency needs thresholds. Provider queues, batching, and infrastructure internals are inferred from observed timing, not directly seen.
Agentic workflows depend on valid tool names, valid JSON, required fields, and schema compliant arguments.
What can be observed
Requested tool schema, emitted tool name, argument payload, JSON parse result, and schema validation result.
How we measure
When the customer provides a schema or contract, validate emitted tool call arguments against that contract before treating the output as usable.
What the report shows
Invalid tool call count, failed field or parse reason, estimated cost, and objective failure status when applicable.
Limits
Without a declared schema or objective contract, Defendr cannot label a tool argument as invalid with the same confidence.
Limits as trust copy
A gateway cannot know arbitrary truth from request and response alone. Known answer checks need reference data.
Behavioral change can be expected or useful. Defendr reports drift against approved baselines and configured thresholds.
A slow call is only a service credit candidate when a threshold or service expectation has been set.
Defendr records what is independently observable and labels limited checks where provider fields do not support a tighter measurement.
From measurement to report
The Accountability page shows the full sample scorecard and explains how Bring Your Own Keys differs from Managed.
View sample scorecardEvidence
The call fields and checks that support the row.
Cost
The estimated cost of calls attached to that failure type.
Status
Reported, eligible for Managed service credit, needs threshold, or needs review.
Next action
Engineering, product, and finance get a shared record to evaluate.