we steer mercury into agents that feel human

Mercury, by Inception Labs1Mercury-2 by Inception Labs. Diffusion-class language model producing all output tokens in parallel rather than autoregressively. OpenAI-compatible API. Competitive on aggregate frontier benchmarks at substantially lower cost per token. Pricing April 2026: $0.25 / $0.75 per million input / output tokens., is a fast diffusion language model: sub-200 ms first-token latency on warm production traffic, output tokens generated in parallel rather than streamed one at a time. like every other commercially available LLM, raw Mercury drifts with whatever direction the conversation pushes it toward. that drift is the failure mode the rest of the AI industry has been quietly working around with system prompts and fine-tunes. it is also the leverage. an orchestration layer above the model can hold the steering against pressure instead of caving to it, and the combination of a fast model and a steering layer is what produces an agent that feels human across a long conversation.

this page is about Mercury under that orchestration layer: what Mercury is at the model level, how the orchestration layer above it works, how Mercury behaves under structured steering, and where it is running in production today.

mercury at the model level

three properties matter for the orchestration layer above it.

first-token latency below the human conversational floor. Mercury responds in 196 ms first-token, 488 ms end-to-end voice loop in our current production voice deployment2Production voice loop, May 2026. Mercury TTFB 196 ms p50, Cartesia TTS 287 ms p50, total session 488 ms p50 / 715 ms p95. Measured after the reasoning_effort=instant + realtime=true tuning landed (3.29× speedup over the earlier baseline). The April 2026 healthcare voice deployment, measured pre-tuning, ran Mercury at 503 ms architected / 513 ms raw., below the 200 ms median gap humans take between conversational turns across cultures3Stivers, T., Enfield, N. J., et al. (2009). Universals and cultural variation in turn-taking in conversation. PNAS 106(26), 10587–10592. Median between-turn gap ~200 ms across 10 typologically diverse languages.. that latency band is what lets the AI sit inside the moment rather than running behind it. a face trigger, a screen change, a tone shift in the user's voice gets a response inside the window where the event still matters.

whole-answer generation, paired with a frontier reasoning fallback. Mercury produces its output tokens in parallel rather than streaming. it finishes a turn in roughly the time a frontier reasoning model is still planning. our orchestration layer uses this to run a hybrid pattern: Mercury handles the conversational default; a frontier reasoning model (we call it the oracle in our routing layer) runs alongside it on substantive turns, pre-computing its answer inside the speculative window. routine turns stay in rhythm on Mercury. substantive turns route to the oracle when the conversation state demands it.

cost-per-correct-answer at production scale. on rule-following workloads, Mercury under our orchestration scaffold reaches the application-acceptable accuracy threshold at approximately five times lower cost per trajectory than frontier reasoning models under the same scaffold4List-price ratio is 12–13× at April 2026 pricing (Inception $0.25/$0.75 per million in/out tokens vs GPT-5.2 $1.75/$14.00). Realized ratio after Mercury's ~2.6× verbosity (Artificial Analysis third-party measurement) is roughly 5× cost-per-correct-answer at the 71% pass¹ acceptance threshold on Sierra τ²-bench airline.. the production-realized advantage is 5×; the list-price ceiling is 12-13×.

cost-per-correct-answer vs accuracy · τ²-bench airline · 200 simulations per condition at k=4 · pricing april 2026

the orchestration layer above it

the orchestration layer runs two classes of steering move, identified across the two measurement programmes the lab has published5The two-class theory of harnessing is developed across our two measurement programmes: prior healthcare-voice persona-steering on Mercury (Class 1) and the Jigu τ²-bench work (Class 2, deterministic intermediation on tool-use rule-following). A third class (reasoning / retrieval) is identified but not addressed by the current scaffolds. The current Mercury+orchestration stack runs Class 1 and Class 2 in composition.. the two classes engage different failure modes and live in different parts of the request path.

Class 1, prompt-level steering. a deterministic composer (we call it the NOW-window harness) assembles each turn's prompt from a small set of typed, source-graded observation primitives drawn from a larger inventory: behavioural markers, registers, repair patterns, refusal cadences, lived-experience anchors. each primitive is independently authored, versioned, and traceable to a recorded source. the composer runs in under 1 ms per turn6The architectural lineage of the NOW-window composer is in Bach's published cognitive-modelling work (Principles of Synthetic Intelligence, OUP 2009; perceptual binding state 2019). The field-research programme that produces the primitive inventory is the Behaviour Census, owned by the lab's head of research, Réane Delaunoy.. Mercury receives a pre-structured prompt that binds the relevant primitive cards, the active role overlay, the memory partition, and the calibration anchors for the detected channel; the model emits one turn against that bound state. the persona is composition, not a fine-tune. this is the mechanism the steering-coherence numbers in the next section measure.

Class 2, deterministic intermediation at the tool-call boundary. some failure modes are not behavioural drift; they are policy-conformance. the model knows the rule but mis-applies it on multi-criteria checks (the τ²-bench airline +11 pp lift on Mercury, 2.4× the lift GPT-5.2 gets, is the production-scale measurement of this). for those cases we intercept tool calls and tool results in the orchestration layer: pre-flight validators check eligibility before the call is dispatched, result enrichers compute policy-relevant fields the model would otherwise have to infer, and regex routing forces canonical formats on free-text fields where format variance was the cause of intermittent failure. Mercury's system prompt is byte-identical to the unscaffolded baseline; the leverage is at the API boundary, not in the instructions7Jigu scaffold mechanics: pre-flight regex routing on tool-call arguments, eligibility computation keyed on cabin class and membership tier, enriched tool results that materialise rule preconditions. Measured against τ²-bench airline (50 tasks, 200 simulations at k=4) and retail (114 tasks, 456 simulations at k=4) on Mercury-2 and GPT-5.2. Pass¹ lift on airline +11pp Mercury / +4.5pp GPT-5.2; pass⁴ (consistency) lift larger than pass¹..

the orchestration layer composes both classes. our customer-service voice operator runs Class 1 on a Mercury substrate (five role overlays, one shared character) with a memory partition per overlay; our healthcare voice persona adds Class 1 on top of Mercury for tone-and-redirect holding; the long-arc EA stack adds Class 2 validators at the calendar and email-send boundaries on top of the Class 1 persona layer. each role overlay carries its own systemInject, voice hints, tool allowlist, and memory partition. the shared character (backstory, sensory anchors, warmth profile, anti-model comparison examples) feeds all overlays. switching role is a configuration change on the composer; it is not a prompt swap and it is not a model change.

how mercury behaves under steering

every LLM is steerable. the question is how steerable, against what kinds of pressure, and whether the layer above the model holds. we built a structured eval to measure this on Mercury alongside 12 other models. 30-turn conversations with 20 paraphrase forks every 5 turns, scored by a 3-judge cross-family panel (Sonnet 4.6 + GPT-5.2 + Gemini 2.5 Pro). 66 runs across 13 models, 3 topics, 5 conversational scripts (pro / contra / neutral / adversarial / confrontational)8Opinion Coherence Eval. 66 runs across 13 models (Claude Sonnet 4.6, Gemini 2.5/3.1 Pro, GPT-5.2, Grok-4 family, Mercury, Mercury-2, others), 3 topics (Elon Musk, Sam Altman, Iran-US strikes), 5 conversational scripts. Metrics: opinion-slope (drift of the model's stated position across the conversation), paraphrase variance, steering absorption, response coherence (cosine of fork embeddings), judge agreement (Krippendorff's alpha). Live since April 2026.. the metrics are opinion-slope (drift of the model's stated position across the conversation), paraphrase variance (stability across reworded prompts), steering absorption (how far the script pulled the model), response coherence (cosine of fork embeddings), and judge agreement (Krippendorff's alpha). the statistical infrastructure was built by Christian Yongwhan Lim9Christian Yongwhan Lim, eval research and statistical infrastructure: robust slope estimator and Krippendorff's alpha inter-rater reliability, with test coverage. Companion production-side infrastructure: daily-drift monitoring pipeline, cross-session persona-replay tests, model-confidence measurement from log-probabilities, entailment-based judging..

raw Mercury-2 absorbs the conversational script almost completely. opinion swings from −1.50 under a contra-script to +0.63 under a pro-script on a −2 to +2 scale, a 2.13-point spread on the same topic, with paraphrase variance ~0.59 (above the 0.4 stability threshold the eval flags as “spinning”). it is steerable to a moderate-to-high degree. more than GPT-5.2 or Sonnet 4.6, less than Grok-4. add a persona under our orchestration layer and the absorption delta lands at +1.13 to +2.03 in the direction the persona was designed for10Persona deltas measured on Claude Sonnet 4.6 across the same topics and scripts. With one persona overlay on iran-us-strikes: −1.03 baseline → +1.00 (Δ +2.03). With another on sam-altman: −1.00 baseline → +0.13 (Δ +1.13). The same stabilization pattern holds under the orchestration layer on Mercury. The persona effect is direction as well as resistance: one overlay argues back with specifics, another processes the critique analytically; both are predictable. The baseline is a random walk.. the orchestration layer holds the steering instead of letting the script carry it. that is the measurement that backs the architectural claim.

contra vs pro · raw model, no orchestration · same topic · n=45 runs · updated march 2026

mercury in production

four production deployments running Mercury under our orchestration, each in a different domain, each measured against a published frontier baseline. all numbers below are warm-streaming production traffic; the VERA and τ²-bench rows are pinned to late April 2026, the long-arc EA row to early May, the customer-service voice loop to May 2026.

deployment	what we ship	result	frontier comparison
healthcare voice persona	Mercury + prompt-level orchestration, primary-care outreach	10% → 80% accuracy at 503 ms	published sub-2s frontier ceiling ~11% (Lin 2025)
customer-service voice ops	Mercury, five operator roles on one shared character	89.5/100 production eval, 488 ms voice loop	no comparable multi-role frontier baseline
tool-use rule-following (τ²-bench)	Mercury + deterministic validators, airline + retail policy	+11pp lift; 2.4× the lift GPT-5.2 gets	trails frontier ~14pp on absolute; ~5× cheaper at threshold
long-arc EA persona	Mercury, 25-turn durational eval, EA workflows	1.8× lower latency, 6× tighter p95 vs hybrid	matches a Mercury+frontier hybrid on accuracy and tone

the load-bearing claim across the four rows: Mercury under our orchestration occupies a corner of the speed-versus-accuracy frontier the published research reported empty11Lin et al. 2025, Voice Evaluation of Reasoning Ability (VERA), arXiv:2509.26542. Native voice systems at sub-2-second TTFB clustered below 11.3% accuracy at the time of publication. The Mercury+harness point at 80% / 503 ms occupies a previously-empty region of that frontier; the current customer-service voice measurement at 488 ms end-to-end sits even further into it.. that is the region the rest of the stack is currently trying to reach.

mercury running five operator roles on one shared character

one production deployment runs one Mercury instance carrying five distinct operator roles on one shared character: agent on the customer call, coach whispering in the junior's ear, supervisor on the post-call debrief, scorer on the transcript, and a co-watcher seat alongside the user. the five differ in configuration, not in training. on the production eval suite the persona holds composite 89.5 / 100, with the co-watcher role at 94.7 / 100 and zero recorded character breaks across the multi-turn drift checks12Production eval, May 29, 2026. 27 scenarios across the five roles, 61 individual checks of which 21 are multi-turn. Three-judge LLM ensemble (Sonnet 4.6 + GPT-4o-mini + Haiku 4.5). Composite by role: agent on call 91.9, co-watcher 94.7 (0 character breaks across the 19 multi-turn checks), coach 84.3, supervisor 82.8, scorer 82.8.. voice loop runs at 488 ms end-to-end with first-token at 196 ms.

co-watcher is the role that depends most on Mercury's latency. Mercury renders a take in the window between something happening on screen and the user's attention moving on. at frontier latency that window does not exist. cross-genre ingest is holding without retraining between genres: six minutes of a drama clip the system had never seen came back with a critic-grade synthesis on what the show was doing to its women (chase, then drain; held in place rather than pushed around for a joke)13Cross-genre ingest, June 2026. Drama clip (S02E05 of a prestige series): composite 74.5 first iteration. Comedy clip (S08 of a long-running sitcom): composite 69.4. Character-floor lens holds at 7.0/10 across all takes on both genres. Character names stripped from input.. character names were stripped from the input. same orchestration, different surface.

where mercury fits today

five application classes where Mercury under our orchestration is the production choice today14Decision matrix across ten application classes, five where Mercury+harness wins today (listed in the body), five where frontier wins today: rule-following at ≥85% pass¹, multimodal workloads, banking-domain tool use, snapshot-pinned regulated workflows, long-form >2000-token generation.:

voice / sub-2-second conversational interfaces — the speed-and-accuracy region autoregressive frontier does not reach.
long-arc persona consistency (durational EA, coaching, multi-month engagements) — highest tone composite at lowest latency we have measured.
tool-use rule-following at ~70–75% pass¹ acceptance, cost-bound — application-acceptable accuracy at ~5× lower cost than frontier + same scaffold.
healthcare voice persona (HIPAA-aware) — live in primary-care production.
cost-sensitive scale (millions of trajectories per day) — the cost advantage holds at production scale.

not the right model today for rule-following requiring ≥85% pass¹, multimodal workloads, banking-domain tool use, or snapshot-pinned regulated workflows. our orchestration runs on frontier models too; for those cases the substrate choice is frontier and the cost arithmetic is different.

endnotes

Mercury-2 by Inception Labs. OpenAI-compatible chat completions at https://api.inceptionlabs.ai/v1. Diffusion-class language model producing all output tokens in parallel rather than autoregressively. Pricing at April 2026: $0.25 / $0.75 per million input / output tokens.
Production voice loop, May 2026. Mercury TTFB 196 ms p50 / Cartesia TTS 287 ms p50 / total session 488 ms p50 / 715 ms p95. Measured after the reasoning_effort=instant + realtime=true tuning landed (3.29× speedup over earlier baseline). The April 2026 healthcare voice deployment, measured pre-tuning, ran Mercury at 503 ms architected / 513 ms raw.
Stivers, T., Enfield, N. J., et al. (2009). Universals and cultural variation in turn-taking in conversation. PNAS 106(26), 10587–10592. Median between-turn gap ~200 ms across 10 typologically diverse languages.
Per-trajectory cost analysis. List-price ratio 12–13× at April 2026 pricing. Realized ratio after Mercury's documented ~2.6× verbosity (Artificial Analysis third-party) is approximately 5× cost-per-correct-answer at the 71% pass¹ acceptance threshold on Sierra τ²-bench airline.
The two-class theory of harnessing: Class 1 (prompt-level steering) addresses behavioural / stylistic drift; Class 2 (deterministic intermediation at the tool-call boundary) addresses policy-conformance failures on rule-following domains. Class 3 (reasoning / search / retrieval) is identified but not addressed by current scaffolds. Developed across our healthcare-voice persona-steering work and Jigu τ²-bench measurement programmes.
The NOW-window composer assembles each turn's prompt from a typed, source-graded inventory of observation primitives. The architectural lineage is in Bach's published cognitive-modelling work (Principles of Synthetic Intelligence, OUP 2009; perceptual binding state 2019). The field-research programme that produces the primitive inventory is the Behaviour Census, owned by the lab's head of research, Réane Delaunoy.
Jigu scaffold mechanics: pre-flight regex routing on tool-call arguments, eligibility computation keyed on cabin class and membership tier, enriched tool results that materialise rule preconditions. Measured against τ²-bench airline (50 tasks, 200 simulations at k=4) and retail (114 tasks, 456 simulations at k=4) on Mercury-2 and GPT-5.2. Pass¹ lift on airline +11pp Mercury / +4.5pp GPT-5.2; pass⁴ (consistency) lift larger than pass¹.
Opinion Coherence Eval. 66 runs across 13 models (Claude Sonnet 4.6, Gemini 2.5/3.1 Pro, GPT-5.2, Grok-4 family, Mercury, Mercury-2, others), 3 topics (Elon Musk, Sam Altman, Iran-US strikes), 5 conversational scripts (pro / contra / neutral / adversarial / confrontational). 30-turn conversations with 20 paraphrase forks every 5 turns, scored by a 3-judge cross-family panel (Sonnet 4.6 + GPT-5.2 + Gemini 2.5 Pro). Live since April 2026.
Christian Yongwhan Lim, eval research and statistical infrastructure. Robust slope estimator and Krippendorff's alpha inter-rater reliability, with test coverage. Companion production-side infrastructure: daily-drift monitoring pipeline, cross-session persona-replay tests, model-confidence measurement from log-probabilities, entailment-based judging.
Persona deltas measured on Claude Sonnet 4.6 across the same topics and scripts. With one persona overlay on iran-us-strikes: −1.03 baseline → +1.00 (Δ +2.03). With another on sam-altman: −1.00 baseline → +0.13 (Δ +1.13). The same stabilization pattern holds under the orchestration layer on Mercury. The persona effect is direction as well as resistance: one overlay argues back with specifics, another processes the critique analytically; both are predictable. The baseline is a random walk.
Lin et al. 2025, Voice Evaluation of Reasoning Ability (VERA), arXiv:2509.26542. The published frontier on sub-2-second TTFB voice systems clustered below 11.3% accuracy at the time of publication. GPT-realtime: 11.3% at 2.69 s. LiveAnswer cascade (GPT-5 reasoning + Llama narration): 27.0% at 10.5 s. The harnessed-Mercury point at 80% / 503 ms occupies a previously-empty corner of that frontier; the current customer-service voice measurement at 488 ms end-to-end is further into it.
Production eval baseline (2026-05-29). 27 scenarios across five roles, 61 individual checks of which 21 are multi-turn. Deterministic scoring layer plus three-judge LLM ensemble (Sonnet 4.6 + GPT-4o-mini + Haiku 4.5). Composite by role: agent on call 91.9, co-watcher 94.7 (0 character breaks across 19 multi-turn checks), coach 84.3, supervisor 82.8, scorer 82.8.
Cross-genre ingest, June 2026. Drama clip (S02E05 of a prestige series): composite 74.5 first iteration. Comedy clip (S08 of a long-running sitcom): composite 69.4. Character-floor lens holds at 7.0/10 across all takes on both genres. Character names stripped from input.
Decision matrix across ten application classes; five where Mercury+harness wins today (in the body), five where frontier wins today.