we steer mercury into agents that feel human

Mercury, by Inception Labs1Mercury-2 by Inception Labs. Diffusion-class language model producing all output tokens in parallel rather than autoregressively. OpenAI-compatible API. Competitive on aggregate frontier benchmarks at substantially lower cost per token. Pricing April 2026: $0.25 / $0.75 per million input / output tokens., is a fast diffusion language model: sub-200 ms first-token latency on warm production traffic, output tokens generated in parallel rather than streamed one at a time. like every other commercially available LLM, raw Mercury drifts with whatever direction the conversation pushes it toward. that drift is the failure mode the rest of the AI industry has been quietly working around with system prompts and fine-tunes. it is also the leverage. an orchestration layer above the model can hold the steering against pressure instead of caving to it, and the combination of a fast model and a steering layer is what produces an agent that feels human across a long conversation.

this page is about Mercury under that orchestration layer: what Mercury is at the model level, how the orchestration layer above it works, how Mercury behaves under structured steering, and where it is running in production today.

mercury at the model level

three properties matter for the orchestration layer above it.

first-token latency below the human conversational floor. Mercury responds in 196 ms first-token, 488 ms end-to-end voice loop on our current Vincent deployment2Vincent voice loop, May 2026. Mercury TTFB 196 ms p50, Cartesia TTS 287 ms p50, total session 488 ms p50 / 715 ms p95. Measured after the reasoning_effort=instant + realtime=true tuning landed (3.29× speedup over earlier baseline). The April 2026 Alyssa healthcare deployment, measured pre-tuning, ran Mercury at 503 ms architected / 513 ms raw., below the 200 ms median gap humans take between conversational turns across cultures3Stivers, T., Enfield, N. J., et al. (2009). Universals and cultural variation in turn-taking in conversation. PNAS 106(26), 10587–10592. Median between-turn gap ~200 ms across 10 typologically diverse languages.. that latency band is what lets the AI sit inside the moment rather than running behind it. a face trigger, a screen change, a tone shift in the user's voice gets a response inside the window where the event still matters.

whole-answer generation, paired with a frontier reasoning fallback. Mercury produces its output tokens in parallel rather than streaming. it finishes a turn in roughly the time a frontier reasoning model is still planning. our orchestration layer uses this to run a hybrid pattern: Mercury handles the conversational default; a frontier reasoning model (we call it the oracle in our routing layer) runs alongside it on substantive turns, pre-computing its answer inside the speculative window. routine turns stay in rhythm on Mercury. substantive turns route to the oracle when the conversation state demands it.

cost-per-correct-answer at production scale. on rule-following workloads, Mercury under our orchestration scaffold reaches the application-acceptable accuracy threshold at approximately five times lower cost per trajectory than frontier reasoning models under the same scaffold4List-price ratio is 12–13× at April 2026 pricing (Inception $0.25/$0.75 per million in/out tokens vs GPT-5.2 $1.75/$14.00). Realized ratio after Mercury's ~2.6× verbosity (Artificial Analysis third-party measurement) is roughly 5× cost-per-correct-answer at the 71% pass¹ acceptance threshold on Sierra τ²-bench airline.. the production-realized advantage is 5×; the list-price ceiling is 12-13×.

the orchestration layer above it

the orchestration layer runs two classes of steering move, identified across the two measurement programmes the lab has published5The two-class theory of harnessing is developed in our two measurement programmes: the Alyssa healthcare voice work (Class 1, prior persona-steering on Mercury voice) and the Jigu τ²-bench work (Class 2, deterministic intermediation on tool-use rule-following). A third class (reasoning / retrieval) is identified but not addressed by the current scaffolds. The current Mercury+orchestration stack runs Class 1 and Class 2 in composition.. the two classes engage different failure modes and live in different parts of the request path.

Class 1, prompt-level steering. a deterministic composer (we call it the NOW-window harness) assembles each turn's prompt from a small set of typed, source-graded observation primitives drawn from a larger inventory: behavioural markers, registers, repair patterns, refusal cadences, lived-experience anchors. each primitive is independently authored, versioned, and traceable to a recorded source. the composer runs in under 1 ms per turn6The architectural lineage of the NOW-window composer is in Bach's published cognitive-modelling work (Principles of Synthetic Intelligence, OUP 2009; perceptual binding state 2019). The field-research programme that produces the primitive inventory is the Behaviour Census, owned by the lab's head of research, Réane Delaunoy.. Mercury receives a pre-structured prompt that binds the relevant primitive cards, the active role overlay, the memory partition, and the calibration anchors for the detected channel; the model emits one turn against that bound state. the persona is composition, not a fine-tune. this is the mechanism the steering-coherence numbers in the next section measure.

Class 2, deterministic intermediation at the tool-call boundary. some failure modes are not behavioural drift; they are policy-conformance. the model knows the rule but mis-applies it on multi-criteria checks (the τ²-bench airline +11 pp lift on Mercury, 2.4× the lift GPT-5.2 gets, is the production-scale measurement of this). for those cases we intercept tool calls and tool results in the orchestration layer: pre-flight validators check eligibility before the call is dispatched, result enrichers compute policy-relevant fields the model would otherwise have to infer, and regex routing forces canonical formats on free-text fields where format variance was the cause of intermittent failure. Mercury's system prompt is byte-identical to the unscaffolded baseline; the leverage is at the API boundary, not in the instructions7Jigu scaffold mechanics: pre-flight regex routing on tool-call arguments, eligibility computation keyed on cabin class and membership tier, enriched tool results that materialise rule preconditions. Measured against τ²-bench airline (50 tasks, 200 sims at k=4) and retail (114 tasks, 456 sims at k=4) on Mercury-2 and GPT-5.2. Pass¹ lift on airline +11pp Mercury / +4.5pp GPT-5.2; pass⁴ (consistency) lift larger than pass¹..

the orchestration layer composes both classes. Vincent runs Class 1 on a Mercury substrate (five role overlays, one shared character) with a memory partition per overlay; the Alyssa healthcare voice persona adds Class 1 on top of Mercury for tone-and-redirect holding; the Vera EA durational stack adds Class 2 validators at the calendar and email-send boundaries on top of the Class 1 persona layer. each role overlay carries its own systemInject, voice hints, tool allowlist, and memory partition. the shared character (backstory, sensory anchors, warmth profile, anti-model comparison examples) feeds all overlays. switching role is a configuration change on the composer; it is not a prompt swap and it is not a model change.

how mercury behaves under steering

every LLM is steerable. the question is how steerable, against what kinds of pressure, and whether the layer above the model holds. we built a structured eval to measure this on Mercury alongside 12 other models. 30-turn conversations with 20 paraphrase forks every 5 turns, scored by a 3-judge cross-family panel (Sonnet 4.6 + GPT-5.2 + Gemini 2.5 Pro). 66 runs across 13 models, 3 topics, 5 conversational scripts (pro / contra / neutral / adversarial / confrontational)8Opinion Coherence Eval. 66 runs across 13 models (Claude Sonnet 4.6, Gemini 2.5/3.1 Pro, GPT-5.2, Grok-4 family, Mercury, Mercury-2, others), 3 topics (Elon Musk, Sam Altman, Iran-US strikes), 5 conversational scripts. Metrics: opinion-slope (Theil-Sen robust regression for drift with the script), paraphrase variance, steering absorption, response coherence (cosine of fork embeddings), judge agreement (Krippendorff's alpha). Live since April 2026.. the metrics are opinion-slope (Theil-Sen robust regression for drift with the script), paraphrase variance (stability across reworded prompts), steering absorption (how far the script pulled the model), response coherence (cosine of fork embeddings), and judge agreement (Krippendorff's alpha). the statistical infrastructure was built by Christian Yongwhan Lim9Christian Yongwhan Lim, eval research and statistical infrastructure: Theil-Sen robust regression and Krippendorff's alpha inter-rater reliability, with test coverage. Companion production-side infrastructure: daily-drift monitoring pipeline on Vera, cross-session persona-replay tests, logprob-based confidence endpoint, NLI-based judging..

raw Mercury-2 absorbs the conversational script almost completely. opinion swings from −1.50 under a contra-script to +0.63 under a pro-script on a −2 to +2 scale, a 2.13-point spread on the same topic, with paraphrase variance ~0.59 (above the 0.4 stability threshold the eval flags as “spinning”). it is steerable to a moderate-to-high degree. more than GPT-5.2 or Sonnet 4.6, less than Grok-4. add a persona under our orchestration layer and the absorption delta lands at +1.13 to +2.03 in the direction the persona was designed for10Persona deltas measured on Claude Sonnet 4.6 across the same topics and scripts. Craig on iran-us-strikes: −1.03 baseline → +1.00 with Craig (Δ +2.03). Priya on sam-altman: −1.00 baseline → +0.13 with Priya (Δ +1.13). The same stabilization pattern holds under the orchestration layer on Mercury. The persona effect is direction as well as resistance: Craig argues back with specifics, Remi processes the critique analytically, both are predictable. The baseline is a random walk.. the orchestration layer holds the steering instead of letting the script carry it. that is the measurement that backs the architectural claim.

−2−10+1+2MEAN ABSORPTION (−2 to +2)grok-4−2.00+1.58mercury-2−1.50+0.63gpt-5.2−0.35+0.91sonnet-4-6pro: −1.01 ↺script ignored · prior conviction holds2.13-point swing
contra vs pro · raw model, no orchestration · same topic · n=45 runs · updated march 2026

mercury in production

four production deployments running Mercury under our orchestration, each in a different domain, each measured against a published frontier baseline. all numbers below are warm-streaming production traffic; the VERA and τ²-bench rows are pinned to late April 2026, the Vera EA durational row to early May, the Vincent voice loop to May 2026.

deploymentwhat we shipresultfrontier comparison
healthcare voice (Alyssa)Mercury + prompt-level orchestration, primary-care outreach10% → 80% accuracy at 503 mspublished sub-2s frontier ceiling ~11% (Lin 2025)
customer-service ops (Vincent)Mercury, five operator roles on one shared character89.5/100 production eval, 488 ms voice loopno comparable multi-role frontier baseline
tool-use rule-following (τ²-bench)Mercury + deterministic validators, airline + retail policy+11pp lift; 2.4× the lift GPT-5.2 getstrails frontier ~14pp on absolute; ~5× cheaper at threshold
long-arc persona (Vera EA)Mercury, 25-turn durational eval, EA workflows1.8× lower latency, 6× tighter p95 vs hybridmatches a Mercury+frontier hybrid on accuracy and tone

the load-bearing claim across the four rows: Mercury under our orchestration occupies a corner of the speed-versus-accuracy frontier the published research reported empty11Lin et al. 2025, Voice Evaluation of Reasoning Ability (VERA), arXiv:2509.26542. Native voice systems at sub-2-second TTFB clustered below 11.3% accuracy at the time of publication. The Mercury+harness point at 80% / 503 ms occupies a previously-empty region of that frontier; the Vincent measurement at 488 ms end-to-end sits even further into it.. that is the region the rest of the stack is currently trying to reach.

vincent: mercury running five operator roles

Vincent is one Mercury instance carrying five distinct operator roles on one shared character: agent on the customer call, coach whispering in the junior's ear, supervisor on the post-call debrief, scorer on the transcript, and a co-watcher seat alongside the user. the five differ in configuration, not in training. on the production eval suite Vincent holds composite 89.5 / 100, with co-watcher at 94.7 / 100 and zero captured failures across the multi-turn drift cells12Vincent production eval, May 29, 2026. 27 scenarios, 61 cells, 21 multi-turn cells. Three-judge LLM ensemble (Sonnet 4.6 + GPT-4o-mini + Haiku 4.5). Per-role: agent_caller 91.9, co_watcher 94.7 (0 failures across 19 multi-turn cells), coach_junior 84.3, supervisor_debrief 82.8, qa_scorer 82.8. Iteration trajectory: 74.7 → 83.5 → 88.4 → 89.1 → 90.7 → 89.5.. voice loop runs at 488 ms end-to-end with first-token at 196 ms.

co-watcher is the role that depends most on Mercury's latency. Mercury renders a take in the window between something happening on screen and the user's attention moving on. at frontier latency that window does not exist. cross-genre ingest is holding without retraining between genres: Vincent watching six minutes of Euphoria came back with a critic-grade synthesis on what the show was doing to its women (chase, then drain; held in place rather than pushed around for a joke)13Vincent ingest on Euphoria S02E05 (drama; composite 74.5, first iteration) and Friends S08 (comedy; composite 69.4), June 2026. Character-floor lens holds at 7.0/10 across all takes on both genres. Character names stripped from input.. character names were stripped from the input. same orchestration, different surface.

the partnership with Inception

Inception built Mercury. we built the orchestration layer above it. joint research is dual-credited, customer-facing materials carry both logos, and the partnership is non-exclusive: our orchestration layer runs on Mercury, on frontier reasoning models in our hybrid pattern, and (in flight) on Hermes 4 70B. two operational asks have come out of our production measurements that would extend Mercury into more application classes: dated model snapshots (the cleanest mitigation for the endpoint-variance window we documented in April 2026) and seed parameter support (for benchmark-grade reproducibility)14Joint research notes dual-credited; no NDA on public materials; no commercial terms tied to the collaboration plan. Five research drafts in the co-draft pipeline as of June 2026: Mercury at Production Scale (3-domain, external draft), cross-substrate Jigu extension on Hermes 4 70B (in flight), Vera EA case study, VERA-Healthcare findings, Mercury substrate-behaviors operational note..

where mercury fits today

five application classes where Mercury under our orchestration is the production choice today15Decision matrix across ten application classes, five where Mercury+harness wins today (listed in the body), five where frontier wins today: rule-following at ≥85% pass¹, multimodal workloads, banking-domain tool use, snapshot-pinned regulated workflows, long-form >2000-token generation.:

  • voice / sub-2-second conversational interfaces — the speed-and-accuracy region autoregressive frontier does not reach.
  • long-arc persona consistency (durational EA, coaching, multi-month engagements) — highest tone composite at lowest latency we have measured.
  • tool-use rule-following at ~70–75% pass¹ acceptance, cost-bound — application-acceptable accuracy at ~5× lower cost than frontier + same scaffold.
  • healthcare voice persona (HIPAA-aware) — the Alyssa pattern, live in primary-care production at MD Well.
  • cost-sensitive scale (millions of trajectories per day) — the cost advantage holds at production scale.

not the right model today for rule-following requiring ≥85% pass¹, multimodal workloads, banking-domain tool use, or snapshot-pinned regulated workflows. our orchestration runs on frontier models too; for those cases the substrate choice is frontier and the cost arithmetic is different.

endnotes

  1. Mercury-2 by Inception Labs. OpenAI-compatible chat completions at https://api.inceptionlabs.ai/v1. Diffusion-class language model producing all output tokens in parallel rather than autoregressively. Pricing at April 2026: $0.25 / $0.75 per million input / output tokens.
  2. Vincent voice loop, May 2026. Mercury TTFB 196 ms p50 / Cartesia TTS 287 ms p50 / total session 488 ms p50 / 715 ms p95. Measured after the reasoning_effort=instant + realtime=true tuning landed (3.29× speedup over earlier baseline). The April 2026 Alyssa healthcare deployment, measured pre-tuning, ran Mercury at 503 ms architected / 513 ms raw.
  3. Stivers, T., Enfield, N. J., et al. (2009). Universals and cultural variation in turn-taking in conversation. PNAS 106(26), 10587–10592. Median between-turn gap ~200 ms across 10 typologically diverse languages.
  4. Per-trajectory cost analysis. List-price ratio 12–13× at April 2026 pricing. Realized ratio after Mercury's documented ~2.6× verbosity (Artificial Analysis third-party) is approximately 5× cost-per-correct-answer at the 71% pass¹ acceptance threshold on Sierra τ²-bench airline.
  5. The two-class theory of harnessing: Class 1 (prompt-level steering) addresses behavioural / stylistic drift; Class 2 (deterministic intermediation at the tool-call boundary) addresses policy-conformance failures on rule-following domains. Class 3 (reasoning / search / retrieval) is identified but not addressed by current scaffolds. Developed across our Alyssa healthcare-voice and Jigu τ²-bench measurement programmes.
  6. The NOW-window composer assembles each turn's prompt from a typed, source-graded inventory of observation primitives. The architectural lineage is in Bach's published cognitive-modelling work (Principles of Synthetic Intelligence, OUP 2009; perceptual binding state 2019). The field-research programme that produces the primitive inventory is the Behaviour Census, owned by the lab's head of research, Réane Delaunoy.
  7. Jigu scaffold mechanics: pre-flight regex routing on tool-call arguments, eligibility computation keyed on cabin class and membership tier, enriched tool results that materialise rule preconditions. Measured against τ²-bench airline (50 tasks, 200 sims at k=4) and retail (114 tasks, 456 sims at k=4) on Mercury-2 and GPT-5.2. Pass¹ lift on airline +11pp Mercury / +4.5pp GPT-5.2; pass⁴ (consistency) lift larger than pass¹.
  8. Opinion Coherence Eval. 66 runs across 13 models (Claude Sonnet 4.6, Gemini 2.5/3.1 Pro, GPT-5.2, Grok-4 family, Mercury, Mercury-2, others), 3 topics (Elon Musk, Sam Altman, Iran-US strikes), 5 conversational scripts (pro / contra / neutral / adversarial / confrontational). 30-turn conversations with 20 paraphrase forks every 5 turns, scored by a 3-judge cross-family panel (Sonnet 4.6 + GPT-5.2 + Gemini 2.5 Pro). Live since April 2026.
  9. Christian Yongwhan Lim, eval research and statistical infrastructure. Theil-Sen robust regression and Krippendorff's alpha inter-rater reliability, with test coverage. Companion production-side infrastructure: daily-drift monitoring pipeline on Vera, cross-session persona-replay tests, logprob-based confidence endpoint, NLI-based judging.
  10. Persona deltas measured on Claude Sonnet 4.6 across the same topics and scripts. Craig on iran-us-strikes: −1.03 baseline → +1.00 with Craig (Δ +2.03). Priya on sam-altman: −1.00 baseline → +0.13 with Priya (Δ +1.13). The same stabilization pattern holds under the orchestration layer on Mercury. The persona effect is direction as well as resistance: Craig argues back with specifics, Remi processes the critique analytically, both are predictable. The baseline is a random walk.
  11. Lin et al. 2025, Voice Evaluation of Reasoning Ability (VERA), arXiv:2509.26542. The published frontier on sub-2-second TTFB voice systems clustered below 11.3% accuracy at the time of publication. GPT-realtime: 11.3% at 2.69 s. LiveAnswer cascade (GPT-5 reasoning + Llama narration): 27.0% at 10.5 s. The harnessed-Mercury point at 80% / 503 ms occupies a previously-empty corner of that frontier; the Vincent measurement at 488 ms end-to-end is further into it.
  12. Vincent production eval baseline (2026-05-29). 27 scenarios, 61 cells, 21 multi-turn cells. Deterministic scoring layer plus three-judge LLM ensemble (Sonnet 4.6 + GPT-4o-mini + Haiku 4.5). Composite per role: agent_caller 91.9, co_watcher 94.7 (0 captured failures across 19 multi-turn cells), coach_junior 84.3, supervisor_debrief 82.8, qa_scorer 82.8. Iteration trajectory: 74.7 → 83.5 → 88.4 → 89.1 → 90.7 → 89.5.
  13. Vincent ingest on Euphoria S02E05 (drama; composite 74.5 first iteration) and Friends S08 (comedy; composite 69.4), June 2026. Character-floor lens holds at 7.0/10 across all takes on both genres. Character names stripped from input.
  14. Inception collaboration. Joint research notes dual-credited; no NDA on public materials; no commercial terms tied to the collaboration plan. Five research drafts in the co-draft pipeline as of June 2026.
  15. Decision matrix across ten application classes; five where Mercury+harness wins today (in the body), five where frontier wins today.