See the difference ARIA makes. Standard mode sends your prompt as-is. ARIA mode structures, scores, and verifies intent — then sends a precision brief.
The dominant paradigm in natural language AI interface research is intent understanding: given a user utterance, determine the most probable interpretation and generate a response. This paradigm is well-served by large language models, which achieve human-level performance on many intent classification benchmarks [1]. Yet practitioners and enterprise users regularly report that AI systems fail to address their actual needs, even when the system appears to "understand" the surface request.
We argue that intent understanding is necessary but insufficient. What is missing is intent fidelity — a property we define as follows:
Intent fidelity is a stronger property than intent understanding. A system can understand the topic of a request (investor pitch deck) while having low-fidelity knowledge of its parameters (for which round? how many slides? which audience? in which market?). The gap between topic-level understanding and parameter-level fidelity is precisely where AI systems produce plausible but imprecise responses — a phenomenon we term the fidelity gap.
This paper makes the following contributions:
A formal definition of intent fidelity as a five-property requirement for AI interfaces, distinct from intent understanding, and a complete technical architecture implementing it.
A novel formula for scoring semantic intent node confidence using three weighted sources: local text evidence, session history, and user profile, with empirically calibrated weights.
A seven-type taxonomy of logical relationships between intent concepts (IMPLICATIVE, PREREQUISITE, INSTANTIATION, CONTRADICTION, DERIVATION, TEMPORAL, SCOPING) derived from formal logic rather than statistical co-occurrence, enabling intent-level reasoning rather than semantic similarity.
The first published post-hoc intent verification system for AI interfaces, which checks each confirmed intent node against the AI response and produces a calibrated match score, verdict, and re-query suggestion.
A method for encoding acoustic prosodic features (pause duration, energy ratio, pitch variation) as semantic confidence deltas on SCIM intent nodes, enabling voice interfaces to represent epistemic uncertainty from vocal hesitation.
A dual-debounce parallel pipeline that delivers peripheral semantic intent confirmation at 300ms — before input completion — using recognition memory rather than working memory, achieving 60–70% perceived latency reduction without modifying the user's input field.
The NLP literature on intent understanding is extensive, tracing from rule-based systems [2] through statistical classifiers [3] to current transformer-based approaches [4]. The dominant task framing is intent classification: assign the user utterance to one of N predefined intent categories. This framing is insufficient for open-domain AI interfaces where intents are not predefined and parameters vary continuously. Slot-filling approaches [5] address parameters but assume a fixed schema. SCIM addresses the open-domain case with a schema-free confidence-scored decomposition.
Dialogue state tracking (DST) maintains a model of the conversation state across turns [6]. DST systems typically track slot values within predefined domains (restaurant booking, hotel reservation). The ARIA approach differs in three ways: (1) it operates across arbitrary domains, (2) it scores confidence at the node level rather than tracking categorical states, and (3) it incorporates a user profile source of confidence that persists across sessions rather than being reset per dialogue.
Clarification generation has been studied in question answering [7] and dialogue systems [8]. The standard approach generates clarification questions based on ambiguous spans. ARIA's Clarification Engine differs in targeting the lowest-confidence node on the critical path — the node whose resolution produces the highest expected confidence gain for the root intent — and in restricting clarification to one question per round, consistent with user experience research showing that multi-question clarification dialogs significantly reduce task completion rates [9].
Post-hoc response verification has been studied in the context of factual accuracy [10] and hallucination detection [11]. The Misalignment Detector addresses a distinct problem: not whether the response is factually correct, but whether it addresses the user's confirmed intent components. A response can be entirely factually accurate while failing to address three of five confirmed intent nodes. No prior system performs this intent-level verification.
Arnold et al.'s CHI 2020 study [12] demonstrated that in-field autocomplete reduces text originality. MakeAIHQ (2026) [13] and Gladia's engineering analysis [14] establish perceived latency thresholds. Speculative decoding [15] addresses server-side inference latency. PIE addresses a previously unstudied problem: perceived latency of semantic intent feedback during composition, via a parallel out-of-field display.
SCIM is the core engine of the ARIA suite. It receives natural language input (text or voice transcript) and produces a structured intent graph: a set of typed semantic nodes, each with a confidence score, plus a set of typed logical relationships between nodes.
SCIM decomposes intent into five node types:
| Type | Definition | Example |
|---|---|---|
| ACTION | The primary verb or task the user intends to accomplish | "create a pitch deck" |
| ENTITY | The primary object the action operates on | "investors" |
| CONSTRAINT | Limiting conditions on the action or entity | "10 slides", "next week" |
| CONTEXT | Background conditions that frame the intent | "for our AI startup" |
| ABSENT | Nodes expected for this intent type but not expressed | "funding stage" (inferred missing) |
ABSENT nodes are particularly important: they represent information the system infers should be present based on the intent type, but which the user has not expressed. A pitch deck request without a funding stage, audience type, or market focus is systematically underspecified. SCIM's ability to identify ABSENT nodes — rather than simply scoring what was said — is the key capability that enables targeted clarification.
Each node is assigned a confidence score C(n) ∈ [0,1] computed from three weighted sources:
Where:
The three-source model has a critical practical implication: the same ambiguous input ("I need a pitch") produces different confidence profiles for a user whose profile identifies them as a Series B fintech founder (Cprofile disambiguates to investor pitch, raising overall confidence) vs. a music teacher (Cprofile disambiguates to pitch correction, different intent entirely). This personalisation is not a heuristic — it is a formally weighted contribution to the confidence formula.
A node is "locked" when C(n) ≥ θ, where θ=0.85 is the default confidence threshold (configurable per deployment). Locked nodes are excluded from clarification targeting and contribute with full weight to the intent brief. The overall intent confidence Cintent is the mean node confidence weighted by node criticality (see Section 3.1.4). The system action is determined by Cintent:
Not all nodes are equally important to the root intent. SCIM identifies a critical path — the subset of nodes whose resolution is necessary and sufficient to generate a high-fidelity brief. Nodes on the critical path receive a criticality weight wc > 1 in the weighted intent confidence calculation. The critical path is derived from the DCT relationship graph (see Section 3.2): nodes that are PREREQUISITE to the root ACTION are critical; CONTEXT nodes that are SCOPING the primary ENTITY are critical; isolated CONTEXT nodes are not.
The DCT builds a logical relationship map between the intent concepts identified by SCIM. It is explicitly not a semantic similarity measure. Two concepts can be semantically similar (cosine distance ≈ 0) without having any useful logical relationship for intent processing. "Investor" and "venture capitalist" are semantically similar; their relationship type (INSTANTIATION: "VC" is an instance of "investor") is what matters for intent fidelity — knowing the relationship type tells the system whether to unify the concepts or maintain them as distinct constraints.
| Type | Definition | Intent Implication |
|---|---|---|
| IMPLICATIVE | A entails B in this context | "Series A pitch" implies "equity financing discussion" |
| PREREQUISITE | A must be resolved before B | "funding stage" must be resolved before "slide count" |
| INSTANTIATION | A is a specific instance of B | "Sequoia" is an instance of "investor" |
| CONTRADICTION | A and B cannot both be true | "bootstrapped" contradicts "seeking VC funding" |
| DERIVATION | B can be inferred from A | "next week" derives "high urgency" |
| TEMPORAL | A and B have a time-ordering constraint | "market analysis" temporally precedes "financial projections" |
| SCOPING | A limits the valid domain of B | "European market" scopes "regulatory requirements" |
The DCT relationship map serves two functions in the pipeline: it informs critical path identification (PREREQUISITE relationships define the critical path), and it surfaces latent concepts — terms that are logically implied by the expressed intent but not mentioned. Latent concepts are displayed in the intent graph UI as disambiguation aids and fed back into the SCIM analysis as implicit CONTEXT nodes.
The DCT's relationship inference is conditioned on the user's knowledge base (documents, domain profile, session history). A relationship that holds in one domain (IMPLICATIVE: "pitch" → "investor meeting" in a startup context) may not hold in another (IMPLICATIVE: "pitch" → "musical note" in a music education context). The DCT uses the knowledge base to select domain-appropriate relationship activations, producing what we term contextually grounded rather than statistically averaged relationship maps.
When Cintent < 0.70, the Clarification Engine selects and generates a single clarification question. The selection algorithm prioritises:
The generated question is constrained to present pre-computed options (direct answers, not sub-questions) that each carry an estimated confidence gain. This design reflects research showing that option-based clarification dialogs achieve 3× higher completion rates than open-ended clarification questions in task-oriented interfaces [9]. Users may select multiple options or provide a freetext answer; freetext answers receive a fixed high confidence estimate (0.92) as they represent the user's most precise expression of the node's value.
A critical design constraint is the one-question-per-round limit. After each Green zone interaction, the system re-analyses the full intent graph with the updated node confidence and determines whether a further clarification round is warranted. In ARIA mode (equivalent to the "Research" mode in the demo), up to 3 rounds are permitted before forcing a PROCEED state, preventing clarification fatigue.
The Brief Generator receives the confirmed node graph (all nodes above the locking threshold) and produces a structured natural language brief for submission to the AI model. The brief is not a reformulation of the user's original input — it is a reconstruction from the locked nodes, which may include profile-derived context that the user never explicitly stated. This produces a qualitatively different AI prompt: one that specifies confirmed parameters rather than expressing the user's potentially ambiguous intent.
The brief generator also produces a machine-readable intent record (JSON) containing node values, confidence scores, and relationship types. This record is passed to the Misalignment Detector after the AI response is received.
The Misalignment Detector is the verification stage of the intent fidelity pipeline. It receives the confirmed node graph and the AI response text, and determines whether each locked node was addressed by the response.
For each locked node (label, value, type), the Detector checks the AI response for:
Each node receives a binary addressed/unaddressed verdict and a severity rating (HIGH for ACTION and critical path nodes, MEDIUM for ENTITY nodes, LOW for CONTEXT nodes). The overall match score is a criticality-weighted average of addressed nodes:
Three verdict levels are defined: ALIGNED (match ≥ 0.85), PARTIAL (0.60–0.84), MISALIGNED (< 0.60). Each verdict carries a recommendation: ACCEPT, RE_QUERY (with a specific suggested follow-up), or ESCALATE (the intent gap is too large for a follow-up; re-clarify from the beginning). The Misalignment Detector is the first published system to provide intent-level — as opposed to factual-level — post-hoc verification of AI responses.
Voice-based intent expression carries information beyond the transcribed words. Acoustic prosodic features — pause duration before a word, vocal energy (amplitude), pitch variation, speech rate — correlate with the speaker's epistemic state: hesitation, emphasis, uncertainty, and conviction [16]. The ARIA Prosody Layer encodes these features as semantic confidence deltas on SCIM intent nodes.
The Prosody Layer receives word-level timestamps from the Whisper ASR output (start_ms, end_ms, probability per word) and computes:
The 0.40 scaling factor reflects the α weight of Clocal in the three-source confidence formula, ensuring that prosodic adjustments operate within the correct component of the confidence model. A word spoken after a 400ms pause receives a −0.25 delta on the node it belongs to, reflecting that the speaker hesitated before stating that concept — a reliable indicator of uncertainty in the speaker's mind about that concept's value or relevance.
Consider a user saying: "I'm looking for... a job in Madrid... that matches my experience... minimum three thousand euros per month." The three-pause pattern produces negative deltas on "Madrid," "experience," and "minimum three thousand euros" — exactly the nodes where the speaker expressed uncertainty. SCIM correctly lowers confidence on these nodes and targets them for clarification, matching what a skilled human interviewer would do.
This represents a qualitative advance over existing voice intent systems, which treat the transcript as semantically equivalent to typed text, discarding all prosodic information. The Prosody Layer is the first published system to use word-level acoustic features as continuous modifiers to semantic intent confidence scores.
The ARIA pipeline as described above delivers its first semantic feedback only after the user finishes composing — at approximately 900–1500ms after the last keystroke. For complex professional intents (multi-clause requests, technical specifications, procurement statements), this creates a perceptible dead zone. The user composes, stops, and waits. The wait is cognitively disruptive because it interrupts the sense of collaboration that characterises effective human-AI interaction.
PIE solves this by decoupling the semantic feedback moment from the input completion moment. The key insight is that an imperfect prediction about the user's intent, displayed peripherally, is more cognitively useful than silence — provided it does not interrupt the composition process.
The ICE is a lightweight fast-inference language model (targeting <200ms response) deployed with a minimal prompt: complete this partial statement into its most probable full sentence. The output is capped at 80 tokens — the ICE is not generating a response, it is completing a sentence. The raw ICE completion is never displayed; only the SCIM-processed interpretation is shown in the PIE Zone.
The PIE Zone is a distinct UI area positioned above the chat history, with dashed-border styling and 70% opacity to signal its predictive nature. It displays:
When the full 900ms analysis completes and the confirmed Blue zone appears, the PIE Zone hides automatically. The transition signals the shift from predicted to confirmed state.
PIE is a user-selectable option (default: ON for returning users, configurable). This design reflects research showing individual differences in sensitivity to peripheral displays during composition tasks [12]. Users who prefer to compose without interruption can disable PIE; the remainder of the ARIA pipeline is unaffected.
Kahneman's dual-process framework [17] provides a unifying cognitive account of ARIA's design. System 2 (slow, deliberate, resource-intensive) is engaged by: composing complex intent statements, evaluating autocomplete suggestions inserted into the input field, reading structured AI responses. System 1 (fast, automatic, low-cost) is engaged by: recognising peripheral confirmation signals, noticing confidence indicators, processing familiar structured patterns.
ARIA's pipeline is designed to minimise System 2 demand at each stage. The Blue zone's structured card format (root intent / confirmed nodes / uncertain nodes) is consistent across all interactions, making it recognisable rather than evaluative. The PIE Zone's distinct styling reduces the probability of confusion with the confirmed state. The Misalignment Detector's ALIGNED/PARTIAL/MISALIGNED verdict requires only recognition, not analysis. The clarification options require selection, not recall.
Tulving's recognition/recall asymmetry [18] — that recognition is significantly less cognitively demanding than recall — motivates several specific design decisions. The Blue zone presents ARIA's interpretation for recognition ("does this match what I meant?") not recall ("what did I mean?"). The clarification options are pre-computed answer candidates rather than open fields. The Misalignment Detector verdict is a recognition task ("did the AI address my confirmed intent?") not an evaluation task. These design choices collectively minimise the cognitive cost of the intent fidelity pipeline.
Hayes and Chenoweth (2006) [19] established that transcription and editing tasks compete for the same working memory resources as higher-order composition planning. The input field modification in standard autocomplete systems creates precisely this competition: the user must interrupt composition to evaluate and accept/reject the suggestion. ARIA's core design principle — that the input field is never modified by any system component — directly protects the working memory capacity allocated to composition.
H1 (Intent fidelity): AI responses generated via the ARIA pipeline (SCIM + Clarification + Brief) will score significantly higher on a blind expert intent fidelity rating scale than responses generated from raw user input, for the same set of complex intent statements.
H2 (Clarification efficiency): The single-question-per-round clarification design will achieve equivalent intent node coverage to unconstrained multi-question clarification while producing significantly higher task completion rates and lower perceived effort (NASA-TLX).
H3 (Misalignment detection accuracy): The Misalignment Detector will show significant agreement (κ > 0.70) with expert human judges on intent fidelity verdicts across a benchmark of 200 intent-response pairs.
H4 (Prosody confidence validity): Prosodic confidence deltas derived from pause duration and energy ratio will show significant positive correlation with post-hoc speaker self-reports of certainty on a word-by-word basis.
H5 (PIE perceived latency): Users in the PIE condition will report significantly lower perceived processing latency than users in the no-PIE condition, controlling for objective end-to-end time.
H6 (PIE cognitive load): PIE will not significantly increase NASA-TLX working memory subscale scores, because the peripheral display design minimises System 2 engagement.
| Variable | Specification |
|---|---|
| Design (H1) | Within-subjects: same 10 intents, ARIA pipeline vs. raw prompt vs. baseline LLM. Expert panel (N=5) rates intent fidelity blind. |
| Design (H2–H6) | 2×2×2 between-subjects: ARIA on/off × PIE on/off × Background context on/off. N=120 (15 per cell). |
| Participants | Knowledge workers, self-reported AI interface users, recruited via professional networks |
| Tasks | 8 professional intent statements across 4 domains (HR, finance, research, product management) |
| Measures | Intent fidelity score (expert panel), NASA-TLX, perceived latency (7-point Likert), text predictability ratio, task completion time, clarification round count |
| Analysis | Mixed ANOVA, Bonferroni correction, Cohen's d for effect sizes, inter-rater reliability (κ) for H3 |
The AI industry currently evaluates interface quality primarily on response quality metrics: factual accuracy, coherence, helpfulness ratings. We argue that intent fidelity should be a distinct evaluation dimension, independent of response quality. A highly accurate response to a misunderstood intent is a failure of intent fidelity even if it scores well on response quality metrics. Separating these dimensions would allow the field to measure and improve the gap between what users ask for and what AI systems address.
The intent fidelity gap is most costly in enterprise contexts where the consequences of misaddressed requests are significant: procurement specifications, legal document drafting, clinical data queries, financial modeling requests. The ARIA suite's enterprise value proposition is precisely this: it reduces the probability that a complex multi-parameter enterprise request is partially or wholly misaddressed by the AI system. The Misalignment Detector provides a verifiable audit trail of intent-response correspondence that has compliance and governance value in regulated industries.
The current SCIM confidence weights (α=0.40, β=0.35, γ=0.25) are empirically calibrated but not formally derived. A learning approach that adapts weights per user based on feedback from the Misalignment Detector is a natural extension. The DCT relationship taxonomy is currently hand-coded; a semi-supervised approach that learns domain-specific relationships from user knowledge base documents would reduce deployment friction. The Prosody Layer's energy ratio measure is a proxy (Whisper word probability) for true acoustic energy; deployment with native audio feature extraction would improve accuracy. PIE's performance on mobile keyboards and voice-first interfaces requires dedicated evaluation.
We have introduced intent fidelity as a paradigm for AI interface design, defined it formally as a five-property requirement, and described the ARIA™ suite — a complete technical architecture implementing it. SCIM provides confidence-scored semantic decomposition via a three-source model. DCT provides logical relationship mapping that enables intent-level reasoning beyond semantic similarity. The Clarification Engine selects optimally valuable clarification questions within a one-question-per-round constraint. The Brief Generator produces structured AI prompts from confirmed nodes. The Misalignment Detector closes the loop by verifying post-hoc that AI responses address confirmed intent. The Prosody Layer extends intent fidelity to voice interfaces via acoustic confidence signals. The Predictive Intent Echo reduces perceived latency to 500ms via peripheral confirmation during composition.
Together, these components constitute the first complete published architecture for intent fidelity in AI interfaces. We present six testable hypotheses and a proposed experimental protocol. The architecture is implemented and live at aria-demo.pages.dev. All components are filed as patent pending under the ARIA™ portfolio, Business Innovation Solutions WLL, Bahrain.
The core argument of this paper is that the field's focus on response quality, without an equivalent focus on intent fidelity, systematically underestimates the rate at which AI systems fail users. Improving the quality of the intent that reaches the AI — not just the quality of the response that leaves it — is the most direct path to AI interfaces that professional users can rely on.
[1] Wei, J. et al. (2022). Emergent abilities of large language models. Transactions on Machine Learning Research.
[2] Allen, J. F. (1987). Natural language understanding. Benjamin Cummings.
[3] Joachims, T. (1998). Text categorisation with SVM. ECML '98. Springer.
[4] Devlin, J. et al. (2019). BERT: Pre-training of deep bidirectional transformers. NAACL-HLT 2019.
[5] Bapna, A. et al. (2017). Sequential dialogue context modelling for spoken language understanding. SIGDIAL 2017.
[6] Wu, C. S. et al. (2019). Transferable multi-domain state generator for task-oriented dialogue systems. ACL 2019.
[7] Rao, S., & Daumé III, H. (2018). Learning to ask good questions. ACL 2018.
[8] Aliannejadi, M. et al. (2019). Asking clarifying questions in open-domain information-seeking conversations. SIGIR 2019.
[9] Stoyanchev, S. et al. (2014). Towards natural clarification questions in dialogue systems. AISB 2014.
[10] Thorne, J. et al. (2018). FEVER: A large-scale dataset for fact extraction and verification. NAACL 2018.
[11] Maynez, J. et al. (2020). On faithfulness and factuality in abstractive summarisation. ACL 2020.
[12] Arnold, K. C., Chauncey, K., & Gajos, K. Z. (2020). Predictive text encourages predictable writing. IUI '20. ACM.
[13] MakeAIHQ. (2026). Streaming responses for real-time UX in ChatGPT apps. MakeAIHQ.
[14] Gladia. (2026). How to measure latency in speech-to-text. Gladia Engineering Blog.
[15] Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding. ICML 2023. Google Research.
[16] Schuller, B. W. et al. (2013). Computational paralinguistics: Emotion, affect and personality in speech and language processing. Wiley.
[17] Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.
[18] Tulving, E. (1985). Memory and consciousness. Canadian Psychology, 26(1), 1–12.
[19] Hayes, J. R., & Chenoweth, N. A. (2006). Is working memory involved in the transcribing and editing of texts? Written Communication, 23(2), 135–149.