Small Models, Big Results: Superior Intent Extraction via Decomposition

As AI technologies advance, truly helpful agents will better anticipate user needs. For mobile experiences to be genuinely helpful, underlying models must understand user interactions. By comprehending current and past tasks, models gain context to predict potential next actions. For example, if a user searched for music festivals in Europe and now seeks a flight to London, an agent can offer to find London festivals on those dates.

Large multimodal LLMs currently excel at understanding user intent from UI trajectories. However, employing LLMs for this task often involves slow, costly server communication and risks exposing sensitive information.

Our recent paper, “Small Models, Big Results: Achieving Superior Intent Extraction Through Decomposition”, presented at EMNLP 2025, investigates using small multimodal LLMs (MLLMs) for on-device understanding of web and mobile user interaction sequences. We decompose intent understanding into two stages: first, summarizing each screen individually, then extracting intent from the sequence of generated summaries, making the task more manageable for small models. We also formalize evaluation metrics, demonstrating results comparable to much larger models, highlighting its potential for on-device applications. This research builds upon our team's prior work in user intent understanding.

Research Highlights: Decomposed Intent Extraction

We introduce a decomposed workflow for understanding user intent from interactions. At inference, the model executes two primary steps: first, it independently summarizes each individual screen interaction; second, it uses these summaries as an event sequence to predict the overall intent of the UI trajectory.

Individual Screen Summarization

The first stage leverages a small multimodal LLM to summarize every individual interaction.

Utilizing a sliding window of three screens (previous, current, next), the model addresses these critical questions:

What is the relevant screen context? Provide a concise list of salient details from the current screen.
What did the user just do? List the actions the user performed during this interaction.
Speculate: What is the user attempting to achieve with this interaction?

Diagram illustrating a user's intent extraction workflow. The user chooses from a travel grid. Text labels detail screen context, user actions, and speculative travel goals.

The initial stage of the decomposed workflow involves examining surrounding screens for each screenshot-action pair. We query screen context, user actions, and user intentions. The example at the bottom demonstrates a potential LLM-generated summary answering these three questions, serving as input for the second stage.

Intent Extraction from Summaries

The second stage employs a fine-tuned small model to extract a single, concise intent statement from the screen summaries.

We identified several techniques that significantly enhance performance:

Fine-tuning: Providing examples of effective intent statements guides the model to prioritize crucial information within summaries and discard irrelevant details. We utilize publicly available automation datasets for training, as they offer robust examples pairing intents with action sequences.
Label Preparation: Summaries may omit details. To prevent the model from hallucinating missing information, we preprocess training intents by removing any elements not present in the summaries, using a separate LLM call.
Dropping Speculations: While speculations aid first-stage summary completeness, they can confuse the second-stage intent extractor. Therefore, we exclude them during the second stage. This strategy, though seemingly counterintuitive, demonstrably improves performance.

Flowchart: User action summaries feed into a 'Model finetuning' stage, yielding a 'Cleaned gold' intent statement.

The second stage of our decomposed workflow utilizes a fine-tuned model that processes summaries from the first stage to output a concise intent statement. We exclude all speculative content and meticulously clean training labels to prevent hallucinations.

Rigorous Evaluation Approach

We employ the Bi-Fact approach to assess predicted intent quality against reference intents. This method uses a separate LLM call to break down intents into indivisible "atomic facts." For instance, "a one-way flight" is an atomic fact, while "a flight from London to Kigali" comprises two. We then quantify the number of reference facts entailed by the predicted intent and vice versa, establishing precise measures of precision (correct predicted facts) and recall (correctly predicted true facts), leading to the calculation of the F1 score.

Side-by-side comparison of Reference and Predicted flight booking facts, using checkmarks and X's for accuracy evaluation.

Fact Coverage Analysis assesses how effectively reference facts are captured in predicted intents (left) and if predicted facts align with the reference intent (right).

Analyzing atomic facts also reveals how different stages of the decomposed approach contribute to errors. Below, we illustrate how we track missed details and hallucinations at each stage by examining fact flow through the system.

Flowcharts analyzing Recall and Precision to pinpoint where facts are missed or hallucinated during extraction.

Error propagation analysis tracks recall and precision across both model stages.

Exceptional Performance Results

Our decomposed approach, which summarizes screens independently before extracting intent, significantly aids small models. Compared to standard methods like chain-of-thought prompting (CoT) and end-to-end fine-tuning (E2E), our method achieves superior results on both mobile and web trajectories, performing well with both Gemini and Qwen2 base models. Notably, the decomposed approach with the Gemini 1.5 Flash 8B model delivers performance comparable to Gemini 1.5 Pro at a significantly lower cost and higher speed. Explore additional experiments in the full paper.

Bar charts comparing BiFact F1 scores for Gemini 1.5 and Qwen2 models across Android and Web trajectories.

Our decomposed method consistently outperforms baseline chain-of-thought prompting (CoT) and end-to-end fine-tuning (E2E) in Bi-Fact F1 scores. On mobile datasets, its performance rivals the large Gemini Pro model.

Conclusion and Future Outlook

We demonstrated that a decomposed approach to trajectory summarization effectively enhances intent understanding with small models. As AI models advance and mobile device processing power increases, we anticipate on-device intent understanding becoming a foundational element for numerous assistive mobile features.

Acknowledgments

We express our gratitude to our coauthors: Noam Kahlon, Joel Oren, Omri Berkovitch, Sapir Caduri, Ido Dagan, and Anatoly Efros.

Commentaires

عدد التعليقات : 0

إضافة تعليق جديد

💬 We’d Love to Hear From You!
Your thoughts and feedback matter to us. Please keep your comments respectful, helpful, and relevant to the topic.
🚫 No spam or promotional links.
🔒 Your email address will not be published.
✍️ Required fields are marked.
Thank you for contributing to the discussion, we look forward to your comment! 😊

DeepGeek

<span data-i18n="pages">الصفحات</span>

Small Models, Big Results: Superior Intent Extraction via Decomposition

Research Highlights: Decomposed Intent Extraction

Individual Screen Summarization

Intent Extraction from Summaries

Rigorous Evaluation Approach

Exceptional Performance Results

Conclusion and Future Outlook

Acknowledgments

إضافة تعليق جديد

MedGemma 1.5: New Medical AI for Images & Med…

AI Agent Systems: When and Why They Work

Instagram Parental Alerts for Teen Self-Harm Sear…

DialogLab: Test AI Group Conversations Easily

Debunking AI Agent Misconceptions: Truths for Pro…