Problem: Smartphones have become indispensable in modern life, yet navigating complex, multi-step tasks on mobile devices often remains frustrating and time-consuming. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments on behalf of users. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences.
Method: To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By “hierarchical,” we refer to an explicit separation of high-level planning and low-level action execution through the structured assignment of five agents: a Manager and four subordinate agents—Perceptor, Operator, Action Reflector, and Notetaker. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. We also introduce Mobile-Eval-E, a new benchmark featuring challenging real-world mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones on diverse tasks.
Manager: Large multimodal model (LMM)-based reasoning agent for creating high-level plans containing decomposed subgoals for the user's request. The Manager also considers avalible Shortcuts from the long-term memory to guide planning. Additionaly, when the model observes consecutive failed actions, an Error Escalation Flag is raised to notify the Manager, who reviews recent errors and decides on higher-level adjustments to resolve the issue. In other cases, when an error first occurs, the Operator will attempt to address it before escalating the issue to the Manager.
Perceptor: A pure vision-based perception module containing three tools: an OCR model, an icon grounding model, and an icon captioning model. The output contains a fine-grained list of texts and icons, along with their coordinates on the screen.
Operator: A LMM-based reasoning agent for deciding the next immediate action based on the high-level plan from the Manager, such as Tap(x, y). The Operator also considers the Tips from the long-term memory to guide the decision-making. The action space is defined to contain not only Atomic Operations but also Shortcuts, which can evolve through tasks.
Action Reflector: A LMM-based reasoning agent for verifying if the previous action achieves expected outcomes based on the before and after screenshots. If the action succeeds, the Action Reflector logs current progress, otherwise the Action Reflector provides additional error feedback.
Notetaker: A LMM-based reasoning agent for aggregating important information during navigating the task. For example, the price of a product or the phone number of a restaurant.
We maintain a persistent long-term memory consisting of two key types of knowledge, Tips and Shortcuts, which aim to enhance both the performance and efficiency of the agent. Two dedicated LLM-based agents, called Experience Reflectors, are used to update the Tips and Shortcuts at the end of each task based on the interaction history.
Tips: Tips are defined as general guidance on effective interactions and lessons learned from previous errors, akin to the episodic memory in human cognition.
Shortcuts: Shortcuts are defined as reusable, executable functions composed of sequences of atomic operations tailored for recurring subroutines. Shortcuts are akin to procedural knowledge, which allows humans to perform well-practiced tasks efficiently and often subconsciously. We explicitly include a precondition in the definition of a Shortcut and require the Operator to verify that the current state satisfies the precondition before using the Shortcut.
See Figure 4 for an example self-evolution step as well as the agent generated Tips and Shortcuts.
Existing dynamic mobile benchmarks (AppAgent, Mobile-Agent, Mobile-Agent-v2) primarily focus on short-horizon, straightforward tasks, where the performance has already saturated. To address this limitation, we propose a challenging benchmark, Mobile-Eval-E, which emphasizes reasoning-intensive, long-horizon, multi-app tasks. Mobile-Eval-E comprises 25 manually crafted tasks spanning 5 real-world scenarios: "Restaurant Recommendation", "Information Searching", "Online Shopping", "What's Trending", and "Travel Planning". As shown in Table 1,significantly surpasses previous benchmarks in complexity, featuring more than 2x the number of expected operations per task. Mobile-Eval-E also encompasses a broader range of Apps, with 76% of the tasks requiring interactions with multiple Apps.
We introduce a new evaluation metric called the Satisfaction Score (SS) to address the challenge posed by real-world tasks that often lack a binary success flag or a ground truth trajectory. This metric is computed based on human-written rubrics that account for both milestone completion, such as "opened Maps," and exploratory behaviors, such as "viewed more than one review." This approach offers a reliable measure of agent performance aligned with human preferences. We further propose a Satisfaction Score vs Steps (SSS) curve to better evaluate and visualize the efficiency of mobile agents. Additionally, we include Action Accuracy (AA) and Reflection Accuracy (RA) as metrics to evaluate action-level performance, and Termination Error (TE) to reflect the agent's robustness.
@misc{wang2025mobileagenteselfevolvingmobileassistant,
title={Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks},
author={Zhenhailong Wang and Haiyang Xu and Junyang Wang and Xi Zhang and Ming Yan and Ji Zhang and Fei Huang and Heng Ji},
year={2025},
eprint={2501.11733},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.11733},
}