AgentKit: A Technical Vision for Building Universal AI Automation for Human-Computer Interaction Based on Rust

Introduction This week I came across a project called Droidrun, which allows you to control your Android phone through natural language commands. When I first saw this project, I didn't think much of it. Today, after seeing the news about the project being open-sourced, I became curious about how it works, so I looked at the code to understand the principles behind it. What I found was truly fascinating. Just this Monday, I had come across Accesskit.dev, a cross-platform, cross-language Rust abstraction layer that encapsulates the native accessibility service APIs of different operating systems (like Windows, macOS, Linux/Unix, Android), such as UIA, NSAccessibility, AT-SPI, and the Android Accessibility Framework. At that time, I was thinking that if large language models were to act as humans, they would essentially be like people with disabilities (no derogatory meaning intended). This API set would be perfect for building AI Agents. And today, I discovered that the core mechanism of the Droidrun project is built using Android's accessibility service API. This is what made me feel that the world is truly amazing: while I was still at the idea stage, someone else had already implemented it. Unfortunately, it's not a cross-platform app, and its limitation is that it only supports Android phones. Coincidentally, I am a number one fan of the Rust language, and I know that Rust is particularly well-suited for cross-platform development. I started thinking, could we take the approach from the Droidrun project, combine it with the Rust language, and implement a universal AI automation kit that not only supports Android phones but also iOS, desktop platforms, and even any smart terminal? This article was born from this idea, and AgentKit is the name I've given to this universal AI automation kit. Therefore, this article will start with the Android platform's AI automation practice, Droidrun.ai, deeply analyze its implementation mechanism and limitations. We will then explore the key role of cross-platform accessibility infrastructure AccessKit. Finally, I will propose a detailed vision for the universal AI control framework AgentKit, including its architecture design, collaborative relationship with existing protocols, potential application scenarios, and development roadmap, aiming to outline a future automation infrastructure driven by AI that transcends digital boundaries. Table of Contents The Future of Applications in the AI Era Analysis: Droidrun AI's Pioneering Exploration of Android Automation Foundation of the AgentKit Vision: Cross-Platform Capabilities of AccessKit AgentKit: Universal AI Automation Framework Concept Complementary Collaboration Between AgentKit and Claude MCP / Google A2A Protocols Conclusion The Future of Applications in the AI Era Before thinking about this universal AI automation kit, the first question that came to my mind was: Do we still need apps in the AI era? After all, if we don't need apps anymore, we wouldn't need any AI automation kit either. Fortunately, I understand an objective principle of this world: castles in the air don't exist. So, let's think about this question starting from the evolution history of computer interfaces. Human-computer interaction has undergone several major paradigm shifts: Command Line Interface: Humans needed precise syntax and memory capabilities, not something an average person could operate (I've heard that the father of Genshin Impact could open game programs in DOS at the age of five). Graphical User Interface (GUI): Introduced visual metaphors and direct manipulation concepts (which saved ordinary people like me). Mobile Touch Interface: Brought computing power into our palms, based on gesture interaction (iPhone is great in every way, just a bit expensive). Voice Assistants: Started moving toward natural language interaction (AI is quite clever). And now we are entering an era of the so-called "AI Intermediary Interface", where large language models act as intermediaries between human intent and computing resources. This transition is indeed revolutionizing the way we interact with technology, but it doesn't mean apps will completely disappear. Despite AI's improving language understanding capabilities, I believe applications will transform rather than disappear, for several reasons: In terms of cognitive and perceptual efficiency, AI cannot replace humans. Humans process visual information with extreme efficiency. Our brains can instantly understand complex spatial relationships, hierarchical structures, and patterns, while describing these using language (text or speech) might require lengthy passages. Imagine editing a photo: describing the precise adjustments you want ("increase brightness by 15% in the upper right quadrant while reducing saturation in the blue channel") is much more complex and cognitively demanding than simply dragging sliders or using visual tools. Langua

Apr 18, 2025 - 06:38

AgentKit: A Technical Vision for Building Universal AI Automation for Human-Computer Interaction Based on Rust

Introduction

This week I came across a project called Droidrun, which allows you to control your Android phone through natural language commands.

When I first saw this project, I didn't think much of it. Today, after seeing the news about the project being open-sourced, I became curious about how it works, so I looked at the code to understand the principles behind it.

What I found was truly fascinating.

Just this Monday, I had come across Accesskit.dev, a cross-platform, cross-language Rust abstraction layer that encapsulates the native accessibility service APIs of different operating systems (like Windows, macOS, Linux/Unix, Android), such as UIA, NSAccessibility, AT-SPI, and the Android Accessibility Framework. At that time, I was thinking that if large language models were to act as humans, they would essentially be like people with disabilities (no derogatory meaning intended). This API set would be perfect for building AI Agents.

And today, I discovered that the core mechanism of the Droidrun project is built using Android's accessibility service API. This is what made me feel that the world is truly amazing: while I was still at the idea stage, someone else had already implemented it.

Unfortunately, it's not a cross-platform app, and its limitation is that it only supports Android phones. Coincidentally, I am a number one fan of the Rust language, and I know that Rust is particularly well-suited for cross-platform development.

I started thinking, could we take the approach from the Droidrun project, combine it with the Rust language, and implement a universal AI automation kit that not only supports Android phones but also iOS, desktop platforms, and even any smart terminal? This article was born from this idea, and AgentKit is the name I've given to this universal AI automation kit.

Therefore, this article will start with the Android platform's AI automation practice, Droidrun.ai, deeply analyze its implementation mechanism and limitations. We will then explore the key role of cross-platform accessibility infrastructure AccessKit. Finally, I will propose a detailed vision for the universal AI control framework AgentKit, including its architecture design, collaborative relationship with existing protocols, potential application scenarios, and development roadmap, aiming to outline a future automation infrastructure driven by AI that transcends digital boundaries.

The Future of Applications in the AI Era
Analysis: Droidrun AI's Pioneering Exploration of Android Automation
Foundation of the AgentKit Vision: Cross-Platform Capabilities of AccessKit
AgentKit: Universal AI Automation Framework Concept
Complementary Collaboration Between AgentKit and Claude MCP / Google A2A Protocols
Conclusion

The Future of Applications in the AI Era

Before thinking about this universal AI automation kit, the first question that came to my mind was: Do we still need apps in the AI era? After all, if we don't need apps anymore, we wouldn't need any AI automation kit either.

Fortunately, I understand an objective principle of this world: castles in the air don't exist.

So, let's think about this question starting from the evolution history of computer interfaces. Human-computer interaction has undergone several major paradigm shifts:

Command Line Interface: Humans needed precise syntax and memory capabilities, not something an average person could operate (I've heard that the father of Genshin Impact could open game programs in DOS at the age of five).
Graphical User Interface (GUI): Introduced visual metaphors and direct manipulation concepts (which saved ordinary people like me).
Mobile Touch Interface: Brought computing power into our palms, based on gesture interaction (iPhone is great in every way, just a bit expensive).
Voice Assistants: Started moving toward natural language interaction (AI is quite clever).

And now we are entering an era of the so-called "AI Intermediary Interface", where large language models act as intermediaries between human intent and computing resources. This transition is indeed revolutionizing the way we interact with technology, but it doesn't mean apps will completely disappear.

Despite AI's improving language understanding capabilities, I believe applications will transform rather than disappear, for several reasons:

In terms of cognitive and perceptual efficiency, AI cannot replace humans. Humans process visual information with extreme efficiency. Our brains can instantly understand complex spatial relationships, hierarchical structures, and patterns, while describing these using language (text or speech) might require lengthy passages. Imagine editing a photo: describing the precise adjustments you want ("increase brightness by 15% in the upper right quadrant while reducing saturation in the blue channel") is much more complex and cognitively demanding than simply dragging sliders or using visual tools. Language, though powerful, is not the optimal choice for all interaction scenarios. Just as we use both verbal and non-verbal communication (gestures, expressions) in real life, human-computer interaction will maintain a state where multiple modes coexist.
Some professional domains have special requirements. Many fields need specialized interfaces that match their conceptual models, such as:

Creative applications: Music production, video editing, and 3D modeling involve spatial and temporal relationships that are naturally expressed through visual interfaces
Data visualization: Understanding complex data often requires interactive exploration, which is difficult to navigate through language alone
Games: Interactive entertainment relies on immediate feedback and spatial awareness
Professional tools: Fields like medical imaging and architectural design require highly visual and precise control

For these two fundamental reasons, I believe applications won't disappear but will deeply integrate AI while retaining visual elements:

AI-enhanced interfaces: Traditional visual elements enhanced by AI capabilities, like Photoshop's generative fill feature
Multimodal interaction: Combining voice, touch, eye movement, and gestures to form a richer interactive experience
Adaptive interfaces: User interfaces that automatically reconfigure based on user behavior and AI understanding (this idea might be somewhat ahead of its time)

From a cognitive science perspective, human thinking processes both linguistic and non-linguistic information simultaneously, and our technological interfaces will reflect this duality. Language models provide a natural instruction layer, while visual interfaces provide the interaction modes required for spatial and visual thinking.

Therefore, universal AI automation kits like AgentKit may become even more important in this evolving process:

Need for bridging technology. As interface paradigms diversify, we need systems that can work across different interaction models and device platforms. AgentKit's vision of providing a unified layer for AI to interact with various interfaces becomes more valuable, not less. This bridging role is similar to the relationship between operating systems and hardware. When hardware becomes more diverse, the operating system's abstraction layer doesn't become less important; rather, it becomes more critical as it needs to coordinate more types of devices.
Managing complexity through abstraction. Even if frontend interfaces change significantly, the basic need to interact with complex systems remains. AgentKit provides an abstraction layer that allows users and AI to avoid focusing too much on implementation details, which is very important. We can draw an analogy to the evolution of programming languages: although high-level languages continue to evolve, the complexity of underlying systems still exists; we just manage this complexity through better abstractions. Similarly, the evolution of AI interfaces won't eliminate the complexity of underlying systems but will create better abstractions to manage it.
Supporting the transition period. The shift to AI-first computing will be gradual and uneven. Systems like AgentKit that can connect traditional and emerging paradigms are crucial during this long transition period. History shows that technological transitions usually take longer than we expect. For example, command-line interfaces still exist today, decades after graphical interfaces became popular, and maintain their advantages in certain domains. Similarly, traditional application design will coexist with the AI era for a long time.

I believe we will see layered user experiences: at the high level, users can express their intent through natural language ("edit my photo from yesterday to make it look warmer"), while AI is responsible for translating this intent into specific operations, or opening appropriate visual interfaces when needed for fine-tuning or intuitive feedback.

A characteristic of technological evolution is that new paradigms rarely completely replace their predecessors. As I mentioned earlier, castles in the air don't exist in this world, and AI is no exception. Instead, they usually create new ecological niches, while old methods continue to exist in their areas of expertise.

Therefore, the future is unlikely to be a binary choice between traditional applications and pure AI interfaces, but rather a rich continuum where different interaction styles coexist, complement, and co-evolve.

Analysis: Droidrun AI's Pioneering Exploration of Android Automation

Droidrun.AI has opened a window into this new human-computer interaction paradigm. I believe it is an important attempt in the field of Android automation, demonstrating how to achieve intelligent device control based on natural language using AI and system-level APIs.

Droidrun Overall Architecture and Working Principles

The Droidrun App consists of two independent projects: droidrun (Python agent) and droidrun-portal (Android application). The overall workflow of the application is shown in the diagram:

Flow Chart Explanation:

User Startup: The user enters a natural language task through the Droidrun CLI.
Environment Setup: The CLI parses parameters, obtains the device serial number, and sets the environment variable DROIDRUN_DEVICE_SERIAL.
Agent Startup: The ReActAgent is initialized.
ReAct Loop:
- Thinking: The Agent sends the current goal and history steps to the LLM Reasoner.
- Decision: The LLM returns the thinking process (thought), the action to be executed (action), and parameters.
- Action: The Agent calls the corresponding function in tools.actions to execute the action.
  - Get UI State (Key Interaction): If the action is get_clickables or get_all_elements:
    - The Python tool sends a broadcast command to the Portal Service on the device via ADB.
    - The Portal Service uses the accessibility API to scan the UI, build a hierarchy, assign indices, write the results to a JSON file, and print the file path via Logcat.
    - The Python tool polls Logcat to get the file path, then pulls the JSON file via ADB, parses it, and returns it as an observation result to the Agent.
  - Execute Device Operations: If the action is tap, swipe, input_text, etc.:
    - The Python tool looks up the element in the cache (if it's tap(index=...)), calculates coordinates, or constructs a command.
    - It sends the corresponding input or am shell command directly to the device via ADB.
    - It returns a message of operation success or failure as the observation result.
  - Complete Task: If the action is complete, the task is marked as completed.
- Observation: The Agent receives the result of the action execution (UI data or operation status).
- Loop/End: If the task is not completed and the maximum number of steps has not been reached, return to the "thinking" stage with the new observation result; otherwise, end the loop.
Result Output: The CLI displays the execution process or final result to the user.

Next, let's look at the code for these two parts to understand their working mechanisms.

Droidrun (Python Agent)

This Python package is the core control end (analogous to the brain and hands) of the Droidrun system, responsible for:

Agent Logic Implementation: Contains intelligent decision-making and task execution processes.
Device Interaction Tools: Provides tools for communicating with and controlling Android devices, mainly relying on ADB and the Droidrun Portal application.
Command Line Interface: Provides an entry point for user interaction with the Agent.

Key Points of Core Components:

adb Package:

- Provides **asynchronous** encapsulation and execution of standard `adb` commands (such as `shell`, `pull`, `install`).
- The `Device` class represents a device, encapsulating specific operations such as `tap`, `swipe`, `input_text`, `take_screenshot`, etc., which are ultimately completed by executing the corresponding `adb shell` commands.
- `DeviceManager` is used to manage multiple device connections.

tools Package:

- **Key Interface:** This is the **main bridge** between the Python Agent and the Android Portal application.
- **UI Perception (`get_clickables`/`get_all_elements`):**
    - **Triggers** the Portal application to scan the UI via `adb shell am broadcast`.
    - **Core Interaction Mechanism:** Polls `adb shell logcat -d` to **listen** for logs printed by Portal containing the JSON file path.
    - After obtaining the path, uses `adb pull` to **pull** the JSON file from the device.
    - Parses the JSON (processes the nested structure to **flatten** it), and **caches** the results (`CLICKABLE_ELEMENTS_CACHE`).
- **UI Operations (`tap`, etc.):**
    - `tap` (based on index) looks up elements in the cache, calculates coordinates, and ultimately executes `adb shell input tap`.
    - Other actions (`swipe`, `input_text`, etc.) are wrappers around `adb.Device` methods, with added error handling and specific logic (such as escaping and chunking for `input_text`).
- **`complete` Tool:** Used by the Agent to indicate to the ReAct loop that the task has been completed.
- Depends on the `DROIDRUN_DEVICE_SERIAL` environment variable to specify the target device.

agent Package:

- **Core Logic:** Implements the **ReAct (Reasoning + Acting)** agent pattern.
- `LLMReasoner`: Encapsulates interaction with LLMs (OpenAI, Anthropic, Gemini), responsible for building detailed **system prompts** (with tool signatures) and user prompts (with history), handling API calls (supporting **visual input**), parsing JSON returned by LLMs (`thought`, `action`, `parameters`), and tracking token limits.
- `ReActAgent`: **Orchestrates the ReAct loop**, maintains execution step history (`ReActStep`). Its `run` method drives the entire process: call `LLMReasoner` to think -> record thought/action -> call `execute_tool` to execute action -> record observation result -> check if completed. `execute_tool` maps the action name returned by the LLM to a function in `tools.actions` and executes it.

cli Package:

- Uses the `click` library to provide a command-line interface.
- The `run` command is the main entry point, responsible for parsing parameters, setting the **device serial number environment variable**, and launching the `ReActAgent`.
- Provides auxiliary commands for device management (`devices`, `connect`) and Portal application installation/setup (`setup`).

ReAct Mechanism
ReAct (Reasoning + Acting) is a core working mode used by DroidRun to control Android devices. It allows AI to complete complex automation tasks by combining thinking and action, similar to how humans work.
Core Idea: ReAct mimics the human problem-solving cycle.

Reasoning: AI (LLM) analyzes the task goal (like "open settings and enable dark mode"), combines it with the current screen state, and thinks about what needs to be done.

Acting: AI chooses a specific action (like "click the 'Display' button") and executes it on the Android device through tools.

Observing: AI sees the result after executing the action (like the screen jumping to the display settings page).

Think again...Act again...: AI continues to think about what to do next based on the observed results, until the task is completed.

The ReAct Loop in DroidRun:

Set Goal: The user provides a natural language task.

Reason: The LLM analyzes the task and current screen to decide on steps.

Choose Action: The Agent selects an available action (such as click, input, swipe, analyze UI elements).

Execute: Execute the action on the Android device.

Observe: The Agent obtains the execution result (such as new screen content).

Reason Again: The Agent evaluates progress and decides on the next step.

Repeat until the task is completed or the maximum step limit is reached.

Key Features (in DroidRun):

Available Actions: Include UI interaction (click, input), application management (launch, install), UI analysis (get elements), and task management (complete).

Visual Capabilities (Optional): When vision=True is enabled, the Agent can analyze screenshots to better understand complex or non-standard UIs.

Token Tracking: Records Token consumption in LLM interactions for cost management and performance optimization.

Step Recording: The Agent records the type of each step (thinking, action, observation, etc.) for easy tracking and debugging.

It's worth mentioning some details about the Agent package in the DroidRun Python library:

Multi-LLM Support: The LLMReasoner class encapsulates OpenAI, Anthropic, and Gemini API calls.

Prompt Engineering: Carefully designed system prompts guide the LLM to analyze UI, think step by step, select tools, and return strict JSON.

Multimodal Capabilities (Optional): Supports the take_screenshot tool and passes image data to vision models (like GPT-4o, Claude 3). Screenshots serve as supplementary information for handling scenarios that the accessibility API cannot cover (games, custom drawing, WebView internal details, disambiguation).

History Recording and Truncation: Maintains ReAct step history and implements simple token budget truncation.

Tool Abstraction: The LLM only needs to output tool names and parameters, and ReActAgent is responsible for calling the specific implementations in tools.actions.

Simply put, ReAct allows the AI agent to complete tasks on Android devices step by step, like a person with a brain who can act, observe, and reflect.

Droidrun Portal (Android Application)

This Android application is the perception layer (analogous to eyes and ears) of the Droidrun system, with the main functions:

Core Mechanism: Runs as an Accessibility Service, using system APIs to read screen UI structure without screenshots or root access.
UI Scanning: Periodically (not responding in real-time to each event) scans the UI elements of the current active window (AccessibilityNodeInfo tree).
Element Processing:
- Extracts information of visible elements (boundaries, text, class name, ID).
- Builds parent-child hierarchy based on spatial containment relationships.
- Assigns sequential indices (clickableIndex) to interactive/text elements for reference by the Python Agent.
Data Provision:
- When receiving commands (such as GET_ELEMENTS) sent by the Python Agent via ADB broadcast:
  - Serializes the processed UI element list (including hierarchy) into nested JSON.
  - Writes the JSON to a file on the device.
  - Key Point: Prints (Logs) the path of the JSON file through Android Logcat, informing the Python Agent where to get the data.
Visualization (Optional): Provides visual feedback through OverlayManager by drawing rectangles with indices and heat map colors (based on detection time) as overlay layers on the screen.
Data Model (ElementNode.kt): Defines the structure for storing UI element information.
Configuration: Declares services and required permissions/capabilities through AndroidManifest.xml and the accessibility service configuration file.

Simply put, Droidrun Portal is an "assistant" running on an Android device that uses the accessibility service to "see" the content on the screen, organizes it into structured JSON data, and then tells the Python Agent where the data is hidden through an "agreement" (file path in Logcat). It can also optionally draw frames on the screen to show what it sees.

Foundation of the AgentKit Vision: Cross-Platform Capabilities of AccessKit

Droidrun appears (I haven't run it locally) to run smoothly on Android phones, but from a universality perspective, it has obvious limitations. It is precisely these limitations that inspired me to continue thinking about AgentKit, a universal cross-platform, cross-device AI automation kit.

One of the huge challenges in implementing cross-platform AI automation is the vastly different accessibility APIs, UI frameworks, and development languages across platforms.

Fortunately, in the Rust ecosystem, there is AccessKit, a cross-platform accessibility infrastructure library based on Rust, designed to solve the above challenges and provide an ideal foundation for AgentKit.

Its design inspiration partly comes from Chromium's multi-process accessibility architecture, adopting a push model: UI toolkits actively push complete accessibility tree information and subsequent incremental updates to AccessKit's platform adapters, rather than waiting for adapters to pull on demand. This model is particularly suitable for toolkits that render UI elements themselves (including immediate mode GUI). Some of AgentKit's code (especially design concepts and data structures) also comes from Chromium.

As a bridge between UI toolkits and native platform APIs, AccessKit's core abstractions are as follows:

Data Model: Defines a set of data structures (nodes, roles, attributes, states, etc.) to describe the UI element hierarchy tree (Accessibility Tree). This model is platform-independent.
- common (accesskit crate): Defines platform-independent UI tree data models (Node, Role, Action, TreeUpdate) and geometric types.
- consumer (accesskit_consumer crate): Maintains complete tree states, handles updates, provides high-level APIs (traversal, query).
Push Model: UI toolkits actively push complete initial tree states and subsequent incremental TreeUpdate to adapters. Adapters maintain internal states, directly responding to AT queries without frequent callbacks to applications. This is in contrast to the "pull model" of traditional AT APIs.
- Advantages: Decoupling, good performance (especially for immediate mode GUI), asynchronous-friendly (default support for async-io and providing tokio optional features to adapt to different asynchronous runtimes).
Platform Adapters (platforms/* crates): Bridge the AccessKit model to native APIs (Windows UIA, macOS NSAccessibility, Unix AT-SPI, Android Accessibility Framework).
Cross-Language Support: The core library is implemented in Rust, bringing memory safety, high performance, zero-cost abstraction, concurrency safety, excellent FFI, and cross-platform compilation capabilities. It also provides bindings for languages like C and Python to facilitate integration with non-Rust toolkits.

Pull Model vs Push Model
Pull Model: This is a traditional or more direct approach. Imagine a screen reader (or other assistive technology, AT) wants to know the name of a button. It sends a request to your application through the platform's accessibility API: "Please tell me the name of the button with ID X." Your application (or UI toolkit) receives this request, looks up the corresponding information, and returns the name to the platform API, which eventually passes it to the screen reader. In this model, the information flow is actively pulled by the AT or platform API.
Push Model: AccessKit adopts this approach. Your application (or UI toolkit) doesn't wait for passive queries but actively pushes the entire or partial state of the accessibility tree to AccessKit's platform adapter. When the UI changes, your application calculates the differences and pushes these changes to the adapter again. The platform adapter is responsible for maintaining a complete, up-to-date internal representation of the accessibility tree. When a screen reader queries information through the platform API (e.g., "What is the name of the node with ID Y?"), the platform adapter can look up and return the information directly from its own maintained tree state, usually without needing to go back and ask your application.

AccessKit's core goal is to enable UI toolkits to provide necessary information to the operating system's accessibility APIs so that various assistive technologies (ATs) can understand and manipulate the UI. These assistive technologies include but are not limited to:

Screen Readers: Such as NVDA, JAWS on Windows, VoiceOver on macOS and iOS, TalkBack on Android, and Orca on Linux. They read aloud the text and UI element information on the screen, or output it to braille displays.
Screen Magnifiers: Such as Windows Magnifier, macOS Zoom. They magnify parts of the screen and may need to track focus or mouse position.
Voice Control Software: Such as Dragon NaturallySpeaking, Windows Voice Access, macOS Voice Control. Users interact with the UI through voice commands.
Switch Devices: Used for users with severe mobility impairments, scanning and selecting UI elements through one or more switches.
Tools for Reading and Writing Difficulties: May need to highlight text, change fonts or colors, etc.

AccessKit's platform adapters are responsible for effectively conveying this information to each platform's native APIs and assistive technologies, thereby simplifying the implementation of cross-platform accessibility.

So, compared to Droidrun Portal, both can provide structured UI perception capabilities, but AccessKit is cross-platform and provides a more general abstraction model, while Droidrun Portal is limited to Android.

Therefore, it's clear that AccessKit's key to implementing universal AI automation control is:

Structured and Semantic UI Understanding: Provides deeper interface understanding (roles, states, attributes, relationships) than screenshots.
Standardized Interaction Interface: Defines a unified Action enumeration, providing a stable programming interface for simulating interactions.
Cross-Platform Consistency: Core models and action definitions are consistent across platforms, facilitating the reuse of control logic.

For applications that have integrated AccessKit, it provides an extremely ideal foundation for perception and control. AI Agents can:

Precise Perception: Understand UI structure, element roles, states, and attributes by querying the AccessKit tree.
Reliable Execution: Execute interactions by requesting standard AccessKit Actions.
Cross-Platform Operation: Reuse most control logic.

Of course, there is a prerequisite here: the target application must integrate AccessKit. For applications that haven't integrated it, AI Agents still need to rely on other technologies (native APIs, visual automation, etc.).

For applications built using AccessKit, AccessKit provides an extremely powerful and ideal infrastructure for AI Agents to perceive and control. It provides a structured, semantic, cross-platform, and relatively stable way to understand UI states and execute interactive actions, which has clear advantages over many existing automation technologies (especially visual automation).

However, its biggest limitation is the adoption rate of applications. Currently, AccessKit is still a relatively new project, and there aren't many UI toolkits and applications that have adopted it. Therefore, a universal AI Agent aimed at controlling arbitrary desktop or mobile applications cannot widely rely on AccessKit yet and must have the capability to use other automation strategies. But as the AccessKit ecosystem develops, it has the potential to become a very valuable tool in the AI control field.

AgentKit: Universal AI Automation Framework Concept

Based on the powerful foundation provided by AccessKit, we can envision a grander, more universal AI control framework—AgentKit. AgentKit's goal is to become a modular, extensible AI automation solution for all platforms.

AgentKit Architecture Macro Overview

AgentKit's design philosophy is layered, modular, and framework-agnostic:

┌─────────────────────────────────────────────┐
│                  AI Agent                   │
│  (LLM/Model, Task Planning, Decision Logic - Pluggable)  │
└──────────────────────┬──────────────────────┘
                       │ (Unified Instructions/States)
┌──────────────────────▼──────────────────────┐
│                 AgentKit Core               │
│ (State Management, Coordinator, Unified Action Model, Security Management)  │
└───┬───────────┬────────────────┬────────────┘
    │           │                │
┌───▼───┐   ┌───▼───┐      ┌────▼────┐
│AccessKit│  │ WebKit│      │Device  │
│ Bridge │  │ Bridge│      │ Bridge  │
└───┬───┘   └───┬───┘      └────┬────┘
    │           │                │
┌───▼───┐   ┌───▼───┐      ┌────▼────┐
│AccessKit│  │Browser/│      │ Platform │
│Adapters│  │WebView │      │Specific  │
└───┬───┘   └───┬───┘      └────┬────┘
    │           │                │
┌───▼───────────▼────────────────▼───┐
│        Applications & UIs          │
└───────────────────────────────────┘

Key Architecture Notes:

Three-Bridge Parallel Strategy:
- AccessKit Bridge: Used with priority, obtaining structured information and executing standard actions.
- WebKit Bridge: Handling Web content (browser/WebView), can execute JS. Solves WebView interaction challenges: WebView exposes internal Web accessibility trees to the system (visible to AccessKit), but JS operations are more powerful and direct. This bridge provides JS execution capability.
- Device Bridge: Handling other non-phone smart terminal devices (Raspberry Pi, Jetson, Orange Pi, and other embedded Linux platforms).
Coordinator Pattern: The Core layer dynamically selects the most appropriate bridging method and handles mixed scenarios.
Logical/Physical Separation: Prioritizes semantics-based logical actions (ClickElement), falling back to physical actions (ClickPosition) when necessary.
Efficient Data Exchange: Drawing on Droidrun experience, but using shared memory/zero-copy memory/memory-mapped files + notification mechanism to replace the file system + Logcat, improving efficiency.

A core advantage of AgentKit's design will be supporting all UI frameworks that have implemented AccessKit, including but not limited to:

Rust Native UI Frameworks:
- Makepad: Hardware-accelerated cross-platform UI framework
- Druid: Data-driven Rust UI toolkit
- Iced: Simple cross-platform GUI library
- Vizia: Declarative Rust GUI library, focusing on audio applications
- Slint: Lightweight UI toolkit, suitable for embedded systems
- Bevy UI: Game UI system based on the Bevy engine
Cross-Language UI Frameworks:
- Flutter: Can integrate AccessKit through Rust bindings
- GTK: Supported through AccessKit adapters
- Qt: Can provide support through AccessKit bridging
- wxWidgets: Can integrate AccessKit to provide cross-platform accessibility support
Experimental or Potentially Supported Frameworks in the Future:
- Xilem: Experimental Rust GUI framework
- Egui: Immediate mode GUI library
- Custom rendering engines: Custom rendering systems used by games and professional applications

For framework support, AgentKit adopts a plugin architecture, implementing flexible integration through adapter interfaces, fully leveraging the advantages of Rust language engineering capabilities.

Using `dora-rs` to Implement Automated Control of Non-Phone Smart Terminals

Enabling AgentKit to control Raspberry Pi, Jetson, and other smart devices is completely feasible and extremely valuable extension direction, which can expand AgentKit from GUI automation to a more universal AI control platform that interacts with the physical world. This corresponds to the Device Bridge part in the architecture above.

Core Mechanism:

Device-Side Nodes: Run dora-rs nodes on target smart devices, acting as device drivers/proxies, encapsulating interaction logic with specific hardware (GPIO, sensors, serial ports, etc.) or software (SDKs, APIs).
Central Brain: AgentKit Core (running on PC/server) is responsible for decision-making.
Network Bridge: AgentKit Core communicates with device-side dora-rs nodes through the network.
Standardized Data Flow: Use dora-rs message passing to standardize instructions (such as SetPinHigh, ReadTemperature), states/data (such as PinState, TemperatureReading), and capability descriptions.

Main Advantages:

Abstract Decoupling: AgentKit Core doesn't need to care about device details, only interacting with standardized interfaces.
Leveraging dora-rs Advantages: Fully utilizing its high performance, low latency, and data flow processing capabilities on embedded Linux.
Easy to Extend: Adding new devices only requires developing corresponding dora-rs nodes.
Distributed Intelligence: Can implement some local processing and control logic on the device side.
Unified Framework: Use one set of AgentKit to orchestrate complex workflows involving GUI, Web, and physical devices.

An application example can be envisioned (Raspberry Pi temperature control LED):

Raspberry Pi runs sensor_node (temperature and humidity reading node) and led_node (LED control node).
AgentKit Core (PC) instructs sensor_node to read the temperature.
sensor_node reads and sends back TemperatureReading(32.5) to Core.
Core determines the temperature is over the limit (>30) and instructs led_node to turn on the LED.
led_node executes and returns LedStatus(on).

In this way, by running dora-rs nodes on smart devices, AgentKit can reliably and efficiently extend its automation capabilities to the physical world, becoming a more powerful universal AI automation platform.

AgentKit's design has significant advantages over platform-specific solutions (like Droidrun.ai):

Truly Cross-Platform: Develop AI Agent logic once, and it can control applications on Windows, macOS, Linux, Android, iOS (through frameworks like Flutter), Web, and more.
Framework Independence: As long as the UI framework implements AccessKit (for native) or provides Web access interfaces, it can be controlled by AgentKit, allowing developers to freely choose their technology stack.
Code Reuse and Maintenance: Core AI logic and control models only need to be maintained as one set, significantly reducing maintenance costs.
Unified User Experience: Users can get consistent AI assistance or automation experiences across different devices and applications.
Leveraging Native Capabilities: Through AccessKit and Web APIs, AgentKit deeply leverages native platform capabilities, achieving precise, efficient perception and control, avoiding fragile image recognition.
Modularity and Extensibility: Can easily add support for new frameworks, new platforms, or new AI models, ensuring the system can evolve with technological developments.
Performance Optimization: Rust-based implementation provides excellent performance and memory safety guarantees, particularly suitable for real-time UI control scenarios.

AgentKit Dedicated AI Gateway Service (Optional Enhancement)

To better support cross-platform deployment, centralized management, and cloud inference capabilities, AgentKit can introduce a dedicated AI Gateway service as a bridge between clients and AI providers. This architecture is not necessary but can bring significant advantages.

Abstraction and Unification: Provides clients with a single, stable API interface, shielding differences and changes in underlying AI providers (OpenAI, Anthropic, Gemini, etc.). Clients don't need to adapt to multiple API formats and authentication methods.
Intelligent Routing and Load Balancing: Automatically selects the most appropriate AI provider or model for inference based on request type (such as requiring visual capabilities), real-time load, cost, provider health status, and other factors.
Security Enhancement: Centrally manages API keys for all AI providers, eliminating the need for clients to directly store these sensitive credentials, reducing the risk of leakage. The Gateway is responsible for authenticating client identity.
Request Optimization and Transformation:
- Compression: Intelligently compresses large data such as UI snapshots and history records, reducing network transmission volume.
- Format Conversion: Converts AgentKit's internal request format to the format required by specific AI providers.
- Context Management: May intelligently truncate or summarize too-long history records.
Response Caching: Caches deterministic or high-frequency requests (e.g., requests with low temperature settings) to avoid repeated calls to expensive AI APIs, reducing costs and latency.
Monitoring, Analysis, and Billing: Centrally collects metrics such as API call count, Token consumption, latency, error rate, facilitating analysis of performance, cost, user behavior patterns, and enabling unified billing.
Rate Limiting and Quota Management: Implements unified rate limiting and quota management for clients to prevent abuse.

The AI Gateway system architecture design is roughly as follows:

┌───────────────────────┐
│   AgentKit Client     │
│  (Windows, macOS,     │
│   Linux, Mobile)      │
└───────────┬───────────┘
            │
            │ HTTPS / WebSockets (Encrypted Communication)
            │ (AgentKit Client <-> Gateway)
            ▼
┌───────────────────────────────────────────┐
│             AgentKit AI Gateway           │
│                                           │
│  ┌─────────────┐      ┌─────────────────┐ │
│  │ Authentication  │◀────▶│ Request Routing &  │ │
│  │ & Authorization │      │ Load Balancing  │ │
│  │ (Client Auth)│      └───────┬─────────┘ │
│  └──────┬──────┘              │           │
│         │                      │           │
│         ▼                      ▼           │
│  ┌─────────────┐      ┌─────────────────┐ │
│  │ Request Optimization  │──────▶│ Response Cache  │ │
│  │ & Transformation  │      └───────┬─────────┘ │
│  └──────┬──────┘              │           │
│         │                      │           │
│         ▼                      ▼           │
│  ┌─────────────────┐  ┌─────────────────┐ │
│  │ AI Provider I/F │  │ Monitoring &     │ │
│  │ (Unified Interface) │  │ Analysis      │ │
│  └──────┬──────┘      └─────────────────┘ │
└─────────┼─────────────────────────────────┘
          │ HTTPS (Gateway <-> AI Provider)
          │ (Including Provider API Key)
          ▼
    ┌────────────┐┌────────┐┌───────────────┐
    │ OpenAI API ││Claude  ││Other AI Providers │
    └────────────┘│  API   │└───────────────┘
                  └────────┘

Core Component Implementation (Conceptual Explanation)

Authentication and Authorization Module: Verifies client API Key or Token, manages client permissions.
Request Routing and Load Balancing: Selects backend AI Provider based on rules (such as request characteristics, provider status, cost).
Request Optimization and Transformation: Implements compression, format adaptation, context processing logic.
Response Cache Management: Uses Redis or memory cache to store cacheable responses.
AI Provider Interface: Defines a unified trait AiProvider, and implements specific adapters for each provider, encapsulating their API calls and key management.
Monitoring and Analysis: Integrates Prometheus/Grafana or similar systems to record key metrics.

AgentKit and Claude MCP / Google A2A Protocols Complementary Collaboration

Recently, MCP and A2A, these two AI Agent-related protocols, have started to become popular. So I think it's necessary to compare them.

First, let's clarify the core problems that each of these three technologies solves:

Technology	Core Positioning	Main Problem Solved
MCP (Anthropic) Protocol	Tool Connection Layer	"How AI can call external tools and data sources"
A2A (Google) Protocol	Agent Collaboration Layer	"How different AI agents can securely communicate with each other"
AgentKit (Tool)	UI Interaction Layer	"How AI can perceive and operate user interfaces across platforms"

Relationship Analysis Between AgentKit and MCP

MCP provides a standardized way to connect AI models with external tools, while AgentKit focuses on providing cross-platform UI control capabilities. This relationship is naturally complementary.

MCP can handle "tool calls" (such as getting weather, searching for information, accessing files), while AgentKit handles "UI interactions" (such as clicking buttons, inputting text, parsing interface structures).

AgentKit can serve as an MCP tool provider. AgentKit can be implemented as an MCP server, providing a standardized set of UI interaction tools. For example:

# Example of AgentKit implemented as an MCP service
from mcp.server.fastmcp import FastMCP

# Create MCP server
mcp = FastMCP("AgentKit UI Automation")

@mcp.tool()
def click_element(element_id: str) -> bool:
    """Click a UI element with the specified ID."""
    # AgentKit internal implementation of cross-platform clicking
    return agent_kit.bridges.coordinator.execute_action(
        AgentAction.ClickElement(element_id=element_id)
    )

The serious security challenges that MCP currently faces (such as tool poisoning attacks) also provide important warnings for AgentKit:

Tool Isolation: AgentKit needs to implement a sandbox-like mechanism to limit the permission scope of UI interaction operations
Clear Permission Model: AgentKit should adopt fine-grained permission control, distinguishing between reading UI (low risk) and executing operations (high risk)
Secondary Confirmation Mechanism: Key operations (such as submitting forms, deleting content) should require explicit user confirmation

Relationship Analysis Between AgentKit and A2A

The A2A protocol focuses on collaboration and communication between AI Agents, while AgentKit focuses on UI interaction capabilities, which are also highly complementary. AgentKit can be implemented as an A2A-compatible agent, dedicated to handling UI interaction tasks, thereby enhancing AI Agent division of labor and collaboration. A2A's security mechanisms (such as AgentCard, authentication) can also enhance AgentKit's security.

For example:

// Example of AgentKit as an A2A agent's AgentCard
{
  "name": "UI Interaction Agent",
  "description": "Specialized in cross-platform user interface interaction",
  "provider": {
    "organization": "AgentKit.org"
  },
  "skills": [
    {
      "id": "ui_discovery",
      "name": "UI Element Discovery",
      "description": "Discover and analyze interactive elements on the interface"
    },
    {
      "id": "ui_interaction",
      "name": "UI Element Interaction",
      "description": "Interact with interface elements (click, input, etc.)"
    }
  ],
  "authentication": {
    "schemes": ["bearer"]
  }
}

AgentCard: A public metadata file in the A2A protocol (usually located at /.well-known/agent.json), describing the agent's capabilities, skills, endpoint URLs, and authentication requirements. Clients use this file for discovery.

Three-Layer Architecture Integration Scheme

Based on the above analysis, a three-layer architecture integrating MCP, A2A, and AgentKit can be constructed:

┌─────────────────────────────────────────┐
│              A2A Collaboration Layer    │
│  (Agent Coordination, Task Assignment, Secure Communication)  │
└───────────────────┬─────────────────────┘
                    │
┌───────────────────▼─────────────────────┐
│              MCP Tool Layer             │
│  (Standardized Tool Calls, Resource Access)  │
└─────────┬─────────────────────┬─────────┘
          │                     │
┌─────────▼────────┐   ┌────────▼─────────┐
│   AgentKit UI Layer  │   │  Other Specialized Tools  │
│  (Cross-platform UI Interaction)  │   │  (Search, Calculation, etc.)  │
└──────────────────┘   └──────────────────┘

In this architecture:

The A2A Layer handles task assignment and collaboration between agents
The MCP Layer provides a standardized tool calling interface
The AgentKit Layer serves as a specialized tool for MCP, providing cross-platform UI interaction capabilities

Conclusion

Above is my technical vision for AgentKit, a universal AI automation framework (or kit).

We are in an era where AI technology is profoundly influencing human-computer interaction. I foresee that future applications will not be completely replaced by AI, but will form a diversified interaction ecosystem: AI agents, traditional graphical applications, and hybrid systems that combine both will coexist.

User experience will also become layered: at the high level, users can conveniently express their intentions through natural language; at the low level, AI is responsible for translating intentions into specific operations, and when necessary, calling appropriate graphical interfaces for users to perform fine control or receive intuitive feedback.

This evolutionary trend highlights the need for new infrastructure. My AgentKit concept is deeply inspired by Droidrun.ai's exploration of automation on the Android platform. Droidrun.ai's practice verifies the feasibility of AI-driven device control but also exposes its single-platform limitations.

Therefore, the core goal of AgentKit is to transcend platform silos and achieve truly unified cross-platform automation. The key technical support for this vision is AccessKit. It provides a standardized, cross-platform accessibility interface that allows AI Agents to "understand" and "operate" user interfaces under different operating systems and application frameworks in a consistent way.

Through this technical vision of AgentKit, I see a clear possible path for AI-driven automation to move from platform silos to cross-platform unification. Rust's safety, performance, and cross-platform capabilities, combined with AccessKit's unified accessibility interface and the understanding capabilities of modern LLMs, together provide a solid technical foundation for this vision.

Thank you for reading.