GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

Note: This repository is for research purposes only and not for commerical.

Abstract

Graphical user interface (GUI) agents built on vision-language models have emerged as a promising approach to automate human-computer workflows. However, they also face the inefficiency challenge as they process long sequences of high-resolution screenshots and solving long-horizon tasks, making inference slow, costly and memory-bound. While key-value (KV) caching can mitigate this, storing the full cache is prohibitive for image-heavy contexts. Existing cache-compression methods are sub-optimal as they do not account for the spatial and temporal redundancy of GUIs. In this work, we first analyze attention patterns in GUI agent workloads and find that, unlike in natural images, attention sparsity is uniformly high across all transformer layers. This insight motivates a simple uniform budget allocation strategy, which we show empirically outperforms more complex layer-varying schemes. Building on this, we introduce GUI-KV, a plug-and-play KV cache compression method for GUI agents that requires no retraining. GUI-KV combines two novel techniques: (i) spatial saliency guidance, which augments attention scores with the L2 norm of hidden states to better preserve semantically important visual tokens, and (ii) temporal redundancy scoring, which projects previous frames' keys onto the current frame's key subspace to preferentially prune redundant history. Across standard GUI agent benchmarks and models, GUI-KV outperforms competitive KV compression baselines, closely matching full-cache accuracy at modest budgets. Notably, in a 5-screenshot setting on the AgentNetBench benchmark, GUI-KV reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline. These results demonstrate that exploiting GUI-specific redundancies enables efficient and reliable agent performance.

Repository Layout

GUI-KV_GH/
├── gui_kv/                   # Core GUI-KV cache compression utilities
│   └── gui_kv_utils.py
└── eval/                     # Benchmark evaluation pipelines
    ├── *_eval.py             # Python entry points
    ├── *_eval.sh             # Orchestration scripts (preferred entry)
    ├── attention_helpers.py  # Attention overrides and helpers
    ├── process_utils.py      # Shared helpers for data / metrics
    └── ...                   # Benchmark-specific utilities

Environment Setup

Requirements

Python ≥ 3.10
Linux with CUDA-enabled GPUs (scripts target 7B-class multimodal models)
CUDA 12.x (for PyTorch and flash-attention)

Installation

Install transformers from source (required):
```
pip install git+https://github.com/huggingface/transformers.git@bbca9782ca1b8b358cc832a1b821aa1b450850da
```
Note: The code requires a specific development version of transformers (4.54.0.dev0) from commit bbca9782.
Install other core dependencies:
```
pip install -r requirements.txt
```
Install flash-attention (if not already installed):
```
pip install flash-attn --no-build-isolation
```
Note: Flash-attention requires CUDA and may take several minutes to compile. See the flash-attention repository for troubleshooting.
Benchmark datasets:
- AgentNetBench, AndroidControl, ScreenSpot v2 / Pro, Multimodal Mind2Web
- See Dataset Configuration below for how to configure paths

Dataset Configuration

Before running benchmarks, configure the dataset paths. You can do this in two ways:

Option 1: Set Environment Variables (Recommended)

Export the required environment variables before running the scripts:

# AgentNetBench
export AGENTNETBENCH_IMGS=/path/to/AgentNetBench/test_data/images
export AGENTNETBENCH_DATA=/path/to/AgentNetBench/test_data

# AndroidControl
export ANDROIDCONTROL_IMGS=/path/to/AndroidControl/images
export ANDROIDCONTROL_TEST=/path/to/AndroidControl/data

# Multimodal Mind2Web
export MM_MIND2WEB_IMGS=/path/to/Multimodal-Mind2Web/release_images
export MM_MIND2WEB_TEST=/path/to/Multimodal-Mind2Web/data/samples

# ScreenSpot v2
export SCREENSPOT_IMGS=/path/to/ScreenSpot-v2/screenspotv2_image
export SCREENSPOT_TEST=/path/to/ScreenSpot-v2

# ScreenSpot Pro
export SCREENSPOTPRO_IMGS=/path/to/ScreenSpot-Pro/images
export SCREENSPOTPRO_TEST=/path/to/ScreenSpot-Pro/annotations

Option 2: Edit Shell Scripts Directly

Open the desired shell script (e.g., eval/agentnetbench_eval.sh) and modify the dataset path defaults near the top of the file.

Running Benchmarks

The recommended entry points are the shell scripts under eval/. Each script sweeps KV cache budgets, launches background jobs with nohup, and writes logs/results under logs/ and results/.

Supported Models and KV Cache Methods

All shell scripts support:

Models:

ByteDance-Seed/UI-TARS-1.5-7B
xlangai/OpenCUA-7B

KV Cache Methods:

original - No compression (full cache)
pyramid_kv - Pyramid KV
vl_cache - VL-Cache
snap_kv - SnapKV
gui_kv - GUI-KV (ours)

You can switch models or cache methods by editing the model_path and kv_cache variables at the top of each shell script.

Available Benchmarks

Shell script	Benchmark	Required environment variables
`eval/agentnetbench_eval.sh`	AgentNetBench	`AGENTNETBENCH_IMGS`, `AGENTNETBENCH_DATA`
`eval/androidcontrol_eval.sh`	AndroidControl	`ANDROIDCONTROL_IMGS`, `ANDROIDCONTROL_TEST`
`eval/multimodal_mind2web_eval.sh`	Multimodal Mind2Web	`MM_MIND2WEB_IMGS`, `MM_MIND2WEB_TEST`
`eval/screenspotv2_eval.sh`	ScreenSpot v2	`SCREENSPOT_IMGS`, `SCREENSPOT_TEST`
`eval/screenspotpro_eval.sh`	ScreenSpot Pro	`SCREENSPOTPRO_IMGS`, `SCREENSPOTPRO_TEST`

Step-by-step

Configure dataset paths using environment variables (see Dataset Configuration above) or by editing the shell script directly.
Adjust kv_cache_budgets, num_gpus, alpha, temperature, or window_size in the shell script if you want to explore other settings.
Run the script, for example:
```
bash eval/agentnetbench_eval.sh
```
Each budget value is dispatched in the background. Monitor progress with the printed log paths (for example, tail -f logs/agentnetbench_eval/...).
Aggregated metrics and detailed JSON traces will appear under results/<benchmark>/....

Running a Single Configuration Manually

If you prefer not to spawn multiple jobs, call the Python entry point directly:

python eval/agentnetbench_eval.py \
  --model_path ByteDance-Seed/UI-TARS-1.5-7B \
  --agentnetbench_imgs /path/to/AgentNetBench/test_data/images \
  --agentnetbench_data /path/to/AgentNetBench/test_data \
  --kv_cache gui_kv \
  --kv_cache_budget 10 \
  --alpha 2 \
  --window_size 8 \
  --temperature 3.5 \
  --attention_implementation flash_attention_2 \
  --model_dtype bfloat16 \
  --results_dir results/agentnetbench/custom_run \
  --task all

All Python drivers share a similar argument set (--kv_cache, --kv_cache_budget, --alpha, --window_size, --temperature, etc.), so you can adapt the command for other datasets.

Using GUI-KV Programmatically

Advanced users can plug GUI-KV into new experiments via gui_kv/gui_kv_utils.py:

from gui_kv.gui_kv_utils import init_gui_kv

# Example hook: register GUI-KV on top of an existing attention module.
kv_config = dict(
    max_capacity_prompt=320,
    window_size=8,
    alpha=2.0,
    temperature=3.5,
    pooling="avgpool",
)
init_gui_kv(model, **kv_config)

See the evaluation scripts in eval/attention_helpers.py for complete integration examples with Qwen2.5-VL and OpenCUA backends.

Citation

@article{huang2025guikv,
  title   = {GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness},
  author  = {Huang, Kung-Hsiang and Qiu, Haoyi and Dai, Yutong and Xiong, Caiming and Wu, Chien-Sheng},
  journal = {arXiv preprint arXiv:2510.00536},
  year    = {2025}
}

License

Distributed under the terms of the repository’s LICENSE.txt. Evaluate the license alongside any external model or dataset licenses before use.

Ethical Considerations

This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people's lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.

Name	Name	Last commit message	Last commit date
Latest commit History 6 Commits
eval	eval
gui_kv	gui_kv
.gitignore	.gitignore
AI_ETHICS.md	AI_ETHICS.md
CODEOWNERS	CODEOWNERS
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md
CONTRIBUTING.md	CONTRIBUTING.md
LICENSE.txt	LICENSE.txt
README.md	README.md
SECURITY.md	SECURITY.md
how_to_license.md	how_to_license.md
requirements.txt	requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

Abstract

Repository Layout

Environment Setup

Requirements

Installation

Dataset Configuration

Option 1: Set Environment Variables (Recommended)

Option 2: Edit Shell Scripts Directly

Running Benchmarks

Supported Models and KV Cache Methods

Available Benchmarks

Step-by-step

Running a Single Configuration Manually

Using GUI-KV Programmatically

Citation

License

Ethical Considerations

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

License

SalesforceAIResearch/GUI-KV

Folders and files

Latest commit

History

Repository files navigation

GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

Abstract

Repository Layout

Environment Setup

Requirements

Installation

Dataset Configuration

Option 1: Set Environment Variables (Recommended)

Option 2: Edit Shell Scripts Directly

Running Benchmarks

Supported Models and KV Cache Methods

Available Benchmarks

Step-by-step

Running a Single Configuration Manually

Using GUI-KV Programmatically

Citation

License

Ethical Considerations

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages