Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Pomilon-Intelligence-Lab/DVOC

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DVOC — Damped Visual Offset Correction

License: MIT Paper

Making weak vision models viable for pixel-precise UI grounding through iterative visual feedback.

DVOC is a training-free recursive refinement protocol for pixel-accurate UI grounding with weak VLMs. Instead of one-shot coordinate prediction, it runs a closed kinematic loop: render a visual overlay (crosshair + uncertainty circle) at the model's previous estimate, ask for a relative offset, apply the correction with exponential damping, and repeat.

Key result: 53–67% raw success (63–100% protocol) vs. 0–14% for naive one-shot prediction across 4 VLMs and 11 UI tasks. A 26B model under DVOC achieves 100% protocol convergence.


Quick Start

git clone https://github.com/Pomilon-Intelligence-Lab/DVOC
cd DVOC
pip install -e dvoc-core/
pip install -e dvoc-py/
export GEMINI_API_KEY="your-key-here"
from dvoc import DVOC

# Desktop backend (OS-level clicks)
agent = DVOC(
    backend="os",
    model="gemini/gemini-3.1-flash-lite",
    epsilon=30.0,
    manual_target=(960, 540),
)

result = agent.run("Click the center of the screen")
print(result["success"])      # True
print(result["final_point"])  # point model converged to
print(result["iterations"])   # iterations needed

Web Backend

from dvoc import DVOC
from dvoc_web import WebBackend

backend = WebBackend(headless=True)
backend.start()
backend.goto("https://example.com")

agent = DVOC(
    backend=backend,  # pass pre-configured backend directly
    model="gemini/gemini-3.1-flash-lite",
    epsilon=30.0,
)
result = agent.run("Click the 'More information' link")
backend.stop()

Architecture

Planner → Backend (capture) → Renderer (overlay) → Verifier (VLM) → Corrector → Policy → ACT

Six modules, one responsibility each:

Module Responsibility
State Immutable per-iteration snapshots with full replay
Planner Initial coordinate guess (manual or VLM)
Renderer Draw crosshair + circle + labels on screenshot
Verifier VLM estimates offset from crosshair to target
Corrector Damped correction with exponential alpha decay
Policy Convergence gating: REFINE / ACT / ABORT

The Verifier uses a ReAct-style prompt — the model reasons step-by-step, sees its own previous iterations, then calls report_offset(dx, dy, confidence) via tool calling.


Repository Structure

dvoc-protocol/
├── dvoc-core/src/dvoc_core/     # Loop engine (10 modules)
│   ├── _types.py                # Point, Vector, Geometry, Action
│   ├── _state.py                # Immutable state model
│   ├── _backend.py              # Backend ABC
│   ├── _planner.py              # Manual or model-based planner
│   ├── _verifier.py             # Simulated or model-based verifier
│   ├── _corrector.py            # Damped correction
│   ├── _renderer.py             # Visual overlay (crosshair + circle)
│   ├── _policy.py               # Convergence policy
│   ├── _loop.py                 # LoopEngine orchestrator
│   └── _model.py                # LiteLLM-based VLM client
├── dvoc-os/                     # Desktop backend (auto-detects X11 / Wayland / Windows)
├── dvoc-web/                    # Web backend (Playwright)
├── dvoc-py/                     # Public SDK
├── examples/
│   ├── benchmark/               # Benchmark engine (CLI, config, tasks, reporter)
│   ├── run_ablation.py          # Ablation experiments
│   ├── run_alpha_ablation.py    # Alpha damping ablation
│   ├── run_predict_ablation.py  # Predict method ablation
│   ├── test_with_model.py       # Synthetic model test harness
│   └── test_web_model.py        # Real web page model test
├── artifacts/
│   ├── charts/                  # Benchmark result charts
│   └── results/                 # Aggregated trial results (JSON)
├── PAPER.md                     # Technical whitepaper
├── Makefile                     # Test runner
└── docs/                        # Design specs and plans

Tests

make test-all
# 80 tests across all packages (61 core + 3 os + 3 py + 13 web)

Or test individual packages:

make test-core    # 61 core engine tests
make test-web     # 13 web backend tests

Research

See PAPER.md for the full technical whitepaper, including benchmark methodology, results, and analysis.

Paper: research.pomilon.xyz/papers/dvoc


License

MIT

About

DVOC (Damped Visual Offset Correction) is a training-free, recursive refinement protocol that enables weak vision-language models (VLMs) to achieve pixel-precise UI grounding through iterative visual feedback.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.