Giving a Blind LLM Eyes: Desktop Control Without a Vision Model

The Hard Ceiling on Local AI

Local LLMs have a fundamental limitation: no vision. Qwen, Gemma, DeepSeek — they're text-only. If you want an AI agent that can actually click around your desktop, it needs to "see" the screen somehow.

The standard answer is a vision model, but that burns VRAM you don't have (the RTX 4090 fits exactly one model at a time). Or you send screenshots to an API — burning 1,000–4,000 tokens per image plus network latency. ASCII art representations run around 16,000 tokens and still lose structural information.

desktop-agent sidesteps the entire problem. It doesn't teach models to see images. It converts the screen into text they already understand.

Two Data Sources, One Description

The command desktop-agent analyze --json runs in about 3 seconds and produces roughly 500 tokens of structured output. That output comes from two independent pipelines running simultaneously:

AT-SPI walks the Linux accessibility tree — every button, menu, input field, and tab that applications expose through the accessibility bus. It captures the element's role, name, position, and whether it's interactive. Native GTK/Qt apps expose rich trees; Firefox and Electron apps don't, but the ones that do cover a surprising amount of the desktop.

RapidOCR takes a screenshot and runs PaddleOCR models through ONNX runtime — phrase-level text detection with 90%+ confidence. This catches everything AT-SPI misses: canvas-rendered text, custom widgets, browser content, terminal output.

The Loop: Analyze → Act → Verify

This is the core workflow:

                            desktop-agent analyze --json    # see the screen as structured text

desktop-agent click @e3         # click an element by reference

desktop-agent analyze --diff    # verify: what changed?

Element references persist to disk through an element cache, so @e3 works from a separate process — no in-memory state between CLI invocations. --diff compares against the previous analyze run and reports which windows, elements, and text regions changed.

Detail levels: --quick does AT-SPI only (no screenshot, sub-second), --deep grabs more elements and text. Region targeting (--region top, --region x,y,w,h) filters to a specific portion of the screen for faster OCR on large displays.

What This Enables

The kind of thing that becomes possible: "find the Spotify window, extract the current playlist, and save the track names to a file." A text-only model running that through desktop-agent can do it. A text-only model without it is blind.

It can't understand icons, images, graphs, or canvas content — that's a real limitation. RapidOCR reads text; AT-SPI reads widget structure. Neither gives semantic understanding of visual content. For a lot of desktop automation that doesn't matter — the buttons are labeled, the fields have names, the text is text.

Works natively with Claude Code, QuetzaCodetl, OpenCode, or any agent harness that can invoke Bash commands.

What You Get

Full desktop-agent tool — modular architecture, element caching, change detection, detail levels, region targeting
Installation script — one command to set up with all dependencies
Integration guides — Claude Code skill config, QuetzaCodetl setup, generic Bash tool setup for any agent
Task-caching system — record, search, and replay desktop automation workflows
Comprehensive docs — architecture overview, API reference, troubleshooting

Get Desktop Agent

One-time purchase. Instant download. Linux only (requires AT-SPI).

$29 AUD

Buy Now

Secure checkout via Polar. Merchant of Record — tax handled automatically.

Written by Indra's Mirror — building tools that let local AI actually interact with the world.

Tags: desktop automation, AI agents, Linux, AT-SPI, OCR, screen understanding, Claude Code, local LLM, accessibility