AI Blog

JUST DROPPED Research CUDA Flash Attention May 9, 2026

Fused TBQ4 Flash Attention: 82 tok/s with Lossless 4-bit KV at 200K

We fused TurboQuant dequant directly into the flash attention kernel — reading raw TBQ4 blocks inline via centroid lookup in the FWHT-rotated domain. 82+ tok/s with lossless 4.25 bpv KV at 200K context on RTX 4090. Nobody else has done this.

Fused FA TBQ4 llama.cpp CUDA Kernel MTP Qwen3.6

Read Article →

NEW Benchmarks Research May 5, 2026

Turbo4 KV Cache: Better Than Q8 at Half the VRAM

Trellis-Coded Quantization benchmark: Turbo4 scores 100/100 vs Q8_0's 91 on hardened agentic benchmark. Lossless FP16 quality, 40 t/s, 256K context on RTX 4090 24GB.

KV Cache TCQ TurboQuant RTX 4090 Qwen3.6

Read Article →

NEW Benchmarks Optimization February 20, 2026

Q4 KV Cache: Surprisingly Viable

Benchmarks show Q4 KV cache producing faster, higher-quality code than Q8. The conventional wisdom about Q4 being unusable may be wrong.

KV Cache Q4 Quantization llama.cpp RTX 4090

Read Article →

NEW Benchmarks February 15, 2026

IQ2 vs IQ3 Quantization: 2x Speed, Same Quality

Comprehensive RTX 4090 benchmark: IQ2_XS hits 86 t/s vs IQ3_XXS at 44 t/s. Quality testing across coding, debugging, and agentic tasks shows negligible difference.

Quantization RTX 4090 Benchmarks Qwen3-Coder-Next

Read Article →

Benchmarks Updated Feb 15, 2026

Maxing Out Qwen3-Coder-Next Abliterated: 94 t/s

The abliterated version hits 94 t/s at 168K context with Q8 KV cache — over 5x faster than the base model. Complete optimization guide with red teaming verification.

Abliterated Q8 KV Cache Local AI RTX 4090

Read Article →

Benchmarks February 9, 2026

Qwen3-Coder-Next: IQ2 vs IQ3 Benchmarks

IQ2_XXS achieves 22 t/s on RTX 4090 at 200K context — 85% faster than IQ3 with no measurable quality loss. Full benchmark data and configuration.

Local AI Benchmarks IQ2 Quantization RTX 4090

Read Article →

Tutorial February 2, 2026

Running Uncensored AI Locally: My PRISM Setup

How I set up GLM-4.7-Flash locally with web search and vision capabilities. No content filters, no API subscription, just my hardware doing what I tell it to.

Local AI GLM-4.7 Uncensored Claude Code

Read Article →

NEW Prompt Engineering Local Models July 3, 2026

The Fable Distillation: Making a $3 Model Work Like a $30 Model

A behavioral overlay that brings frontier-model discipline to local and budget models — goal-driven work, self-verification, honest state reporting. No fine-tuning required. Available now as a downloadable prompt pack.

Prompt Engineering Local AI Agent Reliability Claude Code

Read Article →

NEW Desktop Automation Linux July 3, 2026

Giving a Blind LLM Eyes: Desktop Control Without a Vision Model

desktop-agent turns accessibility trees and OCR into a 500-token screen description that local models can actually use. No vision model, no API costs, no VRAM burned on image tokens.

Desktop Automation AT-SPI OCR Linux AI Agents

Read Article →

NEW Self-Hosted Search July 3, 2026

I Built My Own Search Engine Because Google Stopped Being One

Odin: a self-hosted discovery engine that maps the internet's authoritative hubs and fetches content live. 250K+ pages across 10K+ domains, MCP server for AI agents, zero tracking, no censorship.

Search Engine Self-Hosted MCP Server Privacy AI Search

Read Article →

Fused TBQ4 Flash Attention: 82 tok/s with Lossless 4-bit KV at 200K

Turbo4 KV Cache: Better Than Q8 at Half the VRAM

Q4 KV Cache: Surprisingly Viable

IQ2 vs IQ3 Quantization: 2x Speed, Same Quality

Maxing Out Qwen3-Coder-Next Abliterated: 94 t/s

Qwen3-Coder-Next: IQ2 vs IQ3 Benchmarks

Running Uncensored AI Locally: My PRISM Setup

Get AI Tips in Your Inbox

The Fable Distillation: Making a $3 Model Work Like a $30 Model

Giving a Blind LLM Eyes: Desktop Control Without a Vision Model

I Built My Own Search Engine Because Google Stopped Being One

More Articles Coming Soon