AI Blog

Insights, tutorials, and benchmarks from the world of AI

JUST DROPPED Research CUDA Flash Attention May 9, 2026

Fused TBQ4 Flash Attention: 82 tok/s with Lossless 4-bit KV at 200K

We fused TurboQuant dequant directly into the flash attention kernel — reading raw TBQ4 blocks inline via centroid lookup in the FWHT-rotated domain. 82+ tok/s with lossless 4.25 bpv KV at 200K context on RTX 4090. Nobody else has done this.

Fused FA TBQ4 llama.cpp CUDA Kernel MTP Qwen3.6
Read Article →
NEW Benchmarks Research May 5, 2026

Turbo4 KV Cache: Better Than Q8 at Half the VRAM

Trellis-Coded Quantization benchmark: Turbo4 scores 100/100 vs Q8_0's 91 on hardened agentic benchmark. Lossless FP16 quality, 40 t/s, 256K context on RTX 4090 24GB.

KV Cache TCQ TurboQuant RTX 4090 Qwen3.6
Read Article →
NEW Benchmarks Optimization February 20, 2026

Q4 KV Cache: Surprisingly Viable

Benchmarks show Q4 KV cache producing faster, higher-quality code than Q8. The conventional wisdom about Q4 being unusable may be wrong.

KV Cache Q4 Quantization llama.cpp RTX 4090
Read Article →
NEW Benchmarks February 15, 2026

IQ2 vs IQ3 Quantization: 2x Speed, Same Quality

Comprehensive RTX 4090 benchmark: IQ2_XS hits 86 t/s vs IQ3_XXS at 44 t/s. Quality testing across coding, debugging, and agentic tasks shows negligible difference.

Quantization RTX 4090 Benchmarks Qwen3-Coder-Next
Read Article →
Benchmarks Updated Feb 15, 2026

Maxing Out Qwen3-Coder-Next Abliterated: 94 t/s

The abliterated version hits 94 t/s at 168K context with Q8 KV cache — over 5x faster than the base model. Complete optimization guide with red teaming verification.

Abliterated Q8 KV Cache Local AI RTX 4090
Read Article →
Benchmarks February 9, 2026

Qwen3-Coder-Next: IQ2 vs IQ3 Benchmarks

IQ2_XXS achieves 22 t/s on RTX 4090 at 200K context — 85% faster than IQ3 with no measurable quality loss. Full benchmark data and configuration.

Local AI Benchmarks IQ2 Quantization RTX 4090
Read Article →
Tutorial February 2, 2026

Running Uncensored AI Locally: My PRISM Setup

How I set up GLM-4.7-Flash locally with web search and vision capabilities. No content filters, no API subscription, just my hardware doing what I tell it to.

Local AI GLM-4.7 Uncensored Claude Code
Read Article →

Get AI Tips in Your Inbox

Subscribe for tutorials on local AI, Stable Diffusion, LoRA training, and Claude Code workflows.

NEW Prompt Engineering Local Models July 3, 2026

The Fable Distillation: Making a $3 Model Work Like a $30 Model

A behavioral overlay that brings frontier-model discipline to local and budget models — goal-driven work, self-verification, honest state reporting. No fine-tuning required. Available now as a downloadable prompt pack.

Prompt Engineering Local AI Agent Reliability Claude Code
Read Article →
NEW Desktop Automation Linux July 3, 2026

Giving a Blind LLM Eyes: Desktop Control Without a Vision Model

desktop-agent turns accessibility trees and OCR into a 500-token screen description that local models can actually use. No vision model, no API costs, no VRAM burned on image tokens.

Desktop Automation AT-SPI OCR Linux AI Agents
Read Article →
NEW Self-Hosted Search July 3, 2026

I Built My Own Search Engine Because Google Stopped Being One

Odin: a self-hosted discovery engine that maps the internet's authoritative hubs and fetches content live. 250K+ pages across 10K+ domains, MCP server for AI agents, zero tracking, no censorship.

Search Engine Self-Hosted MCP Server Privacy AI Search
Read Article →

More Articles Coming Soon

We're preparing additional tutorials and guides:

  • Stable Diffusion Mastery — Advanced prompting and workflow optimization
  • LoRA Training from Scratch — Create custom models for any subject
  • LLM Fine-tuning Guide — Adapt open-source models for your domain
  • AI Automation Pipelines — Build end-to-end workflows with n8n
  • Claude Code Pro Tips — Advanced usage and MCP development

Have a topic you'd like us to cover? Let us know