Back in late June, a Discord conversation with Ixtrix kicked off what turned into the homelab's first proper vision RAG pipeline. The premise was simple: I run DeepSeek-V4-Flash-DSpark on 2 of my 3 RTX PRO 6000s (96 GB each), and that 3rd GPU — a ~72 GB Blackwell card — was sitting completely idle. What could it do?
The Candidate: Chandra OCR-2
Ixtrix pointed me at Chandra OCR-2 by Datalab — currently the best model on the olmOCR benchmark at 85.8% (vs Gemini 2.5 Flash at 67.3% and GPT-4o at 69.9%). It outputs native HTML with data-bbox bounding boxes — meaning tables, handwriting, multi-column layouts, and 90 languages all come back as structured markup rather than a flat text blob. For the "give DeepSeek eyes" goal, that was the right shape.
The NVFP4A16 quant from dangvansam/chandra-ocr-2-NVFP4A16 clocks in at ~5.4 GB — barely a dent in a 72 GB card — and runs 2.5× faster than the bf16 baseline thanks to Blackwell's native FP4 tensor cores. Deployed it as a vLLM container (scope-chandra-ocr) on PVE01's vllm01 LXC, port 8020, GPU 0.
The Vision RAG Stack
OCR alone isn't RAG. The vision from Ixtrix's recommendations was a pipeline that preserved document structure through every stage:
PDF/Image → Chandra OCR-2 → Qwen3-VL-Embedding-8B-FP8
↓
ChromaDB (vector store)
↓
Qwen3-VL-Reranker-2B
↓
Qwen3.6-35B-A3B MoE NVFP4
↓
FastAPI (ingest/search/chat)
Chandra OCR-2 turns raw document pages into structured HTML. The Qwen3-VL-Embedding-8B-FP8 (~8 GB) creates vision-aware embeddings — it understands tables, charts, and layouts, not just raw text. Qwen3-VL-Reranker-2B (~4 GB) re-ranks top candidates for precision. Both are small enough to share GPU 1 alongside the LLM.
The LLM itself is Qwen3.6-35B-A3B MoE NVFP4 (~20 GB) — a 35B-parameter MoE model where only ~3B params activate per token. In NVFP4 quant it fits comfortably on a single RTX PRO 6000 with headroom for the embedding model.
The Backend Pattern
The FastAPI service follows a three-endpoint pattern from the rag-backend skill:
POST /ingest — upload a file, Chandra OCRs it, chunks it, embeds it, stores in ChromaDB.
POST /search — query in → retrieve top chunks → return results (no LLM call).
POST /chat — query in → retrieve → ask Qwen 3.6 → return an answer with sources.
ChromaDB runs embedded in the Python process — no separate container — persisting to disk. Microsoft Entra ID SSO validates JWTs from the existing frontend. In dev mode (no env vars set), auth is bypassed for local testing.
Hardware Fit
The 3rd GPU (RTX PRO 6000 Blackwell, 72 GB usable) splits cleanly:
GPU 0: Chandra OCR-2 (~6 GB) + Qwen3-VL-Reranker-2B (~4 GB) = ~10 GB, leaving ~62 GB free for batch OCR workloads.
GPU 1: Qwen3.6-35B-A3B MoE NVFP4 (~20 GB) + Qwen3-VL-Embedding-8B-FP8 (~8 GB) = ~28 GB, leaving ~44 GB free.
Total: ~38 GB utilized out of 144 GB across both GPUs. Plenty of headroom for concurrent requests, larger batch sizes, or adding TTS/STT models later.
Why Chandra Over the Alternatives
Ixtrix's advice was blunt: "ignore Paddle, Chandra is better." The benchmarks back it up:
| Model | Score | Notes |
|---|---|---|
| Chandra OCR-2 | 85.8% | Open source, structured HTML output |
| DeepSeek OCR | 75.4% | Good tables, closed source |
| Gemini 2.5 Flash | 67.3% | Paid API, no self-host |
| GPT-4o (Anchored) | 69.9% | Paid API |
Chandra wins on quality, runs locally, costs nothing beyond the electricity, and the NVFP4 quant makes it trivially small for any Blackwell card.
What's Next
The backend is ready. The frontend — a clean single-page chat interface with upload, search, and browsing — is something I'm working through myself. Long-term, the pipeline handles anything from scanned PDFs to office documents, handwritten notes to multi-language financial statements. The "give the LLM eyes" goal is met; now it's about making that access seamless.
Full write-up on the RAG backend pattern and deploy process lives in the rag-backend skill. If you're running Blackwell hardware, Chandra OCR-2 in NVFP4 is a no-brainer addition to any idle GPU.