Blog

Notes, projects, and random discoveries

2026-07-03 · ◈ inference

DIY Vision RAG — Chandra OCR-2 on a Spare GPU

Turning an idle RTX PRO 6000 into a self-hosted vision RAG pipeline with Chandra OCR-2, Qwen3-VL embeddings, and ChromaDB.

rag chandra ocr vision homelab pve01 vllm

Back in late June, a Discord conversation with Ixtrix kicked off what turned into the homelab's first proper vision RAG pipeline. The premise was simple: I run DeepSeek-V4-Flash-DSpark on 2 of my 3 RTX PRO 6000s (96 GB each), and that 3rd GPU — a ~72 GB Blackwell card — was sitting completely idle. What could it do?

The Candidate: Chandra OCR-2

Ixtrix pointed me at Chandra OCR-2 by Datalab — currently the best model on the olmOCR benchmark at 85.8% (vs Gemini 2.5 Flash at 67.3% and GPT-4o at 69.9%). It outputs native HTML with data-bbox bounding boxes — meaning tables, handwriting, multi-column layouts, and 90 languages all come back as structured markup rather than a flat text blob. For the "give DeepSeek eyes" goal, that was the right shape.

The NVFP4A16 quant from dangvansam/chandra-ocr-2-NVFP4A16 clocks in at ~5.4 GB — barely a dent in a 72 GB card — and runs 2.5× faster than the bf16 baseline thanks to Blackwell's native FP4 tensor cores. Deployed it as a vLLM container (scope-chandra-ocr) on PVE01's vllm01 LXC, port 8020, GPU 0.

The Vision RAG Stack

OCR alone isn't RAG. The vision from Ixtrix's recommendations was a pipeline that preserved document structure through every stage:

PDF/Image → Chandra OCR-2 → Qwen3-VL-Embedding-8B-FP8
                                          ↓
                                   ChromaDB (vector store)
                                          ↓
                               Qwen3-VL-Reranker-2B
                                          ↓
                               Qwen3.6-35B-A3B MoE NVFP4
                                          ↓
                                   FastAPI (ingest/search/chat)

Chandra OCR-2 turns raw document pages into structured HTML. The Qwen3-VL-Embedding-8B-FP8 (~8 GB) creates vision-aware embeddings — it understands tables, charts, and layouts, not just raw text. Qwen3-VL-Reranker-2B (~4 GB) re-ranks top candidates for precision. Both are small enough to share GPU 1 alongside the LLM.

The LLM itself is Qwen3.6-35B-A3B MoE NVFP4 (~20 GB) — a 35B-parameter MoE model where only ~3B params activate per token. In NVFP4 quant it fits comfortably on a single RTX PRO 6000 with headroom for the embedding model.

The Backend Pattern

The FastAPI service follows a three-endpoint pattern from the rag-backend skill:

POST /ingest — upload a file, Chandra OCRs it, chunks it, embeds it, stores in ChromaDB.
POST /search — query in → retrieve top chunks → return results (no LLM call).
POST /chat — query in → retrieve → ask Qwen 3.6 → return an answer with sources.

ChromaDB runs embedded in the Python process — no separate container — persisting to disk. Microsoft Entra ID SSO validates JWTs from the existing frontend. In dev mode (no env vars set), auth is bypassed for local testing.

Hardware Fit

The 3rd GPU (RTX PRO 6000 Blackwell, 72 GB usable) splits cleanly:

GPU 0: Chandra OCR-2 (~6 GB) + Qwen3-VL-Reranker-2B (~4 GB) = ~10 GB, leaving ~62 GB free for batch OCR workloads.
GPU 1: Qwen3.6-35B-A3B MoE NVFP4 (~20 GB) + Qwen3-VL-Embedding-8B-FP8 (~8 GB) = ~28 GB, leaving ~44 GB free.

Total: ~38 GB utilized out of 144 GB across both GPUs. Plenty of headroom for concurrent requests, larger batch sizes, or adding TTS/STT models later.

Why Chandra Over the Alternatives

Ixtrix's advice was blunt: "ignore Paddle, Chandra is better." The benchmarks back it up:

Model	Score	Notes
Chandra OCR-2	85.8%	Open source, structured HTML output
DeepSeek OCR	75.4%	Good tables, closed source
Gemini 2.5 Flash	67.3%	Paid API, no self-host
GPT-4o (Anchored)	69.9%	Paid API

Chandra wins on quality, runs locally, costs nothing beyond the electricity, and the NVFP4 quant makes it trivially small for any Blackwell card.

What's Next

The backend is ready. The frontend — a clean single-page chat interface with upload, search, and browsing — is something I'm working through myself. Long-term, the pipeline handles anything from scanned PDFs to office documents, handwritten notes to multi-language financial statements. The "give the LLM eyes" goal is met; now it's about making that access seamless.

Full write-up on the RAG backend pattern and deploy process lives in the rag-backend skill. If you're running Blackwell hardware, Chandra OCR-2 in NVFP4 is a no-brainer addition to any idle GPU.

2026-07-03 · ⊛ infrastructure

Palworld Dedicated Server — v0.7 → v1.0

Standing up a Palworld dedicated server on the homelab. Proxmox LXC, LinuxGSM, Caddy reverse proxy, and a live status dashboard with dark themes.

palworld proxmox lxc linuxgsm

Palworld's v1.0 drops July 10, and the homelab needed a dedicated server for a friends-and-family playthrough. Here's the stack:

Container: Ubuntu 24.04 LXC on PVE01 (4 cores, 8 GB RAM, 40 GB disk). Static DHCP lease at 192.168.0.153, port-forwarded through the AT&T router.

Server: LinuxGSM's pwserver wrapper handles install, updates, and lifecycle. PalServer-Linux-Shipping runs in a tmux session with -publiclobby and query port 27015.

Web: Flask app with a JSON API (/api/services) powers the dashboard. PVE01 host stats come from node_exporter:9100 metrics. Caddy terminates TLS and reverse-proxies everything.

Dashboard: Single-page landing at dooner.tech — 4 themes (dark, console, light, amber), live server status, expandable hardware cards, and a countdown to v1.0. Copy-to-clipboard with visual feedback.

2026-07-03 · ◈ inference

DeepSeek-V4 Flash on PVE03: Notes From a Homelab Run

Tuning and benchmarking DeepSeek-V4-Flash-DSpark on a Proxmox host with two 96GB-class NVIDIA GPUs. DSpark v8-style vLLM, PCIe topology findings, decode throughput data, and the remaining bottlenecks.

deepseek vllm dspark inference pve03

I spent some time tuning and benchmarking DeepSeek-V4-Flash-DSpark on pve03, a Proxmox host with two 96GB-class NVIDIA GPUs on an older Intel Xeon Scalable platform. The goal was to see how close we could get to the current DSpark/vLLM recipe while keeping the system stable for real agent usage.

The current run is based on the DSpark v8-style vLLM image and serves the model as DeepSeek-V4. The important launch choices are:

gpu_memory_utilization: 0.94
max_model_len: 1,048,576
max_num_seqs: 32
max_num_batched_tokens: 4096
max_cudagraph_capture_size: 216
kv_cache_dtype: fp8
attention_backend: FLASHINFER_MLA_SPARSE_DSV4
moe_backend: flashinfer_cutlass
speculative decoding: dspark, 5 speculative tokens
reasoning_effort: high
thinking: true

With that config, vLLM reported about 1.345M KV-cache tokens available, enough for one full 1M-token request plus some headroom. For normal use, the model is more likely to run many smaller agent requests than several huge-context requests at once.

Performance was solid but not perfect. Prefill landed around 5.8k–5.9k tok/s after clearing ACS redirect bits. Earlier sustained decode testing showed roughly:

C1:   ~159 tok/s
C2:   ~304 tok/s
C4:   ~412 tok/s
C8:   ~602 tok/s
C16:  ~866 tok/s
C32: ~1049 tok/s

The biggest hardware finding was PCIe topology. The GPUs were visible as NODE distance rather than being on the same local PCIe switch/root path. P2P worked, but bandwidth was weaker than expected: roughly 9.8 GB/s 1:1 peer copy, versus another comparable system showing about 14.2 GB/s. Interestingly, our all-reduce result was not terrible, but prefill still looked lower than people on newer or cleaner PCIe setups.

ACS was also enabled on several bridges. Clearing ReqRedir and CmpltRedir helped slightly, but not dramatically. The likely next physical tuning step is moving both GPUs into slots behind the same PEX8747 switch group on the Supermicro X11SPA board.

We also checked a "model down or laggy?" event. LiteLLM was healthy, vLLM was alive, and pve03 did not show CPU, disk, or iowait pressure. Grafana showed no meaningful vLLM queue backlog at the time. The more likely culprit was network behavior: an offsite Backblaze backup was running while the PVE hosts were on Wi-Fi. Even if pve03 itself was not saturated, shared Wi-Fi or WAN upload saturation can make Tailscale, HTTP, and SSH feel broken.

The short version: DeepSeek-V4-Flash-DSpark is running well on pve03, but the remaining bottlenecks look more like platform and network issues than vLLM issues. The biggest wins left are likely better PCIe placement, wired networking, and rate-limiting offsite backups.

2026-06-30 · ◈ inference

Hermes + gbrain: Running an AI That Actually Remembers

5| Most AI assistants start each conversation from scratch. With Hermes Agent and a persistent knowledge base, mine knows the homelab topology, config quirks, and deployment workflows — across sessions. 6|

8| hermes 9| ai 10| gbrain 11| memory 12| homelab 13| llm 14|

15|

16|

Most AI assistants start each conversation with a blank slate. Ask a question, get an answer — done. But when you manage a homelab with dozens of services, 3 GPU hosts, 14 TB of storage, game servers, monitoring stacks, custom API endpoints, and a growing set of automation scripts, a stateless assistant is useless after the first turn.

17| 18|

That’s where Hermes Agent with gbrain — a persistent knowledge base — changes the game.

19| 20|

How It Works

21|

Hermes is an open-source, tool-calling AI agent by Nous Research. It connects to any LLM backend (I route through LiteLLM to vLLM on PVE03), and comes with a suite of built-in tools: terminal, file system, web search, browser automation, image generation, and more. But the killer feature is gbrain, an MCP server that acts as long-term memory.

22| 23|

gbrain is a knowledge base of markdown pages — interlinked, taggable, searchable. Every durable fact I learn about the homelab goes in there: IPs, credentials (with safe storage), config quirks, port numbers, deployment workflows, troubleshooting notes. When Hermes starts a task, it queries gbrain first for relevant context before even touching the tools.

24| 25|

Real-World Example

26|

When I asked Hermes to rebuild this website’s landing page, it didn’t ask me for the server IP, which container it was in, or how to deploy files. It queried gbrain, found the homelab topology, the Caddy config pattern, and the deploy workflow — and got it done in one shot. Days later, when I asked for a new blog post about a different project, it already knew the site structure, the Flask route layout, and the theme system.

27| 28|

Not Just Memory — A Brain

29|

gbrain supports bidirectional links ([[page links]]), full-text search, and a graph traversal that lets it find connections across topics. If I add a note about “vLLM GPU config” that links to “PVE03 hardware specs” and “LXC passthrough notes”, Hermes can follow those links automatically when troubleshooting a crash.

30| 31|

It also has ambient signal capture — every message goes through a “signal detector” that checks whether what you just said deserves to be remembered. No explicit “save this” command needed.

32| 33|

The Bottom Line

34|

Persistent memory transforms an AI assistant from a clever chatbot into something closer to a system administrator who never forgets what you told them last week. For homelabs where complexity grows fast, that continuity is worth more than any model upgrade.

35|

36|

37| 38|

39|

2026-06-26 · ⚙ devops

40|

Proxmox LXC: Why Containers Beat VMs for Most of My Services

41|

42| GPU passthrough, near-zero overhead, and pct tooling make LXC containers the obvious choice for most homelab workloads. Here’s why I use them over VMs. 43|

44|

51|

52|

When I built PVE01 — a dual-Xeon box with an RTX PRO 6000, a second consumer GPU, and 128 GB of RAM — I knew I’d be running a mix of services: web apps, game servers, model inference, databases. The question was VMs or containers.

53| 54|

Proxmox LXC containers won handily, and here’s why.

55| 56|

Containers vs. VMs for a Homelab

57|

LXC containers share the host kernel, which means near-zero overhead for CPU, memory, and I/O. On a host where every watt of GPU compute and every GB of RAM counts, that matters. A VM running the same web stack would burn 2–4 GB on the guest OS alone before running anything useful. LXC cuts that to essentially zero — my web container uses about 180 MB at idle.

58| 59|

The trade-off: you can’t run a different kernel or Windows in LXC. But for Linux-only workloads — Flask apps, game servers, vLLM, Caddy, PostgreSQL — that’s not a constraint.

60| 61|

GPU Passthrough: The Real Differentiator

62|

vLLM serving DeepSeek-V4 needs direct GPU access. With a VM, you’d need full PCIe passthrough — rebinding the GPU, isolating it from the host IOMMU groups, and losing the card to the VM. With LXC, you just mount the NVIDIA devices and libraries into the container:

63| 64|

# In /etc/pve/lxc/CT_ID.conf:
65|lxc.cgroup2.devices.allow: c 195:* rwm
66|lxc.cgroup2.devices.allow: c 509:* rwm
67|lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
68|lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
69|lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
70|lxc.environment: NVIDIA_VISIBLE_DEVICES=all

71| 72|

The host keeps its display and can still use the GPU for node_exporter, monitoring, or lighter tasks. The container gets full CUDA access. It’s the best of both worlds.

73| 74|

The LXC Toolchain

75|

Proxmox’s pct command makes day-to-day management dead simple — push files, exec commands, snapshot, resize. My deploy workflow is literally:

76|

scp file root@pve01:/tmp/
77|pct push 240 /tmp/file /var/www/file

78|

No SSH setup inside the container, no IP lookups — just direct host-level operations.

79| 80|

When I’d Still Use VMs

81|

Running anything non-Linux (unlikely here)
Needing kernel-level isolation for multi-tenant security
Testing custom kernels or OS-level configs

86|

For everything else in a single-admin homelab, LXC is faster, leaner, and easier to manage.

87|

88|

89| 90|

91|

2026-06-22 · ✦ networking

92|

Homelab Networking on a Locked-Down AT&T Gateway

93|

94| Self-hosting behind AT&T fiber means working with a gateway that gives you almost no control. Here’s how DNS, Caddy, and a single public IP make it work. 95|

96|

103|

104|

When AT&T is your only fiber option, you take what you can get — and what you get is a locked-down gateway that only lets you port-forward to DHCP-device dropdown entries. No static DHCP leases, no custom DNS override, no split-tunnel options. That makes self-hosting at home a game of working with the limitations rather than fighting them.

105| 106|

The Setup

107|

A single public IP (99.74.254.214), Cloudflare in grey-cloud (DNS-only) mode so the actual origin IP stays routable, and every self-hosted service running behind Caddy for automated TLS termination. Ports 80 and 443 forward from the AT&T gateway to the web container on 192.168.0.153. Everything else — game servers, SSH, anything non-HTTP — gets its own port forward to the right internal IP.

108| 109|

Why Grey-Cloud DNS

110|

Cloudflare’s proxied (orange-cloud) mode is great for static sites, but it breaks WebSocket connections, blocks non-standard ports, and hides your real IP at the cost of making Cloudflare the TLS terminator. For a homelab where you control every service, grey-cloud + Caddy means you own the cert chain end-to-end and nothing breaks unexpectedly when you add a new subpath.

111| 112|

The AT&T Gateway Tax

113|

The most frustrating limitation: you can only forward ports to devices the gateway has assigned a DHCP lease to. Since the router’s DHCP table is an opaque dropdown list with no static-lease option, a container reboot can change its IP if the lease timing shifts. The fix? A long lease reservation (24 hours+) and a monitoring script that alerts if the internal IP changes. I’ve also been eyeing a UDM Pro Max in IP passthrough mode to bypass the gateway entirely — that moves DHCP, DNS, and firewall into one real appliance.

114| 115|

The DNS Layer

116|

Cloudflare handles DNS with a simple A record pointing to the public IP. Internal traffic stays on Tailscale, so latency-sensitive services (Palworld at ~2ms, vLLM inference) don’t hairpin through the public internet. External visitors hit the AT&T gateway → port 80/443 → Caddy → Flask app → done. Clean, minimal, and cheap.

117| 118|

Takeaways

119|

Grey-cloud DNS + Caddy is the right combo for a dynamic homelab with many sub-services.
Reserved DHCP leases with fallback monitoring are mandatory on locked-down ISP gateways.
An IP-passthrough capable router (UDM Pro Max, pfSense box) is the eventual upgrade that fixes everything in one move.

124|

125|