← back to dooner.tech
⚙ AI Benchmarks
homelab model benchmarks and inference notes — running on PVE03
PVE03 · Supermicro X11SPA-TF · Xeon Gold 6240L (18C/36T) · 192GB RAM
1,048,576
Max context tokens
Chat / LLM
DeepSeek-V4 (DSparK Speculation)
active · 2x GPU (tensor-parallel=2) · :8000
GPU0 96.9 GB
used of 97.9 GB
GPU1 96.8 GB
used of 97.9 GB
vLLM launch args:
vllm serve /models/DeepSeek-V4-Flash-DSpark \
--served-model-name DeepSeek-V4 \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.94 \
--max-model-len 1048576 \
--kv-cache-dtype fp8 \
--block-size 256 \
--trust-remote-code \
--max-num-seqs 32 \
--enable-chunked-prefill \
--enable-flashinfer-autotune \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--reasoning-parser deepseek_v4 \
--enable-auto-tool-choice \
--attention-backend FLASHINFER_MLA_SPARSE_DSV4 \
--speculative-config '{"model":"...DeepSeek-V4-Flash-DSpark","method":"dspark","num_speculative_tokens":5,"draft_sample_method":"probabilistic"}'
Key env vars:
VLLM_ENABLE_PCIE_ALLREDUCE=1
VLLM_USE_FLASHINFER_SAMPLER=1
VLLM_USE_AOT_COMPILE=1
VLLM_USE_MEGA_AOT_ARTIFACT=1
VLLM_CACHE_DIR=/cache/vllm
VLLM_NCCL_SO_PATH=/opt/libnccl-local-inference.so.2.30.4
Speech
Speaches (TTS / STT)
active · CPU · :8012
docker run -d --name speaches-cpu \
-p 8012:8000 \
ghcr.io/speaches-ai/speaches:latest-cpu \
uvicorn --factory speaches.main:create_app
Whisper speech-to-text + Kokoro text-to-speech, all CPU-based.
Embeddings
Qwen3 Embedding (0.6B)
active · CPU · :8010
docker run -d --name tei-embed-fast \
-p 8010:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
--model-id Qwen/Qwen3-Embedding-0.6B \
--max-batch-tokens 2048 \
--max-client-batch-size 8 \
--tokenization-workers 8
BGE-M3 (Multilingual)
active · CPU · :8013
docker run -d --name tei-bge-m3 \
-p 8013:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
--model-id BAAI/bge-m3 \
--max-batch-tokens 2048 \
--max-client-batch-size 8 \
--tokenization-workers 8
Rerankers
BGE Reranker Base (Fast)
active · CPU · :8014
docker run -d --name tei-rerank-fast \
-p 8014:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
--model-id BAAI/bge-reranker-base \
--max-batch-tokens 2048 \
--max-client-batch-size 8 \
--tokenization-workers 8
BGE Reranker Large (Better)
active · CPU · :8016
docker run -d --name tei-rerank-better \
-p 8016:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
--model-id BAAI/bge-reranker-large \
--max-batch-tokens 2048 \
--max-client-batch-size 8 \
--tokenization-workers 8
Architecture
┌────────────────────────────────────────────────━┐
Hermes Agent / Caddy
(calls DeepSeek-V4 via API on PVE03)
└───────────────────┬────────────────────────────────┘
Tailscale / LAN
┌───────────────────┴────────────────────────────────┐
PVE03 — Inference Host
DeepSeek-V4 Speaches Embeddings
:8000 (GPU x2) :8012 (CPU) :8010 Qwen3
:8013 BGE-M3
Rerankers
:8014 Base
:8016 Large
└─────────────────────────────────────────────────────────┘
← back to dooner.tech