AI Benchmarks — dooner.tech

DeepSeek-V4 (DSparK Speculation)

active · 2x GPU (tensor-parallel=2) · :8000

GPU0 96.9 GB

used of 97.9 GB

GPU1 96.8 GB

used of 97.9 GB

~99%

VRAM utilization

vLLM launch args:

vllm serve /models/DeepSeek-V4-Flash-DSpark \
--served-model-name DeepSeek-V4 \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.94 \
--max-model-len 1048576 \
--kv-cache-dtype fp8 \
--block-size 256 \
--trust-remote-code \
--max-num-seqs 32 \
--enable-chunked-prefill \
--enable-flashinfer-autotune \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--reasoning-parser deepseek_v4 \
--enable-auto-tool-choice \
--attention-backend FLASHINFER_MLA_SPARSE_DSV4 \
--speculative-config '{"model":"...DeepSeek-V4-Flash-DSpark","method":"dspark","num_speculative_tokens":5,"draft_sample_method":"probabilistic"}'

Key env vars:

VLLM_ENABLE_PCIE_ALLREDUCE=1
VLLM_USE_FLASHINFER_SAMPLER=1
VLLM_USE_AOT_COMPILE=1
VLLM_USE_MEGA_AOT_ARTIFACT=1
VLLM_CACHE_DIR=/cache/vllm
VLLM_NCCL_SO_PATH=/opt/libnccl-local-inference.so.2.30.4

Speech

Speaches (TTS / STT)

active · CPU · :8012

docker run -d --name speaches-cpu \
-p 8012:8000 \
ghcr.io/speaches-ai/speaches:latest-cpu \
uvicorn --factory speaches.main:create_app

Whisper speech-to-text + Kokoro text-to-speech, all CPU-based.

Embeddings

Qwen3 Embedding (0.6B)

active · CPU · :8010

docker run -d --name tei-embed-fast \
-p 8010:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
--model-id Qwen/Qwen3-Embedding-0.6B \
--max-batch-tokens 2048 \
--max-client-batch-size 8 \
--tokenization-workers 8

BGE-M3 (Multilingual)

active · CPU · :8013

docker run -d --name tei-bge-m3 \
-p 8013:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
--model-id BAAI/bge-m3 \
--max-batch-tokens 2048 \
--max-client-batch-size 8 \
--tokenization-workers 8

Rerankers

BGE Reranker Base (Fast)

active · CPU · :8014

docker run -d --name tei-rerank-fast \
-p 8014:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
--model-id BAAI/bge-reranker-base \
--max-batch-tokens 2048 \
--max-client-batch-size 8 \
--tokenization-workers 8

BGE Reranker Large (Better)

active · CPU · :8016

docker run -d --name tei-rerank-better \
-p 8016:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
--model-id BAAI/bge-reranker-large \
--max-batch-tokens 2048 \
--max-client-batch-size 8 \
--tokenization-workers 8

Architecture

┌────────────────────────────────────────────────━┐
Hermes Agent / Caddy
(calls DeepSeek-V4 via API on PVE03)
└───────────────────┬────────────────────────────────┘
Tailscale / LAN
┌───────────────────┴────────────────────────────────┐
PVE03 — Inference Host

DeepSeek-V4 Speaches Embeddings
:8000 (GPU x2) :8012 (CPU) :8010 Qwen3
:8013 BGE-M3
Rerankers
:8014 Base
:8016 Large
└─────────────────────────────────────────────────────────┘

⚙ AI Benchmarks

Chat / LLM

Speech

Embeddings

Rerankers

Architecture