Tightwad pools your mixed CUDA + ROCm GPUs into a single OpenAI-compatible endpoint.
Speculative decoding proxy: draft fast, verify smart, stream everything.
Same output quality. 2‑3× faster. Zero cloud bill.
$ pip install tightwad $ tightwad proxy start ✓ Draft: Qwen3-8B @ 192.168.1.50:11434 (RTX 2070 — the dusty one) ✓ Target: Qwen3-32B @ 192.168.1.100:11434 (RTX 4070 Ti — the good one) ✓ Proxy listening on http://localhost:8088 → Acceptance rate: 73.2% | Tokens saved this session: 14,891
Pick your poison. Or run both. Tightwad doesn't judge — it just saves you cash.
Got GPUs scattered across machines? Pool them. CUDA on one box, ROCm on another — Tightwad doesn't care. It distributes model layers across all of them and hands you one clean OpenAI-compatible endpoint.
[OpenAI Client]
|
v
+-------------------+
| Tightwad | <-- One endpoint to rule them all
| Coordinator :8080|
+--------+----------+
| distributes layers
+----+----+
v v
+--------+ +--------+
| Worker | | Worker |
| NVIDIA | | AMD |
| 4070Ti | | 7900XTX|
| 16 GB | | 24 GB |
+--------+ +--------+
72B model: covered ✓
Your cheap GPU isn't slow — it's a draft engine. A fast small model guesses tokens. A big model batch-verifies them. Same output quality as running the big model alone. Ships token IDs (bytes), not 100–300 MB of tensor data over the wire.
[Your App / OpenAI SDK]
|
v
+--------------------------+
| Tightwad Proxy :8088 |
| |
| 1. Draft 8 tokens ------+--> Qwen3-8B
| (~100 tok/s, cheap) | RTX 2070 (the dusty one)
| |
| 2. Verify batch --------+--> Qwen3-72B
| (one forward pass) | 4070Ti / Cloud API
| |
| 3. Accept/reject <------+
| 4. Stream to client |
+--------------------------+
Output = identical to 72B alone ✓
Small model blazes through 8 candidate tokens at ~100+ tok/s. Fast and cheap.
Big model evaluates all 8 tokens in a single forward pass. Batch is basically free.
Keep every token both models agree on. Take the big model's token at the first disagreement.
Accepted tokens stream to your app instantly. Repeat until done. Output is mathematically identical.
Tested with Qwen3-8B (RTX 2070) drafting for Qwen3-32B (RTX 4070 Ti Super) across 130 prompts. Real hardware. Real numbers. No cherry-picking.
| Prompt Type | Acceptance Rate | Rounds | Verdict |
|---|---|---|---|
| Reasoning | 32 | Math is deterministic. Love it. | |
| Code | 34 | Syntax is law. Both models agree. | |
| Factual | 18 | Decent. Facts don't lie. | |
| List | 40 | Phrasing varies. Still worthwhile. | |
| Creative | 6 | Many valid outputs. Expected. | |
| ⚡ Average | 26 | 58% of tokens = free. |
Slide to see your monthly cloud inference waste. Then stop doing that.
* Based on 58.3% avg acceptance rate. Draft model runs on your local GPU (electricity cost = rounding error). Cloud API calls reduced by ~58%. Your mileage may vary, but it'll vary in your favor.
Tightwad was built by and for people who think paying $200/month for inference is genuinely offensive.
You have a 2070 in the old desktop, a 4070 in the main rig, and a Quadro doing nothing in the server rack. They're all lazy freeloaders. Tightwad makes them a team.
You're still paying OpenAI/Anthropic for some tasks. Fine. But why let them do the easy parts? Draft locally, verify via API. 58% fewer API calls. Same answers.
You want 72B model quality on a $600 GPU budget. That's not a dream — that's just math. RPC mode distributes layers across whatever you've got collecting dust.
You bought AMD because it was on sale and NVIDIA for the CUDA ecosystem. Now they won't cooperate. Tightwad makes CUDA and ROCm work the same model together. Finally.
No Docker Compose files with 300 environment variables. No Kubernetes YAML. Just Python and one config file.
$ git clone https://github.com/akivasolutions/tightwad.git $ cd tightwad $ python3 -m venv .venv && source .venv/bin/activate $ pip install -e .
proxy: host: 0.0.0.0 port: 8088 max_draft_tokens: 8 fallback_on_draft_failure: true draft: url: http://192.168.1.50:11434 # Your cheap GPU (Ollama) model_name: qwen3:8b backend: ollama target: url: http://192.168.1.100:11434 # Your big GPU (Ollama) model_name: qwen3:32b backend: ollama
$ tightwad proxy start ✓ Draft model healthy ✓ Target model healthy ✓ Proxy listening on http://localhost:8088 # Test it (drop-in for any OpenAI SDK call) $ curl http://localhost:8088/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}' # Check acceptance rate stats $ tightwad proxy status → Acceptance rate: 73.2% | Rounds: 34 | Tokens saved: 14,891
# Or use scripts/install-worker.sh $ cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON $ cmake --build build --config Release $ build/bin/rpc-server -p 50052 # GPU 0
coordinator: host: 0.0.0.0 port: 8080 backend: hip # or cuda gpus: - name: "7900 XTX #0" vram_gb: 24 workers: - host: 192.168.1.100 # NVIDIA box gpus: - name: "RTX 4070 Ti Super" vram_gb: 16 rpc_port: 50052 models: qwen3-72b: path: /models/Qwen3-72B-Q4_K_M.gguf ctx_size: 8192 flash_attn: true default: true
$ tightwad start ✓ Coordinator started ✓ Worker @ 192.168.1.100:50052 online ✓ Model qwen3-72b loaded across 40 GB VRAM # Hot-swap to a different model anytime $ tightwad swap deepseek-r1-70b # Run the benchmark $ tightwad benchmark
Built for terminal people who hate bloat as much as they hate cloud bills.
Drop-in replacement for any OpenAI SDK. Change one URL. That's it. No code changes required.
tightwad swap model-name — swap the model while workers keep running. Zero downtime.
Full streaming support on all endpoints. Tokens flow as they're accepted. No buffering.
tightwad start, tightwad proxy start, tightwad status. Simple commands for complex infrastructure.
One file describes your entire hardware topology. Version control it. Share it. Ship it.
tightwad benchmark — test your cluster's throughput and acceptance rates on real prompts.
Ollama for quick setup. llama.cpp for maximum performance. Switch per-model in the config.
Draft server down? fallback_on_draft_failure: true routes straight to target. Never breaks.
NVIDIA CUDA + AMD ROCm on the same model, same cluster, same endpoint. No compromises.