NestAI Docs

GPU SERVERS

NestAI offers GPU-accelerated servers powered by NVIDIA GPUs via TensorDock. GPU servers deliver 5-10x faster inference compared to CPU — running 7B models at 60-90 tokens per second feels like using ChatGPT.

✓

GPU plans are standalone — they include the server, GPU, and the full managed stack (Ollama, Open WebUI, SSL, monitoring). No separate plan purchase needed.

GPU Tiers

Tier	GPU	VRAM	vCPU	RAM	Storage	Price	Best for
GPU Starter	RTX A4000	16 GB	4	16 GB	150 GB	$119/mo	7B-14B models at ~60 tok/s
GPU Pro	RTX 4090	24 GB	8	32 GB	200 GB	$349/mo	14B-32B models at ~25-35 tok/s
GPU Max	A100 80GB	80 GB	16	64 GB	300 GB	$999/mo	70B models at ~30-50 tok/s

Speed Comparison — CPU vs GPU

Model	CPU (shared)	CPU (dedicated)	GPU Starter	GPU Pro
Qwen 3.5 4B	~15-20 tok/s	~15-20 tok/s	~80+ tok/s	~90+ tok/s
Mistral 7B	~10-15 tok/s	~12-15 tok/s	~55-65 tok/s	~70-80 tok/s
Phi-4 14B	~4-6 tok/s	~7-10 tok/s	~30-40 tok/s	~40-50 tok/s
DeepSeek R1 32B	Too slow	~3-5 tok/s	Won't fit (16GB)	~15-20 tok/s
Llama 3.3 70B	Too slow	~2-3 tok/s	Won't fit	Won't fit

ℹ

For Llama 3.3 70B at usable speed, you need GPU Max (A100 80GB) which runs it at ~30-50 tok/s.

How GPU Servers Work

Choose GPU tier during onboarding

Select "GPU" infrastructure at step 2, then pick your GPU tier based on the models you want to run.

Server is provisioned

NestAI provisions a dedicated VM on TensorDock with NVIDIA drivers, CUDA, Ollama (GPU mode), Open WebUI, and SSL. Takes about 5-10 minutes.

Start chatting

Your team accesses the AI at yourteam.nestai.chirai.dev. Same interface as CPU servers, but responses are 5-10x faster.

Which Model Fits Which GPU?

VRAM	Models that fit	GPU Tier
16 GB (A4000)	3B, 4B, 7B, 8B, 12B (Q4)	GPU Starter
24 GB (RTX 4090)	3B-14B, 32B (Q4)	GPU Pro
80 GB (A100)	All models up to 70B (Q4)	GPU Max

VRAM determines the largest model you can load. All models use Q4_K_M quantization by default — this gives the best balance of quality and speed.

GPU Regions

GPU servers are deployed on TensorDock's global network across 100+ locations in 20+ countries. During onboarding, you select a region preference (Europe, US, or Asia-Pacific) and NestAI automatically selects the best available datacenter.

⚠

GPU availability varies by region and model. If your preferred GPU is not available in your region, NestAI will suggest an alternative location.

GPU vs CPU — When to Choose What

Factor	CPU (from $39/mo)	GPU (from $119/mo)
Speed (7B)	10-20 tok/s	60-90 tok/s
Best for	Light usage, internal tools, document processing	Real-time chat, demos, production, multiple users
Concurrent users	1-2 comfortable	4-8 comfortable
70B models	Unusable (2-3 tok/s)	Usable on GPU Max (30-50 tok/s)
Region options	Germany, US, Singapore	Global (20+ countries)
Deploy time	20-25 min	5-10 min
Infrastructure	Hetzner Cloud	TensorDock