The right bare metal GPU, sized correctly to your workload, is the difference between a predictable monthly infrastructure line and a finance conversation every quarter. The decision is simpler than most spec sheets make it look. Start with your workload, not the hardware.
Match your workload to the right bare metal GPU
| If you are doing this | Start with |
|---|---|
| Inference on 7B–30B models, video analytics, VDI | NVIDIA L4 |
| Mixed AI + 3D rendering, digital twins, or VDI from one GPU pool | NVIDIA RTX Pro 6000 BSE |
| Cost-sensitive training or inference, stepping up from CPU | NVIDIA A100 or L40S |
| Fine-tuning 30B–70B models, high-throughput inference APIs | NVIDIA H100 NVL |
| Long-context inference (64k+ tokens), large RAG, memory-bound HPC | NVIDIA H200 NVL |
| FP64-heavy HPC, data analytics, deep learning on ROCm | AMD Instinct MI210 |
What each bare metal GPU actually does well
L4 (24 GB, 72W)
The density GPU. Runs in any server, handles inference on mid-size models efficiently, and doubles as a video transcoding accelerator. If your workload fits in 24 GB and you want maximum jobs-per-rack, this is the starting point.
RTX Pro 6000 BSE (96 GB GDDR7)
Built on Blackwell with fifth-generation Tensor Cores and full ray-tracing hardware. The only GPU in this list that handles AI inference, 3D rendering, and VDI from the same node. Worth considering when you need one GPU pool to serve both ML engineers and design or simulation teams without splitting infrastructure.
A100 (40/80 GB HBM2e)
Still a solid choice for teams with existing A100-optimised workflows or tighter budgets. Strong across training and inference, but a generation behind the H100 on raw throughput. If your stack runs well on A100 today, the switching cost may not be worth it unless you are hitting memory or throughput ceilings.
L40S (48 GB GDDR6)
The middle tier for teams who need more VRAM than the L4 without committing to H100-class pricing. Handles inference well on models in the 13B–30B range and suits mixed graphics and AI workloads where the H100’s compute overhead is unnecessary.
H100 NVL (94 GB HBM3)
The current fine-tuning standard for most teams. 1,671 TFLOPS of BF16 throughput and NVLink support make it the practical choice for 30B–70B model training and production inference APIs with strict latency SLAs. Most Llama 3 70B and Mistral 8x7B fine-tuning work lands here.
H200 NVL (141 GB HBM3e)
The same Hopper architecture as the H100, with 1.5x more VRAM and roughly 1.2x more memory bandwidth. The performance difference is real, but only shows up when your workload is genuinely memory-bound: long-context inference above 64k tokens, large KV caches, embedding tables that spill out of 94 GB, or HPC grids with significant data movement. If your H100 is not hitting VRAM limits, the H200 will not change your throughput.
AMD Instinct MI210 (64 GB HBM2e)
The FP64 option in this list. At 22.6 TFLOPS peak FP64 and 1.6 TB/s memory bandwidth, the MI210 is built for double-precision compute: data analytics, scientific simulation, and deep learning workloads where FP64 precision is the priority over raw BF16 throughput.
It runs on AMD ROCm, not CUDA. PyTorch and TensorFlow both have ROCm support. If your stack includes CUDA-specific libraries or custom CUDA kernels, those will need porting before you can run them on MI210.
Three questions that settle most GPU decisions
Does your model fit in the GPU’s VRAM with room for activations and KV cache?
If no, move up. If yes by a wide margin, you are paying for memory that sits idle every month. A Mistral 7B inference workload does not need 94 GB of HBM3.
Are you training across multiple GPUs?
If yes, NVLink matters. Use H100 NVL or H200 NVL. NCCL performance across NVLink is substantially better than PCIe for multi-node training, and the difference compounds as you scale. If you are running single-GPU inference, PCIe is fine.
What is your GPU utilisation right now?
Below 50% means you are paying for capacity that sits idle every month. Consistently above 80% means you are likely leaving throughput on the table — the next model size up will pay for itself faster than you expect.
When all three answers point to the same GPU, you have your answer. When they conflict, the multi-GPU question usually wins.