Everyone is building AI products. Startups, enterprises, research labs โ the demand for GPU capacity has turned Nvidia into one of the most valuable companies on earth and made "H100 waitlist" a legitimate phrase people say with a straight face.
But what does training an AI model actually cost? The range is genuinely enormous โ from a few dollars for a small fine-tuned model to hundreds of millions for a frontier LLM. Where you land depends on model size, data volume, hardware choice, and a handful of factors most people don't think about until the bill arrives.
Let's break it down.
The main cost driver: GPU hours
Training a neural network is fundamentally a matrix multiplication problem run billions of times across massive datasets. GPUs are designed for exactly this โ they can run thousands of parallel operations simultaneously. The more parameters in your model and the more training data, the more GPU hours you need.
GPU rental prices on major cloud providers in early 2025:
- NVIDIA A100 (80GB): ~$3โ4/hour on AWS, GCP, Azure on-demand; ~$1.50โ2.50/hour spot
- NVIDIA H100 (80GB): ~$5โ8/hour on-demand; ~$2โ4/hour spot
- NVIDIA A10G: ~$1.20โ1.80/hour โ good for smaller models
Training typically uses clusters of GPUs in parallel, not single cards. A mid-size training run might use 8โ64 GPUs simultaneously for days or weeks.
What different model sizes actually cost
Small models (under 1B parameters) โ fine-tuning: If you're fine-tuning an existing model like Llama or Mistral on a custom dataset, costs are modest. A few hours on a single A10G or A100. Total cost: $20โ200 depending on dataset size and epochs.
Medium models (1Bโ7B parameters) โ training from scratch: Now we're talking multi-day runs on GPU clusters. Training a 7B parameter model on a quality dataset of ~1T tokens requires roughly 100,000โ200,000 GPU hours. At $3/hour average on A100s: $300kโ$600k.
Large models (13Bโ70B parameters): Meta's Llama 2 70B reportedly required approximately 1.7 million GPU hours on A100s. At $3/hour spot pricing: roughly $5 million. On-demand: significantly more.
Frontier models (100B+ parameters): GPT-4 class training runs are estimated at $50โ100M+. These require specialized infrastructure, custom networking, and thousands of GPUs running for months. This is not a startup exercise.
The hidden costs people forget
Data preparation. Training data doesn't arrive clean. Filtering, deduplication, formatting, and quality scoring can cost as much as the compute itself for large runs. Common Crawl is free; making it usable isn't.
Experimentation and failed runs. The first training run rarely goes as planned. Hyperparameter tuning, debugging divergence, adjusting the learning rate schedule โ expect to burn 20โ50% of your budget on runs that don't produce the final model.
Storage and data transfer. Training checkpoints for large models are massive. Storing and transferring them across cloud regions adds up. A 70B parameter model checkpoint is roughly 140GB in fp16; you'll save dozens of checkpoints during training.
Inference infrastructure. Training is a one-time cost; serving the model is ongoing. A 7B model serving thousands of users requires dedicated GPU capacity indefinitely. Many teams discover the training bill was the smaller number.
How to actually reduce your training costs
Use spot/preemptible instances. Cloud providers offer unused GPU capacity at 50โ70% discounts. Your job gets interrupted if demand spikes, but with proper checkpointing every 30โ60 minutes, you lose minimal progress. Most serious training jobs use spot.
Mixed precision training. Training in fp16 or bf16 instead of fp32 cuts memory requirements roughly in half, letting you use fewer GPUs or fit larger batch sizes. The speed improvement is typically 2โ3x with no meaningful accuracy loss.
Gradient checkpointing. Trades compute for memory โ recomputes some activations during the backward pass instead of storing them. Lets you train larger models on the same hardware. Standard practice now.
Start with fine-tuning, not pretraining. If your use case can be served by fine-tuning an existing open-source model, that's almost always the right call economically. You get 90%+ of the capability at 1โ5% of the cost.
Optimize your data. More data isn't always better. The Chinchilla paper showed that many large models were undertrained relative to their size โ you can often get better performance from a smaller model trained on more high-quality data. Quality filters and deduplication improve efficiency dramatically.
The democratization story
Here's the thing: training costs are falling fast. GPT-3 (175B parameters) reportedly cost around $5M to train in 2020. Models with comparable capability can be trained for a fraction of that today, thanks to better architectures, more efficient training techniques, and hardware improvements.
The models that cost $5M to train in 2025 will likely be replicable for $500k in 2027. The frontier keeps moving, but access to capable AI is genuinely expanding every year.
For most teams, the practical question isn't "can we afford to train a frontier model" โ it's "can we build the product we want by fine-tuning something that already exists?" Usually the answer is yes.