Not all "AI Chips" do the same thing
Double-clicking on AI Infrastructure
In first article, I wrote about why “GPU vs. CPU” is the wrong frame for AI infrastructure in 2026, and why the right question is which workload, which wall, which bet? This post is the next step: a walkthrough of every important type of AI accelerator on the market.
To make the examples consistent, I’ll use the same scenario throughout: serving Llama 3.3 70B - Meta’s open-weight model with 70 billion parameters that takes about 140 GB of memory to hold in 16-bit precision. When a user sends it a prompt, the model reads the prompt (prefill) and generates a response one token at a time (decode).
I’ll show how each chip handles this. Eight chips on the tour.
GPU - Nvidia H100/B200, AMD MI300X/MI350X
A general-purpose parallel processor with specialized hardware units (tensor cores) for matrix math. Originally designed for graphics, evolved over fifteen years to handle AI workloads. Each chip has hundreds of small processors, hundreds of gigabytes of fast memory (High Bandwidth Memory - HBM), and high-speed links to other GPUs in the same server.
Llama 3.3 70B example: The 140 GB model doesn’t fit on a single H100 (80 GB) or B200 (192 GB), so for the H100 you split the model across two GPUs and connect them with NVLink. The two GPUs work in parallel during prefill and pass activations between each other during decode. A typical setup serves dozens of concurrent users at 50-100 tokens per second per user. The same hardware can also fine-tune the model by running the math in reverse.
Strengths:
Runs anything — training, inference, classical computing, custom research code.
The mature software ecosystem (CUDA, PyTorch) is fifteen years deep.
Weaknesses:
Expensive per chip and per watt.
TPU - Google v5p, Trillium, Ironwood
A chip built around a systolic array — a grid of arithmetic units wired directly to their neighbors, where data flows through the grid in a wave pattern and emerges as results. Optimized for dense matrix multiplication, which is the dominant operation in neural networks. Designed by Google specifically for their internal workloads and the Google Cloud platform.
Llama 3.3 70B example: On a TPU v5p, the model is split across multiple chips connected by Google’s interconnect (ICI) in a 3D torus topology. Google’s compiler (XLA) handles the splitting automatically. Running Llama 3.3 specifically on TPUs requires PyTorch-XLA, which works but is less mature than running it on a GPU.
Strengths:
Very efficient at dense matrix multiplication.
Scales to thousands of chips as one logical machine.
Weaknesses:
Available exclusively on Google Cloud.
Software ecosystem outside JAX AI Stack is thinner than CUDA’s.
LPU - Groq
A chip that holds all model weights in fast on-chip SRAM rather than HBM. SRAM is roughly 10-100× faster than HBM but much smaller - Groq has 230 MB per chip versus 192 GB for a Blackwell GPU. To compensate, Groq shards the model across hundreds of chips, with each chip holding a slice. The compiler determines on every cycle exactly what every chip does, including cross-chip data movement.
Llama 3.3 70B example: The 140 GB of weights requires roughly 600+ Groq chips to hold in aggregate, with each chip storing about 230 MB. The chips operate in lockstep, passing activations between each other at SRAM speeds. The result is that token generation latency drops from ~50–100 tokens/sec on a GPU to 250–500+ tokens/sec on Groq for a single user stream. Decode dominates the experience of using an LLM, so this is a meaningful difference for latency-sensitive applications.
Strengths:
Extremely low latency on token generation.
Predictable performance because the compiler schedules everything.
Weaknesses:
High chip count for large models.
Designed primarily for inference, not training.
Smaller software ecosystem than CUDA.
WSE - Cerebras
A chip the size of an entire silicon wafer - roughly 46,000 square millimeters, about 57× larger than a Blackwell die. The Wafer Scale Engine 3 has 900,000 cores and 44 GB of on-chip SRAM, all on one piece of silicon. There’s no off-chip memory in the standard sense ; everything stays on-die.
Llama 3.3 70B example: The 140 GB of weights doesn’t fit in the wafer’s 44 GB SRAM, so Cerebras streams weights from a separate appliance (MemoryX) into the wafer during inference. For models that do fit (smaller models, or quantized versions of Llama 3.3 70B), the wafer acts as one giant chip - no chip-to-chip interconnect bottleneck, all communication is on-die.
Strengths:
Eliminates the chip-to-chip interconnect problem for models that fit on-die.
Simple programming model when the model fits.
Weaknesses:
System cost is in the millions per CS-3(~$1.56 million per node).
Software ecosystem is small.
DPU - Nvidia BlueField, AWS Nitro
A specialized chip that handles infrastructure work - networking, storage protocols, security, multi-tenant isolation - that would otherwise consume host CPU cycles. Sits in the server alongside the CPU and GPU. Has its own ARM cores, network silicon, and accelerators for crypto and packet processing.
Llama 3.3 70B example: When users send prompts to your Llama 3.3 70B service, packets arrive at the server’s network interface. A DPU processes them - TLS termination, routing, tenant isolation - and hands them to the host CPU, which orchestrates the GPU. Without the DPU, the host CPU would spend significant cycles on this infrastructure work, reducing the rate at which it can feed the GPU. At thousand-GPU scale, this becomes a bottleneck. The DPU also enables direct GPU-to-storage transfers (GPUDirect) so model weights and datasets can load without round-tripping through host memory.
Strengths:
Frees host CPU for application work.
Enables high-bandwidth I/O patterns that pure CPUs can’t sustain.
Weaknesses:
Adds complexity to the server architecture.
Doesn’t accelerate AI computation directly - it’s everything around.
NPU - Apple Neural Engine, Qualcomm Hexagon
A small, low-power AI accelerator integrated into mobile and laptop SoCs. Optimized for INT8/INT4 inference of small-to-medium models within tight power budgets (watts, not kilowatts). Found in phones, tablets, laptops, cars, and cameras.
Llama 3.3 70B example: The full Llama 3.3 70B doesn’t fit on any current NPU. But a smaller model in the same family - Llama 3.2 3B with 4-bit quantization - fits on a flagship phone’s NPU and can generate ~10–30 tokens/sec on-device. A common architecture is to run the small model locally for privacy and latency, falling back to a cloud-hosted Llama 3.3 70B only for queries the small model can’t handle.
Strengths:
Runs AI on battery power without draining it.
Enables on-device inference for privacy and latency.
Weaknesses:
Memory and compute too small for frontier models.
Trainium and Inferentia - AWS
AWS’s custom silicon, with a clear division:
Trainium chips are designed for training, Inferentia chips are designed for inference.
Both use a systolic-array-based architecture similar to TPUs, but with a different software stack (the AWS Neuron SDK) and a different interconnect (NeuronLink).
The current generation is Trainium3 on a 3nm process.
Llama 3.3 70B example: On AWS, you can serve Llama 3.3 70B on Trainium2 instances using the Neuron SDK. The Trn2 UltraServer connects 64 Trainium2 chips through NeuronLink, presenting them as one logical machine - large enough to hold the model and serve many concurrent users. Inferentia2 instances are the lower-cost option specifically optimized for inference serving. For training or fine-tuning Llama-class models, Trainium2 supports the workload at scale.
Strengths:
Often cheaper per token than Nvidia hardware within AWS.
Tight integration with the rest of AWS (S3, EFA networking, EKS).
Weaknesses:
Neuron SDK has a smaller community than CUDA.
Other hyperscaler custom silicon - Microsoft Maia, Meta MTIA
Microsoft’s Maia deploys in Azure with custom liquid-cooled rack designs. The second-generation Maia 200 (announced Jan 2026) specializes in high-efficiency, low-precision computing (FP4/FP8) for Copilot, GPT-5.2, and agentic AI. Meta Training and Inference Accelerator (MTIA) started on recommendation and ranking inference and is extending to GenAI inference with cost optimization as goal.
Llama 3.3 70B example: Most third-party Llama 3.3 70B serving currently runs on Nvidia hardware, with Maia handling specific first-party Microsoft and OpenAI workloads. MTIA isn’t generally available to external customers - it’s primarily for Meta’s internal serving needs.
Strengths:
Cost efficiency.
Weaknesses:
Less external availability than others in discussion.
A simple way to think about it
Every chip on this list is a different answer to the same question: given the three walls (compute, memory capacity, memory bandwidth) and the three workloads (training, prefill, decode), which trade-off do we make?
GPU - optimized for generality
TPU - optimized for dense matrix multiplication at scale
LPU (Groq) - optimized for decode latency and needs a high chip count to hold large models
WSE (Cerebras) - optimized for on-die bandwidth and hence expensive
DPU - optimized for I/O offload and doesn’t accelerate AI compute directly, by design.
NPU - optimized for edge inference and too small for frontier models
Trainium / Inferentia - optimized for cost efficiency within AWS
Maia / MTIA - optimized for first-party hyperscaler workloads
When you’re sizing infrastructure, the right question is which trade-off your workload can absorb.
If you’re serving a frontier model with a thousand concurrent users on a multi-cloud strategy, you probably want GPUs.
If you’re serving a popular open-source model where token generation latency is the SLA, Groq or Cerebras Inference are worth a look.
If you’re training a Llama-class model from scratch on AWS, Trainium deserves the cost comparison.
If you’re shipping consumer apps, you’ll touch NPUs.
There isn’t a best chip. There’s a chip whose trade-offs match your workload
What’s next
In next article, I’ll cover what happens when you take these chips and try to build a supercomputer out of them - the memory next to the chip (HBM), how chips talk to each other (NVLink, ICI, NeuronLink), how racks talk to other racks (InfiniBand, AI-grade Ethernet).
