Prompt processing is completely compute limited. And at high batch sizes, where the weights are read once for many tokens generated in parallel, token generation is also quite compute limited. Obviously you want enough bandwidth to match the compute, but its very compute heavy.
You can see this for yourself. Try ~10 prompts in parallel on a CPU in llama.cpp, and it will slow to a crawl, while a GPU with a narrow bus won’t slow down much.
Training is a bit more complicated, but that’s not doable on CPUs anyway.
Now, local inference (aka a batch size of 1), past prompt processing, is heavily bandwidth limited. This is why hybrid inference works alright on CPUs. But this doesn’t really apply to servers, which process many users in parallel with each “pass”.
This is commonly cited, but not strictly true.
Prompt processing is completely compute limited. And at high batch sizes, where the weights are read once for many tokens generated in parallel, token generation is also quite compute limited. Obviously you want enough bandwidth to match the compute, but its very compute heavy.
You can see this for yourself. Try ~10 prompts in parallel on a CPU in llama.cpp, and it will slow to a crawl, while a GPU with a narrow bus won’t slow down much.
Training is a bit more complicated, but that’s not doable on CPUs anyway.
Now, local inference (aka a batch size of 1), past prompt processing, is heavily bandwidth limited. This is why hybrid inference works alright on CPUs. But this doesn’t really apply to servers, which process many users in parallel with each “pass”.