DeepSeek ditches Nvidia for Huawei chips in V4 launch

inari@piefed.zip · 1 day ago

DeepSeek ditches Nvidia for Huawei chips in V4 launch

brucethemoose@lemmy.world · 7 hours ago

This is commonly cited, but not strictly true.

Prompt processing is completely compute limited. And at high batch sizes, where the weights are read once for many tokens generated in parallel, token generation is also quite compute limited. Obviously you want enough bandwidth to match the compute, but its very compute heavy.

You can see this for yourself. Try ~10 prompts in parallel on a CPU in llama.cpp, and it will slow to a crawl, while a GPU with a narrow bus won’t slow down much.

Training is a bit more complicated, but that’s not doable on CPUs anyway.

Now, local inference (aka a batch size of 1), past prompt processing, is heavily bandwidth limited. This is why hybrid inference works alright on CPUs. But this doesn’t really apply to servers, which process many users in parallel with each “pass”.

DeepSeek ditches Nvidia for Huawei chips in V4 launch

DeepSeek ditches Nvidia for Huawei chips in V4 launch

DeepSeek launched V4 on Huawei chips one day after White House accused China of AI theft