

Not to speak of the new attention scheme and the (IIRC) MLP changes.
I’m very much looking forward to ik_llama.cpp implementing it. I don’t think I can quite fit Flash on my rig (hence no Ktransformers for me) but a little quantization of the sparse layers, and it’d be perfect.

It is! 143GB last I checked. I’m on 128GB RAM + 3090, 1 NUMA node, so I think it’s juuust barely too tight. But it should be perfect with a few of the “sparsest” MoEs quantized.
If KTransformers supports something like that, I may have to finally check it out, since v4 won’t need many esoteric features.