• 1 Post
  • 869 Comments
Joined 2 years ago
cake
Cake day: March 22nd, 2024

help-circle


  • This is commonly cited, but not strictly true.

    Prompt processing is completely compute limited. And at high batch sizes, where the weights are read once for many tokens generated in parallel, token generation is also quite compute limited. Obviously you want enough bandwidth to match the compute, but its very compute heavy.

    You can see this for yourself. Try ~10 prompts in parallel on a CPU in llama.cpp, and it will slow to a crawl, while a GPU with a narrow bus won’t slow down much.

    Training is a bit more complicated, but that’s not doable on CPUs anyway.

    Now, local inference (aka a batch size of 1), past prompt processing, is heavily bandwidth limited. This is why hybrid inference works alright on CPUs. But this doesn’t really apply to servers, which process many users in parallel with each “pass”.



  • No. Not even close. Non-US models are trained (and run) on peanuts compared to big US models, because they don’t have mega GPU farms and have no other option. Deepseek in particular went all-in on software architecture efficiency.

    …Ironically, the Nvidia GPU embargo was the best thing that ever happened to the Chinese devs. It made them thrifty.

    Many tried to warn US regulators of this, but they had AI Bros whispering in their ears. The US tech system is just too screwed up, I guess.







  • I mean, I use every alternative I can. Vapoursynth scripts, libraw-based projects, random GitHub repos, DaVinci…

    But there are some features I just can’t get great support for outside of definitely-not-high-seas Lightroom Classic:

    • Good lens profiles for weird lenses.

    • Proper HDR PQ/HLG editing and AVIF/JXL export support.

    • RAW support for newer cameras, like my little R50V

    I have yet to try DaVinci’s photo editing mode though. That’s very interesting.


  • On a technical level, that makes zero sense.

    AI “agents” are basically just fancy prompts with a tool calling harness. They are infinitely replicable, at zero cost, with no intrinsic value; the cost comes from the generic CPU host, and the API calls to GPU servers, databases, or whatever else that are all centralized anyway.


    Wanna hear a dirty secret?

    “AI” cost is going to zero.

    Model capabilities aren’t scaling, but inference efficiency is exploding, thanks to more resource-constrained labs and breakthroughs in papers. The endgame of the current bubble is mediocre but useful tools anyone can host themselves, dirt cheap. Maybe a bit more reliable and refined than what we have now, but about as “intelligent.”

    And guess what?

    Microsoft can’t profit off that. None of the Tech Bros can.

    Point being, this exec is either delusional, or jawboning, so the world doesn’t realize that “AI” is a dumb utility/aid, and they can’t make any profit off it.


  • To illustrate what I mean more clearly, look at the top comments/replies for the NASA Artemis posts, as an example.

    …It’s basically all conspiracy theorists, and government skeptics.

    Twitter’s focusing the Artemis posts on them because it’s what they want to see, and most engaging for them.

    In the EFF’s case, I’m not just talking about Musk’s influence. The algorithm will only show the EFF to users who would be highly engaged by it. E.g., angry skeptics who wouldn’t be swayed by the EFF anyway, or fans who already agree with the EFF. It’s literally not going to show the EFF to people who need to see it, as Twitter’s metrics would show it as unengaging.


    This is the “false image” I keep trying to dispel. Twitter is less and less an “even spread” of exposure like people think it is, like it sort of used to be, more-and-more a hyper focused bubble of what you want to hear, and only what you want to hear. All the changes Musk is making are amplifying that. Maybe that’s fine for some orgs, but there’s no point in the EFF staying in that kind of environment, regardless of ethics.








  • They seem to have held back the “big” locally runnable model.

    It’s also kinda conservative/old, architecture wise: 16-bit weights, sliding window attention interleaved with global attention. No MTP, no QAT (yet), no tightly integrated vision, no hybrid mamba like Qwen/Deepseek, nothing weird like that. It’s especially glaring since we know Google is using an exotic architecture for Gemini, and has basically infinite resources for experimentation.

    It also feels kinda “deep fried” like GPT-OSS to me, see: https://github.com/ikawrakow/ik_llama.cpp/issues/1572

    it is acting crazy. it can’t do anything without the proper chat template, or it goes crazy.


    IMO it’s not very interesting, especially with so many other models that run really well on desktops.