Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · edit-2 4 days ago

Yeah. But it also messes stuff up from the llama.cpp baseline, and hides or doesn’t support some features/optimizations, and definitely doesn’t support the more efficient iq_k quants of ik_llama.cpp and its specialzied MoE offloading.

And that’s not even getting into the various controversies around ollama (like broken GGUFs or indications they’re going closed source in some form).

…It just depends on how much performance you want to squeeze out, and how much time you want to spend on the endeavor. Small LLMs are kinda marginal though, so IMO its important if you really want to try; otherwise one is probably better off spending a few bucks on an API that doesn’t log requests.

brucethemoose@lemmy.world · edit-2 4 days ago

In case I miss your reply, assuming a 3080 + 64 GB of RAM, you want the IQ4_KSS (or IQ3_KS, for more RAM for tabs and stuff) version of this:

https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF

Part of it will run on your GPU, part will live in system RAM, but ik_llama.cpp does the quantizations split and GPU offloading in a particularly efficient way for these kind of ‘MoE’ models. Follow the instructions on that page.

If you ‘only’ have 32GB RAM or less, that’s tricker, and the next question is what kind of speeds do you want. But it’s probably best to wait a few days and see how Qwen3 80B looks when it comes out. Or just go with the IQ4_K version of this: https://huggingface.co/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF

And you don’t strickly need the hyper optimization of ik_llama.cpp for a small model like Qwen3 30B. Something easier like lm studio or the llama.cpp docker image would be fine.

Alternatively, you could try to squeeze Gemma 27B into that 11GB VRAM, but it would be tight.

brucethemoose@lemmy.world · edit-2 4 days ago

How much system RAM, and what kind? DDR5?

ik doesn’t have great documentation, so it’d be a lot easier for me to just point you places, heh.

brucethemoose@lemmy.world · edit-2 4 days ago

At risk of getting more technical, ik_llama.cpp has a good built in webui:

https://github.com/ikawrakow/ik_llama.cpp/

Getting more technical, its also way better than ollama. You can run models way smarter than ollama can on the same hardware.

For reference, I’m running GLM-4 (667 GB of raw weights) on a single RTX 3090/Ryzen gaming rig, at reading speed, with pretty low quantization distortion.

And if you want a ‘look this up on the internet for me’ assistant (which you need for them to be truly useful), you need another docker project as well.

…That’s just how LLM self hosting is now. It’s simply too hardware intense and ad hoc to be easy and smart and cheap. You can indeed host a small ‘default’ LLM without much tinkering, but its going to be pretty dumb, and pretty slow on ollama defaults.

brucethemoose@lemmy.world · 4 days ago

It’s not. Ellison is shady AF.

brucethemoose@lemmy.world · 6 days ago

I mean:

Chromium, binary release
Chromium, CachyOS AVX2 build
Thorium
Cromite

brucethemoose@lemmy.world · edit-2 6 days ago

Edge Linux was weirdly performant when I tested it.

Not only did it benchmark better than plain Chromium (or Bromite or Thorium back then), but it seemed to behave better with Wayland. It was, indeed, be best Chromium Linux browser.

Shrug.

brucethemoose@lemmy.world · edit-2 7 days ago

relatively minor efficiencies don’t make up for the compile times lol.

It’s sometimes even a regression. For instance, self-compiled pytorch is way slower than the official releases, and Firefox generally is too unless you are extremely careful about it. Stuff like Python doesn’t get a benefit without patches.

I think the point of Gentoo is supposed to be ‘truly from source’ and utility for embedded stuff, not benchmark performance. Especially since there are distros that offer ‘march’ optimized packages now.

brucethemoose@lemmy.world · edit-2 7 days ago

What Toot said.

Things I would emphasize are:

The community “critical mass” is amazing with the wiki, online posts and such. You get a lot of support that isn’t ancient, jank and ad hoc like Ubuntu.
Arch emphasizes paying attention. It’s not a hands free OS: you have know what graphics drivers you run, and what your desktop environment is. When you update, you have to watch the log for emergency messages and such, including official notifications from the arch repos themself. It’s not a “hands off” OS where you can operate without knowing anything about it, but the reward is that shit gets fixed quick, officially, without having to stray from defaults and break your system, or accumulating a bunch of hacks you have to maintain yourself.
Much of Arch’s bad reputation comes from AUR. Don’t use anything from the AUR (instead of an official repo package) unless you absolutely have to. This is when stuff starts breaking. Installing standalone apps that aren’t on the repo via AUR is fine, but to be clear, avoid things that integrate with the system if you can.
It doesn’t have to be hardcore barebones like Gentoo, there are all sorts of Preconfigurations like Garuda and Endeavor. I recommend CachyOS (which I have kept for two years now, and will into the future).

brucethemoose@lemmy.world · edit-2 9 days ago

Yeah, I know :(.

I only got a Plus becauise it was at a deep ‘loss leader’ discount from AT&T, literally cheaper than old 15s. Can’t complain too much over that, I guess… And I only left the RP2 behind because it microphone is likely clogged with dust, and I’m a little worried about security on it.

brucethemoose@lemmy.world · edit-2 9 days ago

Honestly, YouTube is the “least bad” of most commercial social media.

If people go to Twitter or Discord or whatever instead, that would be awful.

brucethemoose@lemmy.world · edit-2 9 days ago

Yeah. I get ‘tipping the show’, but not the DMing part which so many seem to be after.

The porn itself is a… uh, different itch, I’d say.

brucethemoose@lemmy.world · edit-2 10 days ago

Honestly the iPhone performance is over rated now.

I just came from an Android 9 Razer Phone 2 (with an ancient SD845) to a brand new iPhone 16 plus…

And the IPhone feels slower.

The UI is slower. Scrolling is more stuttery. Heavy webpages that ran fine on my Android phone crawl on the iPhone. It literally has the same amount of RAM (8GB), so it can’t run anything more complex either. And it’s more unintuitive too, with all these slow and wierd gestures just to do basic things, while other features are convoluted.

And I used to be a massive iOS fanboy. I just want my jailbroken iPhone 5 back :(

brucethemoose@lemmy.world · edit-2 10 days ago

There’s at least the specter of sourcing data:

Methodology
- All original statistics cited in this report are based on:
- Internal usage data (AIGirlfriends.ai, Feb–May 2025)
- Proprietary user surveys (n=2,312)
- Public search data via Google Trends, Semrush, SimilarWeb, and App Store rankings
- Community scans of Reddit, Quora, and Twitter using NLP and Pushshift

https://aigirlfriends.ai/blog/research/

Can’t trust any conclusions as far as you can throw them, and probably not the specifics of the data, but I suspect the sentiment is worryingly correct.

brucethemoose@lemmy.world · 10 days ago

A shameless one.

The idea/statistic is still weirdly interesting and discussion worthy though.

brucethemoose@lemmy.world · edit-2 10 days ago

I don’t get this at all.

I’m horrifically lonely. I’m basically a shut in. I need a hug.
I’ve got serious rejection anxiety, among other problems. But I used to have a lot of friends in the past, which makes it even worse.
I… Get attached to characters, like some in TV or wholesome NPCs in games. I wanna hug them.
I’m a local ML enthusiast. I’ve been tinkering with chatbots before ChatGPT was even a thing, including “RP” finetunes.

I’m the target market.

…And I don’t see the appeal of this?

It’s the same reason I don’t get why guys pay to gawk at (excuse me if this is insensitive) people twerking on OnlyFans or Instagram or whatever. There’s no human connection. Do guys really think thumbs-upping some performer behind a screen is emotional attention?

Same with chatbots. They can be cerebrally interesting story generators, sometimes, sure. Even an interesting ‘mirror’ to bounce private thoughts off of.

But a girlfriend?

I really don’t get it. I can’t empathize with that.

I don’t get the world at all, and how billions of people seem to think they’re in some kind of two-way human relationship with influencers or (apparently) chatbots. Where’s the connection?

And yeah, I saw this is a ad, but the idea itself is still weirdly interesting.

brucethemoose@lemmy.world · 10 days ago

The quest for immortality (fueled by corpses of the poor) is a classic ruling class trope.

brucethemoose@lemmy.world · edit-2 14 days ago

Because however one feels about blockchain tech and its future, past companies within the crypto industry are notorious for selling the moon, being shady, and cashing out early. ‘ZCash’ appears to be a good example, particularly because a small group exerts such a high level of control over it.

And if the parallel holds, and at least some of that applies Jay Gaeber’s own personal experience and expectations of what a company’s trajectory should look like, it doesn’t bode well for Bluesky.

brucethemoose@lemmy.world · edit-2 14 days ago

Mobile 5090 would be an underclocked, binned desktop 5080, AFAIK:

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_50_series

In KCD2 (a fantastic CryEngine game, a great benchmark IMO), at QHD, the APU is a hair less half as fast. For instance, 39 FPS at QHD vs 84 FPS for the mobile 5090:

https://www.notebookcheck.net/Nvidia-GeForce-RTX-5090-Laptop-Benchmarks-and-Specs.934947.0.html

https://www.notebookcheck.net/AMD-Radeon-8060S-Benchmarks-and-Specs.942049.0.html

Synthetic benchmarks between the two

But these are both presumably running at high TDP (150W for the 5090). Also, the mobile 5090 is catastrophically overpriced and inevitably tied to a weaker CPU, whereas the APU is a monster of a CPU. So make of that what you will.

brucethemoose@lemmy.world · 14 days ago

Oh wow, that’s awesome! I didn’t know folks ran TDP tests like this, just that my old 3090 seems to have a minimum sweet spot around that same same ~200W based on my own testing, but I figured the 4000 or 5000 series might go lower. Apparently not, at least for the big die.

I also figured the 395 would draw more than 55W! That’s also awesome! I suspect newer, smaller GPUs like the 9000 or 5000 series still make the value proposition questionable, but still you make an excellent point.

And for reference, I just checked, and my dGPU hovers around 30W idle with no display connected.

brucethemoose@lemmy.world · edit-2 11 months ago

Guide to Self Hosting LLMs Faster/Better than Ollama