this post was submitted on 14 Oct 2024

323 points (96.8% liked)

Selfhosted

40133 readers

545 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 1 year ago

MODERATORS

[email protected]

323

Guide to Self Hosting LLMs Faster/Better than Ollama (lemmy.world)

submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]

83 comments fedilink hide all child comments

I see a lot of talk of Ollama here, which I personally don't like because:

The quantizations they use tend to be suboptimal
It abstracts away llama.cpp in a way that, frankly, leaves a lot of performance and quality on the table.
It abstracts away things that you should really know for hosting LLMs.
I don't like some things about the devs. I won't rant, but I especially don't like the hint they're cooking up something commercial.

So, here's a quick guide to get away from Ollama.

First step is to pick your OS. Windows is fine, but if setting up something new, linux is best. I favor CachyOS in particular, for its great python performance. If you use Windows, be sure to enable hardware accelerated scheduling and disable shared memory.
Ensure the latest version of CUDA (or ROCm, if using AMD) is installed. Linux is great for this, as many distros package them for you.
Install Python 3.11.x, 3.12.x, or at least whatever your distro supports, and git. If on linux, also install your distro's "build tools" package.

Now for actually installing the runtime. There are a great number of inference engines supporting different quantizations, forgive the Reddit link but see: https://old.reddit.com/r/LocalLLaMA/comments/1fg3jgr/a_large_table_of_inference_engines_and_supported/

As far as I am concerned, 3 matter to "home" hosters on consumer GPUs:

Exllama (and by extension TabbyAPI), as a very fast, very memory efficient "GPU only" runtime, supports AMD via ROCM and Nvidia via CUDA: https://github.com/theroyallab/tabbyAPI
Aphrodite Engine. While not strictly as vram efficient, its much faster with parallel API calls, reasonably efficient at very short context, and supports just about every quantization under the sun and more exotic models than exllama. AMD/Nvidia only: https://github.com/PygmalionAI/Aphrodite-engine
This fork of kobold.cpp, which supports more fine grained kv cache quantization (we will get to that). It supports CPU offloading and I think Apple Metal: https://github.com/Nexesenex/croco.cpp

Now, there are also reasons I don't like llama.cpp, but one of the big ones is that sometimes its model implementations have... quality degrading issues, or odd bugs. Hence I would generally recommend TabbyAPI if you have enough vram to avoid offloading to CPU, and can figure out how to set it up. So:

Open a terminal, run git clone https://github.com/theroyallab/tabbyAPI.git
cd tabbyAPI
Follow this guide for setting up a python venv and installing pytorch and tabbyAPI: https://github.com/theroyallab/tabbyAPI/wiki/01.-Getting-Started#installing

This can go wrong, if anyone gets stuck I can help with that.

Next, figure out how much VRAM you have.
Figure out how much "context" you want, aka how much text the llm can ingest. If a models has a context length of, say, "8K" that means it can support 8K tokens as input, or less than 8K words. Not all tokenizers are the same, some like Qwen 2.5's can fit nearly a word per token, while others are more in the ballpark of half a work per token or less.
Keep in mind that the actual context length of many models is an outright lie, see: https://github.com/hsiehjackson/RULER
Exllama has a feature called "kv cache quantization" that can dramatically shrink the VRAM the "context" of an LLM takes up. Unlike llama.cpp, it's Q4 cache is basically lossless, and on a model like Command-R, an 80K+ context can take up less than 4GB! Its essential to enable Q4 or Q6 cache to squeeze in as much LLM as you can into your GPU.
With that in mind, you can search huggingface for your desired model. Since we are using tabbyAPI, we want to search for "exl2" quantizations: https://huggingface.co/models?sort=modified&search=exl2
There are all sorts of finetunes... and a lot of straight-up garbage. But I will post some general recommendations based on total vram:
4GB: A very small quantization of Qwen 2.5 7B. Or maybe Llama 3B.
6GB: IMO llama 3.1 8B is best here. There are many finetunes of this depending on what you want (horny chat, tool usage, math, whatever). For coding, I would recommend Qwen 7B coder instead: https://huggingface.co/models?sort=trending&search=qwen+7b+exl2
8GB-12GB Qwen 2.5 14B is king! Unlike it's 7B counterpart, I find the 14B version of the model incredible for its size, and it will squeeze into this vram pool (albeit with very short context/tight quantization for the 8GB cards). I would recommend trying Arcee's new distillation in particular: https://huggingface.co/bartowski/SuperNova-Medius-exl2
16GB: Mistral 22B, Mistral Coder 22B, and very tight quantizations of Qwen 2.5 34B are possible. Honorable mention goes to InternLM 2.5 20B, which is alright even at 128K context.
20GB-24GB: Command-R 2024 35B is excellent for "in context" work, like asking questions about long documents, continuing long stories, anything involving working "with" the text you feed to an LLM rather than pulling from it's internal knowledge pool. It's also quite goot at longer contexts, out to 64K-80K more-or-less, all of which fits in 24GB. Otherwise, stick to Qwen 2.5 34B, which still has a very respectable 32K native context, and a rather mediocre 64K "extended" context via YaRN: https://huggingface.co/DrNicefellow/Qwen2.5-32B-Instruct-4.25bpw-exl2
32GB, same as 24GB, just with a higher bpw quantization. But this is also the threshold were lower bpw quantizations of Qwen 2.5 72B (at short context) start to make sense.
48GB: Llama 3.1 70B (for longer context) or Qwen 2.5 72B (for 32K context or less)

Again, browse huggingface and pick an exl2 quantization that will cleanly fill your vram pool + the amount of context you want to specify in TabbyAPI. Many quantizers such as bartowski will list how much space they take up, but you can also just look at the available filesize.

Now... you have to download the model. Bartowski has instructions here, but I prefer to use this nifty standalone tool instead: https://github.com/bodaay/HuggingFaceModelDownloader
Put it in your TabbyAPI models folder, and follow the documentation on the wiki.
There are a lot of options. Some to keep in mind are chunk_size (higher than 2048 will process long contexts faster but take up lots of vram, less will save a little vram), cache_mode (use Q4 for long context, Q6/Q8 for short context if you have room), max_seq_len (this is your context length), tensor_parallel (for faster inference with 2 identical GPUs), and max_batch_size (parallel processing if you have multiple user hitting the tabbyAPI server, but more vram usage)
Now... pick your frontend. The tabbyAPI wiki has a good compliation of community projects, but Open Web UI is very popular right now: https://github.com/open-webui/open-webui I personally use exui: https://github.com/turboderp/exui
And be careful with your sampling settings when using LLMs. Different models behave differently, but one of the most common mistakes people make is using "old" sampling parameters for new models. In general, keep temperature very low (<0.1, or even zero) and rep penalty low (1.01?) unless you need long, creative responses. If available in your UI, enable DRY sampling to tamp down repition without "dumbing down" the model with too much temperature or repitition penalty. Always use a MinP of 0.05 or higher and disable other samplers. This is especially important for Chinese models like Qwen, as MinP cuts out "wrong language" answers from the response.
Now, once this is all setup and running, I'd recommend throttling your GPU, as it simply doesn't need its full core speed to maximize its inference speed while generating. For my 3090, I use something like sudo nvidia-smi -pl 290, which throttles it down from 420W to 290W.

Sorry for the wall of text! I can keep going, discussing kobold.cpp/llama.cpp, Aphrodite, exotic quantization and other niches like that if anyone is interested.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 1 points 1 month ago (1 children)

I just can't get ROCm / gpu generation to work on Bazzite, like at all. It seems completely cursed. I tried koboldcpp through a Fedora distrobox and it didn't even show any hardware options. Tried through an Arch AUR package through distrobox and the ROCm option is there but ends with a CUDA error. lol The Vulkan option works but seems to still use the CPU more than the GPU and is consequently still kinda slow and I struggle to find a good model for my 8GB card. Fimbulvetr-10.7B-v1-Q5_K_M for example was still too slow to be practical.

Tried LM Studio directly in Bazzite and it also just uses the CPU. It also is very obtuse on how to connect to it with SillyTavern, as it asks for an API key? I managed it once in the past but I can't remember how but it also ended up stopping generating anything after a few replies.

Krita's diffusion also only runs on the CPU, which is abysmally slow, but I'm not sure if they expect Krita to be build directly on the system for ROCm support to work.

I'm not even trying to get SDXL or something to run at this point, since that seems to be still complicated enough even on a regular distro.

[–] [email protected] 2 points 1 month ago* (last edited 1 month ago) (1 children)

I don't like Fedora because its CUDA support is third party, and AFAIK they dont natively package ROCm. And its too complex to use through something like distrobox... I don't want to tell you to switch OSes, but you'd have a much better time with CachyOS, which is also optimized for Steam gaming.

Alternatively you could try installing rocm images through docker, but you have to make sure GPU passthrough is working).

It also depends on your GPU. If you are on an RX 580, you can basically kiss rocm support goodbye, and might want to investigate mlc-llm's vulkan backend.

Fimbulvetr is ancient now, your go to models are Qwen 2.5 14B at short context or llama 3.1 8B/Qwen 2.5 7B at longer context.

[–] [email protected] 1 points 1 month ago (1 children)

I distrohopped so much after each previous distro eventually broke and me clearly not being smart enough to recover. I'm honestly kinda sick of it, even if the immutable nature also annoys the shit out of me.

My GPU is a 6650 XT, which should in principle work with ROCm.

Which model specifically are you recommending? Llama-3.1-8B-Lexi-Uncensored-V2-GGUF? Because the original meta-llama ones are censored to all hell and Huggingface is not particularly easy to navigate, on top of figuring out the right model size & quantization being extremely confusing.

[–] [email protected] 1 points 1 month ago* (last edited 1 month ago) (1 children)

Depends what you mean by censored. I never have a problem with Qwen or llama as long as I give them the right prompt and system prompt. Its not like an API model, they have to continue whatever response you give them.

And... For what? If you are just looking for like ERP, check out drummer's finetunes. Otherwise I tend to avoid "uncensored" finetunes as they dumb the model down a bit, but take your pick: https://huggingface.co/models?sort=modified&search=14B

But you are going to struggle if you can't get rocm working beyond very small context, as that means no flash attention anywhere.

Also, assuming you end up using kobold.cpp-rocm instead, I would use a IQ3_M or IQ3_XS GGUF quantization of a 14B model.

[–] [email protected] 1 points 1 month ago (2 children)

Well, anything remotely raunchy gets a "I cannot participate in explicit content" default reply.

I am using the rocm install of koboldcpp but as said, the ROCm option errors out with a CUDA error for some reason.

[–] [email protected] 1 points 1 month ago* (last edited 1 month ago)

Oh, and again, for raunchy, there are explicit "RP" finetunes, like: https://huggingface.co/TheDrummer

But you just need to set a good system prompt or start a reply with "Sure," and plain qwen or llama will write out unspeakable things.

[–] [email protected] 1 points 1 month ago (1 children)

Thonking What's the error? Did you manually override your architecture as an environment variable?

https://old.reddit.com/r/ROCm/comments/18z29l6/comment/kgeuguq/

https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU?tab=readme-ov-file#additional-information--installation-tips

You are gfx1032

[–] [email protected] 1 points 1 month ago (1 children)

ggml_cuda_compute_forward: ADD failed
CUDA error: shared object initialization failed
  current device: 0, in function ggml_cuda_compute_forward at ggml/src/ggml-cuda.cu:2365
  err
ggml/src/ggml-cuda.cu:107: CUDA error

I didn't do anything past using yay to install the AUR koboldcpp-hipblas package, and customtkinter, since the UI wouldn't work otherwise. The koboldcpp-rocm page very specifically does not mention any other steps in the Arch section and the AUR page only mentions the UI issue.

[–] [email protected] 1 points 1 month ago* (last edited 1 month ago) (1 children)

mmmm I would not use the AUR version, especially on Fedora. It probably relies on a bunch of arch system packages, among other things.

Try installing the rocm fork directly, with its script: https://github.com/YellowRoseCx/koboldcpp-rocm?tab=readme-ov-file#linux

EDIT: There does seem to be a specific quirk related to Fedora.

[–] [email protected] 1 points 1 month ago

I'm not using Fedora, I'm using Bazzite, which is immutable based on SilverBlue. I use an Arch distrobox for this since I can't really install anything directly into the system. The script is what I tried originally in a Fedora distrobox which did not work at all.