this post was submitted on 09 Jul 2025

0 points (NaN% liked)

LocalLLaMA

3361 readers

2 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago

MODERATORS

[email protected]

need help understanding if this setup is even feasible. (feddit.it)

submitted 4 days ago* (last edited 4 days ago) by [email protected] to c/[email protected]

18 comments fedilink hide all child comments

I have an unused dell optiplex 7010 i wanted to use as a base for an interference rig.

My idea was to get a 3060, a pci riser and 500w power supply just for the gpu. Mechanically speaking i had the idea of making a backpack of sorts on the side panel, to fit both the gpu and the extra power supply since unfortunately it's an sff machine.

What's making me weary of going through is the specs of the 7010 itself: it's a ddr3 system with a 3rd gen i7-3770. I have the feeling that as soon as it ends up offloading some of the model into system ram is going to slow down to a crawl. (Using koboldcpp, if that matters.)

Do you think it's even worth going through?

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 0 points 4 days ago (5 children)

I just got second PSU just for powering multiple cards on a single bifurcated pcie for a homelab type thing. A snag I hit that you might be able to learn from: PSUs need to b turned on by the motherboard befor e being able to power a GPU. You need a 15$ electrical relay board that sends power from the motherboard to the second PSU or it won't work.

Its gonna be slow as molasses partially offloaded onto regular ram no matter what its not like ddr4 vs ddr3 is that much different speed wise. It might maybe be a 10-15% increase if that. If your partially offloading and not doing some weird optimized MOE type of offloading expect 1-5token per second (really more like 2-3).

If youre doing real inferencing work and need speed then vram is king. you want to fit it all within the GPU. How much vram is the 3060 youre looking at?

[–] [email protected] 0 points 4 days ago (4 children)

You need a 15$ electrical relay board that sends power from the motherboard to the second PSU or it won't work.

If you are talking about something like the add2psu boards that jump the PS_ON line of the secondary power supply on when the 12v line of the primary one is ready. Then i'm already on it the diy way. Thanks for the heads up though :-).

expect 1-5token per second (really more like 2-3).

5 tokens per seconds would be wonderful compared to what i'm using right now, since it averages at ~ 1,5 tok/s with 13B models. (Koboldcpp through vulkan on a steam deck) My main concerns for upgrading are bigger context/models + trying to speed up prompt processing. But i feel like the last one will also be handicapped by offloading to RAM.

How much vram is the 3060 youre looking at?

I'm looking for the 12GB version. i'm also giving myself space to add another one (most likely through a 1x mining riser) if i manage to save up enough another card in the future to bump it up to 24 gb with parallel processing, though i doubt i'll manage.

Sorry for the wall of text, and thanks for the help.

[–] [email protected] 0 points 4 days ago (3 children)

No worries :) a model fully loaded onto 12gb vram on a 3060 will give you a huge boost around 15-30tps depending on the bandwidth throughput and tensor cores of the 3060. Its really such a big difference once you get a properly fit quantized model your happy with you probably won't be thinking of offloading ram again if you just want llm inferencing. Check to make sure your motherboard supports pcie burfurcation before you make any multigpu plans I got super lucky with my motherboard allowing 4x4x4x4 bifuraction for 4 GPUs potentially but I could have been screwed easy if it didnt.

[–] [email protected] 0 points 3 days ago* (last edited 3 days ago) (1 children)

Is bifurcation necessary because of how CUDA works, or because of bandwidth restraints? Mostly asking because for the secondary card i'll be limited by the x1 link mining risers have (and also because unfortunately both machines lack that capability. :'-) )

Also, if i offload layers to the GPU manually, so that only the context needs to overflow into RAM, will that be less of a slowdown, or will iti comparable to letting model layers into ram? (Sorry for the question bombing, i'm trying understand how mutch i can realistically push the setup before i pull the trigger)

[–] [email protected] 0 points 3 days ago* (last edited 3 days ago) (1 children)

Oh, I LOVE to talk, so I hope you don't mind if I respond with my own wall of text :) It got really long, so I broke it up with headers. It's all hand-written by me though—no AI slop.

TLDR: Bifurcation is needed because of how fitting multiple GPUs on one PCIe x16 lane works and consumer CPU PCIe lane management limits. Context offloading is still partial offloading, so you'll still get hit with the same speed penalty—with the exception of one specific advanced partial offloading inference strategy involving MoE models.

CUDA

To be clear about CUDA, it's an API optimized for software to use NVIDIA cards. When you use an NVIDIA card with Kobold or another engine, you tell it to use CUDA as an API to optimally use the GPU for compute tasks. In Kobold's case, you tell it to use cuBLAS for CUDA.

The PCIe bifurcation stuff is a separate issue when trying to run multiple GPUs on limited hardware. However, CUDA has an important place in multi-GPU setups. Using CUDA with multiple NVIDIA GPUs is the gold standard for homelabs because it's the most supported for advanced PyTorch fine-tuning, post-training, and cutting-edge academic work.

But it's not the only way to do things, especially if you just want inference on Kobold. Vulkan is a universal API that works on both NVIDIA and AMD cards, so you can actually combine them (like a 3060 and an AMD RX) to pool their VRAM. The trade-off is some speed compared to a full NVIDIA setup on CUDA/cuBLAS.

PCIe Bifurcation

Bifurcation is necessary in my case mainly because of physical PCIe port limits on the board and consumer CPU lane handling limits. Most consumer desktops only have one x16 PCIe slot on the motherboard, which typically means only one GPU-type device can fit nicely. Most CPUs only have 24 PCIe lanes, which is just enough to manage one x16 slot GPU, a network card, and some M.2 storage.

There are motherboards with multiple physical x16 PCIe slots and multiple CPU sockets for special server-class CPUs like Threadrippers with huge PCIe lane counts. These can handle all those PCIe devices directly at max speeds, but they're purpose-built server-class components that cost $1,000+ USD just for the motherboard. When you see people on homelab forums running dozens of used server-class GPUs, rest assured they have an expensive motherboard with 8+ PCIe x16 slots, two Threadripper CPUs, and lots of bifurcation. (See the bottom for parts examples.)

Information on this stuff and which motherboards support it is spotty—it's incredibly niche hobbyist territory with just a couple of forum posts to reference. To sanity check, really dig into the exact board manufacturer's spec PDF and look for mentions of PCIe features to be sure bifurcation is supported. Don't just trust internet searches. My motherboard is an MSI B450M Bazooka (I'll try to remember to get exact numbers later). It happened to have 4x4x4x4 compatibility—I didn't know any of this going in and got so lucky!

For multiple GPUs (or other PCIe devices!) to work together on a modest consumer desktop motherboard + CPU sharing a single PCIe x16, you have to:

Get a motherboard that allows you to intelligently split one x16 PCIe lane address into several smaller-sized addresses in the BIOS
Get a bifurcation expansion card meant for the specific splitting (4x4x4x4, 8x8, 8x4x4)
Connect it all together cable-wise and figure out mounting/case modification (or live with server parts thrown together on a homelab table)

A secondary reason I'm bifurcating: the used server-class GPU I got for inferencing (Tesla P100 16GB) has no display output, and my old Ryzen CPU has no integrated graphics either. So my desktop refuses to boot with just the server card—I need at least one display-output GPU too. You won't have this problem with the 3060. In my case, I was planning a multi-GPU setup eventually anyway, so going the extra mile to figure this out was an acceptable learning premium.

Bifurcation cuts into bandwidth, but it's actually not that bad. Going from x16 to x4 only results in about 15% speed decrease, which isn't bad IMO. Did you say you're using a x1 riser though? That splits it to a sixteenth of the bandwidth—maybe I'm misunderstanding what you mean by x1.

I wouldn't obsess over multi-GPU setups too hard. You don't need to shoot for a data center at home right away, especially when you're still getting a feel for this stuff. It's a lot of planning, money, and time to get a custom homelab figured out right. Just going from Steam Deck inferencing to a single proper GPU will be night and day. I started with my decade-old ThinkPad inferencing Llama 3.1 8B at about 1 TPS, and it inspired me enough to dig out the old gaming PC sitting in the basement and squeeze every last megabyte of VRAM out of it. My 8GB 1070 Ti held me for over a year until I started doing enough professional-ish work to justify a proper multi-GPU upgrade.

Offloading Context

Offloading context is still partial offloading, so you'll hit the same speed issues. You want to use a model that leaves enough memory for context completely within your GPU VRAM. Let's say you use a quantized 8B model that's around 8GB on your 12GB card—that leaves 4GB for context, which I'd say is easily about 16k tokens. That's what most lower-parameter local models can realistically handle anyway. You could partially offload into RAM, but it's a bad idea—cutting speed to a tenth just to add context capability you don't need. If you're doing really long conversations, handling huge chunks of text, or want to use a higher-parameter model and don't care about speed, it's understandable. But once you get a taste of 15-30 TPS, going back to 1-3 TPS is... difficult.

MoE

Note that if you're dead set on partial offloading, there's a popular way to squeeze performance through Mixture of Experts (MoE) models. It's all a little advanced and nerdy for my taste, but the gist is that you can use clever partial offloading strategies with your inferencing engine. You split up the different expert layers that make up the model between RAM and VRAM to improve performance—the unused experts live in RAM while the active expert layers live in VRAM. Or something like that.

I like to talk (in case you haven't noticed). Feel free to keep the questions coming—I'm happy to help and maybe save you some headaches.

Oh, in case you want to fantasize about parts shopping for a multi-GPU server-class setup, here are some links I have saved for reference. GPUs used for ML can be fine on 8 PCI lanes (https://www.reddit.com/r/MachineLearning/comments/jp4igh/d_does_x8_lanes_instead_of_x16_lanes_worsen_rtx/)

A Threadripper Pro has 128 PCI lanes: (https://www.amazon.com/AMD-Ryzen-Threadripper-PRO-3975WX/dp/B08V5H7GPM)

You can get dual sWRX8 motherboards: (https://www.newegg.com/p/pl?N=100007625+601362102)

You can get a PCIe 4x expansion card on Amazon: (https://www.amazon.com/JMT-PCIe-Bifurcation-x4x4x4x4-Expansion-20-2mm/dp/B0C9WS3MBG)

All together, that's 256 PCI lanes per machine, as many PCIe slots as you need. At that point, all you need to figure out is power delivery.

[–] [email protected] 0 points 3 days ago* (last edited 3 days ago)

Did you say you’re using a x1 riser though? That splits it to a sixteenth of the bandwidth—maybe I’m misunderstanding what you mean by x1.

not exactly, what i mean by x1 riser is one of these bad boys they are basically extension cords for a x1 pcie link, no bifurcation. the thinkcenter has 1 x16 slot and two x1 slots. my idea for the whole setup was to have the 3060 i'm getting now into the x16 slot of the motherboard, so it can be used for other tasks as well if need's be; while the second 3060 would be placed in one of the x1 slots the motherboard has via the riser; since from what i managed to read it should only affect the time to first load the model. but the fact you only mentioned the x16 slot does make me worry if there is some handicap to the other two x1 slots.

of course, the second card will come down the line; don't have nearly enough money for two cards and the thinkcentre :-P.

started with my decade-old ThinkPad inferencing Llama 3.1 8B at about 1 TPS

pretty mutch same story, but with the optiplex and the steam deck. come to think of it, i do need to polish and share the scripts i wrote for the steam deck, since i designed them to be used without a dock, they're a wonderful gateway drug to this hobby :-).

there’s a popular way to squeeze performance through Mixture of Experts (MoE) models.

yeah, that's a little too out of scope for me, i'm more practical with the hardware side of things, mostly due to lacking hardware to really get into the more involved stuff. though it's not out of question for the future :-).

Tesla P100 16GB

i am somewhat familiar with these bad boys, we have an older poweredge server full of them at work, where it's used for fluid simulation, (i'd love to see how it's set up, but can't risk bricking the workhorse) but the need to figure out a cooling system for these cards, plus the higher power draw made it not really feasible in my budget unfortunately.

load more comments (1 replies)