Wow. Doing some spring-cleaning? I might have one of those on my own small pile of e-waste. Can't even remember what kind of bandwith the PCI bus had... probably enough to fill 128MB.
LocalLLaMA
Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.
Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.
As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.
Lol I was digging for other parts in my "hardware archive" and came across this, I had actually forgotten about the non-express PCI and thought it was AGP for a minute LMAO
What is it? Oh I see the sticker now :-) yes quite the beastly graphics card so much vram!
This is a good time to ask: I want to use AI on a local server (deepseek maybe, image generators like flux, ...) is there a cheaper alternative to flagship Nvidia cards which can do it?
Its all about ram and vram. You can buy some cheap ram sticks get your system to like 128gb ram and run a low quant of the full deepseek. It wont be fast but it will work. Now if you want fast you need to be able to get the model on some graphics card vram ideally all of it. Thats where the high end Nvidia stuff comes in, getting 24gb of vram all on the same card at maximum band with speeds. Some people prefer macs or data center cards. You can use amd cards too its just not as well supported.
Localllama users tend use smaller models than the full deepseek r1 that fit on older cards. 32b partially offloaded between a older graphics card and ram sticks is around the limit of what a non dedicated hobbiest can achieve with ther already existing home hardware. Most are really happy with the performance of mistral small and qwen qwq and the deepseek distills. those that want more have the money to burn on multiple nvidia gpus and a server rack.
LLM wise Your phone can run 1-4b models, Your laptop 4-8b, your older gaming desktop with a 4-8gb vram card can run around 8-32b. Beyond that needs the big expensive 24gb cards and further beyond needs multiples of them.
Stable diffusion models in my experience is very compute intensive. Quantization degredation is much more apparent so You should have vram, a high quant model, and should limit canvas size as low as tolerable.
Hopefully we will get cheaper devices meant for AI hosting like cheaper versions of strix and digits.
Assuming you haven't ruled this out already, test your plans out now using whatever computer you already own. At the hobbyist level you can do a lot with 8GB ram and no graphics card. 7B LLMs are really good now and they're only going to get better.
From my reading, if you don't mind sacrificing speed (tokens/sec), you can run models in system RAM. To be usable though, you'd need at a minimum a dual proc server/workstation for multichannel RAM and enough RAM to fit the model
So for something like DS R1, you'd need like >512GB RAM
You are correct in your understanding. However the last part of your comment needs a big asterisk. Its important to consider quantization.
The full f16 deepseek r1 gguf from unsloth requires 1.34tb of ram. Good luck getting the ram sticks and channels for that.
The q4_km mid range quant is 404gb which would theoretically fit inside 512gb of ram with leftover room for context.
512gb of ram is still a lot, theoretical you could run a lower quant of r1 with 256gb of ram. Not super desirable but totally doable.
Depends on your goals. For raw tokens per second, yeah you want an Nvidia card with enough^(tm)^ memory for your target model(s).
But if you don't care so much for speed beyond a certain amount, or you're okay sacrificing some speed for economy, AMD RX7900 XT/XTX or 9070 both work pretty well for small to mid sized local models.
Otherwise you can look at the SOC type solutions like AMD Strix Halo or Nvidia DGX for more model size at the cost of speed, but always look for reputable benchmarks showing 'enough' speed for your use case.