LocalLLaMA

3306 readers
1 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS
1
2
0
Homelab upgrade WIP (lemmy.world)
submitted 4 days ago* (last edited 4 days ago) by [email protected] to c/[email protected]
 
 

Theres a lot more to this stuff than I thought there would be when starting out. I spent the day familiarizing with how to take apart my pc and swap gpus .Trying to piece everything together. Apparently in order for PC to startup right it needs a graphical driver. I thought the existance of a HDMI port on the motherboard implied the existance of onboard graphics but apparently only special CPUs have that capability. My ryzen 5 2600 doesnt. The p100 Tesla does not have graphical display capabilities. So ive hit a snag.

I'm going to try to run multiple cards together. Hope I can mix amd Rx 580 and nvidia tesla on same board fingers crossed please work. My motherboard thankfully supports 4x4x4x4 pcie x16 bifurcation. Strangely not 8x8 or 8x4x4 for some reason? So I'm planning to get a bifurcstion board and plug both cards in and hope that the amd one is recognized! According to one source The performance loss for using 4x lanes for GPUs doing the compute i m doing is 10-15 % surprisingly tolerable actually.

I never really had to think about how pcie lanes work or how to allocate them properly before.

For now I'm using two power supplies one built into the desktop and the new 850e corsair. I choose this one as it should work with 2-3 GPUs while being in my price range. Also the new 12v-2x6 port supportd like 600w and comes with a dual pcie split which was required for the adapter for Tesla so it all worked out nicely for a clean wire solution.

Sadly I fucked up a little. The pcie release press plastic thing on the motherboard was brittle and I fat thumbed it too hard while having problems removing the GPU initially so it snapped off. I dont know if that's something fixable. It doesnt seem to affect the security of the connection too bad fortunately. I intend to grt a pcie riser extensions cable. Ill have the gpu and bifurcation board layed out nicely on the homelab table while testing.

I need to figure out a external GPU mount system. I see people use server racks or nut and bolt meta chassis. I could get a thin plate of copper the size of the desktops glass window as a base/heatsink?

3
 
 

Hey fellow llama enthusiasts! Great to see that not all of lemmy is AI sceptical.

I'm in the process of upgrading my server with a bunch of GPUs. I'm really excited about the new Mistral / Magistral Small 3.2 models and would love to serve them for me and a couple of friends. My research led me to vLLM with which I was able to double inference speed compared to ollama at least for qwen3-32b-awq.

Now sadly, the most common quantization methods (GGUF, EXL, BNB) are either not fully (GGUF) or not at all (EXL) supported in vLLM, or multi-gpu inference thouth tensor parallelism is not supported (BNB). And especially for new models it's hard to find pre-quantized models in different, more broadly supported formats (AWQ, GPTQ).

Does any of you guys face a similar problem? Do you quantize models yourself? Are there any up-to-date guides you would recommend? Or did I completely overlook another, obvious solution?

It feels like when I've researched something yesterday, it's already outdated again today, since the landscape is so rapidly evolving.

Anyways, thank you for reading and sharing your thoughts or experience if you feel like it.

4
 
 

I've recently been writing fiction and using an AI as a critic/editor to help me tighten things up (as I'm not a particularly skilled prose writer myself). Currently the two ways I've been trying are just writing text in a basic editor and then either saving files to add to a hosted LLM or copy pasting into a local one. Or using pycharm and AI integration plugins for it.

Neither is particularly satisfactory and I'm wondering if anyone knows of a good setup for this (preferably open source but not neccesary), integration with at least one of ollama or open-router would be needed.

5
0
submitted 2 weeks ago* (last edited 2 weeks ago) by [email protected] to c/[email protected]
 
 

I'm looking to locally generate voiceovers from text and also try to generate audiobooks. Does anyone have experience with sherpa-onnx? There also appear to be two separate frontends for Kokoro specifically dedicated for audiobook creation, but they appear to both be abandoned. Or am I barking up the completely wrong tree?
Thanks!

6
 
 

It seems mistral finally released their own version of a small 3.1 2503 with CoT reasoning pattern embedding. Before this the best CoT finetune of Small was DeepHermes with deepseeks r1 distill patterns. According to the technical report, mistral baked their own reasoning patterns for this one so its not just another deepseek distill finetune.

HuggingFace

Blog

Magistral technical research academic paper

7
 
 

I'm limited to 24GB of VRAM, and I need pretty large context for my use-case (20k+). I tried "Qwen3-14B-GGUF:Q6_K_XL," but it doesn't seem to like calling tools more than a couple times, no matter how I prompt it.

Tried using "SuperThoughts-CoT-14B-16k-o1-QwQ-i1-GGUF:Q6_K" and "DeepSeek-R1-Distill-Qwen-14B-GGUF:Q6_K_L," but Ollama or LangGraph gives me an error saying these don't support tool calling.

8
0
submitted 3 weeks ago* (last edited 3 weeks ago) by [email protected] to c/[email protected]
 
 

Sorry team flipped the URL’s around to prevent overflow from lemmy.world users

https://fly.io/blog/youre-all-nuts/

9
 
 

Hey everybody. I'm just getting into LLMs. Total noob. I started using llama-server's web interface, but I'm experimenting with a frontend called SillyTavern. It looks much more powerful, but there's still a lot I don't understand about it, and some design choices I found confusing.

I'm trying the Harbinger-24B model to act as a D&D-style DM, and to run one party character while I control another. I tried several general purpose models too, but I felt the Harbinger purpose-built adventure model was noticeably superior for this.

I'll write a little about my experience with it, and then some thoughts about LLMs and D&D. (Or D&D-ish. I'm not fussy about the exact thing, I just want that flavour of experience).

General Experience

I've run two scenarios. My first try was a 4/10 for my personal satisfaction, and the 2nd was 8/10. I made no changes to the prompts or anything between, so that's all due to the story the model settled into. I'm trying not to give the model any story details, so it makes everything up, and I won't know about it in advance. The first story the model invented was so-so. The second was surprisingly fun. It had historical intrigue, a tie-in to a dark family secret from ancestors of the AI-controlled char, and the dungeon-diving mattered to the overarching story. Solid marks.

My suggestion for others trying this is, if you don't get a story you like out of the model, try a few more times. You might land something much better.

The Good

Harbinger provided a nice mixture of combat and non-combat. I enjoy combat, but I also like solving mysteries and advancing the plot by talking to NPCs or finding a book in the town library, as long as it feels meaningful.

It writes fairly nice descriptions of areas you encounter, and thoughts for the AI-run character.

It seems to know D&D spells and abilities. It lets you use them in creative but very reasonable ways you could do in a pen and paper game, but can't do in a standard CRPG engine. It might let you get away with too much, so you have to keep yourself honest.

The Bad

You may have to try multiple times until the RNG gives you a nice story. You could also inject a story in the base prompt, but I want the LLM to act as a DM for me, where I'm going in completely blind. Also, in my first 4/10 game, the LLM forced really bad "main character syndrome" on me. The whole thing was about me, me, me, I'm special! I found that off putting, but the 2nd 8/10 attempt wasn't like that at all.

As an LLM, it's loosy-goosy about things like inventory, spells, rules, and character progression.

I had a difficult time giving the model OOC instructions. OOC tended to be "heard" by other characters.

Thoughts about fantasy-adventure RP and LLMs

I feel like the LLM is very good at providing descriptions, situations, and locations. It's also very good at understanding how you're trying to be creative with abilities and items, and it lets you solve problems in creative ways. It's more satisfying than a normal CRPG engine in this way.

As an LLM though, it let you steer things in ways you shouldn't be able to in an RPG with fixed rules. Like disallowing a spell you don't know, or remembering how many feet of rope you're carrying. I enjoy the character leveling and crunchy stats part of pen-and-paper or CRPGs, and I haven't found a good way to get the LLM to do that without just handling everything manually and whacking it into the context.

That leads me to think that using an LLM for creativity inside a non-LLM framework to enforce rules, stats, spells, inventory, and abilities might be phenomenal. Maybe AI-dungeon does that? Never tried, and anyway I want local. A hybrid system like that might be scriptable somehow, but I'm too much of a noob to know.

10
 
 

Hey everybody, brand new to running local LLMs, so I'm learning as I go. Also brand new to lemmy.

I have a 16 GB VRAM card, and I was running some models that would overflow 16GB by using the CPU+RAM to run some of the layers. It worked, but was very slow, even for only a few layers.

Well I noticed llama.cpp has an rpc-server feature, so I tried it. It was very easy to use. Lin here, but probably similar on Win or Mac. I had an older gaming rig sitting around with a GTX 1080 in it. Much slower than my 4080, but using it to run a few layers is still FAR faster than using the CPU. Night and day almost.

The main drawbacks I've experienced so far are,

  • By default it tries to split the model evenly between machines. That's fine if you have the same card in all of them, but I wanted to put as much of the model as possible on the fastest card. You can do that using the --tensor-split parameter, but it requires some experimenting to get it right.

  • It loads the rpc machine's part of the model across the network every time you start the server, which can be slow on 1 gigabit network. I didn't see any way to tell rpc-server to load the model from a local copy. It makes my startups go from 1-2 seconds, up to like 30-50 sec.

  • Q8 quantized KV cache works, but Q4 does not.

Lots of people may not be able to run 2 or 3 GPUs in one PC, but might have another PC they can add over the network. Worth a try, I'd say, if you want more VRAM space.

11
 
 

AI bros won't hype this up for the news for sure, but 480x energy doesn't sound optimistic enough for replacement.

12
 
 

WOAH

13
14
 
 

So, ive never really used RAG before. I kept hearing about it and understood it in concept, so wanted to play around with it. I play a table top map drawing game called DELVE that has lots of rules to keep track of requiring lots of manual page flips on my slow ereader.

I wanted to see of it was possible to teach my local model about DELVE through my engines basic RAG system that works by providing a big text file and doing some keyword searches in it based off user input. The engine pulls in local chunks of text document around important keywords as context. This is different than more complex and sophisticated versions of RAG that work off fancy vector space math or something like that. Whatever, kobold is my engine of choice and I'm happy to have something that half works and is simple enough to grasp.

So the first step was converting the game manual PDF to text file. I did this using calibres ebook-convert tool. Then I looked through and tried to clean some glaring issues up invdvitably theres gonna be jank converting PDF unicode or tables.

sudo apt install calibre

ebook-convert delve.pdf delve.txt

Once its cleaned up a little I threw it in kobold lites context>textdb tab and enabled it.

Its cool how such a simple process manages to let the model leverage local datasets. I was shocked when it correctly recited proper game rules.

Full Input/Output shown in screenshotInput: How to play delve?

Output:

Okay, I need to figure out how to explain how to play DELVE based on the provided info snippets. Let's start by breaking down the key components mentioned.

First, the materials required are pencil, eraser, grid paper or blank paper, notepad for resources and trade goods, a deck of standard playing cards without jokers initially, and tokens for various purposes like dwarves, enemies, etc. The starting setup involves drawing a stairwell at the top center as the entrance. Resources (hearts ♥) and Trade Goods (diamonds ♦) start at 20 each. Then there's mention of Soldiers being placed at the entrance. Jokers are removed from the deck at first but added later at depth 5.

The turn structure seems to involve the following steps: prepare, explore, retreat. Each turn has phases like exploring, checking threats, downtime. There's talk about expanding the hold by adding rooms, handling threats through combat or avoidance, managing resources and trade goods.

The Bloodrunes are optional challenges that offer boons but also transport to a nightmare layer upon choosing them. Each rune has specific effects. For example, the Rune of Nightmares immediately starts the player in a nightmare layer with rewards for defeating ancient monsters.

There's a part about using cards instead of dice. Drawing suits corresponds to D4 results, while colors correspond to D2. That's important because the game uses these rolls frequently.

Rooms have various functions. Smelters let you convert ore into metal, Armories provide soldiers, Quarries generate stone, etc. Each room has costs and benefits. Some advanced rooms like Deep Delve or Burglar Alarms have special abilities.

Threats include monsters, thieves, and other events. Combat uses Strength and Defense with armor effects. Thieves steal half your trade goods unless you have burglar alarms. Monsters can cause damage or drain resources.

Trade is done by visiting the Marketplace on certain turns where you exchange goods. The Marketplace deck has different cards affecting trade values. You can sell excess goods or buy needed items at varying rates.

Ancient Ones are powerful enemies encountered as you delve deeper. They have high stats and unique abilities. Beating them gives significant rewards but poses risks if not prepared.

Endgame conditions involve reaching depth 10, accumulating wealth, or being overrun by threats. If you win, you get to set up a new hold with better starting conditions.

I need to structure this explanation step-by-step, covering setup, turn phases, actions in each phase, room building, handling threats, trading, and endgame conditions. Also mention the Bloodrunes as optional challenges with their own rules. Highlight how card draws replace dice rolls for D4 and D2. Maybe note that grid paper isn't mandatory, but it helps visualize the hold layout.

How to Play DELVE: Step-by-Step Guide

  1. Setup

    Materials Needed:
    Pencil/eraser, grid/notebook paper, notebook (for tracking resources), standard playing cards (Jokers removed initially), tokens (coins, dice, etc.).
    Start Your Hold:
    Draw a central stairwell on the top row of grid paper as the entrance.
    Resources:
    Begin with 20 Resources (♥) and 20 Trade Goods (♦) in your notebook. Add 5 Soldiers (tokenized) at the entrance.

  2. Deck & Game Layers

    Remove Jokers from the deck. At Depth 5, re-add them for "Legendary Finds."
    Use cards instead of dice:
    Roll 1D4: Suit = ♠=1, ♣=2, ♦=3, ♥=4.
    Roll 1D2: Color = Black=1, Red=2.

  3. Turn Structure

    Prepare Phase:

     Spend resources to build rooms (e.g., Smelter, Armory) or upgrade defenses.  
     Manage soldiers and prepare for threats.  
    

    Explore Phase:

     Draw two cards from the deck: choose one to reveal a room, threat, or resource.  
     Build rooms by placing them on grid paper (cost varies by type).  
    

    Retreat Phase:

     Resolve combat, theft, or other threats. Sold-out rooms generate income (e.g., Quarry → Stone).  
    
  4. Optional Challenges: Bloodrunes

    Place a Bloodrune (e.g., "Rune of Nightmares") on the first row before starting. Each rune triggers immediate effects (e.g., starting in a nightmare layer) but offers boons like combat bonuses.

  5. Rooms & Upgrades

    Basic Rooms:
    Smelter (3♥): Convert ore to metal.
    Armory (5♥): Add soldiers (1 per turn).
    Quarry (2♥): Gather stone.
    Advanced Rooms:
    Deep Delve (10♥): Explore deeper layers faster.
    Burglar Alarms (7♥): Mitigate thief losses.

  6. Threats

    Combat: Roll strength vs. enemy defense (armor reduces damage). Soldiers protect rooms.
    Thieves: Steal half your Trade Goods unless you have Burglar Alarms.
    Ancient Ones: Powerful bosses at deeper levels (e.g., "The Devourer") with unique abilities and high rewards.

  7. Trading & Marketplace

    Visit the Marketplace every 3 turns by drawing cards: adjust supply/demand for resources. Sell excess goods or buy needed items at fluctuating rates.

  8. Victory Conditions

Win by reaching Depth 10, accumulating 200+ Trade Goods, or surviving until the deck runs out. Lose if overrun by threats, depleted of resources, or defeated in combat.

  1. Bloodrune Example: Rune of Nightmares

    Start immediately on a nightmare layer without access to the Marketplace. Defeat ancient monsters for powerful rewards but risk doom.

Tips:

Use grid paper to visualize your hold’s layout.  
Balance room upgrades with threat mitigation.  
Bloodrunes add difficulty but enhance replayability.  

Enjoy delving into the depths! 🌋⚔️🗡️

15
0
submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]
 
 

Devstral is an agentic LLM for software engineering tasks built under a collaboration between Mistral AI and All Hands AI 🙌. Devstral excels at using tools to explore codebases, editing multiple files and power software engineering agents. The model achieves remarkable performance on SWE-bench which positionates it as the #1 open source model on this benchmark.

It is finetuned from Mistral-Small-3.1, therefore it has a long context window of up to 128k tokens. As a coding agent, Devstral is text-only and before fine-tuning from Mistral-Small-3.1 the vision encoder was removed.

Learn more about Devstral in our blog post. Key Features:

Agentic coding: Devstral is designed to excel at agentic coding tasks, making it a great choice for software engineering agents.

lightweight: with its compact size of just 24 billion parameters, Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM, making it an appropriate model for local deployment and on-device use.

Apache 2.0 License: Open license allowing usage and modification for both commercial and non-commercial purposes.

Context Window: A 128k context window.

Tokenizer: Utilizes a Tekken tokenizer with a 131k vocabulary size.

Learn more about Devstral in our blog post.

https://mistral.ai/news/devstral

16
 
 

If you are an agent builder, these three protocols should be all you need

  • MCP gives agents tools
  • A2A allows agents to communicate with other agents
  • AG-UI brings your agents to the frontend, so they can engage with users.

Is there anything I'm missing?

17
 
 

"While the B60 is designed for powerful 'Project Battlematrix' AI workstations sold as full systems ranging from $5,000 to $10,000, it will carry a roughly $500 per-unit price tag."

18
19
20
 
 

Let's go! Lossless CPU inference