I want to host some LLM’s locally and use more advanced models. Since new hardware is out of the question, I think I should be able to pull something off buying some yesteryear equipment on ebay etc. Did anybody attempt such a project? Does it scale horizontally? (I.e. can I connext two boxes to overcome single box slowness?)

  • robber@lemmy.ml
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    4 hours ago

    To add some practical advice:

    It depends on what you mean by more advanced models. I run Qwen3.6-27b on 48GB VRAM across 3 cards (RTX 2000e Ada), and with the recent software optimizations merged into llama.cpp (tensor parallelism & MTP) I get around 30 tokens per second in generation. I use the model through openwebui for (agentic) web research and simple Q&A mostly and I’m quite happy with what it can do.

    If you want something similar, maybe look at one or two second hand V100 PCIE 32GB. Or something from the Intel Arc Pro series, if you don’t mind the software support lacking behind a bit (as in less optimized).

    Also it might be worth reading into the difference of dense vs MoE models, if you’re new to that. For MoE models, if your system RAM is fast enough, it’s often viable to offload the “experts” (largest parts of such models) to RAM, reducing VRAM capacity needs. Note that server motherboards with e.g. octa-channel RAM have a huge advantage over consumer boards (making DDR4 interesting despite slower speed per module).

    And to adress your last question, while I have no direct experience, I’ve seen posts online about people connecting Strix Halo or DGX Spark devices, but usually via a 10+Gbit/s switch as interconnect is cruical (except if you just want to load balance).

    Self-hosting LLMs is a very fun thing to do, but also a time- and money-consuming rabbit hole. You might wanna check out the LocalLlama community over at shitjustworks.

    Edit: typos

  • Barbecue Cowboy@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    2
    ·
    8 hours ago

    With older hardware, once you accumulate enough vram to run it, your problem is going to shift to memory bandwidth and your questions is going to shift from ‘Can I run this Model’ to ‘Can I run this Model at an acceptable speed’.

  • sobchak@programming.dev
    link
    fedilink
    English
    arrow-up
    4
    ·
    10 hours ago

    The trend I see are the Mac Minis with a lot of unified memory. These are typically very well off people though. Prices for even old GPUs like 3090s are ridiculous now. I don’t think connecting 2 machines over Ethernet would work well, but putting 2 GPUs in a single machine does.

  • xylogx@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    11 hours ago

    What size model? I can run 8 billion parameter models on my Geforce 3070 with 8gb of vram. Bigger models need more memory. For $1-2k you can upgrade to a 16 or 32 gb video card. For $3k you can get a Framework Desktop with 128 gb unified memory. For $6k you can get a DGX Spark with a blackwell chip and 128 gb unified memory. Mac mini or Mac studio are also good choices in this price range.

    • Barbecue Cowboy@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      2
      ·
      8 hours ago

      The 64GB Framework Desktop runs just barely over 2k configured minimally, I went that route because I thought it was a better option than the discrete 32GB video card, but there are tradeoffs with compatibility. Something to think about at the 2k, but not quite 3k range though.

  • iceberg314@slrpnk.net
    link
    fedilink
    English
    arrow-up
    3
    ·
    12 hours ago

    For really lightweight models, Qwen3:8b is pretty good for an 8gb graphics card.

    Gemma 4:e4b is pretty good too. I usually sun that on my 16gb gpu.

    Obviously the little ones aren’t as good as big ones, but you can always rely on real intelligence to fill in the gaps

  • ejs@piefed.social
    link
    fedilink
    English
    arrow-up
    26
    ·
    19 hours ago

    Honestly, you’re a few months late to the whole buying GPUs for local llms party, so expect exorbitant prices even for older cards

    The name of the game is vram. For the most part, more is better. If you can get your hands on multiple matching (same model) 24gb or higher cards (within price range), you’re golden.

    Going for more than 2 gpus can become challenging with motherboard pcie slot heights, so make sure either your cards aren’t too tall or you have widely spaced out pcie slots.

    For inference, speed (tokens/second) is limited by memory bandwidth. Go for faster bandwidth memory cards if you can afford it (e.g. GDDR6 will be faster than GDDR5).

    Also with multi gpus you will need an adequate power supply, and a large enough case.

    If you want to be a bit eccentric and load huge models, you can also go the CPU route and fill up a motherboard with 256 GB ram, because then you’re in the several hundred B param model territory, which could, depending on your use case, be better than having faster inference on smaller/quantized models. Even then, DDR5 with high MHz is still way slower than gpus.

  • BlameThePeacock@lemmy.ca
    link
    fedilink
    English
    arrow-up
    13
    ·
    19 hours ago

    I’m running Qwen 3.6 35B A3B (the MoE model) on an 8GB Vram Nvidia GPU with 32 GB of ram, with tweaking (and Turboquant) I’ve got it up to 30-40 Tokens per second and a 260k Context. It’s very usable. I’ve seen people report success with Dual 3060 Cards, but you’re still talking $1000-1500 for that kind of setup even if you have parts of it already.

  • ryokimball@infosec.pub
    link
    fedilink
    English
    arrow-up
    6
    ·
    19 hours ago

    How old we talking? I personally wouldn’t go further back than 2000 series rtx. A friend has had good luck with Intel GPUs for ‘cheap’.

    No, you absolutely cannot scale horizontally for speed. VRAM is king, with local RAM being swappable with major speed penalties. SSD is even slower than that and all those are orders of magnitude faster than ant Ethernet you’ll be connecting boxes together with. That’s not to say clustering isn’t an option, just that speed is going to be worse the more you scale out like that.

    • droopy4096@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      19 hours ago

      I’ve got some circa 2010 cards laying about with a 32Gb server that already has 8Gb carved out for TrueNAS, so essentially I could squeeze 16-24Gb out of it, but it’s an older i5 Intel CPU

      • robber@lemmy.ml
        link
        fedilink
        English
        arrow-up
        1
        ·
        4 hours ago

        Your biggest issue with 2010 cards will be software (inference engine) support, I assume.

  • civ@lemmy.civl.cc
    link
    fedilink
    English
    arrow-up
    5
    ·
    19 hours ago

    I’m running Qwen 3.6 35B A3B on my Ryzen 8700g and it runs pretty well, but the bigger problem there is probably the cost of RAM

  • droopy4096@lemmy.caOP
    link
    fedilink
    English
    arrow-up
    4
    ·
    18 hours ago

    thank you folks. Your input gives me a decent starting point. I’ll start digging based on info/experiences shared, maybe I can find someone locally selling old GPU with enough ram for cheap

  • solrize@lemmy.ml
    link
    fedilink
    English
    arrow-up
    4
    ·
    18 hours ago

    Unless you’re going to really run a lot, this is an area where vast.ai is probably more affordable than mucking with hardware.

  • tal@lemmy.today
    link
    fedilink
    English
    arrow-up
    3
    ·
    18 hours ago

    If you’re willing to wait until 2028 when memory prices are expected to drop, and if you’re willing to get new hardware if memory prices drop, I’d give real consideration to waiting until then. There’ll also probably be better hardware and better models then.

  • tal@lemmy.today
    link
    fedilink
    English
    arrow-up
    3
    ·
    19 hours ago

    If you can constrain yourself to MoE-based LLMs, they’ll generally deal better from a performance standpoint with not entirely fitting in VRAM better than non-MoE LLMs, as experts may not get loaded into VRAM at all.

  • theunknownmuncher@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    18 hours ago

    I use Instinct MI60 GPUs. They are pretty decent performance for local LLM. Connecting multiple computers is going to be impractical because severe bandwidth bottleneck.

  • worhui@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    19 hours ago

    Ram if a big driver of what models you can run with vram at a premium. Equipping 2 separate boxes with enough ram to load advanced models may be more expensive than just equipping one faster machine.

    On the larger models even with ssd swap I can’t even get them to fully load on my 16gb of ram.

    • droopy4096@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      19 hours ago

      well, I intend on scavenging for parts as I can’t really afford today’s prices. And since I don’t really know what should I grab as minimum specs I don’t even know what to look for. I could try to look for old(er) gaming rigs people sell or maybe there are some business workstations that may be sold in bulk. Either way, knowing what’s the minimum viable set of specs for running qwen or claude locally would be helpful

      • timochka@lemmy.zip
        link
        fedilink
        English
        arrow-up
        5
        arrow-down
        1
        ·
        16 hours ago

        Word of advice - don’t scavenge old /server/ hardware if you plan to put GPUs in - unless you really like heat and noise. Those machines from the likes of HP will take one look at a consumer graphics card in a PCI slot and decide they need to run all the fans at 100% to ward off evil spirits. Not to mention just the general ballache of proprietary PCI riser cards, PCI power cables, etc. etc.

        You’re definitely better off with taking someone’s old gaming rig off their hands.

        In terms of specs - value VRAM above everything else. A slow, old 3000 series card with 24GB of VRAM is much more useful than a brand new 5000 with 16GB. If you can find old RTX3090 24GB, they’re kinda ideal.

        The one thing I will say for modern cards though is that they’re much better for power efficiency - and in particular idle power (which is important if you’re running the thing always on.) For my main LLM machine I have two RTX5060Ti (32GB total), which at the time was the sweet spot for price/performance/power, and it’s very nice that they idle around 3 or 4 watts. I bought them before the world went crazy and prices went mad though, so they may not be the sweet spot any more.

        One you’re in 32GB VRAM type territory, you can run really really good dense models like Qwen-3.6-27b at a decent quant, decent context size, and good performance for things like coding, or bigger MoE models for more general use (particularly then if you have good CPU and regular RAM for offloading to CPU. For use as an assistant (i.e. not an OpenClaw fully automated slop machine,) I use 3.6-27b as a daily driver in Claude Code, and basically never use Sonnet.

      • mierdabird@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        3
        ·
        18 hours ago

        It can vary a lot based on what qwen model you want to run, but generally the 27b dense or 35b MoE are currently the best balance between size and capability afaik.

        If you can run two 16GB cards you can pretty much max out the context on the 27b model, but a single card like the 3060 12gb could still work well on the 35b MoE model with the excess spilling into system memory.

        I saw in another comment you have cards from the 2010’s but if they don’t have at least 8gb I wouldn’t even bother