Asklemmy

47343 readers

936 users here now

A loosely moderated place to ask open-ended questions

Search asklemmy 🔍

If your post meets the following criteria, it's welcome here!

Open-ended question
Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
Not ad nauseam inducing: please make sure it is a question that would be new to most members
An actual topic of discussion

Looking for support?

Looking for a community?

Lemmyverse: community search
sub.rehab: maps old subreddits to fediverse options, marks official as such
[email protected]: a community for finding communities

~Icon~ ~by~ ~@Double_[email protected]~

founded 6 years ago

MODERATORS

[email protected]

Can you self-host AI at parity with chatgpt? (lemmy.ml)

submitted 2 months ago by [email protected] to c/[email protected]

13 comments fedilink hide all child comments

My office computer has a Ryzen 7 5700, RX 580x, and 32gb of ram. Running ollama with deepseekv2 or llama3 is much slower than chatgpt in the browser. Same with my newer, more powerful home computer.

What kind of hardware do you need to run with comparable responsiveness to chatgpt? How much does it cost? Presuming such hardware is commercial, where do you find it?

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 3 points 2 months ago (1 children)

At this point, retail devices capable of 96 GB memory aren't too difficult to find, if pocket allows, but how can one enter TB zone?

[–] [email protected] 7 points 2 months ago

96 GB+ of RAM is relatively easy, but for LLM inference you want VRAM. You can achieve that on a consumer PC by using multiple GPUs, although performance will not be as good as having a single GPU with 96GB of VRAM. Swapping out to RAM during inference slows it down a lot.

On archs with unified memory (like Apple's latest machines), the CPU and GPU share memory, so you could actually find a system with very high memory directly accessible to the GPU. Mac Pros can be configured with up to 192GB of memory, although I doubt it'd be worth it as the GPU probably isn't powerful enough.

Also, the 83GB number I gave was with a hypothetical 1 bit quantization of Deepseek R1, which (if it's even possible) would probably be really shitty, maybe even shittier than Llama 7B.

but how can one enter TB zone?

Data centers use NVLink to connect multiple Nvidia GPUs. Idk what the limits are, but you use it to combine multiple GPUs to pool resources much more efficiently and at a much larger scale than would be possible on consumer hardware. A single Nvidia H200 GPU has 141 GB of VRAM, so you could link them up to build some monster data centers.

Nivida also sells prebuilt machines like the HGX B200 which can have 1.4TB of memory in a single system. That's less than the 2.6TB for unquantized deepseek, but for inference only applications, you could definitely quantize it enough to fit within that limit with little to no quality loss... so if you're really interested and really rich, you could probably buy one of those for your home lab.