this post was submitted on 21 Feb 2024
88 points (90.0% liked)
Technology
59105 readers
4009 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
What kind of a website is that? Super slow and doesn't work without web assembly. Do you really need that for a simple interface
It's not about their frontend, they are running custom LPUs which can process LLM tokens at 500/sec which is insanely impressive.
For reference with a max size of 2k tokens, my dual xeon silver 4114 procs take 2-3 minutes.
No I got what you meant, but that site is weird if it's not doing anything on its own
That with a fp16 model? Don't be scared to try even a 4 bit quantization, you'd be surprised at how little is lost and how much quicker it is.
Isn't it those that cost $2000 per 250mb of memory?? Meaning you'd about 350 to load any half decent model.
Not sure how they are doing it, but it was actually $20k not $2k for 250mb of memory on the card. I suspect the models are probably cached in system memory.