sh.itjust.works

30,728 readers
1,476 users here now

Useful Links

Lemmy Getting Started Guide
Official Lemmy Support
Search communities across all lemmy instance
sh.itjust.works meta discussion community: [email protected]
sh.itjust.works Matrix chat: https://matrix.to/#/#sh.itjust.works:matrix.org

Donations

Rules:

Be respectful. Everyone should feel welcome here.
No bigotry - including racism, sexism, ableism, homophobia, transphobia, or xenophobia.
No Ads / Spamming.
No pornography.

Règles :

Soyez respectueux. Tout le monde doit se sentir le bienvenu ici.
Pas de bigoterie - y compris le racisme, le sexisme, le capacitisme, l'homophobie, la transphobie ou la xénophobie.
Pas de publicités / Pas de spam.
Pas de pornographie.

Fediseer

Other UI options (more to come)

Monitoring Services

lemmy-meter.info

founded 2 years ago

ADMINS

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention (vllm.ai)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

https://github.com/vllm-project/vllm

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more Tensor parallelism support for distributed inference Streaming outputs OpenAI-compatible API server

YouTube video describing it: https://youtu.be/1RxOYLa69Vw

vLLM: 24x faster LLM serving than HuggingFace Transformers (vllm.ai)

submitted 2 years ago by [email protected] to c/[email protected]

0 comments fedilink

HN Discussion