LocalLLaMA

2871 readers

33 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

founded 2 years ago

MODERATORS

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention (vllm.ai)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink hide all child comments

https://github.com/vllm-project/vllm

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more Tensor parallelism support for distributed inference Streaming outputs OpenAI-compatible API server

YouTube video describing it: https://youtu.be/1RxOYLa69Vw

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here