Vllm

App in the BluixApps catalog

What it is

vLLM is the highest-throughput open-source LLM inference engine — built around PagedAttention for memory efficiency. Offers OpenAI-compatible REST API, tensor parallelism for multi-GPU, and serves all major modern LLMs (Llama 3.x, Mistral, Qwen, DeepSeek, etc.) at 5-10× the throughput of vanilla HuggingFace.

Used inside Anthropic's, Bedrock's, and many production LLM platforms — vLLM is the canonical choice for production LLM serving.

What it's for

Production LLM API — serve Llama 3.x, Mistral, Qwen at scale
OpenAI-compatible endpoints — drop-in replacement for OpenAI API in clients
High-throughput batching — continuous batching for many parallel users
Memory-efficient — PagedAttention enables larger batch sizes
Tensor parallelism — split big models across multiple GPUs
Embedding inference — serve embedding models too

Who it's for

AI app developers serving LLM in production
Startups building OpenAI-API replacement infrastructure
Enterprises running internal LLM for compliance / cost reasons
AI agencies offering LLM API to clients
Hosting providers selling LLM-as-a-service

Why teams pick vLLM over alternatives

Apache 2.0 — fully open
Highest throughput in production LLM benchmarks
PagedAttention = more efficient VRAM than competitors
OpenAI-compatible API = trivial client integration
Active development by UC Berkeley + Anyscale
Industry adoption — used in Bedrock, Anthropic infra, many startups
Tensor parallelism — scales to 70B+ models across 4-8 GPUs

Integrations

OpenAI-compatible REST: /v1/models, /v1/chat/completions, /v1/completions, /v1/embeddings
Pair with: OpenWebUI (UI), AnythingLLM, LangChain, LlamaIndex, LiteLLM (multi-model gateway)
HF model auto-download — gated models need HF_TOKEN
Quantization: AWQ, GPTQ, FP8, INT8
Multi-GPU: --tensor-parallel-size N
Multi-node: Ray cluster support for cross-node inference

Notable users & community

33k+ GitHub stars
UC Berkeley Sky Computing Lab + Anyscale
Used inside Bedrock, Anthropic, OpenAI competitor stacks
Active development with weekly releases
Production deployments at thousands of companies

Tips & operations

VRAM by model:
- 7B fp16: 16 GB; INT8 8 GB; AWQ 5 GB
- 13B fp16: 26 GB; AWQ 8 GB
- 70B fp16: 140 GB (needs 8× A100 80GB); AWQ ~40 GB
HF_TOKEN: required for Llama, Gemma (gated models on HF)
Max context: --max-model-len 8192 configurable per model
Quantization for cheaper hosting:
- AWQ: best quality-to-size
- GPTQ: fast inference
- FP8: H100/L40s only
Production: reverse proxy + auth + rate limiting + monitoring (Prometheus metrics built-in)

What we ship in BluixApps

Docker (vllm/vllm-openai:latest)
Default model: meta-llama/Meta-Llama-3.1-8B-Instruct (configurable via /opt/vllm/.env)
Persistent volume: /opt/vllm/models (HF cache)
Port 8000 (standard OpenAI port)
--max-model-len 8192 default
Install report at /root/bluixapps/vllm.txt
Recommended model list by VRAM tier
Pairing suggestions (OpenWebUI, AnythingLLM, LiteLLM)
HF_TOKEN environment variable for gated models
GPU pre-flight check via bluixapps_ensure_nvidia_runtime
Backup hook covers model cache

Read this app's deep dive on bluix.app ↗

Get this app — pick a BluixApps plan

Same catalog. Scaling tenant isolation, white-label and support tier.

Tier	Tenants	Catalog	Support	White-label	Monthly
Stacks	1	19 curated stacks	Standard	—	$19/mo	Detail Deploy
Starter	10	Full catalog	Standard	+$15–25/mo	$49/mo	Detail Deploy
Pro	25	Full catalog	Priority bugfix	+$15–25/mo	$149/mo	Detail Deploy
Growth	100	Full catalog	Priority bugfix	+$15–25/mo	$349/mo	Detail Deploy
Scale	500	Full catalog	7-day window	+$15–25/mo	$799/mo	Detail Deploy
Enterprise	Unlimited	Full catalog	Priority 7-day	Bundled	$1,499/mo	Detail Deploy

Vllm

What it is

What it's for

Who it's for

Why teams pick vLLM over alternatives

Integrations

Notable users & community

Tips & operations

What we ship in BluixApps

Get this app — pick a BluixApps plan

BluixApps Stacks — entry tier, single VPS managed

What's included

What's NOT in this tier

Best for

Plan facts

BluixApps Starter — full catalog, up to 10 isolated tenants

What's included

Best for

Where to upgrade from here

Plan facts

BluixApps Pro — 25 isolated tenants, priority bugfix lane

What's included on top of Starter

Best for

Plan facts

BluixApps Growth — 100 tenants, scale-up reseller toolkit

What's included on top of Pro

Best for

Plan facts

BluixApps Scale — 500 tenants, 7-day support window

What's included on top of Growth

Best for

Where to upgrade from here

Plan facts

BluixApps Enterprise — unlimited tenants, white-label bundled

What's included on top of Scale

Best for

Plan facts

Vllm

What it is

What it's for

Who it's for

Why teams pick vLLM over alternatives

Integrations

Notable users & community

Tips & operations

What we ship in BluixApps

Get this app — pick a BluixApps plan

BluixApps Stacks — entry tier, single VPS managed

What's included

What's NOT in this tier

Best for

Plan facts

BluixApps Starter — full catalog, up to 10 isolated tenants

What's included

Best for

Where to upgrade from here

Plan facts

BluixApps Pro — 25 isolated tenants, priority bugfix lane

What's included on top of Starter

Best for

Plan facts

BluixApps Growth — 100 tenants, scale-up reseller toolkit

What's included on top of Pro

Best for

Plan facts

BluixApps Scale — 500 tenants, 7-day support window

What's included on top of Growth

Best for

Where to upgrade from here

Plan facts

BluixApps Enterprise — unlimited tenants, white-label bundled

What's included on top of Scale

Best for

Plan facts

Generate Password

Generate Password