Vllm

App in the BluixApps catalog

What it is

vLLM is the highest-throughput open-source LLM inference engine — built around PagedAttention for memory efficiency. Offers OpenAI-compatible REST API, tensor parallelism for multi-GPU, and serves all major modern LLMs (Llama 3.x, Mistral, Qwen, DeepSeek, etc.) at 5-10× the throughput of vanilla HuggingFace.

Used inside Anthropic's, Bedrock's, and many production LLM platforms — vLLM is the canonical choice for production LLM serving.

What it's for

  • Production LLM API — serve Llama 3.x, Mistral, Qwen at scale
  • OpenAI-compatible endpoints — drop-in replacement for OpenAI API in clients
  • High-throughput batching — continuous batching for many parallel users
  • Memory-efficient — PagedAttention enables larger batch sizes
  • Tensor parallelism — split big models across multiple GPUs
  • Embedding inference — serve embedding models too

Who it's for

  • AI app developers serving LLM in production
  • Startups building OpenAI-API replacement infrastructure
  • Enterprises running internal LLM for compliance / cost reasons
  • AI agencies offering LLM API to clients
  • Hosting providers selling LLM-as-a-service

Why teams pick vLLM over alternatives

  • Apache 2.0 — fully open
  • Highest throughput in production LLM benchmarks
  • PagedAttention = more efficient VRAM than competitors
  • OpenAI-compatible API = trivial client integration
  • Active development by UC Berkeley + Anyscale
  • Industry adoption — used in Bedrock, Anthropic infra, many startups
  • Tensor parallelism — scales to 70B+ models across 4-8 GPUs

Integrations

  • OpenAI-compatible REST: /v1/models, /v1/chat/completions, /v1/completions, /v1/embeddings
  • Pair with: OpenWebUI (UI), AnythingLLM, LangChain, LlamaIndex, LiteLLM (multi-model gateway)
  • HF model auto-download — gated models need HF_TOKEN
  • Quantization: AWQ, GPTQ, FP8, INT8
  • Multi-GPU: --tensor-parallel-size N
  • Multi-node: Ray cluster support for cross-node inference

Notable users & community

  • 33k+ GitHub stars
  • UC Berkeley Sky Computing Lab + Anyscale
  • Used inside Bedrock, Anthropic, OpenAI competitor stacks
  • Active development with weekly releases
  • Production deployments at thousands of companies

Tips & operations

  • VRAM by model:
    • 7B fp16: 16 GB; INT8 8 GB; AWQ 5 GB
    • 13B fp16: 26 GB; AWQ 8 GB
    • 70B fp16: 140 GB (needs 8× A100 80GB); AWQ ~40 GB
  • HF_TOKEN: required for Llama, Gemma (gated models on HF)
  • Max context: --max-model-len 8192 configurable per model
  • Quantization for cheaper hosting:
    • AWQ: best quality-to-size
    • GPTQ: fast inference
    • FP8: H100/L40s only
  • Production: reverse proxy + auth + rate limiting + monitoring (Prometheus metrics built-in)

What we ship in BluixApps

  • Docker (vllm/vllm-openai:latest)
  • Default model: meta-llama/Meta-Llama-3.1-8B-Instruct (configurable via /opt/vllm/.env)
  • Persistent volume: /opt/vllm/models (HF cache)
  • Port 8000 (standard OpenAI port)
  • --max-model-len 8192 default
  • Install report at /root/bluixapps/vllm.txt
  • Recommended model list by VRAM tier
  • Pairing suggestions (OpenWebUI, AnythingLLM, LiteLLM)
  • HF_TOKEN environment variable for gated models
  • GPU pre-flight check via bluixapps_ensure_nvidia_runtime
  • Backup hook covers model cache
Read this app's deep dive on bluix.app ↗

Get this app — pick a BluixApps plan

Same catalog. Scaling tenant isolation, white-label and support tier.

TierTenantsCatalogSupportWhite-labelMonthly
Stacks119 curated stacksStandard$19/moDetailDeploy
Starter10Full catalogStandard+$15–25/mo$49/moDetailDeploy
Pro25Full catalogPriority bugfix+$15–25/mo$149/moDetailDeploy
Growth100Full catalogPriority bugfix+$15–25/mo$349/moDetailDeploy
Scale500Full catalog7-day window+$15–25/mo$799/moDetailDeploy
EnterpriseUnlimitedFull catalogPriority 7-dayBundled$1,499/moDetailDeploy

Powered by WHMCompleteSolution