Tgi

App in the BluixApps catalog

What it is

TGI (Text Generation Inference) is HuggingFace's official production LLM serving stack — continuous batching, tensor parallelism, quantization (bitsandbytes, GPTQ, AWQ, EETQ), streaming, and OpenAI-compatible API. The HF ecosystem-native answer to vLLM.

If your stack already uses HuggingFace models + Spaces + Inference Endpoints, TGI is the natural self-hosted equivalent.

What it's for

  • HF-native LLM serving — swap to any HF model with one config change
  • Production-grade inference — continuous batching, streaming
  • Broad quantization support — bitsandbytes, GPTQ, AWQ, EETQ
  • OpenAI-compatible API (newer versions)
  • Multi-shard inference — tensor parallelism for big models
  • Tested HF models — TGI tests every major HF model on release

Who it's for

  • Teams already on HF stack — Spaces, Inference Endpoints users
  • AI startups wanting wide model compatibility
  • Researchers needing to swap models frequently
  • Production teams valuing HF's testing + maintenance commitment
  • Hosting providers offering HF-aligned LLM tier

Why teams pick TGI over alternatives

  • Apache 2.0 — fully open
  • HF integration — every new HF model tested on release
  • Best quantization breadth — more formats than vLLM
  • Streaming — first-class server-sent events
  • Simpler model swap than vLLM (any HF model path works)
  • HF backing — long-term maintenance commitment

Integrations

  • OpenAI v1 endpoints: /v1/chat/completions, /v1/completions
  • TGI native: /generate, /generate_stream
  • HF Hub direct model loading
  • Pair with: LangChain (TGI client), LlamaIndex, OpenWebUI
  • Multi-shard: --num-shard N for tensor parallelism
  • Quantization flags:
    • --quantize bitsandbytes (4-bit, simple)
    • --quantize gptq (4-bit, best perf)
    • --quantize awq (4-bit, alternative)
    • --quantize eetq (8-bit, less accuracy loss)

Notable users & community

  • 9k+ GitHub stars
  • HuggingFace corporate backing
  • Used inside HF Inference Endpoints (production at scale)
  • Used by enterprises wanting HF compatibility
  • Featured in HF model card "Use in TGI" buttons

Tips & operations

  • VRAM by model: similar to vLLM (16 GB for 7B fp16, 26 GB for 13B)
  • HF_TOKEN: required for gated models
  • Max input/total tokens: configurable per startup
  • Quantization choice: AWQ usually best balance
  • Multi-GPU: --num-shard 2 enables tensor parallel
  • Streaming: SSE format compatible with OpenAI client streaming
  • Production: reverse proxy + auth + monitoring (Prometheus metrics built-in)
  • vs vLLM: TGI for HF-aligned teams, vLLM for raw peak throughput

What we ship in BluixApps

  • Docker (ghcr.io/huggingface/text-generation-inference:latest)
  • Default model: meta-llama/Meta-Llama-3.1-8B-Instruct (configurable via /opt/tgi/.env)
  • Persistent volume: /opt/tgi/data
  • Port 8001 (separate from vLLM if co-installed)
  • --max-input-length 4096 --max-total-tokens 8192 defaults
  • Install report at /root/bluixapps/tgi.txt
  • Quantization options documented
  • TGI vs vLLM positioning explained
  • HF_TOKEN environment variable for gated models
  • GPU pre-flight check via bluixapps_ensure_nvidia_runtime
  • Backup hook covers model cache
Read this app's deep dive on bluix.app ↗

Get this app — pick a BluixApps plan

Same catalog. Scaling tenant isolation, white-label and support tier.

TierTenantsCatalogSupportWhite-labelMonthly
Stacks119 curated stacksStandard$19/moDetailDeploy
Starter10Full catalogStandard+$15–25/mo$49/moDetailDeploy
Pro25Full catalogPriority bugfix+$15–25/mo$149/moDetailDeploy
Growth100Full catalogPriority bugfix+$15–25/mo$349/moDetailDeploy
Scale500Full catalog7-day window+$15–25/mo$799/moDetailDeploy
EnterpriseUnlimitedFull catalogPriority 7-dayBundled$1,499/moDetailDeploy

Powered by WHMCompleteSolution