Infinity

App in the BluixApps catalog

What it is

Infinity Embedding is a high-throughput embedding inference server — REST API serving text and image embeddings via models like BGE, E5, Jina, Cohere, and more. OpenAI-compatible /v1/embeddings endpoint makes it a drop-in replacement for the OpenAI embeddings API.

5-20× faster than HuggingFace Inference for embeddings, designed for production RAG pipelines.

What it's for

  • Embedding inference at scale — for RAG, search, recommendations
  • OpenAI-compatible API — drop-in replacement for OpenAI embeddings
  • Multi-model serving — multiple embedding models in one container
  • High throughput — batching + tensor parallelism
  • Long-document embeddings — Jina v3 supports 8k+ tokens
  • Multilingual embeddings — BGE-M3, multilingual-e5

Who it's for

  • RAG pipeline builders needing embeddings at scale
  • Search teams building semantic search
  • AI app developers integrating embeddings in their stack
  • AI agencies offering embedding services to clients
  • Hosting providers selling embedding API tier

Why teams pick Infinity over alternatives

  • MIT license — fully open
  • 5-20× faster than HuggingFace Inference
  • OpenAI-compatible — works with LangChain, LlamaIndex, etc.
  • Multi-model — serve multiple embedding models simultaneously
  • Active development — Michael Feil maintains
  • Production-tested — used by AI startups in prod
  • GPU + CPU — gracefully degrades to CPU

Integrations

  • OpenAI v1: /v1/embeddings endpoint
  • Reranker support — rerank documents post-retrieval
  • Pair with: Qdrant / Weaviate / Chroma (vector stores)
  • Pair with: vLLM / Ollama (RAG completion)
  • Pair with: LangChain / LlamaIndex (orchestration)
  • Swagger UI at /docs

Notable users & community

  • 2k+ GitHub stars (newer but rapidly growing)
  • Michael Feil + contributors
  • Featured in production RAG roundups
  • Active community feedback + integrations
  • Multiple AI startups in production

Tips & operations

  • Recommended models by use case:
    • General English: BAAI/bge-large-en-v1.5 (1024 dim, default)
    • General English (lighter): BAAI/bge-base-en-v1.5 (768 dim)
    • Multilingual (100+ languages): intfloat/multilingual-e5-large or BAAI/bge-m3
    • Code: Salesforce/codet5p-embedding
    • Tiny + fast: sentence-transformers/all-MiniLM-L6-v2
    • Long docs (8k+ tokens): jinaai/jina-embeddings-v3
  • Multi-model: start with --model-id A --model-id B for parallel
  • VRAM: 4 GB minimum for distilled; 8 GB for large; 16 GB for jina-v3
  • Speed: 5-20× higher throughput than vanilla HF
  • vs OpenAI API: free + private + no rate limit + multi-model
  • vs sentence-transformers: 10× faster batch processing
  • Production: reverse proxy + auth + monitoring (Prometheus metrics)

What we ship in BluixApps

  • Docker (michaelf34/infinity:latest)
  • Default model: BAAI/bge-large-en-v1.5 (configurable via /opt/infinity/.env)
  • Persistent volume: HF cache (~1-2 GB per model)
  • Port 7884 (Infinity default 7997)
  • Swagger UI at /docs
  • Install report at /root/bluixapps/infinity.txt
  • Recommended model list by use case
  • Multi-model serving guide
  • Infinity vs alternatives comparison
  • Use case examples (BluixApps catalog search, RAG pipelines)
  • Pairing suggestions (Qdrant + vLLM + LangChain)
  • HF_TOKEN environment variable for gated models
  • GPU pre-flight check via bluixapps_ensure_nvidia_runtime
  • Backup hook covers HF cache
Read this app's deep dive on bluix.app ↗

Get this app — pick a BluixApps plan

Same catalog. Scaling tenant isolation, white-label and support tier.

TierTenantsCatalogSupportWhite-labelMonthly
Stacks119 curated stacksStandard$19/moDetailDeploy
Starter10Full catalogStandard+$15–25/mo$49/moDetailDeploy
Pro25Full catalogPriority bugfix+$15–25/mo$149/moDetailDeploy
Growth100Full catalogPriority bugfix+$15–25/mo$349/moDetailDeploy
Scale500Full catalog7-day window+$15–25/mo$799/moDetailDeploy
EnterpriseUnlimitedFull catalogPriority 7-dayBundled$1,499/moDetailDeploy

Powered by WHMCompleteSolution