Ollama

App in the BluixApps catalog

What it is

Ollama is the de-facto standard for running local large language models on your own hardware. A single binary + REST API that pulls models from a public registry (Llama 3.3, Mistral, Qwen, DeepSeek, Phi-4 and dozens more), handles quantization, GPU offload, and exposes a simple /api/generate and /api/chat interface that's API-compatible with the OpenAI SDK.

It's the boring, reliable engine that every other self-hosted AI tool ends up integrating against — Open WebUI, AnythingLLM, LibreChat, Flowise, LangChain, LiteLLM, n8n.

What it's for

  • Private chat assistants — internal company chat that never sends prompts to OpenAI
  • GDPR-compliant LLM access — EU customers in healthcare, legal, finance who can't push prompts to US clouds
  • Cost control at scale — predictable per-month VPS bill vs metered API spend
  • Air-gapped inference — on-prem or restricted-network environments
  • AI app development backbone — local dev loop for engineers building on top of LLMs

Who it's for

  • AI developers & ML engineers — fast local dev loop, no API rate limits, no $ per token while building
  • Privacy-bound enterprises — legal, healthcare, finance, gov teams forbidden from US-hosted LLM APIs
  • Hosting providers — resellers offering "private AI VPS" to their customers as a higher-margin SKU
  • Researchers & academics — evaluating open models without paying OpenAI / Anthropic per experiment
  • Indie SaaS founders — predictable per-month VPS cost beats unpredictable per-token bills as traffic grows

Why teams pick Ollama over alternatives

  • OpenAI-compatible API — most existing client code works with a URL change
  • Massive model catalog with one-command pulls (ollama pull llama3.3)
  • Apache 2.0 license — commercial use unencumbered
  • CPU-capable for small models (TinyLlama, Phi-3) — runs on a $7/mo VPS for testing
  • GPU optional but supported (CUDA, ROCm, Apple Metal) when you scale to 7B+ models
  • Single binary — operational simplicity, no Python venv hell

Integrations

  • Chat UIs — Open WebUI, AnythingLLM, LibreChat, Khoj all detect Ollama as first-class backend
  • Workflow builders — n8n + Flowise + Langflow + Typebot have native Ollama nodes
  • LLM SDKs — LangChain, LlamaIndex, Semantic Kernel, Haystack all support Ollama natively
  • OpenAI-proxy gateways — LiteLLM proxies Ollama as if it were OpenAI for legacy clients
  • IDE assistants — Continue.dev, Aider, Cline, Cody let devs hit local Ollama for code completion
  • Model formats — pulls Hugging Face GGUF directly; Modelfile lets you fork & customize
  • Embeddings endpoint/api/embeddings works with Chroma, Qdrant, pgvector RAG stacks

Notable users & community

  • 110k+ GitHub stars, top of awesome-selfhosted AI category
  • Integrated by Continue.dev, Cline, Aider, LangChain, LlamaIndex, LiteLLM, OpenWebUI as a first-class backend
  • Active Discord, weekly model drops, strong macOS and Linux maintainer community
  • Backed by ollama.ai company — sustainable dev model with permissive Apache 2.0 license
  • Cited in countless "self-host your AI stack" guides on r/selfhosted, r/LocalLLaMA

Tips & operations

  • Pre-pull models before exposing the service — first request triggers a multi-GB download that times out user calls
  • Tune OLLAMA_KEEP_ALIVE — default unloads model after 5 min idle; set 1h for warm latency, -1 to keep forever
  • Verify GPU detection with ollama ps — if model says "100% CPU" you're not using your GPU; check NVIDIA drivers + CUDA toolkit
  • Never expose Ollama directly — no built-in auth; always behind nginx + basic auth, OAuth proxy, or a chat-UI gateway
  • Memory budget rule — 7B Q4 ≈ 5 GB, 13B Q4 ≈ 9 GB, 70B Q4 ≈ 40 GB; size VPS accordingly
  • Disk cleanupollama list then ollama rm unused; models silently accumulate in /usr/share/ollama/.ollama/models

What we ship in BluixApps

  • Docker compose stack: Ollama server + GPU passthrough config (off by default)
  • Pre-allocated model storage volume at /var/lib/ollama for persistence across upgrades
  • Pinned ollama/ollama:0.5.4 image, tracked weekly against upstream
  • HTTP-only by default on 127.0.0.1:11434; SSL + auth via Nginx Proxy Manager when paired with Open WebUI
  • Sizing guidance shipped in customer docs: 8 GB RAM minimum for 3B models, 16 GB recommended for 7B, GPU required for 13B+
  • Backup hook captures /var/lib/ollama before each update (models can be 4-20 GB — opt-in)
Read this app's deep dive on bluix.app ↗

Get this app — pick a BluixApps plan

Same catalog. Scaling tenant isolation, white-label and support tier.

TierTenantsCatalogSupportWhite-labelMonthly
Stacks119 curated stacksStandard$19/moDetailDeploy
Starter10Full catalogStandard+$15–25/mo$49/moDetailDeploy
Pro25Full catalogPriority bugfix+$15–25/mo$149/moDetailDeploy
Growth100Full catalogPriority bugfix+$15–25/mo$349/moDetailDeploy
Scale500Full catalog7-day window+$15–25/mo$799/moDetailDeploy
EnterpriseUnlimitedFull catalogPriority 7-dayBundled$1,499/moDetailDeploy

Powered by WHMCompleteSolution