unsubbed.co

Ollama

Run open-source LLMs locally — get up and running with DeepSeek, Qwen, Gemma, Llama, and more with a single command.

Self-hosted AI inference, honestly reviewed. No marketing fluff, just what you actually get when you run models on your own hardware.


TL;DR

  • What it is: Open-source (MIT) runtime for running large language models locally or on your own server — no cloud API calls, no per-token billing, no data leaving your infrastructure [3].
  • Who it’s for: Developers, privacy-conscious founders, and homelab enthusiasts who want to run AI inference without paying cloud API bills or sending code/data to third parties [2][3].
  • Cost savings: OpenAI API costs $0.005 per 100K tokens at the low end [1]; Ollama local inference costs $0 in API fees but requires hardware that can range from a $5/mo VPS to a $4,000+ GPU rig [3][4].
  • Key strength: Dead-simple model management (ollama pull llama3.2, done), a clean REST API that drops in wherever you’d call OpenAI, and 165,755 GitHub stars — the largest active community around local LLM tooling [merged profile].
  • Key weakness: Model quality and speed depend entirely on your hardware. On modest hardware, local models are noticeably slower and less accurate than hosted APIs [1]. This is a hardware problem, not an Ollama problem — but Ollama doesn’t paper over it.

What is Ollama

Ollama is a runtime that makes running open-weight language models on your own hardware as simple as a one-line shell command. You pull a model, run it, and interact with it through a REST API on localhost:11434. The company describes its mission as “build with open models while keeping your data safe,” which is the cleaner way to say: everything runs on your infrastructure, nothing phones home [merged profile][3].

Under the hood, Ollama is built on llama.cpp — Georgi Gerganov’s C++ inference engine that made running LLMs on consumer hardware viable in the first place [README]. Ollama wraps llama.cpp with automatic model management, GPU detection, memory optimization, and a REST API surface that’s intentionally OpenAI-compatible. That last part matters: most tools that support OpenAI’s API can be pointed at Ollama with a one-line config change [3].

The model library at ollama.com/library covers the major open-weight families: Llama, Mistral, Gemma, DeepSeek, Qwen, Phi, and dozens more. As of this review, the GitHub README leads with Kimi-K2.5, GLM-5, and MiniMax as current highlights — the library tracks frontier open releases fairly quickly [README].

GitHub: 165,755 stars. MIT license. Install in one command on macOS, Windows, or Linux.


Why people choose it

The appeal breaks down along a few axes that keep coming up in real-world usage.

Privacy and data sovereignty. This is the primary driver for the developer audience. The Contabo review [3] frames it plainly: “Running large language models used to mean cloud APIs, per-token billing, and trusting third parties with your data. Ollama changed all that.” For teams doing AI-assisted code review, the concern is concrete — sending your source code to CodeRabbit, GitHub Copilot, or CodeAnt means proprietary code hitting servers you don’t control [2]. With Ollama, the inference happens on your infrastructure. “Your prompts never hit external APIs, and model files live on your disk. Responses generate locally, never touching someone else’s servers” [3].

Cost structure. OpenAI’s API charges per token. Ollama’s API charges nothing after setup. For high-volume use cases — code review on every commit, document analysis pipelines, RAG over internal data — the math shifts in Ollama’s favor once you’ve amortized the hardware [2][3].

Developer control. The REST API is clean and documented. Python and JavaScript client libraries are first-party. You can run multiple models simultaneously and switch per request — something cloud APIs make painful through separate API keys and pricing tiers [3].

The honest counterargument. One Medium review [1] documents the opposite trajectory — someone who ran local models for weeks, got frustrated, and switched back to OpenAI. The complaints are real: “I’d wait anywhere from 10 to 30 seconds just to get a basic output” and “Math problems? Sometimes it straight-up did the wrong calculations.” The reviewer eventually concluded: “I was spending just 0.5 cents for 100,000 tokens [on OpenAI]. That’s absurdly affordable for production-level quality.” [1]

Both experiences are true. They’re just true for different hardware setups and different model sizes. A 7B model on a CPU-only VPS will be slow and less accurate. A 70B model on four RTX 3090s will not. Ollama doesn’t solve the hardware problem — it just removes all the other friction.


Features

Model management:

  • ollama pull <model> downloads a model from the library. ollama run <model> starts an interactive session [README].
  • Models are stored locally under ~/.ollama/models/ on Linux — each model is several gigabytes depending on size and quantization [3].
  • Multiple models can run simultaneously; Ollama manages memory and unloads models under memory pressure automatically [3].

REST API:

  • Local endpoint at http://localhost:11434. Accepts JSON, returns completions [README][3].
  • Streaming responses supported — tokens arrive progressively rather than waiting for full completion [3].
  • OpenAI-compatible endpoint format means tools like n8n, AnythingLLM, Dify, and Open WebUI connect with minimal configuration [README][3].

GPU and hardware:

  • GPU acceleration is automatic — if you have a supported GPU, Ollama detects and uses it [3].
  • CPU-only inference works for smaller models (7B and under), slower for larger ones [3][4].
  • On GPU hardware: a single RTX 3060 12GB is a common entry point. The homelab guide [4] runs four RTX 3090s for heavy workloads [4].

Integrations:

  • Python library (pip install ollama) and JavaScript library (npm i ollama) are first-party [README].
  • The website claims 40,000+ community integrations — this is a stretched number that includes anything built on top of the OpenAI-compatible API, not just Ollama-specific integrations [website scrape].
  • Native integrations in the README include: Open WebUI, LibreChat, AnythingLLM, Dify, n8n, LangChain, LlamaIndex, and dozens more [README].

New cloud tier (as of 2026):

  • Ollama now offers a cloud hosting option — their infrastructure, their GPU, their API key. This is new and isn’t the reason most people come to Ollama [pricing page].

Modelfile:

  • Custom model packaging format. You can define system prompts, temperature defaults, and base model in a Modelfile and publish it as a named model [README docs].

Pricing: SaaS vs self-hosted math

Ollama Cloud (their new managed option):

  • Free: local usage unlimited + light cloud access. No per-token pricing.
  • Pro: $20/mo — 50× more cloud usage than Free, up to 3 cloud models at a time.
  • Max: $100/mo — 5× more than Pro, up to 10 concurrent cloud models.

Usage is measured in GPU-time, not tokens — “Shorter requests and prompts that share cached context use less” [pricing page]. The Free plan exists primarily for evaluation; Pro is positioned for daily development use.

Self-hosted (what most users actually want):

  • Ollama software: $0 (MIT license).
  • A minimal VPS (2 vCPU, 8GB RAM) handles 7B models CPU-only for low-frequency use: ~$5–8/mo on Hetzner or Contabo [3].
  • A GPU-equipped server changes the math entirely: a used RTX 3060 is ~$300, 3090 ~$700. Cloud GPU VMs with NVIDIA GPU run $50–200/mo depending on spec.
  • For serious inference: a 4× RTX 3090 homelab rig runs ~$5,000 upfront but generates zero ongoing API fees [4].

Versus cloud APIs:

  • OpenAI GPT-4o: roughly $2.50 per 1M input tokens. At 10 million tokens/month (a busy dev environment), that’s $25/mo input alone.
  • OpenAI o3: $10 per 1M input tokens — for reasoning tasks, costs escalate fast.
  • Ollama self-hosted at similar quality (a 70B Llama or DeepSeek model on adequate hardware): $0 in API fees after hardware [3].

The honest math: if your inference volume is low and your hardware is modest, cloud APIs are cheaper when you factor in hardware amortization, electricity, and setup time. If your inference volume is high, you handle sensitive data, or you need guaranteed privacy, the calculus flips [1][2][3]. One developer review [1] concludes cloud is better for production at low volume. The Contabo review [3] concludes self-hosted is better for AI-powered workflows where “there are no usage caps slowly draining your budget.” Both are right. The break-even point is roughly 5–10 million tokens/month on a paid GPU, which is achievable for anything running automated code review or document pipelines at scale.


Deployment reality check

Getting Ollama running is genuinely simple — arguably the simplest part of local LLM infrastructure.

Linux install:

curl -fsSL https://ollama.com/install.sh | sh

Done. The service starts, models download on first pull [README].

Docker:

docker run -d -p 11434:11434 ollama/ollama

Official Docker Hub image. Works [README].

What can go sideways:

The security risk is real and under-documented in most setup guides. The Redfox Security analysis [5] is blunt: “Most blog posts walking you through Ollama setup focus entirely on getting the model running and the JWT copied. None of them talk about what happens when that JWT leaks, when the API endpoint is misconfigured.” By default, Ollama listens on port 11434 with no authentication. If that port is exposed to the internet — which is easy to do accidentally — “anyone who can reach the IP can query your model, exfiltrate your prompts, or flood the endpoint with requests that exhaust your server resources” [5].

The check is one command: nmap -sV -p 11434 <your_server_ip>. If it comes back open, that’s a critical finding [5]. The fix is a firewall rule blocking external access to 11434, with a reverse proxy (nginx/Caddy) handling authenticated HTTPS in front.

Hardware reality:

The homelab guide [4] is useful for setting expectations. CPU-only works but “at slow speed and smaller models… this would impact your experience with various models and sizes” [4]. For anything approaching the quality of a cloud API response, you need a GPU with 8GB+ VRAM for 7B models, 24GB for 70B models. Consumer gaming GPUs (RTX 3060, 3080, 3090) work well. Apple Silicon Macs with unified memory are also viable for 7–13B models.

Time estimate for a technical user: 15 minutes to a working local instance. For a production VPS deployment with HTTPS reverse proxy and firewall rules: 1–2 hours. For a non-technical founder: this requires a technical person’s help or a deployment service. There is no admin UI — everything is CLI and API.


Pros and cons

Pros

  • Privacy by architecture. Inference is local. Prompts, completions, and your data never leave your infrastructure [2][3]. For teams with compliance requirements or sensitive IP, this is the point.
  • Genuinely simple model management. ollama pull and ollama run work exactly as advertised. The barrier to trying a new model is one command [README][3].
  • OpenAI-compatible API. Drop-in replacement for any tool that supports OpenAI’s API format. n8n, AnythingLLM, Open WebUI, LangChain — they all work without Ollama-specific code [3][README].
  • MIT license, real. You can embed it in your product, build a service on top of it, fork it. No license negotiation [merged profile].
  • 165,755 GitHub stars. The community produces integrations, Modelfiles, guides, and bug reports at a scale that means obscure problems are usually already documented [merged profile].
  • Automatic GPU utilization. No CUDA configuration required — Ollama handles it [3][4].
  • Model variety. Llama, Mistral, Gemma, DeepSeek, Qwen, Phi — the library tracks frontier open-weight releases [README].

Cons

  • Quality ceiling set by hardware, not software. On a CPU-only VPS, small models produce slow, sometimes unreliable output [1]. The Medium review [1] documents real output failures — wrong math, repetitive generation, inconsistent structured output — that don’t happen on comparable cloud models. These are hardware-constrained problems, not Ollama bugs, but they’re your problem to solve.
  • No authentication on the Ollama API by default. Port 11434 requires manual firewall configuration to stay private. Easy to misconfigure, with real consequences [5].
  • Not a turnkey solution for non-technical founders. There is no web UI in Ollama itself — that’s a separate tool (Open WebUI, etc.). Setup requires comfort with a terminal and understanding of networking basics.
  • Disk space. 7B models are 4–8GB each. Running five or six models for different tasks means 30–50GB of model storage minimum [3].
  • Memory management is automatic but not magic. Loading a 70B model on a machine with 32GB RAM will be slow or fail. You need to match model size to hardware carefully [3][4].
  • Security requires active hardening. The attack surface of a self-hosted Ollama deployment includes the model API, the integration layer, and any automation that acts on model outputs. Each layer has real vulnerabilities [5].

Who should use this / who shouldn’t

Use Ollama if:

  • You’re building AI tools that process sensitive code, documents, or customer data and cannot use cloud APIs for privacy or compliance reasons [2][3].
  • You’re running high-volume inference where per-token cloud costs become significant (document pipelines, code review on every commit, RAG over large internal corpora) [3].
  • You have a technical person on the team — or you are the technical person — and can manage a VPS, set up a reverse proxy, and harden firewall rules [5].
  • You want to experiment with different open-weight models without per-model API pricing [README].

Skip it if:

  • You’re a non-technical founder who has never touched a Linux server. Ollama itself is easy; securing and maintaining the surrounding infrastructure is not.
  • Your inference volume is low. If you’re spending under $20/mo on OpenAI, the hardware and maintenance overhead of self-hosting doesn’t pay off [1].
  • You need frontier model quality (GPT-4o, Claude Opus, Gemini Pro-level). Open-weight models at 7–13B parameters on consumer hardware do not match hosted frontier models on complex tasks [1].
  • You need guaranteed uptime. A self-hosted model server is only as reliable as you make it.

Alternatives worth considering

  • LM Studio — desktop app for running local models with a GUI. Good for personal use, not for server deployment or API exposure.
  • Jan — similar to LM Studio, desktop-first, open-source.
  • vLLM — production-grade inference engine, faster than Ollama for high-throughput serving, significantly more complex to set up [4].
  • llama.cpp (direct) — what Ollama is built on. Gives you more control, requires more work.
  • OpenAI API — the obvious alternative. Simpler, more reliable, better quality on complex tasks, costs money per token, your data goes to their servers [1].
  • Anthropic API — same trade-offs as OpenAI, different model family.
  • Groq / Together AI / Fireworks — cloud inference for open-weight models. You get speed and model variety without managing infrastructure, but your data still leaves your servers.

For a non-technical founder who just wants AI in their product: the hosted APIs are still the path of least resistance. For a developer team with privacy requirements and meaningful inference volume: Ollama is the obvious first stop.


Bottom line

Ollama solves exactly one problem: making local LLM inference not annoying to set up. It does that job extremely well. The REST API is clean, the model library is comprehensive, the install takes two minutes, and the community of 165,755 GitHub stars means integrations exist for almost everything. What it doesn’t solve is the underlying hardware reality — local models on modest hardware are slower and less accurate than hosted frontier models, and at least one developer with real-world experience found the trade-off not worth it [1]. The privacy and cost arguments are genuine for the right workloads. If you’re processing sensitive code, running high-volume pipelines, or building on open-weight models for compliance reasons, Ollama is the right runtime. If you’re a non-technical founder trying to bolt AI onto a product, start with a cloud API and revisit once the infrastructure overhead makes sense.


Sources

  1. W Shamim, Medium“Why I Stopped Using Ollama and Local Models (And Switched Back to OpenAI)” (Apr 6, 2025). https://medium.com/@Shamimw/why-i-stopped-using-ollama-and-local-models-and-switched-back-to-openai-2d125f303e1c

  2. Shrijith, DEV Community“Secure, Self-Hosted AI Code Review Powered by Ollama”. https://dev.to/shrsv/secure-self-hosted-ai-code-review-powered-by-ollama-2p55

  3. Contabo Blog“What Is Ollama and How To Use it with n8n”. https://contabo.com/blog/what-is-ollama-and-how-to-use-it-with-n8n/

  4. Digital Spaceport“How To Setup an AI Server Homelab Beginners Guides – Ollama + OWUI Proxmox 9 LXC”. https://digitalspaceport.com/how-to-setup-an-ai-server-homelab-beginners-guides-ollama-and-openwebui-on-proxmox-lxc/

  5. Karan Patel, Redfox Security“Self-Hosted AI Code Review Security Risks: What Startups Must Know Before Deploying Ollama” (Apr 6, 2026). https://www.redfoxsec.com/blog/self-hosted-ai-code-review-with-ollama-security-risks-hardening-strategies-and-what-startups-get-wrong

Primary sources:

Features

Authentication & Access

  • Single Sign-On (SSO)

Integrations & APIs

  • Plugin / Extension System
  • REST API