AI Development 5 min read

Run LLMs Locally with Ollama: Zero API Costs, Full Privacy, Production-Ready

Ollama lets you run Llama 3, Mistral, DeepSeek, Phi-3, and 100+ open-weight models on your own hardware — zero API costs, data never leaves your servers. This guide covers setup, API integration, embedding models, and when local beats cloud.

G
Gurpreet Singh
March 31, 2026

Why Run LLMs Locally?

Every call to GPT-4o or Claude costs money. At scale — a thousand user sessions a day, each generating 5,000 tokens — OpenAI API costs become a significant operational expense. A customer support chatbot running GPT-4o at $5 per million output tokens could cost $3,000–$10,000 per month at modest traffic.

Beyond cost, there is the data privacy question. Every prompt you send to OpenAI, Anthropic, or Google transits their servers. For businesses handling sensitive client data, medical records, legal documents, or proprietary business intelligence, sending that data to a third-party API is a compliance risk and a competitive one.

Ollama is an open-source tool that makes running LLMs locally as simple as ollama run llama3. It handles model downloading, quantisation, serving, and provides an API compatible with the OpenAI API spec — so your existing code works unchanged, just pointed at a different URL.

How Ollama Works

Ollama wraps the llama.cpp inference engine — a highly optimised C++ implementation of LLM inference that runs on CPU, Apple Silicon (Metal), and NVIDIA/AMD GPUs (CUDA/ROCm). It handles:

  • Model management: Download, store, version, and delete models from the Ollama Hub (similar to Docker Hub for models)
  • Quantisation: Models are distributed in GGUF format with multiple quantisation levels (Q4_K_M, Q5_K_M, Q8_0, F16) — 4-bit quantisation reduces a 70B model from 140GB to 40GB with minimal quality loss
  • Serving: Runs a local HTTP server on localhost:11434 with REST API endpoints
  • Hardware acceleration: Automatically uses GPU layers when a compatible GPU is detected, falling back to CPU
  • Context management: Handles KV cache, context window management, and batching across concurrent requests

Setup and Running Your First Model

# Install Ollama (macOS)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run Llama 3.1 8B (4.7GB download)
ollama run llama3.1

# Or pull without running
ollama pull llama3.1:70b     # 40GB — needs 48GB+ RAM or GPU VRAM
ollama pull mistral           # 4.1GB — excellent quality-to-size ratio
ollama pull deepseek-r1:7b   # 4.7GB — strong reasoning
ollama pull phi3:mini         # 2.2GB — runs on CPU only, fast
ollama pull nomic-embed-text  # 274MB — embedding model for RAG

The API: Drop-in OpenAI Replacement

Ollama exposes two API formats. The native Ollama API and an OpenAI-compatible API at /v1/chat/completions. This means you can use the official OpenAI Python SDK pointed at your local Ollama instance — zero code changes:

from openai import OpenAI

# Change base_url — everything else stays identical
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by SDK but ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain RAG in 3 sentences."}
    ],
    temperature=0.7,
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Your existing Laravel application using the OpenAI PHP client works the same way — just override the base URL in your config:

// config/openai.php
"base_uri" => env("OPENAI_BASE_URI", "http://localhost:11434/v1"),

// .env
OPENAI_BASE_URI=http://ollama:11434/v1
OPENAI_API_KEY=ollama

Local Embeddings for RAG (Zero Cost per Vector)

Embedding 10 million chunks with OpenAI text-embedding-3-small costs ~$200. Running the same embeddings locally with nomic-embed-text costs nothing beyond electricity.

import ollama

def embed_chunks(chunks: list[str]) -> list[list[float]]:
    embeddings = []
    for chunk in chunks:
        response = ollama.embeddings(
            model="nomic-embed-text",
            prompt=chunk
        )
        embeddings.append(response["embedding"])
    return embeddings

# Or via the OpenAI-compatible API
response = client.embeddings.create(
    model="nomic-embed-text",
    input="What is the return policy for damaged items?"
)
vector = response.data[0].embedding  # 768-dimensional vector

Nomic-embed-text produces 768-dimensional vectors and benchmarks competitively with OpenAI's ada-002. For RAG pipelines over internal documents where data privacy matters, local embeddings with local vector storage (pgvector) means your data never leaves your infrastructure.

Modelfiles: Custom System Prompts and Personas

A Modelfile lets you create a custom model variant — baking in a system prompt, sampling parameters, and stop sequences — and give it a name:

FROM llama3.1

SYSTEM """
You are an AI assistant for Acme Corp customer support.
You have access to our return policy and product documentation.
Always be concise. Never make up information not in the provided context.
If unsure, say "Let me check that for you" and use the search tool.
"""

PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

# Build and run
# ollama create acme-support -f Modelfile
# ollama run acme-support

Hardware Requirements and Model Selection

The primary bottleneck for local LLM inference is memory bandwidth, not raw compute. The model must fit in RAM (CPU inference) or VRAM (GPU inference):

  • 8GB RAM / No GPU: phi3:mini (2.2GB), llama3.2:1b (1.3GB) — basic tasks, fast
  • 16GB RAM: llama3.1:8b Q4 (4.7GB), mistral:7b (4.1GB) — solid general purpose
  • 32GB RAM: llama3.1:8b Q8, gemma2:9b — higher quality responses
  • 64GB RAM / RTX 4090 (24GB VRAM): llama3.1:70b Q4 (40GB) with CPU+GPU split — near GPT-3.5 quality
  • 2× RTX 4090 or A100: llama3.1:70b Q8, full GPU inference — GPT-4 competitive

For business deployments, a dedicated server with 2× NVIDIA A100 80GB GPUs (available on AWS as p4d instances) can run llama3.1:70b at full precision with 150+ tokens/second throughput — enterprise-grade performance at a fraction of OpenAI API costs at scale.

When Local Beats Cloud — And When It Doesn't

Use local LLMs when: You have sensitive data (legal, medical, financial) that cannot leave your servers. You have high-volume, predictable workloads where API costs are significant. You need low-latency inference without internet dependency. You want to fine-tune on proprietary data without data leaving your infrastructure.

Use cloud APIs when: You need the absolute best model quality (GPT-4o, Claude 3.5 Opus). You have unpredictable or bursty traffic (cloud scales instantly, local hardware doesn't). You're in early development and don't want to manage GPU infrastructure. The task requires multimodal inputs (vision, audio) that open-weight models don't yet match.

Best practice: Use a hybrid approach. Run a local 8B model for high-volume, straightforward tasks (classification, extraction, summarisation). Fall back to GPT-4o or Claude for complex reasoning, nuanced writing, or edge cases. This hybrid architecture reduces API costs by 70–85% while maintaining quality where it matters.

#Ollama #Local LLM #Llama 3 #Mistral #Open Source AI #RAG #Python #Laravel #Privacy
G
Gurpreet Singh

Senior Full Stack Developer — Laravel, Vue.js, Nuxt.js & AI. Available for freelance projects.

Hire Me for Your Project

Related Articles