How to Set Up Ollama on Your Own Server: A Complete Step-by-Step Guide

How to Set Up Ollama on Your Own Server: A Complete Step-by-Step Guide

Download MarkDown
Velocity Software Solutions
Velocity Software Solutions
May 4, 2026·16 min read

Running large language models on your own server gives you something no cloud API can: complete control over your data, zero per-token costs, and the ability to run inference 24/7 without worrying about rate limits or API outages. We set this up on our own infrastructure at Velsof — a bare-metal server with an NVIDIA RTX 4090 — and it now powers features like AI-powered code explanation on our blog, internal document analysis, and development tooling.

This guide walks you through the entire process of setting up Ollama on a Linux server, from installation to production-ready configuration. Every command here has been tested on Ubuntu 24.04, but the steps work on most Linux distributions with minor adjustments.

Why Self-Host LLMs?

Before we get into the how, here is why you might want to run your own LLM server instead of using OpenAI, Anthropic, or Google APIs:

  • Data privacy. Your prompts and responses never leave your network. For companies handling sensitive code, client data, or regulated information, this is often a hard requirement.
  • Cost at scale. API calls add up fast. If you are making thousands of inference calls per day — for code review, document processing, or AI automation — a one-time GPU investment pays for itself within months.
  • Latency. Local inference on a good GPU is often faster than API round-trips, especially for smaller models. Our Qwen3 1.7B model responds in under 200ms for short prompts.
  • No rate limits. Your server, your rules. Run as many concurrent requests as your hardware can handle.
  • Offline capability. The server works without internet access once models are downloaded.

Prerequisites

Here is what you need before starting:

RequirementMinimumRecommended
OSUbuntu 20.04+ / Debian 11+ / RHEL 8+Ubuntu 24.04 LTS
RAM8 GB (for 7B models)32+ GB
Storage20 GB free100+ GB SSD (models are large)
GPU (optional)NVIDIA GPU with 6+ GB VRAMNVIDIA RTX 4090 (24 GB VRAM)
CUDA (if GPU)CUDA 11.8+CUDA 12.x
NetworkSSH access to the serverDedicated IP or domain

Ollama can run on CPU-only servers, but inference will be significantly slower. For production use, a GPU is strongly recommended. Even an older NVIDIA GTX 1080 Ti (11 GB VRAM) can run 7B parameter models comfortably.

Step 1: Install Ollama

Ollama provides a one-line installer that handles everything — downloading the binary, creating a dedicated system user, and setting up the systemd service.

SSH into your server and run:

Bash
curl -fsSL https://ollama.com/install.sh | sh

You should see output similar to this:

Bash
>>> Installing ollama to /usr/local/bin...
>>> Downloading Linux amd64 bundle...
######################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> Downloading NVIDIA cuDNN...
######################################## 100.0%
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.

The installer automatically:

  • Downloads the Ollama binary to /usr/local/bin/ollama
  • Creates a dedicated ollama system user and group
  • Sets up a systemd service at /etc/systemd/system/ollama.service
  • Detects NVIDIA GPUs and downloads the necessary CUDA libraries
  • Starts the Ollama service immediately

Verify the Installation

Check that Ollama is running:

Bash
# Check the service status
sudo systemctl status ollama

# Expected output:
# ● ollama.service - Ollama Service
#      Loaded: loaded (/etc/systemd/system/ollama.service; enabled)
#      Active: active (running)
#    Main PID: 1640 (ollama)

# Check the version
ollama --version
# ollama version is 0.11.6

# Test the API endpoint
curl http://localhost:11434
# Ollama is running

If the service is not running, start it manually:

Bash
sudo systemctl start ollama
sudo systemctl enable ollama  # ensure it starts on boot

Step 2: Download Your First Model

Ollama uses a Docker-like pull mechanism for models. You specify the model name and optionally a size variant, and Ollama downloads it from the model library.

Start with a smaller model to verify everything works:

Bash
# Pull Llama 3.2 (3B parameters, ~2 GB download)
ollama pull llama3.2

# You'll see download progress:
# pulling manifest
# pulling 4fc21a39517b... 100%  ████████████████████ 2.0 GB
# pulling 966de95ca8a6... 100%  ████████████████████ 1.4 KB
# pulling fcc5a6bec9da... 100%  ████████████████████ 7.7 KB
# pulling a70ff7e570d9... 100%  ████████████████████ 6.0 KB
# pulling 56bb8bd477a5... 100%  ████████████████████  96 B
# pulling 34bb5ab01051... 100%  ████████████████████  561 B
# verifying sha256 digest
# writing manifest
# success

Now test it with a quick prompt:

Bash
# Run an interactive chat
ollama run llama3.2 "What is the capital of France?"

# Expected response:
# The capital of France is Paris.

If you see a response, your setup is working. The model loaded into GPU memory (or RAM if no GPU), processed the prompt, and returned the result.

Choosing the Right Model

The model you choose depends on your hardware and use case. Here is a practical guide based on what we have tested on our RTX 4090 (24 GB VRAM):

ModelParametersSizeVRAM NeededBest For
qwen3:1.7b1.7B1.4 GB~2 GBFast responses, simple tasks, code completion
llama3.23B2.0 GB~3 GBGeneral chat, summarization
deepseek-r1:8b8B4.9 GB~6 GBReasoning, code generation
codellama:13b13B7.4 GB~10 GBCode analysis, programming tasks
gemma3:12b12B8.1 GB~10 GBGeneral purpose, multilingual
qwen3:14b14B9.3 GB~12 GBComplex reasoning, long context
deepseek-r1:32b32B19 GB~22 GBAdvanced reasoning (needs 24 GB GPU)
mixtral:8x7b47B (MoE)26 GB~28 GBHigh quality general purpose (needs large GPU or CPU offload)

Pull multiple models to have them available on demand — Ollama only loads a model into memory when you actually use it:

Bash
# Pull several models for different use cases
ollama pull qwen3:1.7b        # fast, lightweight
ollama pull deepseek-r1:14b    # strong reasoning
ollama pull codellama:13b      # code-focused
ollama pull nomic-embed-text   # text embeddings (for RAG)

# List all downloaded models
ollama list

# NAME                   ID              SIZE      MODIFIED
# qwen3:1.7b             8f68893c685c    1.4 GB    12 days ago
# deepseek-r1:14b        ea35dfe18182    9.0 GB    2 months ago
# codellama:13b          9f438cb9cd58    7.4 GB    3 months ago
# nomic-embed-text       0a109f422b47    274 MB    6 months ago

Step 3: Understand the Systemd Service

The installer creates a systemd service file that manages the Ollama server process. Here is what the default service file looks like:

Bash
# /etc/systemd/system/ollama.service

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/cuda/bin"

[Install]
WantedBy=default.target

Key things to note:

  • Ollama runs as a dedicated ollama user — not root — which is good security practice
  • Restart=always means the service will restart automatically if it crashes
  • The PATH includes /usr/local/cuda/bin for GPU support
  • By default, Ollama listens on 127.0.0.1:11434 — only accessible from the local machine

Common service management commands:

Bash
# Start / stop / restart the service
sudo systemctl start ollama
sudo systemctl stop ollama
sudo systemctl restart ollama

# View logs (last 50 lines)
sudo journalctl -u ollama -n 50 --no-pager

# Follow logs in real-time (useful for debugging)
sudo journalctl -u ollama -f

# Check if enabled on boot
sudo systemctl is-enabled ollama
# enabled

Step 4: Configure Ollama for Production

The default configuration works for local testing, but production use requires a few adjustments.

4a. Allow Remote Access

By default, Ollama only listens on localhost. If you need other servers (like a web application) to access it, you need to change the bind address.

Edit the systemd service file:

Bash
sudo systemctl edit ollama

# This opens an editor. Add the following:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

# Save and exit, then restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama

# Verify it's listening on all interfaces:
sudo ss -tlnp | grep 11434
# LISTEN  0  4096  *:11434  *:*  users:(("ollama",pid=1640,fd=3))

Security warning: If you expose Ollama to the network, make sure you have proper firewall rules in place. Ollama has no built-in authentication. We will set up an Nginx reverse proxy with authentication in Step 6.

4b. Change the Model Storage Location

By default, models are stored in /usr/share/ollama/.ollama/models/. Large models can fill up your root partition quickly. If you have a larger secondary drive, redirect model storage there.

On our server, we symlinked the model blobs directory to a 2 TB disk because the root partition was running low on space:

Bash
# Option 1: Symlink the blobs directory to a larger drive
sudo systemctl stop ollama
sudo mkdir -p /mnt/data/ollama-blobs/blobs
sudo mv /usr/share/ollama/.ollama/models/blobs/* /mnt/data/ollama-blobs/blobs/
sudo rmdir /usr/share/ollama/.ollama/models/blobs
sudo ln -s /mnt/data/ollama-blobs/blobs /usr/share/ollama/.ollama/models/blobs
sudo chown -R ollama:ollama /mnt/data/ollama-blobs
sudo systemctl start ollama

# Verify the symlink
ls -la /usr/share/ollama/.ollama/models/
# blobs -> /mnt/data/ollama-blobs/blobs

# Option 2: Set OLLAMA_MODELS environment variable
sudo systemctl edit ollama
# Add:
# [Service]
# Environment="OLLAMA_MODELS=/mnt/data/ollama/models"
sudo systemctl daemon-reload
sudo systemctl restart ollama

4c. Configure GPU and Memory Settings

If your server has an NVIDIA GPU, Ollama detects it automatically. You can verify GPU is being used by checking the logs when a model loads:

Bash
# Check GPU detection
nvidia-smi
# Shows your GPU, VRAM, and CUDA version

# Check Ollama is using the GPU (run a model, then check)
ollama run qwen3:1.7b "hello" &>/dev/null
nvidia-smi
# You should see an 'ollama' process using GPU memory

# Useful environment variables for GPU control:
sudo systemctl edit ollama

[Service]
# Limit to specific GPUs (useful on multi-GPU servers)
Environment="CUDA_VISIBLE_DEVICES=0"

# Set number of GPU layers (0 = CPU only, -1 = all layers on GPU)
# Useful if model is too large for VRAM - partial offloading
Environment="OLLAMA_NUM_GPU=35"

# Set maximum VRAM to use (in bytes)
# Example: limit to 20 GB on a 24 GB GPU, leaving headroom
Environment="OLLAMA_MAX_VRAM=21474836480"

# Keep model loaded in memory for faster subsequent requests
# Default is 5 minutes. Set to -1 for indefinite.
Environment="OLLAMA_KEEP_ALIVE=-1"

4d. Set Context Window Size

The default context window varies by model but is typically 2048 or 4096 tokens. For tasks that require processing longer documents, you can increase it:

Bash
# Set context size when running a model
ollama run llama3.2 --ctx-size 8192

# Or via the API (num_ctx parameter)
curl -s http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Summarize this document...",
  "options": {
    "num_ctx": 8192
  }
}'

Be aware that larger context windows use more VRAM. An 8K context on a 13B model can use 2-3 GB more VRAM than the default 4K context.

Step 5: Using the Ollama API

Ollama exposes a REST API that is compatible with many AI frameworks. This is how you integrate it with your web applications, scripts, and automation workflows.

5a. Generate Text (Completion API)

Bash
# Simple text generation
curl -s http://localhost:11434/api/generate -d '{
  "model": "qwen3:1.7b",
  "prompt": "Explain what a REST API is in 2 sentences.",
  "stream": false
}' | python3 -m json.tool

# Response:
# {
#     "model": "qwen3:1.7b",
#     "response": "A REST API is an interface that allows...",
#     "done": true,
#     "total_duration": 1283947200,
#     "load_duration": 10234100,
#     "prompt_eval_count": 18,
#     "eval_count": 52,
#     "eval_duration": 482913000
# }

5b. Chat API (Conversational)

Bash
# Multi-turn conversation
curl -s http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "system", "content": "You are a helpful DevOps assistant."},
    {"role": "user", "content": "How do I check disk usage on Linux?"}
  ],
  "stream": false
}' | python3 -m json.tool

# The chat API maintains conversation format,
# making it easy to build chatbot applications

5c. Generate Embeddings

Ollama can generate text embeddings for RAG (retrieval-augmented generation) applications:

Bash
# First, pull an embedding model
ollama pull nomic-embed-text

# Generate embeddings
curl -s http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "What is the best way to deploy a Node.js application?"
}' | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Dimensions: {len(d[\"embeddings\"][0])}')"

# Output: Dimensions: 768

5d. List and Manage Models

Bash
# List all downloaded models via API
curl -s http://localhost:11434/api/tags | python3 -m json.tool

# Pull a model via API
curl -s http://localhost:11434/api/pull -d '{"name": "gemma3:12b"}'

# Delete a model
curl -s -X DELETE http://localhost:11434/api/delete -d '{"name": "old-model:latest"}'

# Show model details (parameters, template, license)
curl -s http://localhost:11434/api/show -d '{"name": "llama3.2"}'

5e. Using the API from Python

For Python applications, you can use the official Ollama library or call the API directly with requests:

Python
# Install the official Ollama Python library
pip install ollama

# --- Using the official library ---
import ollama

# Simple generation
response = ollama.generate(
    model='qwen3:1.7b',
    prompt='Write a Python function to calculate factorial'
)
print(response['response'])

# Chat with conversation history
response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'system', 'content': 'You are a senior Python developer.'},
        {'role': 'user', 'content': 'Review this code for bugs: def add(a, b): return a - b'}
    ]
)
print(response['message']['content'])

# Generate embeddings
response = ollama.embed(
    model='nomic-embed-text',
    input='Machine learning model deployment'
)
print(f'Embedding dimensions: {len(response["embeddings"][0])}')

# --- Or using requests directly (no extra library needed) ---
import requests

response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'qwen3:1.7b',
    'prompt': 'Explain Docker in one sentence.',
    'stream': False
})
print(response.json()['response'])

Step 6: Set Up Nginx Reverse Proxy (Recommended for Production)

If you need to access Ollama from other servers or applications, do not expose port 11434 directly. Instead, set up an Nginx reverse proxy with optional authentication.

Bash
# Install Nginx if not already installed
sudo apt update && sudo apt install -y nginx apache2-utils

# Create a password file for basic authentication
sudo htpasswd -c /etc/nginx/.ollama_htpasswd ollama_user
# Enter a strong password when prompted

# Create the Nginx configuration
sudo tee /etc/nginx/sites-available/ollama <<'EOF'
server {
    listen 8080;
    server_name your-server-ip;

    # Basic authentication
    auth_basic "Ollama API";
    auth_basic_user_file /etc/nginx/.ollama_htpasswd;

    # Rate limiting
    limit_req_zone $binary_remote_addr zone=ollama:10m rate=10r/s;

    location / {
        limit_req zone=ollama burst=20 nodelay;
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # Increase timeouts for long-running inference
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
        proxy_connect_timeout 30s;

        # Disable buffering for streaming responses
        proxy_buffering off;
    }

    # Health check endpoint (no auth required)
    location = /health {
        auth_basic off;
        proxy_pass http://127.0.0.1:11434;
    }
}
EOF

# Enable the site
sudo ln -sf /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx

# Test with authentication
curl -u ollama_user:your_password http://your-server-ip:8080/api/tags

This configuration adds three important production features:

  • Basic authentication — prevents unauthorized access to your inference API
  • Rate limiting — 10 requests per second per IP, with burst allowance of 20
  • Increased timeouts — large model inference can take 30+ seconds; default Nginx timeouts would cut the connection

Step 7: Create Custom Models with Modelfiles

Modelfiles let you create customized versions of base models with specific system prompts, parameters, and behavior. This is useful for building specialized AI agents.

Bash
# Create a Modelfile for a code review assistant
cat > Modelfile-code-reviewer <<'EOF'
FROM codellama:13b

SYSTEM """You are a senior code reviewer. When given code, you:
1. Identify bugs and security vulnerabilities
2. Suggest performance improvements
3. Check for best practices and clean code principles
4. Rate the overall code quality from 1-10

Be specific and cite line numbers. Be constructive, not harsh."""

# Lower temperature for more consistent reviews
PARAMETER temperature 0.3

# Increase context for reviewing longer files
PARAMETER num_ctx 8192

# Stop sequences
PARAMETER stop "<|endoftext|>"
PARAMETER stop "Human:"
EOF

# Build the custom model
ollama create code-reviewer -f Modelfile-code-reviewer

# Test it
ollama run code-reviewer "Review this Python code:
def get_user(id):
    query = f'SELECT * FROM users WHERE id = {id}'
    return db.execute(query)"

# The model will flag the SQL injection vulnerability

Step 8: Monitor and Maintain

Once Ollama is running in production, you will want to monitor resource usage and keep things updated.

Monitor GPU and Memory Usage

Bash
# Real-time GPU monitoring (updates every 1 second)
watch -n 1 nvidia-smi

# Check which models are currently loaded in memory
curl -s http://localhost:11434/api/ps | python3 -m json.tool

# Monitor Ollama logs for errors
sudo journalctl -u ollama -f --no-pager

# Check disk usage of model storage
du -sh /usr/share/ollama/.ollama/models/

# Or if using a custom path:
du -sh /mnt/data/ollama/models/

Update Ollama

Bash
# Update Ollama to the latest version (same install command)
curl -fsSL https://ollama.com/install.sh | sh

# The installer detects the existing installation and updates in place
# The service restarts automatically

# Verify the new version
ollama --version

Clean Up Unused Models

Bash
# List all models with sizes
ollama list

# Remove models you no longer need
ollama rm mixtral:8x7b    # frees 26 GB
ollama rm codellama:7b    # frees 3.8 GB

# Check freed disk space
df -h /usr/share/ollama/

Step 9: Integrate With Your Application

Here is a practical example of how we integrated Ollama into a WordPress site to power an “Explain This Code” feature on our blog. The WordPress plugin sends code snippets to our Ollama server, which returns a plain-English explanation.

PHP
// PHP example: calling Ollama from a WordPress plugin
function explain_code_with_ollama($code, $language, $context) {
    $ollama_url = 'http://your-ollama-server:11434/api/generate';
    
    $prompt = sprintf(
        "Explain this %s code in simple terms. Context: %s\n\nCode:\n```%s\n%s\n```",
        $language,
        $context,
        $language,
        $code
    );
    
    $response = wp_remote_post($ollama_url, array(
        'timeout' => 60,
        'headers' => array('Content-Type' => 'application/json'),
        'body' => json_encode(array(
            'model' => 'qwen3:1.7b',
            'prompt' => $prompt,
            'stream' => false,
            'options' => array(
                'temperature' => 0.3,
                'num_ctx' => 4096,
            ),
        )),
    ));
    
    if (is_wp_error($response)) {
        return 'Error: ' . $response->get_error_message();
    }
    
    $body = json_decode(wp_remote_retrieve_body($response), true);
    return $body['response'] ?? 'Could not generate explanation.';
}

For Node.js applications, the integration is equally straightforward:

JavaScript
// Node.js example: calling Ollama API
const fetch = require('node-fetch');

async function askOllama(prompt, model = 'qwen3:1.7b') {
    const response = await fetch('http://your-ollama-server:11434/api/generate', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
            model,
            prompt,
            stream: false,
            options: { temperature: 0.7 }
        })
    });
    
    const data = await response.json();
    return data.response;
}

// Usage
const explanation = await askOllama(
    'Explain the difference between async/await and Promises in JavaScript.'
);
console.log(explanation);

Troubleshooting Common Issues

Here are the problems we encountered during setup and how we fixed them:

ProblemCauseSolution
Model fails to loadInsufficient VRAM or RAMUse a smaller model or set OLLAMA_NUM_GPU=0 for CPU-only mode
Slow inferenceModel running on CPU instead of GPUCheck nvidia-smi, ensure CUDA is installed, restart Ollama
“connection refused” on port 11434Ollama not running or binding to localhost onlyCheck systemctl status ollama, set OLLAMA_HOST=0.0.0.0
Disk space errorsModels filling root partitionMove storage with OLLAMA_MODELS env var or symlink
Timeout on large promptsDefault Nginx timeout too shortIncrease proxy_read_timeout to 300s+
“CUDA out of memory”Model too large for GPUUse a smaller quantization (Q4 vs Q8) or partially offload to CPU
Slow first responseModel loading from disk into GPUSet OLLAMA_KEEP_ALIVE=-1 to keep model in memory

Frequently Asked Questions

How much does it cost to run Ollama vs using OpenAI API?

The main cost is hardware. An NVIDIA RTX 4090 costs around $1,600-2,000. If you are spending $200-500/month on API calls, the GPU pays for itself in 4-8 months. After that, inference is essentially free (just electricity). For a startup making 10,000+ inference calls per day, self-hosting is almost always cheaper within the first year.

Can Ollama run multiple models simultaneously?

Yes, but each model consumes VRAM when loaded. Ollama automatically loads and unloads models based on demand. If you have 24 GB VRAM, you could keep a 7B model (6 GB) and a 3B model (3 GB) loaded simultaneously, with room for the context window overhead. Use OLLAMA_KEEP_ALIVE to control how long idle models stay in memory.

Is Ollama suitable for production applications?

For internal tools, developer workflows, and moderate-traffic applications — absolutely. We run it in production for AI-powered features on our website. For high-traffic public-facing applications, you would want to add load balancing, health checks, and failover, which is beyond what Ollama handles alone but achievable with Nginx and standard infrastructure tooling.

Can I fine-tune models with Ollama?

Ollama does not support fine-tuning directly. You fine-tune models using tools like Unsloth, Axolotl, or Hugging Face’s transformers library, export the result as a GGUF file, and then import it into Ollama. Modelfiles let you customize system prompts and parameters without fine-tuning, which covers most customization needs.

Does Ollama support AMD GPUs?

Yes. Ollama supports AMD GPUs via ROCm. The install script detects AMD GPUs and installs the necessary ROCm libraries. Performance is generally comparable to NVIDIA on supported hardware (RX 6000 and 7000 series).

What Next?

Once Ollama is running on your server, the possibilities open up quickly. Here are a few things you might want to explore:

  • Build a RAG pipeline — combine Ollama with a vector database like Qdrant or ChromaDB to create AI assistants that can answer questions about your own documents and codebase. Check out our guide on RAG solutions.
  • Create custom AI agents — use Modelfiles to build specialized agents for code review, document analysis, customer support, or any domain-specific task. Read more about our custom AI agent development.
  • Connect to development tools — integrate Ollama with VS Code, JetBrains IDEs, or CI/CD pipelines for AI-assisted development without sending code to external APIs.
  • Deploy Open WebUI — a popular open-source chat interface that connects to Ollama and provides a ChatGPT-like experience for your team.

If you need help setting up self-hosted AI infrastructure, building AI-powered applications, or integrating LLMs into your existing software systems, our team at Velsof has done this for companies across the US and Europe. Get in touch to discuss your project.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *