How to Run Local LLM with Ollama: Complete Setup Guide
in Development on Python, Ai
How to Run Local LLM with Ollama: Complete Setup Guide
Running Large Language Models locally gives you privacy, no API costs, and offline access. Ollama makes this incredibly simple. This guide covers everything from installation to running advanced models on your machine.
Why Run LLMs Locally?
- Privacy: Your data never leaves your machine
- No API Costs: Free unlimited usage
- Offline Access: Works without internet
- Speed: No network latency
- Customization: Fine-tune and modify models
System Requirements
Minimum Requirements
- 8GB RAM (for 7B models)
- Modern CPU (Intel/AMD)
- 10GB disk space
Recommended for Best Performance
- 16GB+ RAM (for larger models)
- NVIDIA GPU with 8GB+ VRAM
- SSD storage
Installing Ollama
macOS
# Using Homebrew
brew install ollama
# Or download directly
curl -fsSL https://ollama.com/install.sh | sh
Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the installer from ollama.com/download
Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Starting Ollama
# Start the Ollama service
ollama serve
The service runs on http://localhost:11434
Downloading and Running Models
Popular Models
# Llama 3.2 (Latest from Meta)
ollama pull llama3.2
# Mistral (Fast and efficient)
ollama pull mistral
# CodeLlama (Optimized for coding)
ollama pull codellama
# Phi-2 (Microsoft's small but powerful model)
ollama pull phi
# Gemma (Google's open model)
ollama pull gemma
Model Sizes
| Model | Size | RAM Required | Best For |
|---|---|---|---|
| phi | 2.7B | 4GB | Quick tasks, low resources |
| mistral | 7B | 8GB | General purpose |
| llama3.2 | 8B | 10GB | Balanced performance |
| llama3.2:70b | 70B | 48GB+ | Maximum quality |
| codellama | 7B | 8GB | Code generation |
Running a Model
# Interactive chat
ollama run llama3.2
# Run with a prompt
ollama run mistral "Explain quantum computing in simple terms"
# List downloaded models
ollama list
# Remove a model
ollama rm modelname
Using Ollama API
Basic API Call
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "What is machine learning?",
"stream": false
}'
Streaming Response
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Write a poem about coding",
"stream": true
}'
Chat API
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Python Integration
Using Official Ollama Library
import ollama
# Simple generation
response = ollama.generate(
model='llama3.2',
prompt='Explain Python decorators'
)
print(response['response'])
# Chat
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'user', 'content': 'Why is the sky blue?'}
]
)
print(response['message']['content'])
# Streaming
for chunk in ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Tell me a story'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)
Using LangChain with Ollama
from langchain_community.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Initialize
llm = Ollama(model="llama3.2")
# Simple chain
prompt = ChatPromptTemplate.from_template("Explain {topic} simply.")
chain = prompt | llm | StrOutputParser()
result = chain.invoke({"topic": "neural networks"})
print(result)
OpenAI-Compatible API
Ollama provides an OpenAI-compatible endpoint:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # Required but unused
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(response.choices[0].message.content)
Creating Custom Models
Modelfile Basics
# Modelfile
FROM llama3.2
# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
# Set system prompt
SYSTEM """You are a helpful coding assistant. You provide clear,
well-commented code examples and explain your reasoning."""
Create and Run Custom Model
# Create the model
ollama create codingassistant -f Modelfile
# Run it
ollama run codingassistant
Advanced Modelfile
FROM llama3.2
# Custom parameters
PARAMETER temperature 0.8
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 8192
# System prompt
SYSTEM """You are an expert Python developer. Follow these rules:
1. Always write clean, PEP 8 compliant code
2. Include docstrings and type hints
3. Add comments for complex logic
4. Suggest optimizations when relevant"""
# Template customization
TEMPLATE """<|system|>
<|end|>
<|user|>
<|end|>
<|assistant|>
"""
GPU Acceleration
NVIDIA GPU Setup
# Check if GPU is detected
nvidia-smi
# Ollama automatically uses GPU if available
ollama run llama3.2
Specify GPU Layers
# Use all GPU layers
OLLAMA_NUM_GPU=999 ollama run llama3.2
# Use CPU only
OLLAMA_NUM_GPU=0 ollama run llama3.2
AMD GPU (ROCm)
# Install ROCm version
docker run -d --device /dev/kfd --device /dev/dri \
-v ollama:/root/.ollama -p 11434:11434 \
--name ollama ollama/ollama:rocm
Web UI Options
Open WebUI (Recommended)
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Access at http://localhost:3000
Text Generation WebUI
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt
python server.py
Performance Optimization
Memory Management
# Set context size (affects memory)
ollama run llama3.2 --num-ctx 2048
# In Modelfile
PARAMETER num_ctx 4096
Quantization
Ollama automatically uses quantized models. Choose quantization level:
# Default (Q4_0) - Good balance
ollama pull llama3.2
# Higher quality (Q5_K_M)
ollama pull llama3.2:q5_k_m
# Lower memory (Q2_K)
ollama pull llama3.2:q2_k
Building an Application
Simple Chatbot
import ollama
class LocalChatbot:
def __init__(self, model="llama3.2"):
self.model = model
self.history = []
def chat(self, message):
self.history.append({"role": "user", "content": message})
response = ollama.chat(
model=self.model,
messages=self.history
)
assistant_message = response['message']['content']
self.history.append({"role": "assistant", "content": assistant_message})
return assistant_message
def clear_history(self):
self.history = []
# Usage
bot = LocalChatbot()
print(bot.chat("Hello! What can you help me with?"))
print(bot.chat("Can you write some Python code?"))
RAG with Local LLM
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
# Setup components
llm = Ollama(model="llama3.2")
embeddings = OllamaEmbeddings(model="llama3.2")
# Create vector store (assume docs are loaded)
vectorstore = Chroma.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever()
# RAG chain
template = """Answer based on context:
{context}
Question: {question}"""
prompt = ChatPromptTemplate.from_template(template)
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
)
answer = chain.invoke("What is the main topic?")
Troubleshooting
Model Won’t Load
# Check available memory
free -h
# Try smaller model
ollama run phi
# Reduce context size
ollama run llama3.2 --num-ctx 1024
Slow Performance
- Use GPU if available
- Choose quantized models
- Reduce context window
- Close other applications
Connection Errors
# Restart Ollama
ollama serve
# Check if running
curl http://localhost:11434/api/tags
Comparison: Local vs Cloud
| Aspect | Local (Ollama) | Cloud (OpenAI) |
|---|---|---|
| Cost | Free | Pay per token |
| Privacy | Complete | Data sent to servers |
| Speed | Depends on hardware | Consistent |
| Quality | Good (varies by model) | Best available |
| Offline | Yes | No |
Conclusion
Ollama makes running powerful LLMs locally accessible to everyone. Whether you need privacy, want to avoid API costs, or just want to experiment, local LLMs are now a viable option.
Start with smaller models like Phi or Mistral 7B, then scale up as you understand your hardware’s capabilities.
Last updated: January 2025
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
