LangChain + Ollama: Build Local AI Applications on Linux Without API Costs (2026)
π― Key Takeaways
- Why Run AI Locally?
- Step 1 β Install Ollama on Linux
- Step 2 β Pull Your First Models
- Step 3 β Test Ollama Directly
- Step 4 β Connect LangChain to Ollama
π Table of Contents
- Why Run AI Locally?
- Step 1 β Install Ollama on Linux
- Step 2 β Pull Your First Models
- Step 3 β Test Ollama Directly
- Step 4 β Connect LangChain to Ollama
- Step 5 β Build a Local Linux Assistant
- Step 6 β Build a Local RAG System (Q&A from Your Docs)
- Step 7 β Expose Ollama to Your Network
- Performance Tips for Linux Servers
- Switching Between Local and Cloud Models
- Conclusion
π Table of Contents
- Why Run AI Locally?
- Step 1 β Install Ollama on Linux
- Step 2 β Pull Your First Models
- Which Model to Use?
- Step 3 β Test Ollama Directly
- Step 4 β Connect LangChain to Ollama
- Step 5 β Build a Local Linux Assistant
- Step 6 β Build a Local RAG System (Q&A from Your Docs)
- Step 7 β Expose Ollama to Your Network
- Performance Tips for Linux Servers
- Switching Between Local and Cloud Models
- Conclusion
Running AI models locally on your Linux server means no API costs, complete data privacy, no internet dependency, and unlimited usage. With Ollama and LangChain working together, you can build production-quality AI applications that run entirely on your own hardware. This guide covers everything from installation to building a fully local RAG application.
Why Run AI Locally?
| Aspect | Cloud API (OpenAI) | Local (Ollama) |
|---|---|---|
| Cost | Pay per token | Free after hardware |
| Privacy | Data sent to OpenAI | Data never leaves server |
| Internet | Required | Not needed |
| Latency | Network dependent | Local hardware speed |
| Limits | Rate limits apply | Unlimited requests |
| Model quality | Best (GPT-4o) | Very good (13B models) |
For sysadmin use cases β log analysis, runbook Q&A, internal tooling β local models are often more than sufficient, and the privacy and cost benefits are significant.
Step 1 β Install Ollama on Linux
# One-line install (works on Ubuntu, Debian, RHEL, Rocky)
curl -fsSL https://ollama.com/install.sh | sh
# Ollama runs as a systemd service automatically
systemctl status ollama
# Verify it is working
ollama --version
Step 2 β Pull Your First Models
# Mistral 7B β best all-round model for 8GB+ RAM
ollama pull mistral
# Llama 3.2 3B β very fast, good for simple tasks
ollama pull llama3.2
# Phi-3 Mini β excellent for coding tasks, only 3.8B
ollama pull phi3:mini
# Llama 3.1 8B β best quality for 16GB RAM systems
ollama pull llama3.1:8b
# List downloaded models
ollama list
Which Model to Use?
| Model | RAM Needed | Best For | Speed |
|---|---|---|---|
| llama3.2:1b | 2 GB | Simple tasks, testing | Very fast |
| llama3.2:3b | 3 GB | General use | Fast |
| phi3:mini | 4 GB | Coding, reasoning | Fast |
| mistral | 5 GB | Best all-round 7B | Good |
| llama3.1:8b | 6 GB | Best quality 8B | Good |
| llama3.1:13b | 10 GB | Best local quality | Moderate |
Step 3 β Test Ollama Directly
# Chat with a model from the terminal
ollama run mistral
# Ask a question and exit
ollama run mistral "What are the top 5 Linux commands every sysadmin must know?"
# Use via REST API (Ollama exposes an OpenAI-compatible API)
curl http://localhost:11434/api/generate \
-d '{"model": "mistral", "prompt": "Explain SELinux in simple terms", "stream": false}' \
| python3 -m json.tool
Step 4 β Connect LangChain to Ollama
# Install LangChain with Ollama support
source ~/langchain-projects/lc-env/bin/activate
pip install langchain langchain-ollama langchain-community
nano ollama_test.py
from langchain_ollama import ChatOllama
# Connect to local Ollama β no API key needed!
llm = ChatOllama(model="mistral", temperature=0)
# Same interface as ChatOpenAI
response = llm.invoke("What is the difference between a process and a thread in Linux?")
print(response.content)
python3 ollama_test.py
Step 5 β Build a Local Linux Assistant
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOllama(model="mistral", temperature=0)
# Create a Linux-focused assistant
prompt = ChatPromptTemplate.from_messages([
("system", """You are an expert Linux sysadmin assistant.
- Give concise, practical answers
- Always include actual commands when helpful
- Warn about dangerous operations
- Assume the user is running Ubuntu or RHEL"""),
("human", "{question}")
])
chain = prompt | llm
# Test it
questions = [
"How do I find files larger than 1GB on my server?",
"My server load average is 8.5 with 4 CPUs, what should I check?",
"How do I check which process is using the most memory?"
]
for q in questions:
print(f"\nQ: {q}")
print(f"A: {chain.invoke({'question': q}).content}")
print("-" * 60)
Step 6 β Build a Local RAG System (Q&A from Your Docs)
RAG (Retrieval-Augmented Generation) lets the AI answer questions from your own documents β runbooks, wikis, configuration files β all locally.
# Install RAG dependencies
pip install langchain-chroma chromadb sentence-transformers
from langchain_ollama import ChatOllama
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA
# Step 1: Load your documents (runbooks, wikis, etc.)
# loader = TextLoader("/path/to/your/runbook.txt")
# Or load an entire directory:
# loader = DirectoryLoader("/path/to/docs/", glob="**/*.txt")
# docs = loader.load()
# For this example, create sample documents
from langchain_core.documents import Document
docs = [
Document(page_content="To restart nginx: systemctl restart nginx. Check status with: systemctl status nginx. Logs at: /var/log/nginx/error.log", metadata={"source": "nginx-runbook"}),
Document(page_content="Database backup procedure: Run pg_dump dbname > backup.sql. Schedule with cron: 0 2 * * * pg_dump mydb > /backups/mydb_$(date +%Y%m%d).sql", metadata={"source": "db-runbook"}),
Document(page_content="When disk is full: Check large files with 'du -sh /* | sort -rh | head -20'. Clean logs with 'journalctl --vacuum-size=1G'. Remove old Docker images with 'docker system prune'", metadata={"source": "disk-runbook"}),
]
# Step 2: Split documents into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
# Step 3: Create embeddings using Ollama (local, free)
embeddings = OllamaEmbeddings(model="llama3.2")
# Step 4: Store in local vector database
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./local_db")
# Step 5: Create retrieval chain
llm = ChatOllama(model="mistral", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
return_source_documents=True
)
# Step 6: Ask questions about your documents
questions = [
"How do I restart nginx?",
"What should I do when disk is full?",
"How do I backup the database?"
]
for q in questions:
result = qa_chain.invoke({"query": q})
print(f"\nQ: {q}")
print(f"A: {result['result']}")
sources = [doc.metadata['source'] for doc in result['source_documents']]
print(f"Sources: {', '.join(set(sources))}")
Step 7 β Expose Ollama to Your Network
By default Ollama only listens on localhost. To access it from other machines (like your laptop):
# Edit Ollama systemd service
sudo systemctl edit ollama
# Add these lines in the editor
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
# Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart ollama
# Verify it is listening on all interfaces
ss -tlnp | grep 11434
Now connect from LangChain on another machine:
# Connect to remote Ollama server
llm = ChatOllama(
model="mistral",
base_url="http://your-server-ip:11434"
)
Performance Tips for Linux Servers
# Check GPU acceleration (much faster if available)
ollama run mistral "hello" --verbose 2>&1 | grep -i "gpu\|cpu"
# Monitor resource usage while model runs
htop # or
watch -n 1 'free -h && echo "---" && cat /proc/loadavg'
# Increase performance: set number of threads
OLLAMA_NUM_PARALLEL=2 ollama serve
# Check model loading time
time ollama run mistral "hi" --nowordwrap
Switching Between Local and Cloud Models
One of the best features of LangChain is that you can swap between Ollama and OpenAI with one line:
import os
# Choose model based on environment variable
USE_LOCAL = os.getenv("USE_LOCAL_AI", "true").lower() == "true"
if USE_LOCAL:
from langchain_ollama import ChatOllama
llm = ChatOllama(model="mistral", temperature=0)
print("Using local Ollama model")
else:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
print("Using OpenAI API")
# The rest of your code is identical regardless
response = llm.invoke("Explain Linux namespaces")
print(response.content)
Conclusion
Running LangChain with Ollama on Linux gives you a powerful, private, and free AI stack. For sysadmin use cases β internal documentation Q&A, log analysis, automation assistance β local 7B and 13B models perform excellently. You get the full power of modern AI without sending sensitive infrastructure data to external APIs, without ongoing costs, and without internet dependency. As hardware continues to get cheaper and models continue to improve, local AI is becoming the default choice for privacy-conscious organisations.
Was this article helpful?
About Ramesh Sundararamaiah
Red Hat Certified Architect
Expert in Linux system administration, DevOps automation, and cloud infrastructure. Specializing in Red Hat Enterprise Linux, CentOS, Ubuntu, Docker, Ansible, and enterprise IT solutions.