The AI landscape has evolved rapidly, but there’s a growing concern about sending sensitive data to cloud services, managing API costs, and maintaining functionality without internet connectivity. Enter Ollama, a solution that brings powerful language models to your local machine, paired perfectly with .NET’s robust ecosystem.
Why Local AI Development Matters
Privacy and Security: Your data never leaves your machine. This is crucial for industries dealing with sensitive information, legal documents, healthcare records, or proprietary business data.
Cost Control: No per-token charges, no surprise bills. Once you’ve downloaded a model, inference is free beyond your electricity costs.
Offline Capability: Build applications that work without internet connectivity—essential for field work, air-gapped environments, or regions with unreliable connectivity.
Development Flexibility: Experiment freely without worrying about API rate limits or costs during development and testing.
What is Ollama?
Ollama is an open-source tool that makes it easy to run large language models locally. Think of it as Docker for AI models—it handles model downloads, manages resources, and provides a simple API interface.
Supported Models:
- Llama - Meta’s flagship models (Llama 3.1, 3.2, 3.3, 4 with vision)
- Qwen - Alibaba’s high-performing multilingual models (Qwen 3, Qwen 2.5)
- Mistral - Efficient models with long context (Mistral Small, Large, Nemo)
- DeepSeek - Reasoning models (DeepSeek-R1, DeepSeek-V3)
- Phi - Microsoft’s compact models (Phi-3, Phi-4)
- Gemma - Google’s open models (Gemma 3, CodeGemma)
- CodeLlama / Devstral - Specialized for code generation
- Specialized Models - Vision models, multimodal, reasoning, embedding models
- And hundreds more
Browse the complete library at ollama.com/library to explore all available models, including specialized versions for coding, reasoning, multilingual support, and vision capabilities.
Setting Up Ollama
Installation
Windows: Download the installer from ollama.ai/download and run it, or use winget:
winget install Ollama.Ollama
Ollama runs as a background service.
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.ai/install.sh | sh
Downloading Your First Model
ollama pull llama3.2
This downloads the Llama 3.2 model (~4.7GB). Start it with:
ollama run llama3.2
Testing the API
Ollama exposes a REST API on http://localhost:11434. Test it:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false
}'
Integrating Ollama with .NET
Using HttpClient
For simple scenarios, use .NET’s built-in HttpClient:
using System.Net.Http.Json;
using System.Text.Json;
public class OllamaClient
{
private readonly HttpClient _httpClient;
private const string BaseUrl = "http://localhost:11434";
public OllamaClient()
{
_httpClient = new HttpClient { BaseAddress = new Uri(BaseUrl) };
}
public async Task<string> GenerateAsync(string model, string prompt)
{
var request = new
{
model,
prompt,
stream = false
};
var response = await _httpClient.PostAsJsonAsync("/api/generate", request);
response.EnsureSuccessStatusCode();
var result = await response.Content.ReadFromJsonAsync<OllamaResponse>();
return result?.Response ?? string.Empty;
}
}
public record OllamaResponse(string Response, string Model, bool Done);
Usage:
var client = new OllamaClient();
var answer = await client.GenerateAsync("llama3.2", "Explain dependency injection in C#");
Console.WriteLine(answer);
Using OllamaSharp Library
For production use, consider OllamaSharp:
dotnet add package OllamaSharp
using OllamaSharp;
var ollama = new OllamaApiClient("http://localhost:11434", "llama3.2");
var chat = new Chat(ollama);
await foreach (var response in chat.SendAsync("What are SOLID principles?"))
{
Console.Write(response);
}
Building a Document Q&A System
Here’s a practical example combining Ollama with document processing:
public class DocumentQAService
{
private readonly OllamaClient _ollama;
private readonly Dictionary<string, string> _documents = new();
public DocumentQAService(OllamaClient ollama)
{
_ollama = ollama;
}
public void AddDocument(string id, string content)
{
_documents[id] = content;
}
public async Task<string> AskQuestionAsync(string question)
{
// Combine all documents as context
var context = string.Join("\n\n", _documents.Values);
var prompt = $"""
Context:
{context}
Question: {question}
Answer based only on the context provided above.
""";
return await _ollama.GenerateAsync("llama3.2", prompt);
}
}
Usage:
var qa = new DocumentQAService(new OllamaClient());
qa.AddDocument("policy", "Our refund policy allows returns within 30 days...");
qa.AddDocument("shipping", "We ship worldwide with DHL. Delivery takes 3-5 days...");
var answer = await qa.AskQuestionAsync("What is your refund policy?");
Console.WriteLine(answer);
Choosing the Right Model
Different models suit different needs:
| Model | Size | Best For | Speed |
|---|---|---|---|
| Phi-4 | 14B | Fast reasoning & coding | Fast |
| Llama 3.2 | 1B-3B | Quick tasks, chat | Very Fast |
| Qwen 2.5 | 7B-32B | General purpose, coding | Medium |
| Llama 3.1 | 8B-70B | Complex reasoning | Medium-Slow |
| DeepSeek-R1 | 7B-671B | Advanced reasoning | Slower |
For development, start with Qwen 2.5 7B or Llama 3.2 3B (excellent balance of speed and quality). For coding tasks, DeepSeek-Coder or Devstral are specialized choices.
Best Practices
1. Prompt Engineering
Be specific and provide context:
// ❌ Vague
var result = await ollama.GenerateAsync("qwen2.5", "Write code");
// ✅ Specific
var result = await ollama.GenerateAsync("qwen2.5",
"Write a C# method that validates email addresses using regex. " +
"Include error handling and XML documentation comments.");
2. Temperature Control
Control randomness with temperature (0.0 = deterministic, 1.0 = creative):
var request = new
{
model = "qwen2.5",
prompt = "Generate a creative story",
options = new
{
temperature = 0.3 // Lower value = more deterministic
}
};
3. Context Windows
Models have token/context limits that vary by model and configuration (many modern models support from 8K up to 128K tokens or more). For longer documents, implement chunking:
public async Task<string> SummarizeLongDocument(string document)
{
const int chunkSize = 2000; // characters
var chunks = SplitIntoChunks(document, chunkSize);
var summaries = new List<string>();
foreach (var chunk in chunks)
{
var summary = await _ollama.GenerateAsync("llama3.2",
$"Summarize this text:\n{chunk}");
summaries.Add(summary);
}
// Final summary of summaries
return await _ollama.GenerateAsync("llama3.2",
$"Create a final summary from these summaries:\n{string.Join("\n", summaries)}");
}
4. Model Management
Check available models:
ollama list
Remove unused models to save space:
ollama rm mistral
Performance Considerations
Hardware Requirements:
- Minimum: 8GB RAM, modern CPU
- Recommended: 16GB+ RAM, GPU with CUDA/ROCm support
- Optimal: 32GB RAM, NVIDIA RTX series GPU
GPU Acceleration:
Ollama automatically uses your GPU if available. Check with:
ollama ps
Memory Usage:
Monitor with:
# Windows
Get-Process ollama
# Linux/Mac
ps aux | grep ollama
Common Use Cases
1. Code Review Assistant
var review = await ollama.GenerateAsync("devstral",
$"Review this C# code for issues:\n{code}");
2. Email Drafting
var email = await ollama.GenerateAsync("qwen2.5",
"Draft a professional email declining a meeting request politely.");
3. Data Extraction
var extracted = await ollama.GenerateAsync("qwen2.5",
$"Extract names, dates, and amounts from this invoice:\n{invoice}");
4. Translation
var translation = await ollama.GenerateAsync("qwen2.5",
$"Translate to French: {englishText}");
Limitations and Considerations
Model Quality: Local models may not match GPT-4o’s quality for complex tasks. Choose based on your requirements.
Resource Intensive: Larger models need significant RAM and benefit greatly from GPUs.
No Internet Knowledge: Models only know information from their training data. They can’t access current events or web content.
Hallucinations: All LLMs can generate plausible-sounding but incorrect information. Always validate critical outputs.
Next Steps
- Start Small: Try Phi-4 or Llama 3.2 for initial experiments
- Build Prototypes: Create simple chat or document processing apps
- Optimize Prompts: Experiment with different prompt structures
- Add RAG: Combine with vector databases for better context
- Monitor Performance: Profile your application under load
The combination of Ollama and .NET opens up possibilities for privacy-focused, cost-effective AI applications. Whether you’re building internal tools, prototyping ideas, or creating offline-capable software, local AI gives you control and flexibility.
Explore, experiment, and innovate—the future of AI is in your hands, locally.
This post was created with the assistance of AI.
See also
- Quality & Delivery: Testing, Security, Performance, and CI in Mes Recettes
- Inside the Architecture: Patterns, Data Modeling, and Extensibility in Mes Recettes
- Mes Recettes: Building a Modern Recipe Index with Blazor WebAssembly & Supabase
- Supabase and Table Relationships
- Review of .NET Micro Framework and Gadgeteer at Alt.NET Montreal