Local AI Development with Ollama and .NET

Colorful wave circuits with an Ollama shield and .NET chips

Local AI Development with Ollama and .NET

The AI landscape has evolved rapidly, but there’s a growing concern about sending sensitive data to cloud services, managing API costs, and maintaining functionality without internet connectivity. Enter Ollama, a solution that brings powerful language models to your local machine, paired perfectly with .NET’s robust ecosystem.

Why Local AI Development Matters

Privacy and Security: Your data never leaves your machine. This is crucial for industries dealing with sensitive information, legal documents, healthcare records, or proprietary business data.

Cost Control: No per-token charges, no surprise bills. Once you’ve downloaded a model, inference is free beyond your electricity costs.

Offline Capability: Build applications that work without internet connectivity—essential for field work, air-gapped environments, or regions with unreliable connectivity.

Development Flexibility: Experiment freely without worrying about API rate limits or costs during development and testing.

What is Ollama?

Ollama is an open-source tool that makes it easy to run large language models locally. Think of it as Docker for AI models—it handles model downloads, manages resources, and provides a simple API interface.

Supported Models:

  • Llama - Meta’s flagship models (Llama 3.1, 3.2, 3.3, 4 with vision)
  • Qwen - Alibaba’s high-performing multilingual models (Qwen 3, Qwen 2.5)
  • Mistral - Efficient models with long context (Mistral Small, Large, Nemo)
  • DeepSeek - Reasoning models (DeepSeek-R1, DeepSeek-V3)
  • Phi - Microsoft’s compact models (Phi-3, Phi-4)
  • Gemma - Google’s open models (Gemma 3, CodeGemma)
  • CodeLlama / Devstral - Specialized for code generation
  • Specialized Models - Vision models, multimodal, reasoning, embedding models
  • And hundreds more

Browse the complete library at ollama.com/library to explore all available models, including specialized versions for coding, reasoning, multilingual support, and vision capabilities.

Setting Up Ollama

Installation

Windows: Download the installer from ollama.ai/download and run it, or use winget:

winget install Ollama.Ollama

Ollama runs as a background service.

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.ai/install.sh | sh

Downloading Your First Model

ollama pull llama3.2

This downloads the Llama 3.2 model (~4.7GB). Start it with:

ollama run llama3.2

Testing the API

Ollama exposes a REST API on http://localhost:11434. Test it:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Integrating Ollama with .NET

Using HttpClient

For simple scenarios, use .NET’s built-in HttpClient:

using System.Net.Http.Json;
using System.Text.Json;

public class OllamaClient
{
    private readonly HttpClient _httpClient;
    private const string BaseUrl = "http://localhost:11434";

    public OllamaClient()
    {
        _httpClient = new HttpClient { BaseAddress = new Uri(BaseUrl) };
    }

    public async Task<string> GenerateAsync(string model, string prompt)
    {
        var request = new
        {
            model,
            prompt,
            stream = false
        };

        var response = await _httpClient.PostAsJsonAsync("/api/generate", request);
        response.EnsureSuccessStatusCode();

        var result = await response.Content.ReadFromJsonAsync<OllamaResponse>();
        return result?.Response ?? string.Empty;
    }
}

public record OllamaResponse(string Response, string Model, bool Done);

Usage:

var client = new OllamaClient();
var answer = await client.GenerateAsync("llama3.2", "Explain dependency injection in C#");
Console.WriteLine(answer);

Using OllamaSharp Library

For production use, consider OllamaSharp:

dotnet add package OllamaSharp
using OllamaSharp;

var ollama = new OllamaApiClient("http://localhost:11434", "llama3.2");

var chat = new Chat(ollama);
await foreach (var response in chat.SendAsync("What are SOLID principles?"))
{
    Console.Write(response);
}

Building a Document Q&A System

Here’s a practical example combining Ollama with document processing:

public class DocumentQAService
{
    private readonly OllamaClient _ollama;
    private readonly Dictionary<string, string> _documents = new();

    public DocumentQAService(OllamaClient ollama)
    {
        _ollama = ollama;
    }

    public void AddDocument(string id, string content)
    {
        _documents[id] = content;
    }

    public async Task<string> AskQuestionAsync(string question)
    {
        // Combine all documents as context
        var context = string.Join("\n\n", _documents.Values);
        
        var prompt = $"""
            Context:
            {context}
            
            Question: {question}
            
            Answer based only on the context provided above.
            """;

        return await _ollama.GenerateAsync("llama3.2", prompt);
    }
}

Usage:

var qa = new DocumentQAService(new OllamaClient());
qa.AddDocument("policy", "Our refund policy allows returns within 30 days...");
qa.AddDocument("shipping", "We ship worldwide with DHL. Delivery takes 3-5 days...");

var answer = await qa.AskQuestionAsync("What is your refund policy?");
Console.WriteLine(answer);

Choosing the Right Model

Different models suit different needs:

ModelSizeBest ForSpeed
Phi-414BFast reasoning & codingFast
Llama 3.21B-3BQuick tasks, chatVery Fast
Qwen 2.57B-32BGeneral purpose, codingMedium
Llama 3.18B-70BComplex reasoningMedium-Slow
DeepSeek-R17B-671BAdvanced reasoningSlower

For development, start with Qwen 2.5 7B or Llama 3.2 3B (excellent balance of speed and quality). For coding tasks, DeepSeek-Coder or Devstral are specialized choices.

Best Practices

1. Prompt Engineering

Be specific and provide context:

// ❌ Vague
var result = await ollama.GenerateAsync("qwen2.5", "Write code");

// ✅ Specific
var result = await ollama.GenerateAsync("qwen2.5", 
    "Write a C# method that validates email addresses using regex. " +
    "Include error handling and XML documentation comments.");

2. Temperature Control

Control randomness with temperature (0.0 = deterministic, 1.0 = creative):

var request = new
{
    model = "qwen2.5",
    prompt = "Generate a creative story",
    options = new
    {
        temperature = 0.3 // Lower value = more deterministic
    }
};

3. Context Windows

Models have token/context limits that vary by model and configuration (many modern models support from 8K up to 128K tokens or more). For longer documents, implement chunking:

public async Task<string> SummarizeLongDocument(string document)
{
    const int chunkSize = 2000; // characters
    var chunks = SplitIntoChunks(document, chunkSize);
    var summaries = new List<string>();

    foreach (var chunk in chunks)
    {
        var summary = await _ollama.GenerateAsync("llama3.2", 
            $"Summarize this text:\n{chunk}");
        summaries.Add(summary);
    }

    // Final summary of summaries
    return await _ollama.GenerateAsync("llama3.2",
        $"Create a final summary from these summaries:\n{string.Join("\n", summaries)}");
}

4. Model Management

Check available models:

ollama list

Remove unused models to save space:

ollama rm mistral

Performance Considerations

Hardware Requirements:

  • Minimum: 8GB RAM, modern CPU
  • Recommended: 16GB+ RAM, GPU with CUDA/ROCm support
  • Optimal: 32GB RAM, NVIDIA RTX series GPU

GPU Acceleration:

Ollama automatically uses your GPU if available. Check with:

ollama ps

Memory Usage:

Monitor with:

# Windows
Get-Process ollama

# Linux/Mac
ps aux | grep ollama

Common Use Cases

1. Code Review Assistant

var review = await ollama.GenerateAsync("devstral",
    $"Review this C# code for issues:\n{code}");

2. Email Drafting

var email = await ollama.GenerateAsync("qwen2.5",
    "Draft a professional email declining a meeting request politely.");

3. Data Extraction

var extracted = await ollama.GenerateAsync("qwen2.5",
    $"Extract names, dates, and amounts from this invoice:\n{invoice}");

4. Translation

var translation = await ollama.GenerateAsync("qwen2.5",
    $"Translate to French: {englishText}");

Limitations and Considerations

Model Quality: Local models may not match GPT-4o’s quality for complex tasks. Choose based on your requirements.

Resource Intensive: Larger models need significant RAM and benefit greatly from GPUs.

No Internet Knowledge: Models only know information from their training data. They can’t access current events or web content.

Hallucinations: All LLMs can generate plausible-sounding but incorrect information. Always validate critical outputs.

Next Steps

  1. Start Small: Try Phi-4 or Llama 3.2 for initial experiments
  2. Build Prototypes: Create simple chat or document processing apps
  3. Optimize Prompts: Experiment with different prompt structures
  4. Add RAG: Combine with vector databases for better context
  5. Monitor Performance: Profile your application under load

The combination of Ollama and .NET opens up possibilities for privacy-focused, cost-effective AI applications. Whether you’re building internal tools, prototyping ideas, or creating offline-capable software, local AI gives you control and flexibility.

Explore, experiment, and innovate—the future of AI is in your hands, locally.


This post was created with the assistance of AI.


See also