Local AI Development with Ollama and .NET

Part 1 of 2 in the “Local AI with Ollama and .NET” series: Part 2 – Local RAG with Ollama, LiteLLM, and Qdrant | 🇫🇷 Version

The AI landscape has evolved rapidly, but there’s a growing concern about sending sensitive data to cloud services, managing API costs, and maintaining functionality without internet connectivity. Enter Ollama, a solution that brings powerful language models to your local machine, paired perfectly with .NET’s robust ecosystem.

Why Local AI Development Matters

Privacy and Security: Your data never leaves your machine. This is crucial for industries dealing with sensitive information, legal documents, healthcare records, or proprietary business data.

Cost Control: No per-token charges, no surprise bills. Once you’ve downloaded a model, inference is free beyond your electricity costs.

Offline Capability: Build applications that work without internet connectivity—essential for field work, air-gapped environments, or regions with unreliable connectivity.

Development Flexibility: Experiment freely without worrying about API rate limits or costs during development and testing.

What is Ollama?

Ollama is an open-source tool that makes it easy to run large language models locally. Think of it as Docker for AI models—it handles model downloads, manages resources, and provides a simple API interface.

Supported Models:

Llama - Meta’s flagship models (Llama 3.1, 3.2, 3.3, 4 with vision)
Qwen - Alibaba’s high-performing multilingual models (Qwen 3, Qwen 2.5)
Mistral - Efficient models with long context (Mistral Small, Large, Nemo)
DeepSeek - Reasoning models (DeepSeek-R1, DeepSeek-V3)
Phi - Microsoft’s compact models (Phi-3, Phi-4)
Gemma - Google’s open models (Gemma 3, CodeGemma)
CodeLlama / Devstral - Specialized for code generation
Specialized Models - Vision models, multimodal, reasoning, embedding models
And hundreds more

Browse the complete library at ollama.com/library to explore all available models, including specialized versions for coding, reasoning, multilingual support, and vision capabilities.

Setting Up Ollama

Installation

Windows: Download the installer from ollama.ai/download and run it, or use winget:

winget install Ollama.Ollama

Ollama runs as a background service.

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.ai/install.sh | sh

Downloading Your First Model

ollama pull llama3.2

This downloads the Llama 3.2 model (~4.7GB). Start it with:

ollama run llama3.2

Testing the API

Ollama exposes a REST API on http://localhost:11434. Test it:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Integrating Ollama with .NET

Using HttpClient

For simple scenarios, use .NET’s built-in HttpClient:

using System.Net.Http.Json;
using System.Text.Json;

public class OllamaClient
{
    private readonly HttpClient _httpClient;
    private const string BaseUrl = "http://localhost:11434";

    public OllamaClient()
    {
        _httpClient = new HttpClient { BaseAddress = new Uri(BaseUrl) };
    }

    public async Task<string> GenerateAsync(string model, string prompt)
    {
        var request = new
        {
            model,
            prompt,
            stream = false
        };

        var response = await _httpClient.PostAsJsonAsync("/api/generate", request);
        response.EnsureSuccessStatusCode();

        var result = await response.Content.ReadFromJsonAsync<OllamaResponse>();
        return result?.Response ?? string.Empty;
    }
}

public record OllamaResponse(string Response, string Model, bool Done);

Usage:

var client = new OllamaClient();
var answer = await client.GenerateAsync("llama3.2", "Explain dependency injection in C#");
Console.WriteLine(answer);

Using OllamaSharp Library

For production use, consider OllamaSharp:

dotnet add package OllamaSharp

using OllamaSharp;

var ollama = new OllamaApiClient("http://localhost:11434", "llama3.2");

var chat = new Chat(ollama);
await foreach (var response in chat.SendAsync("What are SOLID principles?"))
{
    Console.Write(response);
}

Building a Document Q&A System

Here’s a practical example combining Ollama with document processing:

public class DocumentQAService
{
    private readonly OllamaClient _ollama;
    private readonly Dictionary<string, string> _documents = new();

    public DocumentQAService(OllamaClient ollama)
    {
        _ollama = ollama;
    }

    public void AddDocument(string id, string content)
    {
        _documents[id] = content;
    }

    public async Task<string> AskQuestionAsync(string question)
    {
        // Combine all documents as context
        var context = string.Join("\n\n", _documents.Values);
        
        var prompt = $"""
            Context:
            {context}
            
            Question: {question}
            
            Answer based only on the context provided above.
            """;

        return await _ollama.GenerateAsync("llama3.2", prompt);
    }
}

Usage:

var qa = new DocumentQAService(new OllamaClient());
qa.AddDocument("policy", "Our refund policy allows returns within 30 days...");
qa.AddDocument("shipping", "We ship worldwide with DHL. Delivery takes 3-5 days...");

var answer = await qa.AskQuestionAsync("What is your refund policy?");
Console.WriteLine(answer);

Choosing the Right Model

Different models suit different needs:

Model	Size	Best For	Speed
Phi-4	14B	Fast reasoning & coding	Fast
Llama 3.2	1B-3B	Quick tasks, chat	Very Fast
Qwen 2.5	7B-32B	General purpose, coding	Medium
Llama 3.1	8B-70B	Complex reasoning	Medium-Slow
DeepSeek-R1	7B-671B	Advanced reasoning	Slower

For development, start with Qwen 2.5 7B or Llama 3.2 3B (excellent balance of speed and quality). For coding tasks, DeepSeek-Coder or Devstral are specialized choices.

Best Practices

1. Prompt Engineering

Be specific and provide context:

// ❌ Vague
var result = await ollama.GenerateAsync("qwen2.5", "Write code");

// ✅ Specific
var result = await ollama.GenerateAsync("qwen2.5", 
    "Write a C# method that validates email addresses using regex. " +
    "Include error handling and XML documentation comments.");

2. Temperature Control

Control randomness with temperature (0.0 = deterministic, 1.0 = creative):

var request = new
{
    model = "qwen2.5",
    prompt = "Generate a creative story",
    options = new
    {
        temperature = 0.3 // Lower value = more deterministic
    }
};

3. Context Windows

Models have token/context limits that vary by model and configuration (many modern models support from 8K up to 128K tokens or more). For longer documents, implement chunking:

public async Task<string> SummarizeLongDocument(string document)
{
    const int chunkSize = 2000; // characters
    var chunks = SplitIntoChunks(document, chunkSize);
    var summaries = new List<string>();

    foreach (var chunk in chunks)
    {
        var summary = await _ollama.GenerateAsync("llama3.2", 
            $"Summarize this text:\n{chunk}");
        summaries.Add(summary);
    }

    // Final summary of summaries
    return await _ollama.GenerateAsync("llama3.2",
        $"Create a final summary from these summaries:\n{string.Join("\n", summaries)}");
}

4. Model Management

Check available models:

ollama list

Remove unused models to save space:

ollama rm mistral

Performance Considerations

Hardware Requirements:

Minimum: 8GB RAM, modern CPU
Recommended: 16GB+ RAM, GPU with CUDA/ROCm support
Optimal: 32GB RAM, NVIDIA RTX series GPU

GPU Acceleration:

Ollama automatically uses your GPU if available. Check with:

ollama ps

Memory Usage:

Monitor with:

# Windows
Get-Process ollama

# Linux/Mac
ps aux | grep ollama

Common Use Cases

1. Code Review Assistant

var review = await ollama.GenerateAsync("devstral",
    $"Review this C# code for issues:\n{code}");

2. Email Drafting

var email = await ollama.GenerateAsync("qwen2.5",
    "Draft a professional email declining a meeting request politely.");

3. Data Extraction

var extracted = await ollama.GenerateAsync("qwen2.5",
    $"Extract names, dates, and amounts from this invoice:\n{invoice}");

4. Translation

var translation = await ollama.GenerateAsync("qwen2.5",
    $"Translate to French: {englishText}");

Limitations and Considerations

Model Quality: Local models may not match GPT-4o’s quality for complex tasks. Choose based on your requirements.

Resource Intensive: Larger models need significant RAM and benefit greatly from GPUs.

No Internet Knowledge: Models only know information from their training data. They can’t access current events or web content.

Hallucinations: All LLMs can generate plausible-sounding but incorrect information. Always validate critical outputs.

Next Steps

Start Small: Try Phi-4 or Llama 3.2 for initial experiments
Build Prototypes: Create simple chat or document processing apps
Optimize Prompts: Experiment with different prompt structures
Add RAG: Combine with vector databases for better context
Monitor Performance: Profile your application under load

The combination of Ollama and .NET opens up possibilities for privacy-focused, cost-effective AI applications. Whether you’re building internal tools, prototyping ideas, or creating offline-capable software, local AI gives you control and flexibility.

Explore, experiment, and innovate—the future of AI is in your hands, locally.

This post was created with the assistance of AI.