Local RAG with Ollama, LiteLLM, and Qdrant

Futuristic data layers with stacked databases, flowing data streams, and analytical visualizations

Local RAG with Ollama, LiteLLM, and Qdrant

Part 2 of 2 in the “Local AI with Ollama and .NET” series: Part 1 – Local AI Development | πŸ‡«πŸ‡· Version

This post shows how to build a local RAG pipeline in .NET using Ollama for models, LiteLLM as a thin API proxy, and Qdrant for vector search.

A complete, working example is available here: mongeon/code-examples Β· local-rag-ollama-litellm.

How RAG Works

RAG (Retrieval-Augmented Generation) augments an LLM with external knowledge by retrieving relevant documents and feeding them as context. Instead of relying only on the model’s training data, RAG grounds answers in your specific documents, ensuring accuracy, freshness, and traceability.

The flow breaks into two phases:

Indexing (Ingesting)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Your Documents                          β”‚
β”‚              (PDFs, markdown, code, etc.)                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Clean & Split β”‚  (Remove headers,
                    β”‚   into Chunks   β”‚   normalize text)
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚  Convert to Vectors  β”‚  (Embedding model:
                   β”‚  (via Ollama/        β”‚   nomic-embed-text)
                   β”‚   LiteLLM)           β”‚
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Store in Vector DB β”‚  (Qdrant)
                    β”‚                     β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Indexing runs once (or when documents change). You load your knowledge base, break documents into manageable chunks (200–500 tokens each), and convert each chunk into a dense vector using an embedding model. These vectors capture semantic meaning, similar concepts cluster together in vector space. Store them in a vector database for fast similarity search later.

This is a one-time cost per document set. You can run it as a batch job, overnight, or as part of your CI/CD pipeline. Once indexed, your knowledge base is ready for queries.

Query

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   User Question                              β”‚
β”‚            "How do embeddings work?"                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚  Convert Q to Vector β”‚  (Same embedding
                   β”‚                      β”‚   model)
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚  Find Top-k Similar  β”‚  (Cosine similarity
                   β”‚  Chunks in Vector DB β”‚   search)
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚ Retrieved Chunks (Grounding Context)      β”‚
          β”‚ - "Embeddings are dense vectors..."       β”‚
          β”‚ - "They capture semantic meaning..."      β”‚
          β”‚ - ...                                     β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Build Prompt with Context                       β”‚
    β”‚ "Context: [chunks]\n\nQuestion: How do...?"     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   Generate Answer via     β”‚  (Ollama + LiteLLM,
              β”‚   Local LLM (qwen2.5)     β”‚   temperature: 0.2)
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚ Grounded Answer with Citation        β”‚
        β”‚ "Based on the retrieved context,     β”‚
        β”‚  embeddings are... [cited]"          β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Query runs every time a user asks a question. The question is converted to a vector using the same embedding model, then the vector database finds the most similar chunks (typically 3–5). These chunks become the context fed to your LLM, which generates an answer grounded in your documents, not from its training data.

This process is fast (typically <500ms for retrieval + generation on local hardware) and deterministic, same question with same chunks yields consistent answers. The separation of indexing and retrieval means you can scale knowledge bases without retraining models or slowing down query time.

Key Benefits

  • Accuracy: Answers grounded in real documents, not hallucinations.
  • Traceability: Users see which chunks were used, citations included.
  • Freshness: Update knowledge by reindexing; no model retraining.
  • Offline: Everything runs locally; no cloud API calls.

Architecture

The implementation stack:

  1. Ingest & Index: Load documents β†’ clean β†’ chunk β†’ embed β†’ store vectors.
  2. Retrieval: At query time, embed question β†’ find top-k similar chunks in vector DB.
  3. Generation: Compose prompt with context β†’ answer via Ollama (through LiteLLM proxy).
  4. Evaluation: Validate grounding, measure relevance, and track latency.

Prerequisites

  • Ollama installed and models available (e.g., nomic-embed-text for embeddings, qwen2.5 or llama3.2 for generation).
  • LiteLLM running as a proxy to Ollama.
  • Qdrant (Docker).

Sample LiteLLM config (litellm.yaml)

LiteLLM acts as a unified API proxy between your .NET code and Ollama. Instead of calling Ollama directly, you point your application to LiteLLM, which routes requests to the right model. This keeps your code consistent whether you use local models (via Ollama) or eventually switch to cloud APIs.

The config below defines two models:

  • nomic-embed-text: A lightweight embedding model (~340M parameters) that converts text to dense vectors.
  • qwen2.5: A general-purpose LLM for answer generation.

Both point to your local Ollama instance on port 11434.

model_list:
  - model_name: nomic-embed-text
    litellm_params:
      model: ollama/nomic-embed-text
  - model_name: qwen2.5
    litellm_params:
      model: ollama/qwen2.5
litellm_settings:
  host: 0.0.0.0
  port: 8080
ollama_settings:
  api_base: http://localhost:11434

Run: litellm --config litellm.yaml

Ingestion and Chunking (C#)

Why chunking? Large documents won’t fit into a single embedding. Chunking breaks documents into overlapping segments (~200–500 tokens) that are semantically meaningful (whole sentences or paragraphs). This balance ensures:

  • Semantic completeness: Chunks represent coherent ideas.
  • Reasonable size: Short enough to embed quickly, large enough to be useful as context.
  • Overlap: A 20–30% overlap reduces information loss at chunk boundaries.

The Ingest.Chunk() method:

  1. Normalizes whitespace (removes extra spaces/newlines).
  2. Splits by sentence boundaries (periods).
  3. Groups sentences until reaching ~400 tokens; then yields and starts a new chunk.
using System.Text.RegularExpressions;

public static class Ingest
{
     public static IEnumerable<string> Chunk(string text, int maxTokens = 400)
    {
        // Normalize whitespace
        var cleaned = Regex.Replace(text, "\\s+", " ").Trim();

        // Split by sentence
        var sentences = cleaned.Split(['.', '!', '?'], StringSplitOptions.RemoveEmptyEntries);

        var current = new List<string>();

        foreach (var sentence in sentences)
        {
            var trimmed = sentence.Trim();
            if (string.IsNullOrWhiteSpace(trimmed)) continue;

            var candidate = string.Join(" ", current.Append(trimmed));
            var tokenCount = candidate.Split(' ', StringSplitOptions.RemoveEmptyEntries).Length;

            if (tokenCount > maxTokens)
            {
                if (current.Count > 0)
                {
                    yield return string.Join(" ", current);
                }
                current.Clear();
            }

            current.Add(trimmed);
        }

        if (current.Count > 0)
        {
            yield return string.Join(" ", current);
        }
    }
}

Embeddings via LiteLLM

What are embeddings? An embedding is a dense vector (list of floats, typically 384–768 dimensions) that represents the semantic meaning of text. Similar texts produce similar vectors, measured by cosine distance. This enables fast similarity search in vector databases.

The EmbeddingClient sends text to LiteLLM, which forwards it to Ollama’s embedding model. The model returns a float[] representing that text in vector space. Later, when you query, you embed the question using the same model and search the vector database for nearest neighbors.

using LocalRag.Core.Models;
using System.Net.Http.Json;

public class EmbeddingClient(HttpClient httpClient)
{
    public async Task<float[]> EmbedAsync(string text, string model = "nomic-embed-text")
    {
        try
        {
            var request = new EmbedRequest(text, model);
            var response = await httpClient.PostAsJsonAsync("/v1/embeddings", request);
            response.EnsureSuccessStatusCode();

            var embedResponse = await response.Content.ReadFromJsonAsync<EmbedResponse>();
            return embedResponse?.Data?.FirstOrDefault()?.Embedding ?? [];
        }
        catch (Exception ex)
        {
            Console.WriteLine($"❌ Embedding error: {ex.Message}");
            throw;
        }
    }
}

public record EmbedRequest(string Input, string Model);
public record EmbedResponse(List<EmbedData> Data);
public record EmbedData(float[] Embedding);

Storing Vectors in Qdrant

Why a vector database? Storing embeddings in plain memory or a text file is slow. Vector databases like Qdrant are optimized for similarity search using algorithms (HNSW, IVF) that find nearest neighbors in milliseconds, even with millions of vectors.

QdrantClient provides two core operations:

  • Upsert: Store (or update) a chunk with its embedding and metadata (e.g., source document, chunk text).
  • Search: Given a query embedding and k, return the k nearest chunks by cosine similarity.

The payload dictionary carries metadata alongside the vector, enabling rich context and traceability.

using LocalRag.Core.Models;
using Qdrant.Client.Grpc;

public class QdrantClient(string host, int port = 6334, string collection = "docs")
{
    private readonly Qdrant.Client.QdrantClient _client = new(host, port);

    public async Task UpsertAsync(Guid id, float[] vector, Dictionary<string, object> payload)
    {
        var point = new PointStruct
        {
            Id = new PointId { Uuid = id.ToString() },
            Vectors = vector,
            Payload = { }
        };

        foreach (var kvp in payload)
        {
            point.Payload[kvp.Key] = kvp.Value switch
            {
                string s => s,
                int i => i,
                long l => l,
                double d => d,
                bool b => b,
                _ => kvp.Value.ToString() ?? string.Empty
            };
        }

        await _client.UpsertAsync(collection, [point]);
    }

    public async Task<IReadOnlyList<SearchHit>> SearchAsync(float[] vector, int k = 4)
    {
        var results = await _client.SearchAsync(
            collectionName: collection,
            vector: vector,
            limit: (ulong)k,
            payloadSelector: true
        );

        return [.. results.Select(r => new SearchHit(
            Id: Guid.Parse(r.Id.Uuid),
            Score: r.Score,
            Payload: r.Payload.ToDictionary(
                kvp => kvp.Key,
                kvp => ConvertValue(kvp.Value)
            )
        ))];
    }

    private static object ConvertValue(Value value)
    {
        return value.KindCase switch
        {
            Value.KindOneofCase.StringValue => value.StringValue,
            Value.KindOneofCase.IntegerValue => value.IntegerValue,
            Value.KindOneofCase.DoubleValue => value.DoubleValue,
            Value.KindOneofCase.BoolValue => value.BoolValue,
            _ => value.ToString()
        };
    }
}

Wiring Ingestion β†’ Qdrant

The Indexer class orchestrates the ingestion pipeline. Given a document ID and text:

  1. Chunks the text into sentences (using Ingest.Chunk).
  2. Embeds each chunk (via EmbeddingClient).
  3. Upserts each embedding + metadata to Qdrant.

This is a batch operation, you typically run it once per document set or on a schedule when documents update.

using LocalRag.Core.Utils;

public class Indexer(EmbeddingClient embeddingClient, QdrantClient qdrantClient)
{
    public async Task IndexDocumentAsync(string docId, string text)
    {
        var chunks = Ingest.Chunk(text).ToList();
        Console.WriteLine($"πŸ“„ Document '{docId}' split into {chunks.Count} chunks");

        int chunkIndex = 0;
        foreach (var chunk in chunks)
        {
            try
            {
                var embedding = await embeddingClient.EmbedAsync(chunk);

                var payload = new Dictionary<string, object>
                {
                    ["content"] = chunk,
                    ["docId"] = docId,
                    ["chunkIndex"] = chunkIndex
                };

                var id = Guid.NewGuid();
                await qdrantClient.UpsertAsync(id, embedding, payload);
                chunkIndex++;
            }
            catch (Exception ex)
            {
                Console.WriteLine($"❌ Error indexing chunk {chunkIndex}: {ex.Message}");
            }
        }

        Console.WriteLine($"βœ… Indexed {chunkIndex} chunks for '{docId}'\n");
    }
}

Qdrant collection creation (one-time):

curl -X PUT http://localhost:6333/collections/docs \
    -H "Content-Type: application/json" \
    -d '{
        "vectors": {
            "size": 768,
            "distance": "Cosine"
        }
    }'

Retrieval and Answering

The query phase. When a user asks a question, RagService performs three steps:

  1. Embed the question using the same embedding model as indexing (crucial: consistency ensures correct similarity).
  2. Retrieve top-k chunks from Qdrant (typically 3–5; tradeoff between context size and relevance).
  3. Compose a prompt with the chunks as context and send to the LLM via LiteLLM.

The LLM generates an answer grounded in the retrieved context, not from its training data alone, reducing hallucinations and improving accuracy.

using LocalRag.Core.Models;
using System.Net.Http.Json;

public class RagService(EmbeddingClient embeddingClient, QdrantClient qdrantClient, HttpClient httpClient)
{
    private readonly EmbeddingClient _embeddingClient = embeddingClient;
    private readonly QdrantClient _qdrantClient = qdrantClient;
    private readonly HttpClient _httpClient = httpClient;

    public async Task<string> AskAsync(string question)
    {
        try
        {
            Console.WriteLine($"πŸ” Question: {question}");

            // Step 1: Embed the question
            var questionVector = await _embeddingClient.EmbedAsync(question);
            Console.WriteLine("βœ“ Question embedded");

            // Step 2: Retrieve relevant chunks
            var hits = await _qdrantClient.SearchAsync(questionVector, 4);
            Console.WriteLine($"βœ“ Retrieved {hits.Count} chunks");

            if (hits.Count == 0)
            {
                return "⚠️ No relevant documents found in the knowledge base.";
            }

            // Display retrieved chunks
            var context = new System.Text.StringBuilder();
            int i = 1;
            foreach (var hit in hits)
            {
                var content = hit.Payload["content"]?.ToString() ?? string.Empty;
                Console.WriteLine($"  [{i}] (Score: {hit.Score:F4}) {content[..Math.Min(80, content.Length)]}...");
                context.AppendLine(content);
                i++;
            }

            // Step 3: Generate response with context
            var messages = new List<ChatMessage>
            {
                new("system", "You are a helpful assistant. Answer questions based on the provided context. Always cite the context."),
                new("user", $"Context:\n{context}\n\nQuestion: {question}")
            };

            var chatRequest = new ChatRequest("qwen2.5", messages, 0.2);
            var response = await _httpClient.PostAsJsonAsync("http://localhost:8080/v1/chat/completions", chatRequest);
            response.EnsureSuccessStatusCode();

            var chatResponse = await response.Content.ReadFromJsonAsync<ChatResponse>();
            var answer = chatResponse?.Choices?.FirstOrDefault()?.Message?.Content ?? "No response generated.";

            Console.WriteLine($"\nπŸ’‘ Answer: {answer}\n");
            return answer;
        }
        catch (Exception ex)
        {
            Console.WriteLine($"❌ Error in RAG query: {ex.Message}");
            throw;
        }
    }
}

public record ChatMessage(string Role, string Content);
public record ChatRequest(string Model, List<ChatMessage> Messages, double Temperature = 0.2);
public record ChatResponse(List<ChatChoice> Choices);
public record ChatChoice(ChatMessage Message);

Complete Ingestion Example (Program.cs)

Tying it all together. Below is a full console application that:

  1. Sets up dependency injection for HTTP clients (LiteLLM, Qdrant) and the Indexer.
  2. Scans a documents/ folder for markdown files.
  3. Ingests each file: chunks β†’ embeds β†’ stores in Qdrant.
  4. Tests retrieval with a sample query to confirm the pipeline works.

This is your entry point, run it with dotnet run to populate your vector database.

using LocalRag.Core.Services;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;

var builder = Host.CreateDefaultBuilder(args);

builder.ConfigureServices((context, services) =>
{
    // HttpClient for LiteLLM (Embedding + Chat)
    services.AddHttpClient<EmbeddingClient>(c =>
    {
        c.BaseAddress = new Uri("http://localhost:8080");
        c.Timeout = TimeSpan.FromSeconds(30);
    });

    // Qdrant Client (using gRPC SDK)
    services.AddSingleton<QdrantClient>(sp =>
        new QdrantClient("localhost", 6334, "docs"));

    // RAG Service (needs both embedding and qdrant, plus a general http client)
    services.AddHttpClient<RagService>(c =>
    {
        c.Timeout = TimeSpan.FromSeconds(60);
    });

    services.AddScoped<Indexer>();
});

var host = builder.Build();

// Run the RAG pipeline
using var scope = host.Services.CreateScope();
var indexer = scope.ServiceProvider.GetRequiredService<Indexer>();
var ragService = scope.ServiceProvider.GetRequiredService<RagService>();

// Create documents folder if it doesn't exist
var docFolder = "documents";
if (!Directory.Exists(docFolder))
{
    Directory.CreateDirectory(docFolder);
}

// Create sample documents for testing
var sampleFiles = new Dictionary<string, string>
{
    ["embeddings.md"] = @"Embeddings are dense vectors that represent the semantic meaning of text.
            Each embedding is typically 384-768 dimensions long.
            Similar concepts cluster together in vector space.
            Embeddings are created by machine learning models specialized in semantic encoding.
            They enable fast similarity search in vector databases.",

    ["rag.md"] = @"RAG stands for Retrieval-Augmented Generation.
            It combines information retrieval with language model generation.
            RAG improves accuracy by grounding responses in retrieved documents.
            The process has two phases: indexing (offline) and querying (online).
            RAG reduces hallucinations in large language models.",

    ["ollama.md"] = @"Ollama is a framework for running large language models locally.
            It supports models like Llama, Mistral, Qwen, and many others.
            Ollama can run on consumer hardware with reasonable performance.
            It provides an OpenAI-compatible API endpoint on port 11434.
            Ollama is completely free and open source."
};

Console.WriteLine("πŸ“š Setting up sample documents...\n");
foreach (var (filename, content) in sampleFiles)
{
    var filepath = Path.Combine(docFolder, filename);
    if (!File.Exists(filepath))
    {
        await File.WriteAllTextAsync(filepath, content);
        Console.WriteLine($"βœ“ Created {filename}");
    }
}

Console.WriteLine("\n⏳ Indexing documents into Qdrant...\n");

// Index all documents
foreach (var file in Directory.GetFiles(docFolder, "*.md"))
{
    var docId = Path.GetFileNameWithoutExtension(file);
    var text = await File.ReadAllTextAsync(file);
    await indexer.IndexDocumentAsync(docId, text);
}

Console.WriteLine("βœ… Indexing complete!\n");
Console.WriteLine("=".PadRight(60, '='));
Console.WriteLine("\nπŸš€ Starting RAG Query Tests\n");
Console.WriteLine("=".PadRight(60, '=') + "\n");

// Test queries
var testQueries = new[]
{
        "What are embeddings?",
        "How does RAG work?",
        "What is Ollama?",
        "How can I use embeddings for similarity search?"
    };

foreach (var query in testQueries)
{
    try
    {
        await ragService.AskAsync(query);
        Console.WriteLine("-".PadRight(60, '-') + "\n");
    }
    catch (Exception ex)
    {
        Console.WriteLine($"❌ Query failed: {ex.Message}\n");
    }
}

Console.WriteLine("βœ… All tests completed!");

Key points:

  • Dependency Injection: Register EmbeddingClient, QdrantClient, and Indexer in the service container.
  • Document Folder: Place .md files in a documents/ folder; the example creates a sample if it doesn’t exist.
  • Indexing Loop: Reads each file, chunks it, embeds chunks, and stores them in Qdrant.
  • Test Query: Demonstrates embedding a question and retrieving top-2 similar chunks.

Run this with dotnet run to populate your vector database. You’re now ready for the query phase!

Evaluation Basics

  • Grounding: Check the answer references retrieved chunks (simple regex for citations or explicit IDs).
  • Relevance: Measure if retrieved chunks contain question keywords; log top-k scores.
  • Quality rubric: Ask a judge model locally: “Rate answer correctness 1–5 and note missing facts”.
  • Latency: Track per stage, embedding, search, generation.

Tips

  • Pin models and LiteLLM config in source control; document model sizes and RAM needs.
  • Keep chunks small (200–500 tokens) and overlap ~20–30% for better recall.
  • Normalize text (lowercase, strip headers/footers) before embedding.
  • Add caching for embeddings of unchanged documents.

You now have a local RAG stack: ingest and embed documents, store vectors, retrieve relevant context, and generate grounded answers, all without leaving your machine.


This post was created with the assistance of AI.


See also