How to Create a Knowledge Base for AI: Complete Guide

What Is an AI Knowledge Base?

An AI knowledge base is a curated collection of documents — product documentation, support articles, internal policies, past conversations, FAQs — that has been processed into a format AI can search and reason over.

It is the difference between asking ChatGPT "How do I reset my password in your app?" (it has no idea, so it guesses or refuses) and asking your own AI assistant the same question (it retrieves the exact steps from your help docs and answers accurately).

The technical mechanism — Retrieval-Augmented Generation, or RAG — is covered in detail in my companion article on building a RAG chatbot with Laravel and pgvector. This article focuses on the knowledge side: what goes in, how to structure it, how to keep it current, and how to measure whether it is actually working.

I have built knowledge bases for three production AI chatbots. The largest contains 4,200 documents across a SaaS product's entire help centre, internal runbooks, and six months of resolved support tickets. The chatbot it powers deflects 74% of inbound support queries — but only because the knowledge base was built correctly from day one. A poorly structured knowledge base produces an AI that sounds confident but gives wrong answers, which is worse than no AI at all.

What to Put in Your AI Knowledge Base

Not everything belongs in a knowledge base. Irrelevant or low-quality content degrades retrieval quality — the AI spends its context window on noise instead of signal.

High-Value Content (Always Include)

Product documentation: Feature descriptions, how-to guides, API references. The authoritative source on what your product does and how to use it.
Resolved support tickets: Real questions real users asked, with the answers that actually solved their problem. This is gold — it covers the long tail of edge cases your documentation never explicitly addresses.
FAQ pages: Already structured as question-answer pairs, which is exactly the format AI retrieval is optimised for.
Release notes and changelogs: Users constantly ask "when did you add X?" or "how does the new Y work?" — release notes answer these directly.
Error messages and their explanations: "Error 4032: Webhook signature mismatch" — what does it mean, why does it happen, how do you fix it? Document every error code your product emits.

Medium-Value Content (Include Selectively)

Blog posts and tutorials: Useful if they are accurate and up to date. Old blog posts with deprecated instructions actively hurt your knowledge base quality.
Internal runbooks: For internal-facing chatbots (employee support, IT helpdesk), runbooks are essential. For customer-facing chatbots, keep them out — customers should not see internal procedures.
Video transcripts: If your product has tutorial videos, transcribe them with Whisper and include the transcripts. Video content is otherwise invisible to AI retrieval.

What to Exclude

Marketing copy and landing pages — persuasive language trained to make things sound good, not to accurately describe how things work
Outdated documentation that has been superseded — old versions create contradictions that confuse the AI
Duplicate content — multiple sources saying the same thing wastes vector space and can cause the AI to retrieve the same information three times in one response
Confidential information not intended for the chatbot's audience — pricing negotiations, internal strategy, personnel information

How to Structure Documents for AI Retrieval

AI retrieval works by finding passages that are semantically similar to the user's question. How you write and structure your documents determines whether those passages are findable.

Lead with the Answer

Traditional documentation often starts with context before getting to the answer. AI retrieval punishes this pattern — if the answer is buried in paragraph four, the retrieval system may find the chunk containing the context but miss the chunk containing the actual answer.

// BAD — context before answer
"The Webhooks feature, introduced in version 3.2, allows external services
to be notified when events occur in your account. To configure webhooks,
you must first enable the feature in your account settings. Once enabled,
navigate to Settings > Integrations > Webhooks..."

// GOOD — answer first
"To set up a webhook: Go to Settings > Integrations > Webhooks > Add Webhook.
Enter your endpoint URL, select the events to listen for, and click Save.
Your endpoint will receive POST requests with a JSON payload signed with
your webhook secret whenever the selected events occur."

Write for Questions, Not Topics

Users ask questions. Your documents should answer them. Structure articles around the questions they answer, not around the product features they describe:

// Topic-first (harder to retrieve)
# Billing Settings
The billing section of your account allows you to manage your subscription,
payment methods, and invoices...

// Question-first (easier to retrieve)
# How do I change my billing plan?
# How do I update my payment method?
# Where can I download my invoices?
# What happens to my data if I cancel?

Each question becomes its own self-contained section. When a user asks "how do I update my credit card?", the retrieval system finds the exact answer, not a general billing overview.

Include Synonyms and Common Misspellings

Users say "cancel subscription," "stop subscription," "unsubscribe," and "delete my account" — all meaning the same thing. Your documentation probably uses one consistent term. Add a synonyms line at the end of each article:

// Append to articles to improve retrieval recall
*Also searched as: cancel account, stop billing, unsubscribe, end trial, delete subscription*

Chunk Your Documents Deliberately

When your documents are ingested into the vector store, they are split into chunks of 300–500 tokens each. If a chunk boundary falls in the middle of a procedure, the chunk is useless — it starts at step 4 of a 6-step process with no context about what is being done.

Structure documents so that each logical unit (a step-by-step procedure, a concept explanation, a FAQ answer) fits within a single chunk. Keep sections short and self-contained. Use H2/H3 headings as natural chunk boundaries.

Technical Implementation: Ingesting Your Knowledge Base

Setting Up the Document Model

// app/Models/KnowledgeDocument.php
class KnowledgeDocument extends Model
{
    protected function casts(): array
    {
        return [
            'metadata'    => 'array',
            'ingested_at' => 'datetime',
            'is_active'   => 'boolean',
        ];
    }

    public function chunks(): HasMany
    {
        return $this->hasMany(DocumentChunk::class, 'document_id');
    }

    public function needsReIngestion(): bool
    {
        return is_null($this->ingested_at) ||
               $this->updated_at->isAfter($this->ingested_at);
    }
}

Supporting Multiple Source Types

Your knowledge base will come from multiple places — a Notion workspace, a Confluence space, a folder of Markdown files, a database of resolved tickets. Build a source adapter for each:

// app/Contracts/KnowledgeSource.php
interface KnowledgeSource
{
    public function documents(): Generator; // yields KnowledgeDocument
    public function lastModified(string $externalId): Carbon;
}

// app/Services/Knowledge/NotionSource.php
class NotionSource implements KnowledgeSource
{
    public function __construct(private string $databaseId) {}

    public function documents(): Generator
    {
        $pages = $this->fetchNotionPages($this->databaseId);
        foreach ($pages as $page) {
            yield new KnowledgeDocument([
                'external_id'   => $page['id'],
                'source'        => 'notion',
                'title'         => $page['properties']['Name']['title'][0]['text']['content'],
                'content'       => $this->extractPageContent($page['id']),
                'metadata'      => ['url' => $page['url'], 'last_edited' => $page['last_edited_time']],
            ]);
        }
    }
}

// app/Services/Knowledge/SupportTicketSource.php
class SupportTicketSource implements KnowledgeSource
{
    public function documents(): Generator
    {
        SupportTicket::resolved()
            ->where('is_public', true)
            ->with('replies')
            ->lazy()
            ->each(function (SupportTicket $ticket) {
                $content = "Question: {$ticket->subject}\n\n";
                $content .= "Context: {$ticket->body}\n\n";
                $content .= "Resolution: " . $ticket->resolvedReply->body;

                yield new KnowledgeDocument([
                    'external_id' => "ticket-{$ticket->id}",
                    'source'      => 'support_tickets',
                    'title'       => $ticket->subject,
                    'content'     => $content,
                    'metadata'    => ['ticket_id' => $ticket->id, 'category' => $ticket->category],
                ]);
            });
    }
}

The Sync Command

// app/Console/Commands/SyncKnowledgeBase.php
class SyncKnowledgeBase extends Command
{
    protected $signature = 'kb:sync {--source= : Sync only this source} {--force : Re-ingest all documents}'
    ;

    public function handle(): void
    {
        $sources = $this->getSources();

        foreach ($sources as $sourceName => $source) {
            if ($this->option('source') && $this->option('source') !== $sourceName) {
                continue;
            }

            $this->info("Syncing {$sourceName}...");
            $synced = 0;
            $skipped = 0;

            foreach ($source->documents() as $document) {
                $existing = KnowledgeDocument::where('external_id', $document->external_id)->first();

                if ($existing && !$this->option('force') && !$existing->needsReIngestion()) {
                    $skipped++;
                    continue;
                }

                $doc = $existing ? $existing->fill($document->toArray()) : $document;
                $doc->save();

                IngestDocument::dispatch($doc);
                $synced++;
            }

            $this->info("  Queued {$synced} documents for ingestion, skipped {$synced} unchanged.");
        }
    }
}

Schedule this to run nightly:

// routes/console.php
Schedule::command('kb:sync')->dailyAt('02:00');

Keeping Your Knowledge Base Accurate Over Time

A knowledge base is not a one-time project. It degrades over time as your product changes and the documents go stale. Build maintenance into your workflow from the start.

Track Document Freshness

// Flag documents that have not been updated in 90 days
KnowledgeDocument::where('source', 'manual')
    ->where('updated_at', '<', now()->subDays(90))
    ->each(function ($doc) {
        // Send a Slack notification to the document owner
        Notification::route('slack', config('services.slack.kb_channel'))
            ->notify(new StaleDocumentAlert($doc));
    });

Learn from Failed Retrievals

When your chatbot cannot answer a question (no chunks above the relevance threshold), log it. These are gaps in your knowledge base — real questions your users are asking that your documents do not cover. Review them weekly and create articles to fill the gaps.

// In your RetrievalService, when chunks are empty:
if ($chunks->isEmpty()) {
    UnansweredQuestion::create([
        'question'   => $question,
        'asked_at'   => now(),
        'user_id'    => auth()->id(),
        'session_id' => session()->getId(),
    ]);
}

Monitor Answer Quality with Feedback Loops

Add a thumbs up/thumbs down to every chatbot response. Track the ratio per source document and per chunk. Documents with consistently low-quality retrievals need to be rewritten — the content is there, but the structure is making it hard to find.

// app/Models/ChatFeedback.php
// After each response, record:
ChatFeedback::create([
    'chat_log_id'    => $chatLog->id,
    'rating'         => $request->input('rating'), // positive | negative
    'chunk_ids'      => $retrievedChunkIds,         // which chunks were used
    'document_ids'   => $retrievedDocumentIds,
    'submitted_at'   => now(),
]);

A monthly report grouping negative feedback by source document tells you exactly which parts of your knowledge base need the most attention.

Measuring Knowledge Base Effectiveness

Three metrics tell you whether your knowledge base is working:

Containment rate: What percentage of questions does the chatbot answer (vs. say "I don't have that information")? Below 60% means your knowledge base has significant coverage gaps.
User satisfaction rate: Thumbs up / (thumbs up + thumbs down). Below 70% means the answers being retrieved are inaccurate or poorly structured.
Deflection rate: Of questions answered by the chatbot, what percentage did not result in a human support ticket? This is your ROI metric.

Target benchmarks after 90 days of operation: 75%+ containment, 80%+ satisfaction, 60%+ deflection. If any of these are significantly below target, the fix is almost always in the knowledge base content — not in the AI model or retrieval algorithm.

Frequently Asked Questions

How many documents do I need to start?

Start with 20–50 high-quality documents covering your most common support queries. You do not need to boil the ocean before launching. Identify your top 20 most-asked support questions (look at your ticket history), write a clear, self-contained article for each, and launch. Expand from there based on the unanswered question log.

How do I handle information that changes frequently?

For frequently changing information (pricing, feature availability, integration status), do not put it in the static knowledge base. Instead, build a lookup tool the AI can call — a function that queries your live database for current pricing or feature flags. This way the AI always gets current data, not a stale snapshot from the last ingestion.

Can I use private customer data in the knowledge base?

No. Your knowledge base is retrieved wholesale into AI prompts. If a support ticket containing a customer's personal information is in the knowledge base, that information may appear in responses to completely unrelated questions. Only include resolved tickets after stripping all personally identifiable information — customer names, email addresses, account IDs, and any specific data about their configuration.

How long does it take to build an effective knowledge base?

The initial setup — infrastructure, ingestion pipeline, first 50 documents — takes 2–3 weeks of engineering time plus 1–2 weeks of content work. A knowledge base that achieves 75%+ containment on a typical SaaS support use case requires 150–300 documents. At one article per hour of writing time, that is 150–300 hours of content work — spread across 2–3 months of ongoing effort, not all upfront. Plan for knowledge base maintenance to be an ongoing part of your support team's workflow, not a one-time project.

How to Create a Knowledge Base for AI: Complete Setup Guide