Table of Contents

Creating Knowledge Units

Learn how to create and configure Knowledge Units that define embedding and indexing settings for data pipelines.

Overview

A Knowledge Unit is a central configuration resource that defines how data should be embedded and indexed. It connects data pipeline stages to vector databases and specifies embedding models and dimensions. Knowledge Units ensure consistency between embedding generation and vector storage.

Prerequisites

  • Access to FoundationaLLM Management Portal
  • Required permissions: FoundationaLLM.KnowledgeUnit/knowledgeUnits/write
  • At least one Vector Database configured (see Creating Vector Databases)
  • Understanding of embedding models and dimensions

What is a Knowledge Unit?

A Knowledge Unit is a resource that:

  • Links embedding and indexing pipeline stages to a vector database
  • Defines which embedding model to use
  • Specifies embedding dimensions
  • Configures index settings and partitions
  • Ensures consistency between embedding and storage

Key Concept: When you configure embedding and indexing stages in a data pipeline, you reference a Knowledge Unit rather than directly specifying embedding models or vector databases. This provides centralized, consistent configuration.

Creating a Knowledge Unit

Step 1: Navigate to Knowledge Units

  1. Open the FoundationaLLM Management Portal
  2. Navigate to DataKnowledge Units
  3. Click Create New Knowledge Unit

Step 2: Configure Basic Settings

Name: A unique identifier for the knowledge unit

  • Use descriptive names (e.g., product-docs-unit, support-kb-unit)
  • Follow naming convention: lowercase with hyphens
  • Example: company-knowledge-base-unit

Display Name: Human-readable name shown in the UI

  • Example: "Company Knowledge Base Unit"

Description: Purpose of this knowledge unit

  • Document which pipelines use it
  • Note the data types it handles
  • Example: "Knowledge unit for company internal documentation with 2048-dimension embeddings"

Step 3: Select Vector Database

Choose the vector database where indexed data will be stored.

Vector Database Object ID: Reference to a vector database resource

  • Format: instances/{instanceId}/providers/FoundationaLLM.VectorDatabase/vectorDatabases/{vectorDatabaseName}
  • Example: instances/default/providers/FoundationaLLM.VectorDatabase/vectorDatabases/company-kb

The selected vector database must:

  • Already be created and configured
  • Have an active connection to the vector storage service
  • Have an index schema compatible with the embedding dimensions

Step 4: Configure Embedding Settings

Embedding Model: The model used to generate vector embeddings

Common models:

  • text-embedding-3-large: High quality, default for most use cases
  • text-embedding-3-small: Faster, lower resource usage
  • text-embedding-ada-002: Legacy model, widely compatible

Embedding Dimensions: Size of the vector embeddings

Common dimension options:

  • 1536: Standard for text-embedding-ada-002
  • 2048: Default for text-embedding-3-large (recommended)
  • 3072: Maximum quality for text-embedding-3-large
  • 512 or 1024: Lower dimensions for faster search

Important: The embedding dimensions must match the vector field dimensions in the target vector database index schema.

Step 5: Configure Index Settings

Index Partition Name: Logical partition within the index

  • Enables multiple data sets in one physical index
  • Allows filtering by partition during search
  • Example: main, tenant-a, product-docs

Index Configuration: Additional index-specific settings

  • Field mappings
  • Search parameters
  • Filtering rules

Step 6: Example Configuration

{
  "name": "product-docs-unit",
  "display_name": "Product Documentation Unit",
  "description": "Knowledge unit for product documentation with high-quality embeddings",
  "vector_database_object_id": "instances/default/providers/FoundationaLLM.VectorDatabase/vectorDatabases/product-docs-search",
  "embedding_profile": {
    "embedding_model": "text-embedding-3-large",
    "embedding_dimensions": 2048
  },
  "index_configuration": {
    "partition_name": "main"
  }
}

Step 7: Review and Create

  1. Verify vector database is properly configured
  2. Confirm embedding model and dimensions match your needs
  3. Check partition name is correct
  4. Click Create
  5. The knowledge unit is now available for use in data pipelines

Using Knowledge Units in Data Pipelines

Once created, reference the knowledge unit in both embedding and indexing pipeline stages:

In Embedding Stage

{
  "name": "Embed",
  "plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/GatewayTextEmbeddingDataPipelineStagePlugin",
  "plugin_parameters": [
    {
      "parameter_metadata": {
        "name": "KnowledgeUnitObjectId",
        "type": "resource-object-id"
      },
      "default_value": "instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/product-docs-unit"
    }
  ]
}

In Indexing Stage

{
  "name": "Index",
  "plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/AzureAISearchIndexingDataPipelineStagePlugin",
  "plugin_parameters": [
    {
      "parameter_metadata": {
        "name": "KnowledgeUnitObjectId",
        "type": "resource-object-id"
      },
      "default_value": "instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/product-docs-unit"
    }
  ]
}

Best Practice: Always use the same knowledge unit for both embedding and indexing stages to ensure consistency.

Best Practices

Naming Conventions

  • Purpose-based names: product-docs-unit, support-kb-unit
  • Include key attributes: high-quality-embeddings-unit, fast-search-unit
  • Environment prefixes: dev-docs-unit, prod-docs-unit

Configuration Strategy

Dimension Selection:

  • 2048 dimensions: Recommended for most production use cases
  • 1536 dimensions: Good balance of quality and performance
  • 3072 dimensions: Maximum quality for critical applications
  • 1024 or lower: Fast search with acceptable quality trade-off

Model Selection:

  • text-embedding-3-large: Best quality, higher cost
  • text-embedding-3-small: Good quality, lower cost
  • text-embedding-ada-002: Legacy compatibility

Reusability

When to Create Multiple Knowledge Units:

  • Different embedding models or dimensions needed
  • Different vector databases for different data types
  • Different partition strategies
  • Development vs. production environments

When to Reuse a Knowledge Unit:

  • Same embedding requirements across pipelines
  • Same vector database and partition
  • Consistent search quality requirements

Consistency

Maintain Consistency Between:

  • Embedding model dimensions in knowledge unit
  • Vector field dimensions in vector database schema
  • Embedding and indexing stages in pipelines

Common Scenarios

Scenario 1: Standard Document Indexing

Single knowledge unit for general purpose:

  • Model: text-embedding-3-large
  • Dimensions: 2048
  • Vector Database: company-docs
  • Partition: main

Scenario 2: Multiple Data Types

Separate knowledge units for different data types:

  • Technical Docs Unit: 3072 dimensions, high quality
  • Support KB Unit: 2048 dimensions, balanced
  • Chat History Unit: 1024 dimensions, fast search

Scenario 3: Multi-Tenant Application

Knowledge units per tenant:

  • Same vector database, different partitions
  • Consistent embedding configuration
  • Tenant isolation through partitions

Scenario 4: Development and Production

Separate knowledge units for environments:

  • Dev: Lower dimensions (1024), faster iteration
  • Staging: Production-like (2048), testing
  • Production: Optimized configuration (2048 or 3072)

Troubleshooting

Knowledge Unit Creation Fails

Problem: Unable to create knowledge unit Solutions:

  • Verify you have required permissions
  • Check that vector database reference is valid
  • Ensure the name is unique
  • Verify embedding model name is correct

Dimension Mismatch Error

Problem: Pipeline fails with dimension mismatch Solutions:

  • Verify knowledge unit dimensions match vector database schema
  • Update vector database schema to match dimensions
  • Or update knowledge unit dimensions to match schema
  • Ensure embedding model supports the specified dimensions

Embedding Stage Fails

Problem: Pipeline fails at embedding stage Solutions:

  • Verify knowledge unit is properly configured
  • Check embedding model is available
  • Verify API endpoint for embeddings is configured
  • Check rate limits and quotas

Indexing Stage Fails

Problem: Data doesn't get indexed Solutions:

  • Verify knowledge unit vector database connection
  • Check that index exists in vector database
  • Verify partition name is correct
  • Ensure schema allows the data structure

Inconsistent Search Results

Problem: Search quality is poor Solutions:

  • Verify same knowledge unit is used for embedding and indexing
  • Check embedding dimensions haven't changed
  • Verify vector database contains the expected data
  • Review embedding model choice