Creating Knowledge Units

Learn how to create and configure Knowledge Units that define embedding and indexing settings for data pipelines.

Overview

A Knowledge Unit is a central configuration resource that defines how data should be embedded and indexed. It connects data pipeline stages to vector databases and specifies embedding models and dimensions. Knowledge Units ensure consistency between embedding generation and vector storage.

Prerequisites

Access to FoundationaLLM Management Portal
Required permissions: FoundationaLLM.KnowledgeUnit/knowledgeUnits/write
At least one Vector Database configured (see Creating Vector Databases)
Understanding of embedding models and dimensions

What is a Knowledge Unit?

A Knowledge Unit is a resource that:

Links embedding and indexing pipeline stages to a vector database
Defines which embedding model to use
Specifies embedding dimensions
Configures index settings and partitions
Ensures consistency between embedding and storage

Key Concept: When you configure embedding and indexing stages in a data pipeline, you reference a Knowledge Unit rather than directly specifying embedding models or vector databases. This provides centralized, consistent configuration.

Creating a Knowledge Unit

Step 1: Navigate to Knowledge Units

Open the FoundationaLLM Management Portal
Navigate to Data → Knowledge Units
Click Create New Knowledge Unit

Step 2: Configure Basic Settings

Name: A unique identifier for the knowledge unit

Use descriptive names (e.g., product-docs-unit, support-kb-unit)
Follow naming convention: lowercase with hyphens
Example: company-knowledge-base-unit

Display Name: Human-readable name shown in the UI

Example: "Company Knowledge Base Unit"

Description: Purpose of this knowledge unit

Document which pipelines use it
Note the data types it handles
Example: "Knowledge unit for company internal documentation with 2048-dimension embeddings"

Step 3: Select Vector Database

Choose the vector database where indexed data will be stored.

Vector Database Object ID: Reference to a vector database resource

Format: instances/{instanceId}/providers/FoundationaLLM.VectorDatabase/vectorDatabases/{vectorDatabaseName}
Example: instances/default/providers/FoundationaLLM.VectorDatabase/vectorDatabases/company-kb

The selected vector database must:

Already be created and configured
Have an active connection to the vector storage service
Have an index schema compatible with the embedding dimensions

Step 4: Configure Embedding Settings

Embedding Model: The model used to generate vector embeddings

Common models:

text-embedding-3-large: High quality, default for most use cases
text-embedding-3-small: Faster, lower resource usage
text-embedding-ada-002: Legacy model, widely compatible

Embedding Dimensions: Size of the vector embeddings

Common dimension options:

1536: Standard for text-embedding-ada-002
2048: Default for text-embedding-3-large (recommended)
3072: Maximum quality for text-embedding-3-large
512 or 1024: Lower dimensions for faster search

Important: The embedding dimensions must match the vector field dimensions in the target vector database index schema.

Step 5: Configure Index Settings

Index Partition Name: Logical partition within the index

Enables multiple data sets in one physical index
Allows filtering by partition during search
Example: main, tenant-a, product-docs

Index Configuration: Additional index-specific settings

Field mappings
Search parameters
Filtering rules

Step 6: Example Configuration

{
  "name": "product-docs-unit",
  "display_name": "Product Documentation Unit",
  "description": "Knowledge unit for product documentation with high-quality embeddings",
  "vector_database_object_id": "instances/default/providers/FoundationaLLM.VectorDatabase/vectorDatabases/product-docs-search",
  "embedding_profile": {
    "embedding_model": "text-embedding-3-large",
    "embedding_dimensions": 2048
  },
  "index_configuration": {
    "partition_name": "main"
  }
}

Step 7: Review and Create

Verify vector database is properly configured
Confirm embedding model and dimensions match your needs
Check partition name is correct
Click Create
The knowledge unit is now available for use in data pipelines

Using Knowledge Units in Data Pipelines

Once created, reference the knowledge unit in both embedding and indexing pipeline stages:

In Embedding Stage

{
  "name": "Embed",
  "plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/GatewayTextEmbeddingDataPipelineStagePlugin",
  "plugin_parameters": [
    {
      "parameter_metadata": {
        "name": "KnowledgeUnitObjectId",
        "type": "resource-object-id"
      },
      "default_value": "instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/product-docs-unit"
    }
  ]
}

In Indexing Stage

{
  "name": "Index",
  "plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/AzureAISearchIndexingDataPipelineStagePlugin",
  "plugin_parameters": [
    {
      "parameter_metadata": {
        "name": "KnowledgeUnitObjectId",
        "type": "resource-object-id"
      },
      "default_value": "instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/product-docs-unit"
    }
  ]
}

Best Practice: Always use the same knowledge unit for both embedding and indexing stages to ensure consistency.

Best Practices

Naming Conventions

Purpose-based names: product-docs-unit, support-kb-unit
Include key attributes: high-quality-embeddings-unit, fast-search-unit
Environment prefixes: dev-docs-unit, prod-docs-unit

Configuration Strategy

Dimension Selection:

2048 dimensions: Recommended for most production use cases
1536 dimensions: Good balance of quality and performance
3072 dimensions: Maximum quality for critical applications
1024 or lower: Fast search with acceptable quality trade-off

Model Selection:

text-embedding-3-large: Best quality, higher cost
text-embedding-3-small: Good quality, lower cost
text-embedding-ada-002: Legacy compatibility

Reusability

When to Create Multiple Knowledge Units:

Different embedding models or dimensions needed
Different vector databases for different data types
Different partition strategies
Development vs. production environments

When to Reuse a Knowledge Unit:

Same embedding requirements across pipelines
Same vector database and partition
Consistent search quality requirements

Consistency

Maintain Consistency Between:

Embedding model dimensions in knowledge unit
Vector field dimensions in vector database schema
Embedding and indexing stages in pipelines

Common Scenarios

Scenario 1: Standard Document Indexing

Single knowledge unit for general purpose:

Model: text-embedding-3-large
Dimensions: 2048
Vector Database: company-docs
Partition: main

Scenario 2: Multiple Data Types

Separate knowledge units for different data types:

Technical Docs Unit: 3072 dimensions, high quality
Support KB Unit: 2048 dimensions, balanced
Chat History Unit: 1024 dimensions, fast search

Scenario 3: Multi-Tenant Application

Knowledge units per tenant:

Same vector database, different partitions
Consistent embedding configuration
Tenant isolation through partitions

Scenario 4: Development and Production

Separate knowledge units for environments:

Dev: Lower dimensions (1024), faster iteration
Staging: Production-like (2048), testing
Production: Optimized configuration (2048 or 3072)

Troubleshooting

Knowledge Unit Creation Fails

Problem: Unable to create knowledge unit Solutions:

Verify you have required permissions
Check that vector database reference is valid
Ensure the name is unique
Verify embedding model name is correct

Dimension Mismatch Error

Problem: Pipeline fails with dimension mismatch Solutions:

Verify knowledge unit dimensions match vector database schema
Update vector database schema to match dimensions
Or update knowledge unit dimensions to match schema
Ensure embedding model supports the specified dimensions

Embedding Stage Fails

Problem: Pipeline fails at embedding stage Solutions:

Verify knowledge unit is properly configured
Check embedding model is available
Verify API endpoint for embeddings is configured
Check rate limits and quotas

Indexing Stage Fails

Problem: Data doesn't get indexed Solutions:

Verify knowledge unit vector database connection
Check that index exists in vector database
Verify partition name is correct
Ensure schema allows the data structure

Inconsistent Search Results

Problem: Search quality is poor Solutions:

Verify same knowledge unit is used for embedding and indexing
Check embedding dimensions haven't changed
Verify vector database contains the expected data
Review embedding model choice

Table of Contents