Creating Knowledge Units
Learn how to create and configure Knowledge Units that define embedding and indexing settings for data pipelines.
Overview
A Knowledge Unit is a central configuration resource that defines how data should be embedded and indexed. It connects data pipeline stages to vector databases and specifies embedding models and dimensions. Knowledge Units ensure consistency between embedding generation and vector storage.
Prerequisites
- Access to FoundationaLLM Management Portal
- Required permissions:
FoundationaLLM.KnowledgeUnit/knowledgeUnits/write - At least one Vector Database configured (see Creating Vector Databases)
- Understanding of embedding models and dimensions
What is a Knowledge Unit?
A Knowledge Unit is a resource that:
- Links embedding and indexing pipeline stages to a vector database
- Defines which embedding model to use
- Specifies embedding dimensions
- Configures index settings and partitions
- Ensures consistency between embedding and storage
Key Concept: When you configure embedding and indexing stages in a data pipeline, you reference a Knowledge Unit rather than directly specifying embedding models or vector databases. This provides centralized, consistent configuration.
Creating a Knowledge Unit
Step 1: Navigate to Knowledge Units
- Open the FoundationaLLM Management Portal
- Navigate to Data → Knowledge Units
- Click Create New Knowledge Unit
Step 2: Configure Basic Settings
Name: A unique identifier for the knowledge unit
- Use descriptive names (e.g.,
product-docs-unit,support-kb-unit) - Follow naming convention: lowercase with hyphens
- Example:
company-knowledge-base-unit
Display Name: Human-readable name shown in the UI
- Example: "Company Knowledge Base Unit"
Description: Purpose of this knowledge unit
- Document which pipelines use it
- Note the data types it handles
- Example: "Knowledge unit for company internal documentation with 2048-dimension embeddings"
Step 3: Select Vector Database
Choose the vector database where indexed data will be stored.
Vector Database Object ID: Reference to a vector database resource
- Format:
instances/{instanceId}/providers/FoundationaLLM.VectorDatabase/vectorDatabases/{vectorDatabaseName} - Example:
instances/default/providers/FoundationaLLM.VectorDatabase/vectorDatabases/company-kb
The selected vector database must:
- Already be created and configured
- Have an active connection to the vector storage service
- Have an index schema compatible with the embedding dimensions
Step 4: Configure Embedding Settings
Embedding Model: The model used to generate vector embeddings
Common models:
text-embedding-3-large: High quality, default for most use casestext-embedding-3-small: Faster, lower resource usagetext-embedding-ada-002: Legacy model, widely compatible
Embedding Dimensions: Size of the vector embeddings
Common dimension options:
1536: Standard for text-embedding-ada-0022048: Default for text-embedding-3-large (recommended)3072: Maximum quality for text-embedding-3-large512or1024: Lower dimensions for faster search
Important: The embedding dimensions must match the vector field dimensions in the target vector database index schema.
Step 5: Configure Index Settings
Index Partition Name: Logical partition within the index
- Enables multiple data sets in one physical index
- Allows filtering by partition during search
- Example:
main,tenant-a,product-docs
Index Configuration: Additional index-specific settings
- Field mappings
- Search parameters
- Filtering rules
Step 6: Example Configuration
{
"name": "product-docs-unit",
"display_name": "Product Documentation Unit",
"description": "Knowledge unit for product documentation with high-quality embeddings",
"vector_database_object_id": "instances/default/providers/FoundationaLLM.VectorDatabase/vectorDatabases/product-docs-search",
"embedding_profile": {
"embedding_model": "text-embedding-3-large",
"embedding_dimensions": 2048
},
"index_configuration": {
"partition_name": "main"
}
}
Step 7: Review and Create
- Verify vector database is properly configured
- Confirm embedding model and dimensions match your needs
- Check partition name is correct
- Click Create
- The knowledge unit is now available for use in data pipelines
Using Knowledge Units in Data Pipelines
Once created, reference the knowledge unit in both embedding and indexing pipeline stages:
In Embedding Stage
{
"name": "Embed",
"plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/GatewayTextEmbeddingDataPipelineStagePlugin",
"plugin_parameters": [
{
"parameter_metadata": {
"name": "KnowledgeUnitObjectId",
"type": "resource-object-id"
},
"default_value": "instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/product-docs-unit"
}
]
}
In Indexing Stage
{
"name": "Index",
"plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/AzureAISearchIndexingDataPipelineStagePlugin",
"plugin_parameters": [
{
"parameter_metadata": {
"name": "KnowledgeUnitObjectId",
"type": "resource-object-id"
},
"default_value": "instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/product-docs-unit"
}
]
}
Best Practice: Always use the same knowledge unit for both embedding and indexing stages to ensure consistency.
Best Practices
Naming Conventions
- Purpose-based names:
product-docs-unit,support-kb-unit - Include key attributes:
high-quality-embeddings-unit,fast-search-unit - Environment prefixes:
dev-docs-unit,prod-docs-unit
Configuration Strategy
Dimension Selection:
- 2048 dimensions: Recommended for most production use cases
- 1536 dimensions: Good balance of quality and performance
- 3072 dimensions: Maximum quality for critical applications
- 1024 or lower: Fast search with acceptable quality trade-off
Model Selection:
- text-embedding-3-large: Best quality, higher cost
- text-embedding-3-small: Good quality, lower cost
- text-embedding-ada-002: Legacy compatibility
Reusability
When to Create Multiple Knowledge Units:
- Different embedding models or dimensions needed
- Different vector databases for different data types
- Different partition strategies
- Development vs. production environments
When to Reuse a Knowledge Unit:
- Same embedding requirements across pipelines
- Same vector database and partition
- Consistent search quality requirements
Consistency
Maintain Consistency Between:
- Embedding model dimensions in knowledge unit
- Vector field dimensions in vector database schema
- Embedding and indexing stages in pipelines
Common Scenarios
Scenario 1: Standard Document Indexing
Single knowledge unit for general purpose:
- Model:
text-embedding-3-large - Dimensions:
2048 - Vector Database:
company-docs - Partition:
main
Scenario 2: Multiple Data Types
Separate knowledge units for different data types:
- Technical Docs Unit: 3072 dimensions, high quality
- Support KB Unit: 2048 dimensions, balanced
- Chat History Unit: 1024 dimensions, fast search
Scenario 3: Multi-Tenant Application
Knowledge units per tenant:
- Same vector database, different partitions
- Consistent embedding configuration
- Tenant isolation through partitions
Scenario 4: Development and Production
Separate knowledge units for environments:
- Dev: Lower dimensions (1024), faster iteration
- Staging: Production-like (2048), testing
- Production: Optimized configuration (2048 or 3072)
Troubleshooting
Knowledge Unit Creation Fails
Problem: Unable to create knowledge unit Solutions:
- Verify you have required permissions
- Check that vector database reference is valid
- Ensure the name is unique
- Verify embedding model name is correct
Dimension Mismatch Error
Problem: Pipeline fails with dimension mismatch Solutions:
- Verify knowledge unit dimensions match vector database schema
- Update vector database schema to match dimensions
- Or update knowledge unit dimensions to match schema
- Ensure embedding model supports the specified dimensions
Embedding Stage Fails
Problem: Pipeline fails at embedding stage Solutions:
- Verify knowledge unit is properly configured
- Check embedding model is available
- Verify API endpoint for embeddings is configured
- Check rate limits and quotas
Indexing Stage Fails
Problem: Data doesn't get indexed Solutions:
- Verify knowledge unit vector database connection
- Check that index exists in vector database
- Verify partition name is correct
- Ensure schema allows the data structure
Inconsistent Search Results
Problem: Search quality is poor Solutions:
- Verify same knowledge unit is used for embedding and indexing
- Check embedding dimensions haven't changed
- Verify vector database contains the expected data
- Review embedding model choice