Creating Knowledge Sources
Learn how to create and configure Knowledge Sources that connect data pipelines to source data and knowledge units.
Overview
A Knowledge Source is a resource that represents a logical collection of data to be processed by data pipelines. It connects:
- Data sources (where the raw data comes from)
- Knowledge units (how the data should be embedded and indexed)
- Data pipelines (how to process the data)
Knowledge Sources provide a high-level abstraction for managing the entire knowledge ingestion workflow.
Prerequisites
- Access to FoundationaLLM Management Portal
- Required permissions:
FoundationaLLM.KnowledgeSource/knowledgeSources/write - At least one Data Source configured (Azure Data Lake, SharePoint, etc.)
- At least one Knowledge Unit configured (see Creating Knowledge Units)
- Understanding of your data pipeline requirements
What is a Knowledge Source?
A Knowledge Source is a resource that:
- Links raw data sources to processed knowledge units
- Defines what data should be processed
- Specifies how data should be transformed (via data pipelines)
- Tracks the relationship between source data and indexed content
- Enables end-to-end knowledge management
Relationship:
Data Source → Knowledge Source → Data Pipeline → Knowledge Unit → Vector Database
↓ ↓ ↓ ↓ ↓
Raw Files Configuration Processing Embeddings Storage
Creating a Knowledge Source
Step 1: Navigate to Knowledge Sources
- Open the FoundationaLLM Management Portal
- Navigate to Data → Knowledge Sources
- Click Create New Knowledge Source
Step 2: Configure Basic Settings
Name: A unique identifier for the knowledge source
- Use descriptive names (e.g.,
product-documentation,support-articles) - Follow naming convention: lowercase with hyphens
- Example:
company-internal-docs
Display Name: Human-readable name shown in the UI
- Example: "Company Internal Documentation"
Description: Purpose and scope of this knowledge source
- Document what data it includes
- Note any specific requirements or constraints
- Example: "Internal company documentation from SharePoint, processed for semantic search"
Step 3: Select Data Source
Choose where the raw data comes from.
Data Source Types:
- Azure Data Lake: Files stored in Azure Data Lake Storage Gen2
- SharePoint Online: Documents from SharePoint document libraries
- Context File: Specific files provided via API
- Other: Custom data source plugins
Data Source Object ID: Reference to a data source resource
- Format:
instances/{instanceId}/providers/FoundationaLLM.DataSource/dataSources/{dataSourceName} - Example:
instances/default/providers/FoundationaLLM.DataSource/dataSources/company-sharepoint
See the Data Sources section for details on configuring specific data source types.
Step 4: Configure Data Selection
Specify which data from the source should be processed.
For Azure Data Lake:
{
"data_source_parameters": {
"Folders": [
"/documentation/product",
"/documentation/api"
]
}
}
For SharePoint Online:
{
"data_source_parameters": {
"DocumentLibraries": [
"Shared Documents",
"Product Documentation"
]
}
}
File Filters (optional):
- File extensions to include/exclude
- Date ranges
- Metadata filters
Step 5: Link Knowledge Unit
Connect to the knowledge unit that defines embedding and indexing settings.
Knowledge Unit Object ID: Reference to a knowledge unit resource
- Format:
instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/{knowledgeUnitName} - Example:
instances/default/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/company-docs-unit
This knowledge unit determines:
- Which embedding model to use
- What embedding dimensions to generate
- Which vector database to store results
- What index partition to use
Step 6: Configure Processing Options
Refresh Strategy:
- Manual: Run pipeline manually when needed
- Scheduled: Run on a defined schedule (e.g., daily)
- Incremental: Only process new/changed files
Processing Options:
- Content Safety: Enable content filtering
- Knowledge Graph: Extract structured relationships
- Custom Stages: Additional processing steps
Step 7: Example Configuration
{
"name": "product-documentation",
"display_name": "Product Documentation",
"description": "Technical documentation for products from SharePoint",
"data_source_object_id": "instances/default/providers/FoundationaLLM.DataSource/dataSources/company-sharepoint",
"data_source_parameters": {
"DocumentLibraries": [
"Product Documentation"
]
},
"knowledge_unit_object_id": "instances/default/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/product-docs-unit",
"refresh_strategy": "scheduled",
"schedule": {
"frequency": "daily",
"time": "02:00"
}
}
Step 8: Review and Create
- Verify data source is accessible
- Confirm knowledge unit is properly configured
- Review data selection parameters
- Check processing options
- Click Create
Using Knowledge Sources with Data Pipelines
After creating a knowledge source, you can:
Create a Data Pipeline
- Navigate to Data → Data Pipelines
- Create a new pipeline
- Reference the knowledge source in the data source plugin parameters
- Use the same knowledge unit in embedding and indexing stages
Run the Pipeline
- Manually trigger the pipeline
- Or configure a trigger (manual or scheduled)
- Monitor progress in Data Pipeline Runs
Best Practices
Naming and Organization
Naming Conventions:
- Use descriptive, purpose-driven names
- Include data type or source:
sharepoint-product-docs - Add environment prefix if needed:
prod-customer-support
Organization:
- Group related knowledge sources
- Document dependencies between sources
- Maintain consistent naming across data sources, knowledge units, and knowledge sources
Data Selection
Start Small:
- Begin with a subset of data
- Verify processing works correctly
- Scale up gradually
File Filtering:
- Exclude irrelevant file types
- Filter by date to process recent data first
- Use metadata to select relevant content
Incremental Processing:
- Enable incremental updates for large datasets
- Only process changed files to save resources
- Track last processed timestamp
Refresh Strategy
Scheduled Refresh:
- Daily: For regularly updated content
- Weekly: For less frequently changing data
- Hourly: For near-real-time requirements (use frequent schedules as Event triggers are not yet available)
Manual Refresh:
- For one-time data loads
- During development and testing
- For ad-hoc data updates
Knowledge Unit Alignment
Consistency:
- Use the same knowledge unit for related knowledge sources
- Ensures consistent embedding and search quality
- Simplifies management
Separation:
- Use different knowledge units for different data types
- Allows different embedding models or dimensions
- Enables different vector databases
Common Scenarios
Scenario 1: Company Documentation
Single knowledge source for all company docs:
- Data Source: SharePoint Online (multiple libraries)
- Knowledge Unit: Standard 2048-dimension embeddings
- Refresh: Daily at night
- Use Case: General company knowledge search
Scenario 2: Multi-Source Product Documentation
Multiple knowledge sources, same knowledge unit:
- Source 1: SharePoint (user guides)
- Source 2: Azure Data Lake (API docs)
- Knowledge Unit: Shared product-docs unit
- Use Case: Unified product documentation search
Scenario 3: Departmental Knowledge Bases
Separate knowledge sources per department:
- HR Docs: HR SharePoint, hr-knowledge unit
- Engineering Docs: Confluence export, eng-knowledge unit
- Sales Docs: Sales SharePoint, sales-knowledge unit
- Use Case: Department-specific search with different configurations
Scenario 4: Multi-Tenant Application
Knowledge sources per tenant:
- Same vector database, different partitions (via knowledge units)
- Separate data sources per tenant
- Tenant isolation through knowledge source + knowledge unit pairing
Troubleshooting
Knowledge Source Creation Fails
Problem: Unable to create knowledge source Solutions:
- Verify you have required permissions
- Check that data source reference is valid
- Ensure knowledge unit reference is valid
- Verify the name is unique
Data Source Not Accessible
Problem: Pipeline fails to read from data source Solutions:
- Verify data source configuration and credentials
- Check network connectivity
- Ensure permissions are granted to access the data
- Test data source connection independently
No Data Processed
Problem: Pipeline runs but no data is indexed Solutions:
- Verify data selection parameters (folders, libraries)
- Check that files exist in specified locations
- Review file type filters
- Check pipeline logs for errors
Incorrect Knowledge Unit
Problem: Data indexed with wrong embeddings or to wrong database Solutions:
- Verify knowledge unit object ID is correct
- Check knowledge unit configuration
- Ensure knowledge unit references correct vector database
- Update knowledge source with correct knowledge unit
Performance Issues
Problem: Pipeline takes too long to process Solutions:
- Enable incremental processing
- Reduce scope of data selection
- Optimize knowledge unit embedding dimensions
- Schedule during off-peak hours