Creating Knowledge Sources

Learn how to create and configure Knowledge Sources that connect data pipelines to source data and knowledge units.

Overview

A Knowledge Source is a resource that represents a logical collection of data to be processed by data pipelines. It connects:

Data sources (where the raw data comes from)
Knowledge units (how the data should be embedded and indexed)
Data pipelines (how to process the data)

Knowledge Sources provide a high-level abstraction for managing the entire knowledge ingestion workflow.

Prerequisites

Access to FoundationaLLM Management Portal
Required permissions: FoundationaLLM.KnowledgeSource/knowledgeSources/write
At least one Data Source configured (Azure Data Lake, SharePoint, etc.)
At least one Knowledge Unit configured (see Creating Knowledge Units)
Understanding of your data pipeline requirements

What is a Knowledge Source?

A Knowledge Source is a resource that:

Links raw data sources to processed knowledge units
Defines what data should be processed
Specifies how data should be transformed (via data pipelines)
Tracks the relationship between source data and indexed content
Enables end-to-end knowledge management

Relationship:

Data Source → Knowledge Source → Data Pipeline → Knowledge Unit → Vector Database
     ↓              ↓                  ↓                ↓               ↓
  Raw Files   Configuration      Processing      Embeddings        Storage

Creating a Knowledge Source

Step 1: Navigate to Knowledge Sources

Open the FoundationaLLM Management Portal
Navigate to Data → Knowledge Sources
Click Create New Knowledge Source

Step 2: Configure Basic Settings

Name: A unique identifier for the knowledge source

Use descriptive names (e.g., product-documentation, support-articles)
Follow naming convention: lowercase with hyphens
Example: company-internal-docs

Display Name: Human-readable name shown in the UI

Example: "Company Internal Documentation"

Description: Purpose and scope of this knowledge source

Document what data it includes
Note any specific requirements or constraints
Example: "Internal company documentation from SharePoint, processed for semantic search"

Step 3: Select Data Source

Choose where the raw data comes from.

Data Source Types:

Azure Data Lake: Files stored in Azure Data Lake Storage Gen2
SharePoint Online: Documents from SharePoint document libraries
Context File: Specific files provided via API
Other: Custom data source plugins

Data Source Object ID: Reference to a data source resource

Format: instances/{instanceId}/providers/FoundationaLLM.DataSource/dataSources/{dataSourceName}
Example: instances/default/providers/FoundationaLLM.DataSource/dataSources/company-sharepoint

See the Data Sources section for details on configuring specific data source types.

Step 4: Configure Data Selection

Specify which data from the source should be processed.

For Azure Data Lake:

{
  "data_source_parameters": {
    "Folders": [
      "/documentation/product",
      "/documentation/api"
    ]
  }
}

For SharePoint Online:

{
  "data_source_parameters": {
    "DocumentLibraries": [
      "Shared Documents",
      "Product Documentation"
    ]
  }
}

File Filters (optional):

File extensions to include/exclude
Date ranges
Metadata filters

Step 5: Link Knowledge Unit

Connect to the knowledge unit that defines embedding and indexing settings.

Knowledge Unit Object ID: Reference to a knowledge unit resource

Format: instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/{knowledgeUnitName}
Example: instances/default/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/company-docs-unit

This knowledge unit determines:

Which embedding model to use
What embedding dimensions to generate
Which vector database to store results
What index partition to use

Step 6: Configure Processing Options

Refresh Strategy:

Manual: Run pipeline manually when needed
Scheduled: Run on a defined schedule (e.g., daily)
Incremental: Only process new/changed files

Processing Options:

Content Safety: Enable content filtering
Knowledge Graph: Extract structured relationships
Custom Stages: Additional processing steps

Step 7: Example Configuration

{
  "name": "product-documentation",
  "display_name": "Product Documentation",
  "description": "Technical documentation for products from SharePoint",
  "data_source_object_id": "instances/default/providers/FoundationaLLM.DataSource/dataSources/company-sharepoint",
  "data_source_parameters": {
    "DocumentLibraries": [
      "Product Documentation"
    ]
  },
  "knowledge_unit_object_id": "instances/default/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/product-docs-unit",
  "refresh_strategy": "scheduled",
  "schedule": {
    "frequency": "daily",
    "time": "02:00"
  }
}

Step 8: Review and Create

Verify data source is accessible
Confirm knowledge unit is properly configured
Review data selection parameters
Check processing options
Click Create

Using Knowledge Sources with Data Pipelines

After creating a knowledge source, you can:

Create a Data Pipeline

Navigate to Data → Data Pipelines
Create a new pipeline
Reference the knowledge source in the data source plugin parameters
Use the same knowledge unit in embedding and indexing stages

Run the Pipeline

Manually trigger the pipeline
Or configure a trigger (manual or scheduled)
Monitor progress in Data Pipeline Runs

Best Practices

Naming and Organization

Naming Conventions:

Use descriptive, purpose-driven names
Include data type or source: sharepoint-product-docs
Add environment prefix if needed: prod-customer-support

Organization:

Group related knowledge sources
Document dependencies between sources
Maintain consistent naming across data sources, knowledge units, and knowledge sources

Data Selection

Start Small:

Begin with a subset of data
Verify processing works correctly
Scale up gradually

File Filtering:

Exclude irrelevant file types
Filter by date to process recent data first
Use metadata to select relevant content

Incremental Processing:

Enable incremental updates for large datasets
Only process changed files to save resources
Track last processed timestamp

Refresh Strategy

Scheduled Refresh:

Daily: For regularly updated content
Weekly: For less frequently changing data
Hourly: For near-real-time requirements (use frequent schedules as Event triggers are not yet available)

Manual Refresh:

For one-time data loads
During development and testing
For ad-hoc data updates

Knowledge Unit Alignment

Consistency:

Use the same knowledge unit for related knowledge sources
Ensures consistent embedding and search quality
Simplifies management

Separation:

Use different knowledge units for different data types
Allows different embedding models or dimensions
Enables different vector databases

Common Scenarios

Scenario 1: Company Documentation

Single knowledge source for all company docs:

Data Source: SharePoint Online (multiple libraries)
Knowledge Unit: Standard 2048-dimension embeddings
Refresh: Daily at night
Use Case: General company knowledge search

Scenario 2: Multi-Source Product Documentation

Multiple knowledge sources, same knowledge unit:

Source 1: SharePoint (user guides)
Source 2: Azure Data Lake (API docs)
Knowledge Unit: Shared product-docs unit
Use Case: Unified product documentation search

Scenario 3: Departmental Knowledge Bases

Separate knowledge sources per department:

HR Docs: HR SharePoint, hr-knowledge unit
Engineering Docs: Confluence export, eng-knowledge unit
Sales Docs: Sales SharePoint, sales-knowledge unit
Use Case: Department-specific search with different configurations

Scenario 4: Multi-Tenant Application

Knowledge sources per tenant:

Same vector database, different partitions (via knowledge units)
Separate data sources per tenant
Tenant isolation through knowledge source + knowledge unit pairing

Troubleshooting

Knowledge Source Creation Fails

Problem: Unable to create knowledge source Solutions:

Verify you have required permissions
Check that data source reference is valid
Ensure knowledge unit reference is valid
Verify the name is unique

Data Source Not Accessible

Problem: Pipeline fails to read from data source Solutions:

Verify data source configuration and credentials
Check network connectivity
Ensure permissions are granted to access the data
Test data source connection independently

No Data Processed

Problem: Pipeline runs but no data is indexed Solutions:

Verify data selection parameters (folders, libraries)
Check that files exist in specified locations
Review file type filters
Check pipeline logs for errors

Incorrect Knowledge Unit

Problem: Data indexed with wrong embeddings or to wrong database Solutions:

Verify knowledge unit object ID is correct
Check knowledge unit configuration
Ensure knowledge unit references correct vector database
Update knowledge source with correct knowledge unit

Performance Issues

Problem: Pipeline takes too long to process Solutions:

Enable incremental processing
Reduce scope of data selection
Optimize knowledge unit embedding dimensions
Schedule during off-peak hours

Table of Contents