Table of Contents

Creating Knowledge Sources

Learn how to create and configure Knowledge Sources that connect data pipelines to source data and knowledge units.

Overview

A Knowledge Source is a resource that represents a logical collection of data to be processed by data pipelines. It connects:

  • Data sources (where the raw data comes from)
  • Knowledge units (how the data should be embedded and indexed)
  • Data pipelines (how to process the data)

Knowledge Sources provide a high-level abstraction for managing the entire knowledge ingestion workflow.

Prerequisites

  • Access to FoundationaLLM Management Portal
  • Required permissions: FoundationaLLM.KnowledgeSource/knowledgeSources/write
  • At least one Data Source configured (Azure Data Lake, SharePoint, etc.)
  • At least one Knowledge Unit configured (see Creating Knowledge Units)
  • Understanding of your data pipeline requirements

What is a Knowledge Source?

A Knowledge Source is a resource that:

  • Links raw data sources to processed knowledge units
  • Defines what data should be processed
  • Specifies how data should be transformed (via data pipelines)
  • Tracks the relationship between source data and indexed content
  • Enables end-to-end knowledge management

Relationship:

Data Source → Knowledge Source → Data Pipeline → Knowledge Unit → Vector Database
     ↓              ↓                  ↓                ↓               ↓
  Raw Files   Configuration      Processing      Embeddings        Storage

Creating a Knowledge Source

Step 1: Navigate to Knowledge Sources

  1. Open the FoundationaLLM Management Portal
  2. Navigate to DataKnowledge Sources
  3. Click Create New Knowledge Source

Step 2: Configure Basic Settings

Name: A unique identifier for the knowledge source

  • Use descriptive names (e.g., product-documentation, support-articles)
  • Follow naming convention: lowercase with hyphens
  • Example: company-internal-docs

Display Name: Human-readable name shown in the UI

  • Example: "Company Internal Documentation"

Description: Purpose and scope of this knowledge source

  • Document what data it includes
  • Note any specific requirements or constraints
  • Example: "Internal company documentation from SharePoint, processed for semantic search"

Step 3: Select Data Source

Choose where the raw data comes from.

Data Source Types:

  • Azure Data Lake: Files stored in Azure Data Lake Storage Gen2
  • SharePoint Online: Documents from SharePoint document libraries
  • Context File: Specific files provided via API
  • Other: Custom data source plugins

Data Source Object ID: Reference to a data source resource

  • Format: instances/{instanceId}/providers/FoundationaLLM.DataSource/dataSources/{dataSourceName}
  • Example: instances/default/providers/FoundationaLLM.DataSource/dataSources/company-sharepoint

See the Data Sources section for details on configuring specific data source types.

Step 4: Configure Data Selection

Specify which data from the source should be processed.

For Azure Data Lake:

{
  "data_source_parameters": {
    "Folders": [
      "/documentation/product",
      "/documentation/api"
    ]
  }
}

For SharePoint Online:

{
  "data_source_parameters": {
    "DocumentLibraries": [
      "Shared Documents",
      "Product Documentation"
    ]
  }
}

File Filters (optional):

  • File extensions to include/exclude
  • Date ranges
  • Metadata filters

Connect to the knowledge unit that defines embedding and indexing settings.

Knowledge Unit Object ID: Reference to a knowledge unit resource

  • Format: instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/{knowledgeUnitName}
  • Example: instances/default/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/company-docs-unit

This knowledge unit determines:

  • Which embedding model to use
  • What embedding dimensions to generate
  • Which vector database to store results
  • What index partition to use

Step 6: Configure Processing Options

Refresh Strategy:

  • Manual: Run pipeline manually when needed
  • Scheduled: Run on a defined schedule (e.g., daily)
  • Incremental: Only process new/changed files

Processing Options:

  • Content Safety: Enable content filtering
  • Knowledge Graph: Extract structured relationships
  • Custom Stages: Additional processing steps

Step 7: Example Configuration

{
  "name": "product-documentation",
  "display_name": "Product Documentation",
  "description": "Technical documentation for products from SharePoint",
  "data_source_object_id": "instances/default/providers/FoundationaLLM.DataSource/dataSources/company-sharepoint",
  "data_source_parameters": {
    "DocumentLibraries": [
      "Product Documentation"
    ]
  },
  "knowledge_unit_object_id": "instances/default/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/product-docs-unit",
  "refresh_strategy": "scheduled",
  "schedule": {
    "frequency": "daily",
    "time": "02:00"
  }
}

Step 8: Review and Create

  1. Verify data source is accessible
  2. Confirm knowledge unit is properly configured
  3. Review data selection parameters
  4. Check processing options
  5. Click Create

Using Knowledge Sources with Data Pipelines

After creating a knowledge source, you can:

Create a Data Pipeline

  1. Navigate to DataData Pipelines
  2. Create a new pipeline
  3. Reference the knowledge source in the data source plugin parameters
  4. Use the same knowledge unit in embedding and indexing stages

Run the Pipeline

  1. Manually trigger the pipeline
  2. Or configure a trigger (manual or scheduled)
  3. Monitor progress in Data Pipeline Runs

Best Practices

Naming and Organization

Naming Conventions:

  • Use descriptive, purpose-driven names
  • Include data type or source: sharepoint-product-docs
  • Add environment prefix if needed: prod-customer-support

Organization:

  • Group related knowledge sources
  • Document dependencies between sources
  • Maintain consistent naming across data sources, knowledge units, and knowledge sources

Data Selection

Start Small:

  • Begin with a subset of data
  • Verify processing works correctly
  • Scale up gradually

File Filtering:

  • Exclude irrelevant file types
  • Filter by date to process recent data first
  • Use metadata to select relevant content

Incremental Processing:

  • Enable incremental updates for large datasets
  • Only process changed files to save resources
  • Track last processed timestamp

Refresh Strategy

Scheduled Refresh:

  • Daily: For regularly updated content
  • Weekly: For less frequently changing data
  • Hourly: For near-real-time requirements (use frequent schedules as Event triggers are not yet available)

Manual Refresh:

  • For one-time data loads
  • During development and testing
  • For ad-hoc data updates

Knowledge Unit Alignment

Consistency:

  • Use the same knowledge unit for related knowledge sources
  • Ensures consistent embedding and search quality
  • Simplifies management

Separation:

  • Use different knowledge units for different data types
  • Allows different embedding models or dimensions
  • Enables different vector databases

Common Scenarios

Scenario 1: Company Documentation

Single knowledge source for all company docs:

  • Data Source: SharePoint Online (multiple libraries)
  • Knowledge Unit: Standard 2048-dimension embeddings
  • Refresh: Daily at night
  • Use Case: General company knowledge search

Scenario 2: Multi-Source Product Documentation

Multiple knowledge sources, same knowledge unit:

  • Source 1: SharePoint (user guides)
  • Source 2: Azure Data Lake (API docs)
  • Knowledge Unit: Shared product-docs unit
  • Use Case: Unified product documentation search

Scenario 3: Departmental Knowledge Bases

Separate knowledge sources per department:

  • HR Docs: HR SharePoint, hr-knowledge unit
  • Engineering Docs: Confluence export, eng-knowledge unit
  • Sales Docs: Sales SharePoint, sales-knowledge unit
  • Use Case: Department-specific search with different configurations

Scenario 4: Multi-Tenant Application

Knowledge sources per tenant:

  • Same vector database, different partitions (via knowledge units)
  • Separate data sources per tenant
  • Tenant isolation through knowledge source + knowledge unit pairing

Troubleshooting

Knowledge Source Creation Fails

Problem: Unable to create knowledge source Solutions:

  • Verify you have required permissions
  • Check that data source reference is valid
  • Ensure knowledge unit reference is valid
  • Verify the name is unique

Data Source Not Accessible

Problem: Pipeline fails to read from data source Solutions:

  • Verify data source configuration and credentials
  • Check network connectivity
  • Ensure permissions are granted to access the data
  • Test data source connection independently

No Data Processed

Problem: Pipeline runs but no data is indexed Solutions:

  • Verify data selection parameters (folders, libraries)
  • Check that files exist in specified locations
  • Review file type filters
  • Check pipeline logs for errors

Incorrect Knowledge Unit

Problem: Data indexed with wrong embeddings or to wrong database Solutions:

  • Verify knowledge unit object ID is correct
  • Check knowledge unit configuration
  • Ensure knowledge unit references correct vector database
  • Update knowledge source with correct knowledge unit

Performance Issues

Problem: Pipeline takes too long to process Solutions:

  • Enable incremental processing
  • Reduce scope of data selection
  • Optimize knowledge unit embedding dimensions
  • Schedule during off-peak hours