Data Pipeline Concepts
This document provides visual representations of the core concepts related to data pipelines, knowledge sources, and knowledge units in FoundationaLLM.
Knowledge Source and Knowledge Unit Relationship
A Knowledge Source represents a collection of related data, while Knowledge Units are the actual indexed data that can be queried. Each Knowledge Source can have multiple Knowledge Units associated with it.
graph LR
KS[Knowledge Source<br/>Contains metadata and<br/>references to knowledge units]
KU1[Knowledge Unit 1<br/>Points to vector database<br/>and optional knowledge graph]
KU2[Knowledge Unit 2<br/>Points to vector database<br/>and optional knowledge graph]
KU3[Knowledge Unit 3<br/>Points to vector database<br/>and optional knowledge graph]
VDB1[(Vector Database 1<br/>Stores embeddings)]
VDB2[(Vector Database 2<br/>Stores embeddings)]
VDB3[(Vector Database 3<br/>Stores embeddings)]
KG1[(Knowledge Graph<br/>Vector Database 1)]
KG3[(Knowledge Graph<br/>Vector Database 3)]
KS -->|knowledge_unit_object_ids| KU1
KS -->|knowledge_unit_object_ids| KU2
KS -->|knowledge_unit_object_ids| KU3
KU1 -->|vector_database_object_id| VDB1
KU2 -->|vector_database_object_id| VDB2
KU3 -->|vector_database_object_id| VDB3
KU1 -->|knowledge_graph_vector_database_object_id| KG1
KU3 -->|knowledge_graph_vector_database_object_id| KG3
style KS fill:#e1f5ff
style KU1 fill:#fff4e1
style KU2 fill:#fff4e1
style KU3 fill:#fff4e1
style VDB1 fill:#e8f5e9
style VDB2 fill:#e8f5e9
style VDB3 fill:#e8f5e9
style KG1 fill:#f3e5f5
style KG3 fill:#f3e5f5
Key Points:
- Each Knowledge Source contains a list of
knowledge_unit_object_idsreferencing its Knowledge Units - Each Knowledge Unit has a required
vector_database_object_idpointing to a vector database for semantic search - Each Knowledge Unit can optionally have a
knowledge_graph_vector_database_object_idfor graph-based queries - Knowledge Units also have an optional
vector_store_idwithin the vector database
Data Pipeline Structure
A Data Pipeline consists of three main components: Data Source, Stages (structured as a forest), and Triggers.
graph TD
DP[Data Pipeline Definition<br/>Configuration and metadata]
DS[Data Pipeline Data Source<br/>Plugin-based data ingestion]
DSR[(Data Source Resource<br/>e.g., Azure Data Lake,<br/>SharePoint Online)]
DSP[Data Source Plugin<br/>Implements data reading]
SS1[Starting Stage 1<br/>e.g., Text Extraction]
SS2[Starting Stage 2<br/>e.g., Image Processing]
S1[Stage: Partition<br/>Split text into chunks]
S2[Stage: Embed<br/>Generate embeddings]
S3[Stage: Index<br/>Store in vector DB]
S4[Stage: Alternative Path<br/>Different processing]
P1[Stage Plugin<br/>Implements logic]
P2[Dependency Plugin<br/>Helper functionality]
T1[Trigger: Schedule<br/>Cron-based execution]
T2[Trigger: Event<br/>File change detection]
T3[Trigger: Manual<br/>User-initiated]
DP --> DS
DP --> SS1
DP --> SS2
DP --> T1
DP --> T2
DP --> T3
DS -->|data_source_object_id| DSR
DS -->|plugin_object_id| DSP
SS1 -->|next_stages| S1
SS2 -->|next_stages| S4
S1 -->|next_stages| S2
S2 -->|next_stages| S3
S1 -->|plugin_object_id| P1
S1 -->|plugin_dependencies| P2
T1 -.->|provides parameter_values| DS
T1 -.->|provides parameter_values| S1
T1 -.->|provides parameter_values| S2
T1 -.->|provides parameter_values| S3
style DP fill:#e1f5ff
style DS fill:#fff4e1
style DSR fill:#e8f5e9
style DSP fill:#f3e5f5
style SS1 fill:#fff4e1
style SS2 fill:#fff4e1
style S1 fill:#fff4e1
style S2 fill:#fff4e1
style S3 fill:#fff4e1
style S4 fill:#fff4e1
style P1 fill:#f3e5f5
style P2 fill:#f3e5f5
style T1 fill:#ffebee
style T2 fill:#ffebee
style T3 fill:#ffebee
Key Points:
- Data Source: References a FoundationaLLM DataSource resource and uses a plugin to read data
- Stages: Organized as a forest (multiple starting points), each stage has a plugin and optional dependencies
- Triggers: Schedule, Event, or Manual triggers that provide parameter values to execute the pipeline
- Plugins: Both data sources and stages use plugins for implementation, with configurable parameters
Data Pipeline Execution Flow
This diagram shows the lifecycle of a data pipeline execution from trigger to completion.
sequenceDiagram
participant T as Trigger<br/>(Schedule/Event/Manual)
participant API as DataPipelineAPI
participant Engine as DataPipelineEngine
participant Runner as DataPipelineRunner
participant State as DataPipelineStateService
participant Queue as Azure Storage Queue
participant Worker as DataPipelineWorker
participant Stage as DataPipelineStageRunner
participant Plugin as Stage Plugin
participant VDB as Vector Database
T->>API: Trigger pipeline execution<br/>(with parameter values)
API->>Engine: Create DataPipelineRun
Engine->>State: Initialize run state
State-->>Engine: State initialized
Engine->>Runner: InitializeNew(run, contentItems)
Runner->>State: InitializeDataPipelineRunState
State-->>Runner: State created
loop For each starting stage
Runner->>Stage: CreateStageRunner(stage, workItems)
Stage->>Queue: Enqueue work items
end
Runner-->>Engine: Pipeline run initialized
Engine-->>API: Run ID returned
loop Processing work items
Worker->>Queue: Dequeue work item
Queue-->>Worker: Work item message
Worker->>Stage: Process work item
Stage->>Plugin: Execute plugin logic
Plugin->>Plugin: Transform content<br/>(extract/partition/embed/index)
alt If indexing stage
Plugin->>VDB: Store embeddings
VDB-->>Plugin: Stored successfully
end
Plugin-->>Stage: Processing complete
Stage->>State: Update content item state
alt If stage has next stages
Stage->>Queue: Enqueue items for next stage
end
Stage-->>Worker: Work item complete
end
Worker->>State: Check pipeline completion
State-->>Worker: All stages complete
Worker->>Engine: Mark run as completed
Key Points:
- Triggers initiate pipeline runs with parameter values
- Engine creates a DataPipelineRun and initializes state
- Runner creates stage runners for starting stages
- Work items are enqueued to Azure Storage Queues
- Workers dequeue and process items through stage plugins
- State service tracks progress of content items through stages
- Upon completion, items may be enqueued for subsequent stages
- Pipeline completes when all stages have processed all items
Data Pipeline and Knowledge Unit Integration
This diagram shows how data pipelines populate knowledge units that are referenced by knowledge sources.
graph TB
subgraph "Data Sources"
ADL[(Azure Data Lake<br/>Raw documents)]
SPO[(SharePoint Online<br/>Documents and files)]
PS[(Private Storage<br/>Custom data)]
end
subgraph "Data Pipeline Execution"
DP[Data Pipeline Definition<br/>Extract → Partition → Embed → Index]
DS[Data Source Stage<br/>Read files from storage]
EX[Extraction Stage<br/>Extract text from PDFs, DOCX, etc.]
PA[Partition Stage<br/>Split into chunks]
EM[Embedding Stage<br/>Generate vector embeddings]
IX[Indexing Stage<br/>Store in vector database]
KGE[Knowledge Graph Stages<br/>Optional: Extract entities,<br/>relationships, consolidate]
DP --> DS
DS --> EX
EX --> PA
PA --> EM
EM --> IX
EM --> KGE
end
subgraph "Knowledge Storage"
VDB[(Vector Database<br/>Azure AI Search<br/>Stores embeddings<br/>and metadata)]
KGVDB[(Knowledge Graph<br/>Vector Database<br/>Stores graph embeddings)]
end
subgraph "Knowledge Organization"
KU[Knowledge Unit<br/>Points to vector DB<br/>and optional KG DB]
KS[Knowledge Source<br/>Collection of<br/>knowledge units]
end
subgraph "Agent Access"
Agent[FoundationaLLM Agent<br/>Uses knowledge source<br/>for RAG]
User[User Query]
end
ADL -->|files| DS
SPO -->|files| DS
PS -->|files| DS
IX -->|writes embeddings| VDB
KGE -->|writes graph data| KGVDB
VDB -.->|referenced by<br/>vector_database_object_id| KU
KGVDB -.->|referenced by<br/>knowledge_graph_vector_database_object_id| KU
KU -->|knowledge_unit_object_ids| KS
KS -->|configured in agent| Agent
User -->|queries| Agent
Agent -->|searches| VDB
Agent -->|queries| KGVDB
style ADL fill:#e8f5e9
style SPO fill:#e8f5e9
style PS fill:#e8f5e9
style DP fill:#e1f5ff
style DS fill:#fff4e1
style EX fill:#fff4e1
style PA fill:#fff4e1
style EM fill:#fff4e1
style IX fill:#fff4e1
style KGE fill:#fff4e1
style VDB fill:#e8f5e9
style KGVDB fill:#f3e5f5
style KU fill:#fff4e1
style KS fill:#e1f5ff
style Agent fill:#ffebee
style User fill:#f5f5f5
Key Points:
- Data pipelines read from various data sources (Azure Data Lake, SharePoint, etc.)
- Pipeline stages transform data: Extract → Partition → Embed → Index
- Optional knowledge graph stages can extract entities and relationships
- Embeddings are stored in vector databases (e.g., Azure AI Search)
- Knowledge graph data is stored in separate graph vector databases
- Knowledge Units reference the vector databases where data is stored
- Knowledge Sources aggregate multiple Knowledge Units
- Agents use Knowledge Sources for Retrieval-Augmented Generation (RAG)
- Users query agents, which in turn search the vector databases
Complete End-to-End Flow
This diagram provides a comprehensive view of how data flows from source systems to user queries.
flowchart TD
Start([Raw Data in Storage])
subgraph "Phase 1: Pipeline Definition"
DefineDP[Define Data Pipeline<br/>- Configure data source<br/>- Define stages<br/>- Set up triggers]
DefineDS[Configure Data Source<br/>- Azure Data Lake<br/>- SharePoint Online<br/>- Private Storage]
DefineKU[Create Knowledge Unit<br/>- Set vector DB reference<br/>- Optional: KG DB reference]
DefineKS[Create Knowledge Source<br/>- Add knowledge unit refs]
end
subgraph "Phase 2: Pipeline Execution"
Trigger[Trigger Fires<br/>Schedule/Event/Manual]
Execute[Execute Pipeline<br/>Process content items]
Stage1[Stage 1: Extract Text<br/>PDF, DOCX, XLSX, Images]
Stage2[Stage 2: Partition<br/>Token or Semantic chunking]
Stage3[Stage 3: Embed<br/>Generate vector embeddings]
Stage4[Stage 4: Index<br/>Store in vector database]
KGStage[Optional KG Stages<br/>Extract, Consolidate,<br/>Summarize, Embed, Index]
end
subgraph "Phase 3: Data Storage"
VectorDB[(Vector Database<br/>Searchable embeddings)]
GraphDB[(Knowledge Graph DB<br/>Entity relationships)]
end
subgraph "Phase 4: Agent Configuration"
ConfigAgent[Configure Agent<br/>with Knowledge Source]
Agent[Agent Ready<br/>for User Queries]
end
subgraph "Phase 5: User Interaction"
UserQuery[User Submits Query]
Search[Agent Searches<br/>Vector DB & KG]
Retrieve[Retrieve Relevant<br/>Documents & Entities]
Generate[Generate Response<br/>with LLM]
Response([Return Answer to User])
end
Start --> DefineDP
DefineDP --> DefineDS
DefineDS --> DefineKU
DefineKU --> DefineKS
DefineKS --> Trigger
Trigger --> Execute
Execute --> Stage1
Stage1 --> Stage2
Stage2 --> Stage3
Stage3 --> Stage4
Stage3 --> KGStage
Stage4 --> VectorDB
KGStage --> GraphDB
VectorDB -.->|referenced by| DefineKU
GraphDB -.->|referenced by| DefineKU
DefineKS --> ConfigAgent
ConfigAgent --> Agent
Agent --> UserQuery
UserQuery --> Search
Search --> Retrieve
Retrieve --> Generate
Generate --> Response
style Start fill:#e8f5e9
style DefineDP fill:#e1f5ff
style DefineDS fill:#e1f5ff
style DefineKU fill:#fff4e1
style DefineKS fill:#fff4e1
style Trigger fill:#ffebee
style Execute fill:#ffebee
style Stage1 fill:#fff4e1
style Stage2 fill:#fff4e1
style Stage3 fill:#fff4e1
style Stage4 fill:#fff4e1
style KGStage fill:#f3e5f5
style VectorDB fill:#e8f5e9
style GraphDB fill:#f3e5f5
style ConfigAgent fill:#e1f5ff
style Agent fill:#ffebee
style UserQuery fill:#f5f5f5
style Search fill:#fff4e1
style Retrieve fill:#fff4e1
style Generate fill:#fff4e1
style Response fill:#e8f5e9
Complete Flow Summary:
Phase 1: Setup & Definition
- Define data pipeline with stages and triggers
- Configure data source connections
- Create knowledge units pointing to target databases
- Create knowledge sources aggregating knowledge units
Phase 2: Data Processing
- Pipeline trigger fires (schedule, event, or manual)
- Content items flow through stages: Extract → Partition → Embed → Index
- Optional knowledge graph stages extract entities and relationships
- Each stage transforms data and passes to next stage
Phase 3: Persistent Storage
- Vector embeddings stored in vector databases (Azure AI Search)
- Knowledge graph data stored in graph databases
- Knowledge units reference these storage locations
Phase 4: Agent Setup
- Agents configured with knowledge sources
- Knowledge sources provide access to all associated knowledge units
- Agent ready to serve user queries
Phase 5: Query & Response
- User submits natural language query
- Agent searches vector databases and knowledge graphs
- Relevant documents and entities retrieved
- LLM generates contextual response using retrieved information
- Response returned to user
Key Concepts Summary
| Concept | Description | Purpose |
|---|---|---|
| Knowledge Source | Collection of knowledge units | Aggregates multiple knowledge units for agent use |
| Knowledge Unit | References to indexed data | Points to vector DB and optional knowledge graph |
| Data Pipeline | Automated data processing workflow | Transforms raw data into searchable knowledge |
| Data Source | Source of raw data | Provides input files/documents for processing |
| Pipeline Stage | Single processing step | Performs specific transformation (extract, partition, embed, index) |
| Pipeline Trigger | Execution initiator | Starts pipeline on schedule, event, or manual action |
| Plugin | Pluggable implementation | Provides specific functionality for data sources and stages |
| Vector Database | Embedding storage | Enables semantic search over processed data |
| Knowledge Graph | Entity relationship storage | Enables graph-based queries and reasoning |
Parameter Naming Convention
When triggers provide parameter values to execute pipelines, they follow a specific naming convention:
Data Source Parameters:
DataSource.{DataSourceName}.{ParameterName}- Example:
DataSource.VGDataLake.Folders
- Example:
Stage Parameters:
Stage.{StageName}.{ParameterName}- Example:
Stage.Partition.PartitioningStrategy
- Example:
Stage Dependency Parameters:
Stage.{StageName}.Dependency.{DependencyPluginName}.{ParameterName}- Example:
Stage.Partition.Dependency.TokenContentTextPartitioning.PartitionSizeTokens
- Example:
This convention allows triggers to provide all necessary configuration values for pipeline execution without user interaction.