Configuring Pipeline Stages
Learn how to configure data pipeline stages to transform and process your data.
Overview
Pipeline stages define the processing steps that transform raw data into indexed, searchable content. Each stage performs a specific transformation and can be configured with parameters to control its behavior.
Common Pipeline Stages
Text Extraction Stage
Extracts text content from various document formats.
Purpose: Converts binary files (PDF, DOCX, XLSX, images) into plain text
Plugin: TextExtractionDataPipelineStage
Common Parameters:
- None required for basic operation
- Plugin automatically selects appropriate extractor based on file type
Supported Formats:
- PDF documents
- Microsoft Word (.docx)
- Microsoft Excel (.xlsx)
- Microsoft PowerPoint (.pptx)
- Images (with OCR)
Example Configuration:
{
"name": "Extract",
"description": "Extract text from documents",
"plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/TextExtractionDataPipelineStage",
"plugin_parameters": null
}
Text Partitioning Stage
Splits extracted text into smaller, manageable chunks.
Purpose: Breaks large documents into appropriately-sized segments for embedding and retrieval
Plugin: TextPartitioningDataPipelineStage
Key Parameters:
| Parameter | Type | Description | Default | Recommended |
|---|---|---|---|---|
PartitioningStrategy |
string | Strategy to use: "Token" or "Semantic" | Token | Token for most cases |
PartitionSizeTokens |
int | Size of each partition in tokens | 400 | 300-500 |
PartitionOverlapTokens |
int | Overlap between partitions | 100 | 50-150 |
Partitioning Strategies:
Token-Based: Splits text based on token count
- Fast and predictable
- Good for general-purpose use
- Consistent chunk sizes
Semantic: Splits based on semantic boundaries
- Preserves meaning better
- May result in variable chunk sizes
- Better for complex documents
Example Configuration:
{
"name": "Partition",
"description": "Split text into chunks",
"plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/TextPartitioningDataPipelineStage",
"plugin_parameters": [
{
"parameter_metadata": {
"name": "PartitioningStrategy",
"type": "string"
},
"default_value": "Token"
}
],
"plugin_dependencies": [
{
"plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/TokenContentTextPartitioning",
"plugin_parameters": [
{
"parameter_metadata": {
"name": "PartitionSizeTokens",
"type": "int"
},
"default_value": 400
},
{
"parameter_metadata": {
"name": "PartitionOverlapTokens",
"type": "int"
},
"default_value": 100
}
]
}
]
}
Text Embedding Stage
Generates vector embeddings for text chunks.
Purpose: Converts text into numerical vectors for semantic search
Plugin: GatewayTextEmbeddingDataPipelineStage
Key Parameters:
| Parameter | Type | Description | Required |
|---|---|---|---|
KnowledgeUnitObjectId |
resource-object-id | Reference to Knowledge Unit that defines embedding settings | Yes |
Configuration Through Knowledge Units:
The embedding model and dimensions are not configured directly on this stage. Instead, they are derived from the Knowledge Unit specified in the KnowledgeUnitObjectId parameter.
To configure embedding settings:
- Create or edit a Knowledge Unit (see Creating Knowledge Units)
- Set the embedding model and dimensions in the Knowledge Unit configuration
- Reference that Knowledge Unit in this stage's parameters
Common Embedding Models (configured in Knowledge Unit):
text-embedding-3-large: High quality, larger vectorstext-embedding-3-small: Faster, smaller vectorstext-embedding-ada-002: Legacy model
Dimension Options (configured in Knowledge Unit):
- Higher dimensions (2048, 3072): Better accuracy, more storage, slower search
- Lower dimensions (256, 512, 1024): Faster search, less storage, may reduce accuracy
Example Configuration:
{
"name": "Embed",
"description": "Generate vector embeddings",
"plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/GatewayTextEmbeddingDataPipelineStagePlugin",
"plugin_parameters": [
{
"parameter_metadata": {
"name": "KnowledgeUnitObjectId",
"type": "resource-object-id"
},
"default_value": "instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/my-knowledge-unit"
}
]
}
Note: The Knowledge Unit referenced must be properly configured with an embedding profile and vector database before the pipeline runs.
Indexing Stage
Stores embeddings in a vector database for search.
Purpose: Persists processed data to make it searchable by agents
Plugin: AzureAISearchIndexingDataPipelineStagePlugin
Key Parameters:
| Parameter | Type | Description | Required |
|---|---|---|---|
KnowledgeUnitObjectId |
resource-object-id | Reference to Knowledge Unit that defines indexing settings | Yes |
Configuration Through Knowledge Units:
The vector database (e.g., Azure AI Search index), index name, and index settings are not configured directly on this stage. Instead, they are derived from the Knowledge Unit specified in the KnowledgeUnitObjectId parameter.
To configure indexing settings:
- Create or edit a Vector Database (see Creating Vector Databases)
- Create or edit a Knowledge Unit that references that Vector Database (see Creating Knowledge Units)
- Reference that Knowledge Unit in this stage's parameters
The Knowledge Unit contains:
- Reference to the Vector Database to use
- Index name and partition configuration
- Schema and field mappings
- Search and indexing parameters
Example Configuration:
{
"name": "Index",
"description": "Store in vector database",
"plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/AzureAISearchIndexingDataPipelineStagePlugin",
"plugin_parameters": [
{
"parameter_metadata": {
"name": "KnowledgeUnitObjectId",
"type": "resource-object-id"
},
"default_value": "instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/my-knowledge-unit"
}
]
}
Note: The Knowledge Unit must be properly configured with a Vector Database reference before the pipeline runs. The same Knowledge Unit should be used for both the embedding and indexing stages to ensure consistency.
Advanced Stages
Knowledge Graph Stages
For extracting structured information and relationships:
- Knowledge Extraction: Extract entities and relationships
- Knowledge Consolidation: Merge and deduplicate entities
- Knowledge Summarization: Generate summaries
- Knowledge Graph Embedding: Create embeddings for entities
- Knowledge Graph Indexing: Index graph data
When to Use: When you need structured knowledge extraction beyond simple text search
Content Safety Stage
Filters or flags inappropriate content.
Plugin: AzureAIContentSafetyShieldingDataPipelineStage
Purpose: Identify and handle potentially harmful content
Stage Configuration Best Practices
Ordering Stages
Typical pipeline flow:
- Extract → Get text from documents
- Partition → Split into chunks
- Embed → Generate vectors
- Index → Store in database
Parameter Tuning
For Large Documents:
- Increase
PartitionSizeTokensto 500-800 - Use higher overlap (150-200) for better context
For Short Documents:
- Decrease
PartitionSizeTokensto 200-300 - Lower overlap acceptable (50-75)
For Better Quality:
- Use
text-embedding-3-largemodel - Use higher dimensions (2048-3072)
- Use semantic partitioning
For Better Performance:
- Use
text-embedding-3-smallmodel - Use lower dimensions (512-1024)
- Use token partitioning
Dependencies Between Stages
Some stages require specific inputs:
- Partitioning requires extracted text
- Embedding requires partitioned text
- Indexing requires embeddings
Configure stages in dependency order using the next_stages property.
Knowledge Units for Embedding and Indexing
The embedding and indexing stages derive their configuration from Knowledge Units:
Why Knowledge Units?
- Centralizes embedding and index configuration
- Ensures consistency between embedding generation and index storage
- Allows reuse of the same configuration across multiple pipelines
- Enables vector database selection and embedding dimensionality to be managed independently
Best Practice:
- Use the same Knowledge Unit for both embedding and indexing stages
- Create separate Knowledge Units for different embedding requirements (e.g., different models or dimensions)
- Properly configure Vector Databases before creating Knowledge Units
See:
Common Scenarios
Scenario 1: Basic Document Processing
For standard document indexing:
Extract → Partition (Token, 400) → Embed (large, 2048) → Index
Scenario 2: High-Volume Processing
For large document sets where speed matters:
Extract → Partition (Token, 300) → Embed (small, 1024) → Index
Scenario 3: High-Quality Search
For maximum search quality:
Extract → Partition (Semantic, 500) → Embed (large, 3072) → Index
Scenario 4: Knowledge Graph Pipeline
For structured knowledge extraction:
Extract → Partition → Knowledge Extraction →
Knowledge Consolidation → Knowledge Summarization →
Knowledge Graph Embedding → Knowledge Graph Indexing
Troubleshooting Stage Configuration
Extraction Fails
Problem: Files can't be processed Solutions:
- Verify file format is supported
- Check file isn't corrupted
- Ensure sufficient memory for large files
Partitioning Issues
Problem: Chunks too large or too small Solutions:
- Adjust
PartitionSizeTokens - Check token counting method
- Review document structure
Embedding Errors
Problem: Embedding fails or times out Solutions:
- Verify Knowledge Unit is properly configured
- Check Vector Database connection in Knowledge Unit
- Verify embedding model is available
- Check rate limits
- Reduce batch size
Indexing Failures
Problem: Data doesn't appear in index Solutions:
- Verify Knowledge Unit is properly configured
- Check Vector Database configuration in Knowledge Unit
- Verify index exists in the Vector Database
- Check API endpoint credentials
- Ensure index schema matches data structure