Configuring Pipeline Stages

Learn how to configure data pipeline stages to transform and process your data.

Overview

Pipeline stages define the processing steps that transform raw data into indexed, searchable content. Each stage performs a specific transformation and can be configured with parameters to control its behavior.

Common Pipeline Stages

Text Extraction Stage

Extracts text content from various document formats.

Purpose: Converts binary files (PDF, DOCX, XLSX, images) into plain text

Plugin: TextExtractionDataPipelineStage

Common Parameters:

None required for basic operation
Plugin automatically selects appropriate extractor based on file type

Supported Formats:

PDF documents
Microsoft Word (.docx)
Microsoft Excel (.xlsx)
Microsoft PowerPoint (.pptx)
Images (with OCR)

Example Configuration:

{
  "name": "Extract",
  "description": "Extract text from documents",
  "plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/TextExtractionDataPipelineStage",
  "plugin_parameters": null
}

Text Partitioning Stage

Splits extracted text into smaller, manageable chunks.

Purpose: Breaks large documents into appropriately-sized segments for embedding and retrieval

Plugin: TextPartitioningDataPipelineStage

Key Parameters:

Parameter	Type	Description	Default	Recommended
`PartitioningStrategy`	string	Strategy to use: "Token" or "Semantic"	Token	Token for most cases
`PartitionSizeTokens`	int	Size of each partition in tokens	400	300-500
`PartitionOverlapTokens`	int	Overlap between partitions	100	50-150

Partitioning Strategies:

Token-Based: Splits text based on token count
- Fast and predictable
- Good for general-purpose use
- Consistent chunk sizes
Semantic: Splits based on semantic boundaries
- Preserves meaning better
- May result in variable chunk sizes
- Better for complex documents

Example Configuration:

{
  "name": "Partition",
  "description": "Split text into chunks",
  "plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/TextPartitioningDataPipelineStage",
  "plugin_parameters": [
    {
      "parameter_metadata": {
        "name": "PartitioningStrategy",
        "type": "string"
      },
      "default_value": "Token"
    }
  ],
  "plugin_dependencies": [
    {
      "plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/TokenContentTextPartitioning",
      "plugin_parameters": [
        {
          "parameter_metadata": {
            "name": "PartitionSizeTokens",
            "type": "int"
          },
          "default_value": 400
        },
        {
          "parameter_metadata": {
            "name": "PartitionOverlapTokens",
            "type": "int"
          },
          "default_value": 100
        }
      ]
    }
  ]
}

Text Embedding Stage

Generates vector embeddings for text chunks.

Purpose: Converts text into numerical vectors for semantic search

Plugin: GatewayTextEmbeddingDataPipelineStage

Key Parameters:

Parameter	Type	Description	Required
`KnowledgeUnitObjectId`	resource-object-id	Reference to Knowledge Unit that defines embedding settings	Yes

Configuration Through Knowledge Units:

The embedding model and dimensions are not configured directly on this stage. Instead, they are derived from the Knowledge Unit specified in the KnowledgeUnitObjectId parameter.

To configure embedding settings:

Create or edit a Knowledge Unit (see Creating Knowledge Units)
Set the embedding model and dimensions in the Knowledge Unit configuration
Reference that Knowledge Unit in this stage's parameters

Common Embedding Models (configured in Knowledge Unit):

text-embedding-3-large: High quality, larger vectors
text-embedding-3-small: Faster, smaller vectors
text-embedding-ada-002: Legacy model

Dimension Options (configured in Knowledge Unit):

Higher dimensions (2048, 3072): Better accuracy, more storage, slower search
Lower dimensions (256, 512, 1024): Faster search, less storage, may reduce accuracy

Example Configuration:

{
  "name": "Embed",
  "description": "Generate vector embeddings",
  "plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/GatewayTextEmbeddingDataPipelineStagePlugin",
  "plugin_parameters": [
    {
      "parameter_metadata": {
        "name": "KnowledgeUnitObjectId",
        "type": "resource-object-id"
      },
      "default_value": "instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/my-knowledge-unit"
    }
  ]
}

Note: The Knowledge Unit referenced must be properly configured with an embedding profile and vector database before the pipeline runs.

Indexing Stage

Stores embeddings in a vector database for search.

Purpose: Persists processed data to make it searchable by agents

Plugin: AzureAISearchIndexingDataPipelineStagePlugin

Key Parameters:

Parameter	Type	Description	Required
`KnowledgeUnitObjectId`	resource-object-id	Reference to Knowledge Unit that defines indexing settings	Yes

Configuration Through Knowledge Units:

The vector database (e.g., Azure AI Search index), index name, and index settings are not configured directly on this stage. Instead, they are derived from the Knowledge Unit specified in the KnowledgeUnitObjectId parameter.

To configure indexing settings:

Create or edit a Vector Database (see Creating Vector Databases)
Create or edit a Knowledge Unit that references that Vector Database (see Creating Knowledge Units)
Reference that Knowledge Unit in this stage's parameters

The Knowledge Unit contains:

Reference to the Vector Database to use
Index name and partition configuration
Schema and field mappings
Search and indexing parameters

Example Configuration:

{
  "name": "Index",
  "description": "Store in vector database",
  "plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/AzureAISearchIndexingDataPipelineStagePlugin",
  "plugin_parameters": [
    {
      "parameter_metadata": {
        "name": "KnowledgeUnitObjectId",
        "type": "resource-object-id"
      },
      "default_value": "instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/my-knowledge-unit"
    }
  ]
}

Note: The Knowledge Unit must be properly configured with a Vector Database reference before the pipeline runs. The same Knowledge Unit should be used for both the embedding and indexing stages to ensure consistency.

Advanced Stages

Knowledge Graph Stages

For extracting structured information and relationships:

Knowledge Extraction: Extract entities and relationships
Knowledge Consolidation: Merge and deduplicate entities
Knowledge Summarization: Generate summaries
Knowledge Graph Embedding: Create embeddings for entities
Knowledge Graph Indexing: Index graph data

When to Use: When you need structured knowledge extraction beyond simple text search

Content Safety Stage

Filters or flags inappropriate content.

Plugin: AzureAIContentSafetyShieldingDataPipelineStage

Purpose: Identify and handle potentially harmful content

Stage Configuration Best Practices

Ordering Stages

Typical pipeline flow:

Extract → Get text from documents
Partition → Split into chunks
Embed → Generate vectors
Index → Store in database

Parameter Tuning

For Large Documents:

Increase PartitionSizeTokens to 500-800
Use higher overlap (150-200) for better context

For Short Documents:

Decrease PartitionSizeTokens to 200-300
Lower overlap acceptable (50-75)

For Better Quality:

Use text-embedding-3-large model
Use higher dimensions (2048-3072)
Use semantic partitioning

For Better Performance:

Use text-embedding-3-small model
Use lower dimensions (512-1024)
Use token partitioning

Dependencies Between Stages

Some stages require specific inputs:

Partitioning requires extracted text
Embedding requires partitioned text
Indexing requires embeddings

Configure stages in dependency order using the next_stages property.

Knowledge Units for Embedding and Indexing

The embedding and indexing stages derive their configuration from Knowledge Units:

Why Knowledge Units?

Centralizes embedding and index configuration
Ensures consistency between embedding generation and index storage
Allows reuse of the same configuration across multiple pipelines
Enables vector database selection and embedding dimensionality to be managed independently

Best Practice:

Use the same Knowledge Unit for both embedding and indexing stages
Create separate Knowledge Units for different embedding requirements (e.g., different models or dimensions)
Properly configure Vector Databases before creating Knowledge Units

See:

Common Scenarios

Scenario 1: Basic Document Processing

For standard document indexing:

Extract → Partition (Token, 400) → Embed (large, 2048) → Index

Scenario 2: High-Volume Processing

For large document sets where speed matters:

Extract → Partition (Token, 300) → Embed (small, 1024) → Index

Scenario 3: High-Quality Search

For maximum search quality:

Extract → Partition (Semantic, 500) → Embed (large, 3072) → Index

Scenario 4: Knowledge Graph Pipeline

For structured knowledge extraction:

Extract → Partition → Knowledge Extraction → 
Knowledge Consolidation → Knowledge Summarization → 
Knowledge Graph Embedding → Knowledge Graph Indexing

Troubleshooting Stage Configuration

Extraction Fails

Problem: Files can't be processed Solutions:

Verify file format is supported
Check file isn't corrupted
Ensure sufficient memory for large files

Partitioning Issues

Problem: Chunks too large or too small Solutions:

Adjust PartitionSizeTokens
Check token counting method
Review document structure

Embedding Errors

Problem: Embedding fails or times out Solutions:

Verify Knowledge Unit is properly configured
Check Vector Database connection in Knowledge Unit
Verify embedding model is available
Check rate limits
Reduce batch size

Indexing Failures

Problem: Data doesn't appear in index Solutions:

Verify Knowledge Unit is properly configured
Check Vector Database configuration in Knowledge Unit
Verify index exists in the Vector Database
Check API endpoint credentials
Ensure index schema matches data structure

Table of Contents

Configuring Pipeline Stages

Overview

Common Pipeline Stages

Text Extraction Stage

Text Partitioning Stage

Text Embedding Stage

Indexing Stage

Advanced Stages

Knowledge Graph Stages

Content Safety Stage

Stage Configuration Best Practices

Ordering Stages

Parameter Tuning

Dependencies Between Stages

Knowledge Units for Embedding and Indexing

Common Scenarios

Scenario 1: Basic Document Processing

Scenario 2: High-Volume Processing

Scenario 3: High-Quality Search

Scenario 4: Knowledge Graph Pipeline

Troubleshooting Stage Configuration

Extraction Fails

Partitioning Issues

Embedding Errors

Indexing Failures

Related Topics