Table of Contents

Configuring Pipeline Stages

Learn how to configure data pipeline stages to transform and process your data.

Overview

Pipeline stages define the processing steps that transform raw data into indexed, searchable content. Each stage performs a specific transformation and can be configured with parameters to control its behavior.

Common Pipeline Stages

Text Extraction Stage

Extracts text content from various document formats.

Purpose: Converts binary files (PDF, DOCX, XLSX, images) into plain text

Plugin: TextExtractionDataPipelineStage

Common Parameters:

  • None required for basic operation
  • Plugin automatically selects appropriate extractor based on file type

Supported Formats:

  • PDF documents
  • Microsoft Word (.docx)
  • Microsoft Excel (.xlsx)
  • Microsoft PowerPoint (.pptx)
  • Images (with OCR)

Example Configuration:

{
  "name": "Extract",
  "description": "Extract text from documents",
  "plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/TextExtractionDataPipelineStage",
  "plugin_parameters": null
}

Text Partitioning Stage

Splits extracted text into smaller, manageable chunks.

Purpose: Breaks large documents into appropriately-sized segments for embedding and retrieval

Plugin: TextPartitioningDataPipelineStage

Key Parameters:

Parameter Type Description Default Recommended
PartitioningStrategy string Strategy to use: "Token" or "Semantic" Token Token for most cases
PartitionSizeTokens int Size of each partition in tokens 400 300-500
PartitionOverlapTokens int Overlap between partitions 100 50-150

Partitioning Strategies:

  1. Token-Based: Splits text based on token count

    • Fast and predictable
    • Good for general-purpose use
    • Consistent chunk sizes
  2. Semantic: Splits based on semantic boundaries

    • Preserves meaning better
    • May result in variable chunk sizes
    • Better for complex documents

Example Configuration:

{
  "name": "Partition",
  "description": "Split text into chunks",
  "plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/TextPartitioningDataPipelineStage",
  "plugin_parameters": [
    {
      "parameter_metadata": {
        "name": "PartitioningStrategy",
        "type": "string"
      },
      "default_value": "Token"
    }
  ],
  "plugin_dependencies": [
    {
      "plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/TokenContentTextPartitioning",
      "plugin_parameters": [
        {
          "parameter_metadata": {
            "name": "PartitionSizeTokens",
            "type": "int"
          },
          "default_value": 400
        },
        {
          "parameter_metadata": {
            "name": "PartitionOverlapTokens",
            "type": "int"
          },
          "default_value": 100
        }
      ]
    }
  ]
}

Text Embedding Stage

Generates vector embeddings for text chunks.

Purpose: Converts text into numerical vectors for semantic search

Plugin: GatewayTextEmbeddingDataPipelineStage

Key Parameters:

Parameter Type Description Required
KnowledgeUnitObjectId resource-object-id Reference to Knowledge Unit that defines embedding settings Yes

Configuration Through Knowledge Units:

The embedding model and dimensions are not configured directly on this stage. Instead, they are derived from the Knowledge Unit specified in the KnowledgeUnitObjectId parameter.

To configure embedding settings:

  1. Create or edit a Knowledge Unit (see Creating Knowledge Units)
  2. Set the embedding model and dimensions in the Knowledge Unit configuration
  3. Reference that Knowledge Unit in this stage's parameters

Common Embedding Models (configured in Knowledge Unit):

  • text-embedding-3-large: High quality, larger vectors
  • text-embedding-3-small: Faster, smaller vectors
  • text-embedding-ada-002: Legacy model

Dimension Options (configured in Knowledge Unit):

  • Higher dimensions (2048, 3072): Better accuracy, more storage, slower search
  • Lower dimensions (256, 512, 1024): Faster search, less storage, may reduce accuracy

Example Configuration:

{
  "name": "Embed",
  "description": "Generate vector embeddings",
  "plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/GatewayTextEmbeddingDataPipelineStagePlugin",
  "plugin_parameters": [
    {
      "parameter_metadata": {
        "name": "KnowledgeUnitObjectId",
        "type": "resource-object-id"
      },
      "default_value": "instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/my-knowledge-unit"
    }
  ]
}

Note: The Knowledge Unit referenced must be properly configured with an embedding profile and vector database before the pipeline runs.

Indexing Stage

Stores embeddings in a vector database for search.

Purpose: Persists processed data to make it searchable by agents

Plugin: AzureAISearchIndexingDataPipelineStagePlugin

Key Parameters:

Parameter Type Description Required
KnowledgeUnitObjectId resource-object-id Reference to Knowledge Unit that defines indexing settings Yes

Configuration Through Knowledge Units:

The vector database (e.g., Azure AI Search index), index name, and index settings are not configured directly on this stage. Instead, they are derived from the Knowledge Unit specified in the KnowledgeUnitObjectId parameter.

To configure indexing settings:

  1. Create or edit a Vector Database (see Creating Vector Databases)
  2. Create or edit a Knowledge Unit that references that Vector Database (see Creating Knowledge Units)
  3. Reference that Knowledge Unit in this stage's parameters

The Knowledge Unit contains:

  • Reference to the Vector Database to use
  • Index name and partition configuration
  • Schema and field mappings
  • Search and indexing parameters

Example Configuration:

{
  "name": "Index",
  "description": "Store in vector database",
  "plugin_object_id": "instances/{instanceId}/providers/FoundationaLLM.Plugin/plugins/AzureAISearchIndexingDataPipelineStagePlugin",
  "plugin_parameters": [
    {
      "parameter_metadata": {
        "name": "KnowledgeUnitObjectId",
        "type": "resource-object-id"
      },
      "default_value": "instances/{instanceId}/providers/FoundationaLLM.KnowledgeUnit/knowledgeUnits/my-knowledge-unit"
    }
  ]
}

Note: The Knowledge Unit must be properly configured with a Vector Database reference before the pipeline runs. The same Knowledge Unit should be used for both the embedding and indexing stages to ensure consistency.

Advanced Stages

Knowledge Graph Stages

For extracting structured information and relationships:

  1. Knowledge Extraction: Extract entities and relationships
  2. Knowledge Consolidation: Merge and deduplicate entities
  3. Knowledge Summarization: Generate summaries
  4. Knowledge Graph Embedding: Create embeddings for entities
  5. Knowledge Graph Indexing: Index graph data

When to Use: When you need structured knowledge extraction beyond simple text search

Content Safety Stage

Filters or flags inappropriate content.

Plugin: AzureAIContentSafetyShieldingDataPipelineStage

Purpose: Identify and handle potentially harmful content

Stage Configuration Best Practices

Ordering Stages

Typical pipeline flow:

  1. Extract → Get text from documents
  2. Partition → Split into chunks
  3. Embed → Generate vectors
  4. Index → Store in database

Parameter Tuning

For Large Documents:

  • Increase PartitionSizeTokens to 500-800
  • Use higher overlap (150-200) for better context

For Short Documents:

  • Decrease PartitionSizeTokens to 200-300
  • Lower overlap acceptable (50-75)

For Better Quality:

  • Use text-embedding-3-large model
  • Use higher dimensions (2048-3072)
  • Use semantic partitioning

For Better Performance:

  • Use text-embedding-3-small model
  • Use lower dimensions (512-1024)
  • Use token partitioning

Dependencies Between Stages

Some stages require specific inputs:

  • Partitioning requires extracted text
  • Embedding requires partitioned text
  • Indexing requires embeddings

Configure stages in dependency order using the next_stages property.

Knowledge Units for Embedding and Indexing

The embedding and indexing stages derive their configuration from Knowledge Units:

Why Knowledge Units?

  • Centralizes embedding and index configuration
  • Ensures consistency between embedding generation and index storage
  • Allows reuse of the same configuration across multiple pipelines
  • Enables vector database selection and embedding dimensionality to be managed independently

Best Practice:

  • Use the same Knowledge Unit for both embedding and indexing stages
  • Create separate Knowledge Units for different embedding requirements (e.g., different models or dimensions)
  • Properly configure Vector Databases before creating Knowledge Units

See:

Common Scenarios

Scenario 1: Basic Document Processing

For standard document indexing:

Extract → Partition (Token, 400) → Embed (large, 2048) → Index

Scenario 2: High-Volume Processing

For large document sets where speed matters:

Extract → Partition (Token, 300) → Embed (small, 1024) → Index

For maximum search quality:

Extract → Partition (Semantic, 500) → Embed (large, 3072) → Index

Scenario 4: Knowledge Graph Pipeline

For structured knowledge extraction:

Extract → Partition → Knowledge Extraction → 
Knowledge Consolidation → Knowledge Summarization → 
Knowledge Graph Embedding → Knowledge Graph Indexing

Troubleshooting Stage Configuration

Extraction Fails

Problem: Files can't be processed Solutions:

  • Verify file format is supported
  • Check file isn't corrupted
  • Ensure sufficient memory for large files

Partitioning Issues

Problem: Chunks too large or too small Solutions:

  • Adjust PartitionSizeTokens
  • Check token counting method
  • Review document structure

Embedding Errors

Problem: Embedding fails or times out Solutions:

  • Verify Knowledge Unit is properly configured
  • Check Vector Database connection in Knowledge Unit
  • Verify embedding model is available
  • Check rate limits
  • Reduce batch size

Indexing Failures

Problem: Data doesn't appear in index Solutions:

  • Verify Knowledge Unit is properly configured
  • Check Vector Database configuration in Knowledge Unit
  • Verify index exists in the Vector Database
  • Check API endpoint credentials
  • Ensure index schema matches data structure