Table of Contents

Creating Data Pipelines

Learn how to create and configure data pipelines for processing and indexing your data.

Overview

Data pipelines define the processing workflow for transforming raw data into indexed, searchable content that agents can use. A pipeline consists of:

  • Data Source: Where to read data from
  • Stages: Processing steps (text extraction, splitting, embedding, indexing)
  • Configuration: Parameters for each stage

Accessing Data Pipelines

  1. In the Management Portal sidebar, click Data Pipelines under the Data section
  2. The pipelines list shows all configured pipelines

Pipeline List

The table displays:

Column Description
Name Pipeline identifier
Description Purpose of the pipeline
Active Whether the pipeline is enabled
Edit Settings icon to modify configuration
Run Execute the pipeline manually
Delete Remove the pipeline

Creating a Pipeline

  1. Click Create Pipeline at the top right
  2. Configure the pipeline settings

Basic Information

Field Description
Pipeline Name Unique identifier (letters, numbers, dashes, underscores)
Display Name User-friendly name
Description Purpose and what data it processes

Select Data Source

Choose an existing data source from the dropdown. This determines where the pipeline reads input data.

After selecting a data source:

Field Description
Data Source Name Override name for this pipeline's data reference
Data Source Description Additional description
Data Source Plugin Plugin for processing this data source type

Configure Data Source Plugin

The plugin determines how data is read. Configure default values for plugin parameters:

Parameter Type Input Method
String/Int/Float/DateTime Text input
Boolean Toggle switch
Array Chips (comma-separated values)
Resource Object ID Dropdown selection

Pipeline Stages

Stages define the processing workflow. Common stages include:

Stage Types

Stage Purpose
Text Extraction Extract text content from documents
Text Partitioning Split text into chunks
Text Embedding Generate vector embeddings
Indexing Store in vector database

Configuring Stages

For each stage:

  1. Select Plugin: Choose the processing plugin
  2. Configure Parameters: Set stage-specific settings
  3. Define Outputs: Specify where results go

Stage Parameters

Common parameters by stage type:

Text Partitioning:

Parameter Description
Chunk Size Maximum size of each text chunk
Overlap Size Characters overlapping between chunks
Tokenizer How to count tokens

Text Embedding:

Parameter Description
Model Embedding model to use
Dimensions Vector size
Batch Size Documents per batch

Indexing:

Parameter Description
Vector Store Target vector database
Index Name Name for the index

Stage Ordering

Stages can be reordered:

  • Drag and drop stages to change order
  • Use the visual connectors to see flow
  • Ensure logical progression (extraction → partitioning → embedding → indexing)

Advanced Configuration

Trigger Settings

Configure when the pipeline runs:

Trigger Type Description Status
Manual Run on demand only ✅ Available
Scheduled Run on a cron schedule ✅ Available
Event-driven Run when data changes ⚠️ Not currently available

Note: Triggers are defined as part of the pipeline configuration. Event-driven triggers are planned for a future release. See Setting Up Triggers for details.

Prompt Configuration

Some stages use prompts for processing:

  1. Select a prompt from the available options
  2. Configure prompt parameters
  3. Set the prompt role (e.g., main_prompt)

Saving the Pipeline

  1. Review all configuration sections
  2. Click Create Pipeline or Save Changes
  3. Wait for validation and saving

Best Practices

Design

  • Start Simple: Begin with basic Extract → Partition → Embed → Index flow
  • Use Descriptive Names: Choose clear names like "customer-docs-pipeline" not "pipeline1"
  • Document Purpose: Write clear descriptions of what data the pipeline processes
  • Plan for Scale: Consider future data volume when configuring

Naming Conventions

Pipeline Names:

Good:
- customer-documents-pipeline
- sharepoint-knowledge-base
- financial-reports-indexer

Bad:
- pipeline1
- test
- my-pipeline

Stage Names:

Good:
- Extract
- Partition
- Embed-Large-Model
- Index-Customer-Data

Bad:
- stage1
- process
- do-stuff

Configuration

Chunk Size Guidelines:

Content Type Recommended Size Reasoning
Technical docs 400-600 tokens Dense information
General text 300-500 tokens Balanced
Chat/dialogue 200-300 tokens Conversational
Code 500-800 tokens Need context

Embedding Model Selection:

Use Case Model Dimensions
General purpose text-embedding-3-large 1536
High accuracy text-embedding-3-large 3072
Cost-effective text-embedding-3-small 1536
Fast processing text-embedding-3-small 512

Performance

  • Batch Sizes: Start with 50-100 items, adjust based on memory
  • Parallel Processing: Enable where supported for better throughput
  • Monitor First Run: Watch first execution closely for issues
  • Incremental Processing: Use filters to process only new/changed files

Maintenance

  • Test First: Always test with sample data before full runs
  • Create Snapshots: Snapshot configuration before major changes
  • Monitor Regularly: Check pipeline runs daily initially
  • Review Logs: Investigate warnings and errors promptly
  • Update Documentation: Keep pipeline documentation current

Common Pitfalls

Configuration Errors

Problem: Missing required parameters

Error: "Required parameter 'Folders' not provided"

Solution: Check trigger configuration has all data source parameters

Problem: Wrong plugin selected

Error: "Plugin does not support this data source type"

Solution: Verify plugin matches data source (e.g., Azure Data Lake plugin for Azure Data Lake source)

Problem: Invalid stage order

Error: "Stage requires input from previous stage"

Solution: Ensure stages are in logical order (can't embed before extracting)

Performance Issues

Problem: Pipeline too slow Causes:

  • Chunks too small (too many to process)
  • Batch size too small
  • Sequential instead of parallel processing

Solution: See Optimizing Pipeline Performance

Problem: Out of memory errors Causes:

  • Batch size too large
  • Processing very large files
  • Insufficient worker memory

Solution: Reduce batch size, split large files, or increase worker resources

Data Quality Issues

Problem: Search results poor quality Causes:

  • Chunks too large (lose precision)
  • Wrong embedding model
  • Low dimensions

Solution: Experiment with partition size, try larger model or higher dimensions

Problem: Missing content Causes:

  • Files filtered out
  • Unsupported formats
  • Extraction failures

Solution: Check logs for skipped files, verify formats, review extraction errors

Real-World Examples

Example 1: Basic Document Indexing

Scenario: Index PDF documents from Azure Data Lake

Configuration:

Pipeline: company-docs-pipeline
Data Source: Azure Data Lake (company-storage)
Folders: /documents/company-policies

Stages:
1. Extract (TextExtractionDataPipelineStage)
2. Partition (Token, 400 tokens, 100 overlap)
3. Embed (text-embedding-3-large, 1536 dimensions)
4. Index (Azure AI Search, company-docs-index, default partition)

Trigger: Schedule, daily at 2 AM

Why This Works:

  • Simple, reliable flow
  • Standard chunk size for documents
  • High-quality embeddings
  • Regular refresh schedule

Example 2: High-Volume Processing

Scenario: Process thousands of support tickets daily

Configuration:

Pipeline: support-tickets-pipeline
Data Source: Azure Data Lake (support-storage)
Folders: /tickets/new

Stages:
1. Extract
2. Partition (Token, 300 tokens, 50 overlap)
3. Embed (text-embedding-3-small, 1024 dimensions)
4. Index (Azure AI Search, support-index, tickets partition)

Trigger: Schedule, every 15 minutes

Why This Works:

  • Frequent scheduled trigger for near-real-time processing
  • Smaller chunks for faster processing
  • Efficient embedding model for speed
  • Lower cost for high volume

Example 3: Multi-Language Content

Scenario: Index international documents in multiple languages

Configuration:

Pipeline: global-content-pipeline
Data Source: SharePoint Online (international-sites)

Stages:
1. Extract
2. Partition (Semantic, 500 tokens, 150 overlap)
3. Embed (text-embedding-3-large, 3072 dimensions)
4. Index (Azure AI Search, global-index, by-language partitions)

Trigger: Schedule, weekly on Sunday

Why This Works:

  • Semantic partitioning preserves meaning across languages
  • High-dimensional embeddings for better multilingual support
  • Partitions separate by language for better search
  • Weekly refresh sufficient for less dynamic content

Example 4: Code Documentation

Scenario: Index code repositories and technical documentation

Configuration:

Pipeline: code-docs-pipeline
Data Source: Azure Data Lake (code-repos)
Folders: /repos/*/docs

Stages:
1. Extract
2. Partition (Token, 800 tokens, 200 overlap)
3. Embed (text-embedding-3-large, 3072 dimensions)
4. Index (Azure AI Search, code-docs-index, by-repo partitions)

Trigger: Schedule, hourly

Why This Works:

  • Larger chunks preserve code context
  • Higher overlap maintains continuity
  • Hourly schedule for timely updates
  • Partitions separate by repository

Troubleshooting Creation Issues

Cannot Create Pipeline

Problem: Create button disabled or fails

Solutions:

  • Check you have PipelineAdministrator or PipelineDeveloper role
  • Verify pipeline name is unique
  • Ensure required fields filled
  • Check data source exists and is accessible

Configuration Not Saved

Problem: Changes don't persist

Solutions:

  • Check for validation errors (red text)
  • Verify all required fields completed
  • Check browser console for errors
  • Try refreshing and re-entering

Preview/Test Not Working

Problem: Cannot test pipeline before saving

Solutions:

  • Ensure data source accessible
  • Check sample data available
  • Verify authentication configured
  • Review test data permissions