Creating Data Pipelines

Learn how to create and configure data pipelines for processing and indexing your data.

Overview

Data pipelines define the processing workflow for transforming raw data into indexed, searchable content that agents can use. A pipeline consists of:

Data Source: Where to read data from
Stages: Processing steps (text extraction, splitting, embedding, indexing)
Configuration: Parameters for each stage

Accessing Data Pipelines

In the Management Portal sidebar, click Data Pipelines under the Data section
The pipelines list shows all configured pipelines

Pipeline List

The table displays:

Column	Description
Name	Pipeline identifier
Description	Purpose of the pipeline
Active	Whether the pipeline is enabled
Edit	Settings icon to modify configuration
Run	Execute the pipeline manually
Delete	Remove the pipeline

Creating a Pipeline

Click Create Pipeline at the top right
Configure the pipeline settings

Basic Information

Field	Description
Pipeline Name	Unique identifier (letters, numbers, dashes, underscores)
Display Name	User-friendly name
Description	Purpose and what data it processes

Select Data Source

Choose an existing data source from the dropdown. This determines where the pipeline reads input data.

After selecting a data source:

Field	Description
Data Source Name	Override name for this pipeline's data reference
Data Source Description	Additional description
Data Source Plugin	Plugin for processing this data source type

Configure Data Source Plugin

The plugin determines how data is read. Configure default values for plugin parameters:

Parameter Type	Input Method
String/Int/Float/DateTime	Text input
Boolean	Toggle switch
Array	Chips (comma-separated values)
Resource Object ID	Dropdown selection

Pipeline Stages

Stages define the processing workflow. Common stages include:

Stage Types

Stage	Purpose
Text Extraction	Extract text content from documents
Text Partitioning	Split text into chunks
Text Embedding	Generate vector embeddings
Indexing	Store in vector database

Configuring Stages

For each stage:

Select Plugin: Choose the processing plugin
Configure Parameters: Set stage-specific settings
Define Outputs: Specify where results go

Stage Parameters

Common parameters by stage type:

Text Partitioning:

Parameter	Description
Chunk Size	Maximum size of each text chunk
Overlap Size	Characters overlapping between chunks
Tokenizer	How to count tokens

Text Embedding:

Parameter	Description
Model	Embedding model to use
Dimensions	Vector size
Batch Size	Documents per batch

Indexing:

Parameter	Description
Vector Store	Target vector database
Index Name	Name for the index

Stage Ordering

Stages can be reordered:

Drag and drop stages to change order
Use the visual connectors to see flow
Ensure logical progression (extraction → partitioning → embedding → indexing)

Advanced Configuration

Trigger Settings

Configure when the pipeline runs:

Trigger Type	Description	Status
Manual	Run on demand only	✅ Available
Scheduled	Run on a cron schedule	✅ Available
Event-driven	Run when data changes	⚠️ Not currently available

Note: Triggers are defined as part of the pipeline configuration. Event-driven triggers are planned for a future release. See Setting Up Triggers for details.

Prompt Configuration

Some stages use prompts for processing:

Select a prompt from the available options
Configure prompt parameters
Set the prompt role (e.g., main_prompt)

Saving the Pipeline

Review all configuration sections
Click Create Pipeline or Save Changes
Wait for validation and saving

Best Practices

Design

Start Simple: Begin with basic Extract → Partition → Embed → Index flow
Use Descriptive Names: Choose clear names like "customer-docs-pipeline" not "pipeline1"
Document Purpose: Write clear descriptions of what data the pipeline processes
Plan for Scale: Consider future data volume when configuring

Naming Conventions

Pipeline Names:

Good:
- customer-documents-pipeline
- sharepoint-knowledge-base
- financial-reports-indexer

Bad:
- pipeline1
- test
- my-pipeline

Stage Names:

Good:
- Extract
- Partition
- Embed-Large-Model
- Index-Customer-Data

Bad:
- stage1
- process
- do-stuff

Configuration

Chunk Size Guidelines:

Content Type	Recommended Size	Reasoning
Technical docs	400-600 tokens	Dense information
General text	300-500 tokens	Balanced
Chat/dialogue	200-300 tokens	Conversational
Code	500-800 tokens	Need context

Embedding Model Selection:

Use Case	Model	Dimensions
General purpose	text-embedding-3-large	1536
High accuracy	text-embedding-3-large	3072
Cost-effective	text-embedding-3-small	1536
Fast processing	text-embedding-3-small	512

Performance

Batch Sizes: Start with 50-100 items, adjust based on memory
Parallel Processing: Enable where supported for better throughput
Monitor First Run: Watch first execution closely for issues
Incremental Processing: Use filters to process only new/changed files

Maintenance

Test First: Always test with sample data before full runs
Create Snapshots: Snapshot configuration before major changes
Monitor Regularly: Check pipeline runs daily initially
Review Logs: Investigate warnings and errors promptly
Update Documentation: Keep pipeline documentation current

Common Pitfalls

Configuration Errors

Problem: Missing required parameters

Error: "Required parameter 'Folders' not provided"

Solution: Check trigger configuration has all data source parameters

Problem: Wrong plugin selected

Error: "Plugin does not support this data source type"

Solution: Verify plugin matches data source (e.g., Azure Data Lake plugin for Azure Data Lake source)

Problem: Invalid stage order

Error: "Stage requires input from previous stage"

Solution: Ensure stages are in logical order (can't embed before extracting)

Performance Issues

Problem: Pipeline too slow Causes:

Chunks too small (too many to process)
Batch size too small
Sequential instead of parallel processing

Solution: See Optimizing Pipeline Performance

Problem: Out of memory errors Causes:

Batch size too large
Processing very large files
Insufficient worker memory

Solution: Reduce batch size, split large files, or increase worker resources

Data Quality Issues

Problem: Search results poor quality Causes:

Chunks too large (lose precision)
Wrong embedding model
Low dimensions

Solution: Experiment with partition size, try larger model or higher dimensions

Problem: Missing content Causes:

Files filtered out
Unsupported formats
Extraction failures

Solution: Check logs for skipped files, verify formats, review extraction errors

Real-World Examples

Example 1: Basic Document Indexing

Scenario: Index PDF documents from Azure Data Lake

Configuration:

Pipeline: company-docs-pipeline
Data Source: Azure Data Lake (company-storage)
Folders: /documents/company-policies

Stages:
1. Extract (TextExtractionDataPipelineStage)
2. Partition (Token, 400 tokens, 100 overlap)
3. Embed (text-embedding-3-large, 1536 dimensions)
4. Index (Azure AI Search, company-docs-index, default partition)

Trigger: Schedule, daily at 2 AM

Why This Works:

Simple, reliable flow
Standard chunk size for documents
High-quality embeddings
Regular refresh schedule

Example 2: High-Volume Processing

Scenario: Process thousands of support tickets daily

Configuration:

Pipeline: support-tickets-pipeline
Data Source: Azure Data Lake (support-storage)
Folders: /tickets/new

Stages:
1. Extract
2. Partition (Token, 300 tokens, 50 overlap)
3. Embed (text-embedding-3-small, 1024 dimensions)
4. Index (Azure AI Search, support-index, tickets partition)

Trigger: Schedule, every 15 minutes

Why This Works:

Frequent scheduled trigger for near-real-time processing
Smaller chunks for faster processing
Efficient embedding model for speed
Lower cost for high volume

Example 3: Multi-Language Content

Scenario: Index international documents in multiple languages

Configuration:

Pipeline: global-content-pipeline
Data Source: SharePoint Online (international-sites)

Stages:
1. Extract
2. Partition (Semantic, 500 tokens, 150 overlap)
3. Embed (text-embedding-3-large, 3072 dimensions)
4. Index (Azure AI Search, global-index, by-language partitions)

Trigger: Schedule, weekly on Sunday

Why This Works:

Semantic partitioning preserves meaning across languages
High-dimensional embeddings for better multilingual support
Partitions separate by language for better search
Weekly refresh sufficient for less dynamic content

Example 4: Code Documentation

Scenario: Index code repositories and technical documentation

Configuration:

Pipeline: code-docs-pipeline
Data Source: Azure Data Lake (code-repos)
Folders: /repos/*/docs

Stages:
1. Extract
2. Partition (Token, 800 tokens, 200 overlap)
3. Embed (text-embedding-3-large, 3072 dimensions)
4. Index (Azure AI Search, code-docs-index, by-repo partitions)

Trigger: Schedule, hourly

Why This Works:

Larger chunks preserve code context
Higher overlap maintains continuity
Hourly schedule for timely updates
Partitions separate by repository

Troubleshooting Creation Issues

Cannot Create Pipeline

Problem: Create button disabled or fails

Solutions:

Check you have PipelineAdministrator or PipelineDeveloper role
Verify pipeline name is unique
Ensure required fields filled
Check data source exists and is accessible

Configuration Not Saved

Problem: Changes don't persist

Solutions:

Check for validation errors (red text)
Verify all required fields completed
Check browser console for errors
Try refreshing and re-entering

Preview/Test Not Working

Problem: Cannot test pipeline before saving

Solutions:

Ensure data source accessible
Check sample data available
Verify authentication configured
Review test data permissions

Table of Contents

Creating Data Pipelines

Overview

Accessing Data Pipelines

Pipeline List

Creating a Pipeline

Basic Information

Select Data Source

Configure Data Source Plugin

Pipeline Stages

Stage Types

Configuring Stages

Stage Parameters

Stage Ordering

Advanced Configuration

Trigger Settings

Prompt Configuration

Saving the Pipeline

Best Practices

Design

Naming Conventions

Configuration

Performance

Maintenance

Common Pitfalls

Configuration Errors

Performance Issues

Data Quality Issues

Real-World Examples

Example 1: Basic Document Indexing

Example 2: High-Volume Processing

Example 3: Multi-Language Content

Example 4: Code Documentation

Troubleshooting Creation Issues

Cannot Create Pipeline

Configuration Not Saved

Preview/Test Not Working

Related Topics