Creating Data Pipelines
Learn how to create and configure data pipelines for processing and indexing your data.
Overview
Data pipelines define the processing workflow for transforming raw data into indexed, searchable content that agents can use. A pipeline consists of:
- Data Source: Where to read data from
- Stages: Processing steps (text extraction, splitting, embedding, indexing)
- Configuration: Parameters for each stage
Accessing Data Pipelines
- In the Management Portal sidebar, click Data Pipelines under the Data section
- The pipelines list shows all configured pipelines
Pipeline List
The table displays:
| Column | Description |
|---|---|
| Name | Pipeline identifier |
| Description | Purpose of the pipeline |
| Active | Whether the pipeline is enabled |
| Edit | Settings icon to modify configuration |
| Run | Execute the pipeline manually |
| Delete | Remove the pipeline |
Creating a Pipeline
- Click Create Pipeline at the top right
- Configure the pipeline settings
Basic Information
| Field | Description |
|---|---|
| Pipeline Name | Unique identifier (letters, numbers, dashes, underscores) |
| Display Name | User-friendly name |
| Description | Purpose and what data it processes |
Select Data Source
Choose an existing data source from the dropdown. This determines where the pipeline reads input data.
After selecting a data source:
| Field | Description |
|---|---|
| Data Source Name | Override name for this pipeline's data reference |
| Data Source Description | Additional description |
| Data Source Plugin | Plugin for processing this data source type |
Configure Data Source Plugin
The plugin determines how data is read. Configure default values for plugin parameters:
| Parameter Type | Input Method |
|---|---|
| String/Int/Float/DateTime | Text input |
| Boolean | Toggle switch |
| Array | Chips (comma-separated values) |
| Resource Object ID | Dropdown selection |
Pipeline Stages
Stages define the processing workflow. Common stages include:
Stage Types
| Stage | Purpose |
|---|---|
| Text Extraction | Extract text content from documents |
| Text Partitioning | Split text into chunks |
| Text Embedding | Generate vector embeddings |
| Indexing | Store in vector database |
Configuring Stages
For each stage:
- Select Plugin: Choose the processing plugin
- Configure Parameters: Set stage-specific settings
- Define Outputs: Specify where results go
Stage Parameters
Common parameters by stage type:
Text Partitioning:
| Parameter | Description |
|---|---|
| Chunk Size | Maximum size of each text chunk |
| Overlap Size | Characters overlapping between chunks |
| Tokenizer | How to count tokens |
Text Embedding:
| Parameter | Description |
|---|---|
| Model | Embedding model to use |
| Dimensions | Vector size |
| Batch Size | Documents per batch |
Indexing:
| Parameter | Description |
|---|---|
| Vector Store | Target vector database |
| Index Name | Name for the index |
Stage Ordering
Stages can be reordered:
- Drag and drop stages to change order
- Use the visual connectors to see flow
- Ensure logical progression (extraction → partitioning → embedding → indexing)
Advanced Configuration
Trigger Settings
Configure when the pipeline runs:
| Trigger Type | Description | Status |
|---|---|---|
| Manual | Run on demand only | ✅ Available |
| Scheduled | Run on a cron schedule | ✅ Available |
| Event-driven | Run when data changes | ⚠️ Not currently available |
Note: Triggers are defined as part of the pipeline configuration. Event-driven triggers are planned for a future release. See Setting Up Triggers for details.
Prompt Configuration
Some stages use prompts for processing:
- Select a prompt from the available options
- Configure prompt parameters
- Set the prompt role (e.g.,
main_prompt)
Saving the Pipeline
- Review all configuration sections
- Click Create Pipeline or Save Changes
- Wait for validation and saving
Best Practices
Design
- Start Simple: Begin with basic Extract → Partition → Embed → Index flow
- Use Descriptive Names: Choose clear names like "customer-docs-pipeline" not "pipeline1"
- Document Purpose: Write clear descriptions of what data the pipeline processes
- Plan for Scale: Consider future data volume when configuring
Naming Conventions
Pipeline Names:
Good:
- customer-documents-pipeline
- sharepoint-knowledge-base
- financial-reports-indexer
Bad:
- pipeline1
- test
- my-pipeline
Stage Names:
Good:
- Extract
- Partition
- Embed-Large-Model
- Index-Customer-Data
Bad:
- stage1
- process
- do-stuff
Configuration
Chunk Size Guidelines:
| Content Type | Recommended Size | Reasoning |
|---|---|---|
| Technical docs | 400-600 tokens | Dense information |
| General text | 300-500 tokens | Balanced |
| Chat/dialogue | 200-300 tokens | Conversational |
| Code | 500-800 tokens | Need context |
Embedding Model Selection:
| Use Case | Model | Dimensions |
|---|---|---|
| General purpose | text-embedding-3-large | 1536 |
| High accuracy | text-embedding-3-large | 3072 |
| Cost-effective | text-embedding-3-small | 1536 |
| Fast processing | text-embedding-3-small | 512 |
Performance
- Batch Sizes: Start with 50-100 items, adjust based on memory
- Parallel Processing: Enable where supported for better throughput
- Monitor First Run: Watch first execution closely for issues
- Incremental Processing: Use filters to process only new/changed files
Maintenance
- Test First: Always test with sample data before full runs
- Create Snapshots: Snapshot configuration before major changes
- Monitor Regularly: Check pipeline runs daily initially
- Review Logs: Investigate warnings and errors promptly
- Update Documentation: Keep pipeline documentation current
Common Pitfalls
Configuration Errors
Problem: Missing required parameters
Error: "Required parameter 'Folders' not provided"
Solution: Check trigger configuration has all data source parameters
Problem: Wrong plugin selected
Error: "Plugin does not support this data source type"
Solution: Verify plugin matches data source (e.g., Azure Data Lake plugin for Azure Data Lake source)
Problem: Invalid stage order
Error: "Stage requires input from previous stage"
Solution: Ensure stages are in logical order (can't embed before extracting)
Performance Issues
Problem: Pipeline too slow Causes:
- Chunks too small (too many to process)
- Batch size too small
- Sequential instead of parallel processing
Solution: See Optimizing Pipeline Performance
Problem: Out of memory errors Causes:
- Batch size too large
- Processing very large files
- Insufficient worker memory
Solution: Reduce batch size, split large files, or increase worker resources
Data Quality Issues
Problem: Search results poor quality Causes:
- Chunks too large (lose precision)
- Wrong embedding model
- Low dimensions
Solution: Experiment with partition size, try larger model or higher dimensions
Problem: Missing content Causes:
- Files filtered out
- Unsupported formats
- Extraction failures
Solution: Check logs for skipped files, verify formats, review extraction errors
Real-World Examples
Example 1: Basic Document Indexing
Scenario: Index PDF documents from Azure Data Lake
Configuration:
Pipeline: company-docs-pipeline
Data Source: Azure Data Lake (company-storage)
Folders: /documents/company-policies
Stages:
1. Extract (TextExtractionDataPipelineStage)
2. Partition (Token, 400 tokens, 100 overlap)
3. Embed (text-embedding-3-large, 1536 dimensions)
4. Index (Azure AI Search, company-docs-index, default partition)
Trigger: Schedule, daily at 2 AM
Why This Works:
- Simple, reliable flow
- Standard chunk size for documents
- High-quality embeddings
- Regular refresh schedule
Example 2: High-Volume Processing
Scenario: Process thousands of support tickets daily
Configuration:
Pipeline: support-tickets-pipeline
Data Source: Azure Data Lake (support-storage)
Folders: /tickets/new
Stages:
1. Extract
2. Partition (Token, 300 tokens, 50 overlap)
3. Embed (text-embedding-3-small, 1024 dimensions)
4. Index (Azure AI Search, support-index, tickets partition)
Trigger: Schedule, every 15 minutes
Why This Works:
- Frequent scheduled trigger for near-real-time processing
- Smaller chunks for faster processing
- Efficient embedding model for speed
- Lower cost for high volume
Example 3: Multi-Language Content
Scenario: Index international documents in multiple languages
Configuration:
Pipeline: global-content-pipeline
Data Source: SharePoint Online (international-sites)
Stages:
1. Extract
2. Partition (Semantic, 500 tokens, 150 overlap)
3. Embed (text-embedding-3-large, 3072 dimensions)
4. Index (Azure AI Search, global-index, by-language partitions)
Trigger: Schedule, weekly on Sunday
Why This Works:
- Semantic partitioning preserves meaning across languages
- High-dimensional embeddings for better multilingual support
- Partitions separate by language for better search
- Weekly refresh sufficient for less dynamic content
Example 4: Code Documentation
Scenario: Index code repositories and technical documentation
Configuration:
Pipeline: code-docs-pipeline
Data Source: Azure Data Lake (code-repos)
Folders: /repos/*/docs
Stages:
1. Extract
2. Partition (Token, 800 tokens, 200 overlap)
3. Embed (text-embedding-3-large, 3072 dimensions)
4. Index (Azure AI Search, code-docs-index, by-repo partitions)
Trigger: Schedule, hourly
Why This Works:
- Larger chunks preserve code context
- Higher overlap maintains continuity
- Hourly schedule for timely updates
- Partitions separate by repository
Troubleshooting Creation Issues
Cannot Create Pipeline
Problem: Create button disabled or fails
Solutions:
- Check you have PipelineAdministrator or PipelineDeveloper role
- Verify pipeline name is unique
- Ensure required fields filled
- Check data source exists and is accessible
Configuration Not Saved
Problem: Changes don't persist
Solutions:
- Check for validation errors (red text)
- Verify all required fields completed
- Check browser console for errors
- Try refreshing and re-entering
Preview/Test Not Working
Problem: Cannot test pipeline before saving
Solutions:
- Ensure data source accessible
- Check sample data available
- Verify authentication configured
- Review test data permissions