Table of Contents

Troubleshooting Data Pipelines

Learn how to diagnose and resolve common data pipeline issues.

Overview

This guide helps you troubleshoot data pipeline problems through systematic diagnosis and resolution. Follow the troubleshooting workflow to identify and fix issues efficiently.

Troubleshooting Workflow

  1. Identify the Problem: Determine what's not working
  2. Gather Information: Collect logs, error messages, and context
  3. Analyze Symptoms: Match symptoms to known issues
  4. Apply Solutions: Follow recommended fixes
  5. Verify Resolution: Confirm the issue is resolved
  6. Document: Record the problem and solution

Common Issues by Category

Pipeline Execution Issues

Pipeline Won't Start

Symptoms:

  • Run button disabled or does nothing
  • Error: "Cannot start pipeline"
  • Pipeline stays in "Pending" status

Common Causes:

Cause Check Solution
Pipeline inactive Pipeline status Activate the pipeline
Insufficient permissions User role Request Operator or Administrator role
Another run active Pipeline runs list Wait for completion or cancel
Missing trigger parameters Trigger configuration Add required parameters
System maintenance Platform status Wait and retry

Detailed Diagnosis:

  1. Check Pipeline Status:

    Navigate to Data Pipelines → Find your pipeline
    Status should show "Active" not "Inactive"
    
  2. Verify Permissions:

    Check your assigned roles
    Need: PipelineOperator or PipelineAdministrator
    
  3. Review Active Runs:

    Go to Data Pipeline Runs
    Filter by your pipeline name
    Check for "Running" status
    

Pipeline Fails Immediately

Symptoms:

  • Pipeline starts but fails within seconds
  • Status changes directly to "Failed"
  • No data processed

Common Causes:

1. Data Source Connection Issues

Symptoms:

  • Error: "Unable to connect to data source"
  • Error: "Access denied"

Solutions:

  • Verify data source configuration
  • Check authentication credentials
  • Test network connectivity
  • Verify permissions on source

2. Missing Configuration

Symptoms:

  • Error: "Required parameter not found"
  • Error: "Invalid configuration"

Solutions:

  • Review trigger parameter values
  • Verify all required parameters provided
  • Check parameter naming convention
  • Validate parameter types

3. Plugin Not Found

Symptoms:

  • Error: "Plugin not registered"
  • Error: "Unknown plugin"

Solutions:

  • Verify plugin object ID
  • Check plugin deployment
  • Confirm plugin version compatibility

Pipeline Hangs or Times Out

Symptoms:

  • Pipeline runs for extended period
  • No progress updates
  • Eventually times out

Common Causes:

Cause Typical Duration Solution
Large dataset Hours Normal, monitor progress
Network issues Varies Check connectivity
Resource contention Varies Check system load
Infinite loop Forever Cancel and review config

Diagnostic Steps:

  1. Check Expected Duration:

    • Review similar past runs
    • Calculate based on data volume
    • Compare against baselines
  2. Monitor Progress:

    • Check items processed
    • Review stage completion
    • Look for incremental updates
  3. Inspect Logs:

    • Look for repeated errors
    • Check for warnings
    • Identify stuck stage

Stage-Specific Issues

Text Extraction Fails

Symptoms:

  • Error: "Unable to extract text"
  • Files skipped
  • Empty content output

Common Causes:

1. Unsupported File Format

Solution:

Verify file extension matches supported types:
- PDF: .pdf
- Word: .docx (not .doc)
- Excel: .xlsx (not .xls)
- PowerPoint: .pptx (not .ppt)

2. Corrupted Files

Solution:

- Test file opens normally
- Try re-downloading source file
- Check file size isn't zero
- Verify file isn't password-protected

3. Large File Issues

Solution:

- Check file size limits
- Increase memory allocation
- Split large files
- Use streaming extraction if available

Partitioning Produces Wrong Results

Symptoms:

  • Chunks too large or too small
  • Text cut mid-sentence
  • Overlaps incorrect

Diagnosis:

  1. Review Parameters:

    {
      "PartitionSizeTokens": 400,  // Target size
      "PartitionOverlapTokens": 100  // Overlap
    }
    
  2. Check Token Counting:

    • Different models count tokens differently
    • Verify tokenizer matches embedding model

Solutions:

Problem Current Setting Recommended Setting
Chunks too large 800+ tokens 300-500 tokens
Chunks too small <200 tokens 400-600 tokens
Poor overlap <50 tokens 100-150 tokens
Excessive overlap >200 tokens 50-100 tokens

Embedding Failures

Symptoms:

  • Error: "Embedding failed"
  • Error: "Rate limit exceeded"
  • Timeout errors

Common Causes:

1. API Rate Limits

Symptoms:

  • Error: "429 Too Many Requests"
  • Intermittent failures

Solutions:

  • Reduce batch size
  • Add retry logic
  • Check quota limits
  • Consider slower processing

2. Invalid Model Name

Symptoms:

  • Error: "Model not found"
  • All embedding fails

Solution:

Verify model name exactly matches:
- text-embedding-3-large
- text-embedding-3-small
- text-embedding-ada-002

Common typos:
✗ text-embedding-3-Large (wrong case)
✗ text-embedding-large (missing version)
✓ text-embedding-3-large (correct)

3. Text Too Long

Symptoms:

  • Error: "Input too long"
  • Specific chunks fail

Solutions:

  • Reduce partition size
  • Check for malformed chunks
  • Verify token limits for model

Indexing Issues

Symptoms:

  • Data doesn't appear in index
  • Partial data indexed
  • Search doesn't find content

Common Causes:

1. Index Not Found

Symptoms:

  • Error: "Index does not exist"

Solutions:

1. Verify index name spelling
2. Check index exists in Azure AI Search
3. Confirm API endpoint configuration
4. Verify permissions on index

2. Index Full

Symptoms:

  • Error: "Insufficient storage"
  • Partial success

Solutions:

1. Check Azure AI Search quota
2. Review index size
3. Consider index partitioning
4. Clean up old data

3. Schema Mismatch

Symptoms:

  • Error: "Field not found"
  • Data transformation fails

Solutions:

1. Review index schema
2. Verify field names match
3. Check data types
4. Update schema if needed

Data Quality Issues

Missing Data

Symptoms:

  • Expected files not processed
  • Gaps in indexed content
  • Fewer items than source

Diagnosis Checklist:

  • [ ] Check data source filter/path
  • [ ] Verify file permissions
  • [ ] Review error logs for skipped files
  • [ ] Confirm trigger parameter values
  • [ ] Check for duplicate detection

Investigation:

  1. Compare Source vs. Processed:

    Source: 1000 files
    Processed: 950 files
    Missing: 50 files → Check logs for reasons
    
  2. Review Pipeline Logs:

    • Look for "skipped" messages
    • Check for unsupported formats
    • Identify permission errors

Incorrect Data

Symptoms:

  • Text doesn't match source
  • Garbled characters
  • Wrong encoding

Common Causes:

1. Encoding Issues

Symptoms:

  • Special characters display wrong
  • Non-English text garbled

Solutions:

  • Verify source file encoding
  • Check UTF-8 support
  • Test with sample files

2. Extraction Errors

Symptoms:

  • PDF text out of order
  • Tables not parsed correctly
  • Images missing

Solutions:

  • Review PDF structure
  • Check extraction plugin version
  • Consider OCR for images
  • Validate output samples

Performance Issues

Slow Processing

Symptoms:

  • Pipeline takes much longer than expected
  • Processing rate decreases over time
  • Timeouts occur

Performance Benchmarks:

Stage Typical Rate Slow If
Extraction 10-50 docs/min <5 docs/min
Partitioning 100-500 chunks/min <50 chunks/min
Embedding 50-200 chunks/min <25 chunks/min
Indexing 100-1000 chunks/min <50 chunks/min

Optimization Strategies:

  1. Increase Batch Sizes (if memory allows):

    Current: 10 items/batch
    Recommended: 50-100 items/batch
    
  2. Optimize Partitioning:

    Reduce chunk size: 400 → 300 tokens
    Reduces total chunks processed
    
  3. Use Faster Embedding Model:

    Current: text-embedding-3-large
    Faster: text-embedding-3-small
    (Trade-off: slightly lower quality)
    
  4. Parallel Processing:

    • Increase worker instances
    • Enable parallel stages
    • Use multiple pipelines for different data sets

High Resource Usage

Symptoms:

  • High memory consumption
  • CPU at 100%
  • Storage filling quickly

Solutions:

Memory:

- Reduce batch sizes
- Process smaller files
- Enable streaming if available
- Increase worker memory limits

CPU:

- Reduce parallelism
- Optimize chunk sizes
- Schedule during off-peak
- Consider async processing

Storage:

- Clean up old runs
- Archive completed data
- Implement retention policies
- Monitor growth trends

Log Analysis

Finding Logs

  1. Navigate to Data Pipeline Runs
  2. Click on specific run
  3. View Execution Log or Details

Reading Log Messages

Log Levels:

  • ERROR: Failures requiring attention
  • WARNING: Issues that may cause problems
  • INFO: Normal operations
  • DEBUG: Detailed diagnostic information

Common Error Patterns:

ERROR: Connection timeout
→ Network or firewall issue

ERROR: Unauthorized access
→ Permission or authentication problem

ERROR: Resource not found
→ Configuration error (index, model, etc.)

ERROR: Invalid parameter value
→ Configuration or data format issue

WARNING: Retrying operation
→ Transient error, may resolve

WARNING: Deprecated feature
→ Update configuration

Error Message Reference

Error Code Meaning Solution
DPS-001 Data source unreachable Check connectivity
DPS-002 Authentication failed Verify credentials
DPS-003 Plugin error Check plugin configuration
DPS-004 Timeout Increase timeout or reduce load
DPS-005 Invalid configuration Review settings

State Inspection

Checking Pipeline State

  1. Configuration State:

    • Review pipeline definition
    • Verify all stages configured
    • Check trigger parameters
  2. Execution State:

    • Check run status
    • Review stage completion
    • Inspect work item progress
  3. Data State:

    • Verify source data availability
    • Check index contents
    • Validate embedding quality

Recovery Procedures

Restarting Failed Pipeline

  1. Review failure reason in logs
  2. Fix underlying issue
  3. Run pipeline again (full or incremental)
  4. Monitor for successful completion

Cleaning Up Partial Runs

If pipeline failed mid-execution:
1. Identify what was processed
2. Determine if partial data acceptable
3. Clean up or complete processing
4. Document for audit trail

Rollback Procedures

For incorrect data indexed:

  1. Stop any active runs
  2. Remove incorrect data from index
  3. Reprocess from source
  4. Verify correct data loaded

Prevention Best Practices

Pre-Flight Checks

Before running pipeline:

  • [ ] Verify data source accessible
  • [ ] Check target has capacity
  • [ ] Review configuration
  • [ ] Test with small sample
  • [ ] Monitor first few minutes

Monitoring

  • Set up alerts for failures
  • Monitor run durations
  • Track success rates
  • Review logs regularly

Documentation

  • Document pipeline configurations
  • Record common issues and solutions
  • Maintain runbook
  • Keep change log

Getting Help

Information to Gather

When seeking support, provide:

  1. Pipeline name and ID
  2. Run ID and timestamp
  3. Error messages (complete text)
  4. Recent changes made
  5. What you've tried

Escalation Path

  1. Self-Service: Use this guide
  2. Team Lead: Escalate if unresolved
  3. Platform Team: For infrastructure issues
  4. Vendor Support: For third-party components