Azure Data Lake as a Knowledge Source
Learn how to configure Azure Data Lake Storage Gen2 as a knowledge source for FoundationaLLM agents.
Overview
Azure Data Lake Storage Gen2 (ADLS) provides scalable, secure storage for large volumes of documents. It's ideal for:
- Large document repositories
- Structured data files (CSV, JSON, Parquet)
- Enterprise content management
- Multi-team data sharing
Prerequisites
Before configuring ADLS as a knowledge source:
- Azure Data Lake Storage Gen2 account with hierarchical namespace enabled
- Container created for your documents
- Access permissions configured (see Authentication Options)
- Network connectivity from FoundationaLLM services to storage account
- Firewall rules allowing access (if enabled)
- User with Data Sources Contributor role configured
Configuring ADLS as a Data Source
Step 1: Navigate to Data Sources
- In the Management Portal sidebar, click Data Sources
- Click Create Data Source
- Select Azure Data Lake from the type dropdown
NOTE: If you don’t see the Create Data Source button, the most likely reason is your account doesn’t have the “Data Sources Contributor” role assigned at the instance scope. Use the Security, Access Control feature in the Management Portal to add your user account to that role (or ask an authorized administrator to do so).
Step 2: Enter Basic Information
| Field | Description |
|---|---|
| Data Source Name | Unique identifier (e.g., corporate-docs-adls) |
| Description | Purpose and contents (e.g., "Corporate policy documents") |
Step 3: Select Authentication Type
Choose the authentication method:
| Type | Description | Best For |
|---|---|---|
| Azure Identity | Managed Identity authentication | Production deployments |
| Account Key | Storage account access key | Development, testing |
| Connection String | Full connection string | Legacy configurations |
Step 4: Configure Connection Details
For Azure Identity (Managed Identity)
| Field | Value |
|---|---|
| Authentication Type | Azure Identity |
| Account Name | Your storage account name (e.g., mystorageaccount) |
For Account Key
| Field | Value |
|---|---|
| Authentication Type | Account Key |
| API Key | Storage account access key |
| Endpoint | Storage endpoint URL (e.g., https://mystorageaccount.dfs.core.windows.net) |
For Connection String
| Field | Value |
|---|---|
| Authentication Type | Connection String |
| Connection String | Full storage connection string |
Step 5: Specify Folders
Enter the container and folder paths containing your documents:
- Type the path (e.g.,
mycontainer/documents/policies) - Press Enter or , to add it
- Repeat for additional paths
Path Format Examples:
- Container root:
mycontainer - Single folder:
mycontainer/documents - Nested folder:
mycontainer/departments/hr/policies
Step 6: Save the Data Source
Click Create Data Source to save the configuration.
Authentication Configuration
Using Managed Identity (Recommended)
Managed Identity provides secure, credential-free access:
- Ensure the FoundationaLLM deployment has a managed identity
- Grant the managed identity the Storage Blob Data Reader role on the storage account
- Select Azure Identity authentication in the data source configuration
Azure CLI Example:
az role assignment create \
--assignee <managed-identity-object-id> \
--role "Storage Blob Data Reader" \
--scope /subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage-account>
Using Account Keys
For development or when managed identity isn't available:
- Navigate to your storage account in Azure Portal
- Go to Access Keys
- Copy a key value
- Enter it in the API Key field
Warning: Account keys provide full access. Rotate them regularly and use managed identity in production.
Using Connection Strings
Connection strings are useful for quick configuration:
- Navigate to your storage account in Azure Portal
- Go to Access Keys
- Copy the connection string
- Enter it in the Connection String field
Folder Structure Best Practices
Organizing Documents
mycontainer/
├── departments/
│ ├── hr/
│ │ ├── policies/
│ │ └── procedures/
│ └── finance/
│ ├── reports/
│ └── guidelines/
└── shared/
└── templates/
Path Selection Tips
- Specific paths reduce processing time and improve relevance
- Multiple paths can be specified for different content types
- Avoid very deep nesting which complicates management
Using in Data Pipelines
After creating the data source:
- Navigate to Data Pipelines
- Create a new pipeline
- Select your ADLS data source
- Configure processing stages
- Run the pipeline to index documents
Supported File Types
ADLS data sources typically support:
| Category | Extensions |
|---|---|
| Documents | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS |
| Text | TXT, MD, HTML, XML, JSON |
| Data | CSV, TSV, Parquet |
Specific support depends on the data pipeline configuration.
Troubleshooting
Authentication Failures
- Verify managed identity has correct role assignments
- Check account key hasn't been rotated
- Ensure connection string is complete and correct
Access Denied Errors
- Verify folder paths are correct
- Check container exists and is accessible
- Review storage account firewall settings
Network Connectivity Issues
- Ensure FoundationaLLM services can reach the storage endpoint
- Check VNet configurations if using private endpoints
- Review firewall allow lists