Troubleshooting Guide
This guide provides structured approaches to diagnosing and resolving common issues in FoundationaLLM deployments.
Quick Diagnostics
Health Check Commands
# AKS deployment - check all pods
kubectl get pods -n fllm
# AKS deployment - check services
kubectl get svc -n fllm
# Quick Start - check container apps
az containerapp list -g <resource-group> -o table
Expected Healthy State
| Component | Status |
|---|---|
| All pods | Running |
| All services | Active |
| All containers | Running |
Common Issues
1. Authentication Failures
Symptoms
- Unable to log in to portals
- "Invalid token" errors
- Redirect loops after login
Diagnosis
// Check Entra ID sign-in logs
AADSignInLogs
| where TimeGenerated > ago(1h)
| where ResultType != 0
| project TimeGenerated, UserPrincipalName, AppDisplayName, ResultType, ResultDescription
Solutions
| Issue | Solution |
|---|---|
| Invalid redirect URI | Update redirect URIs in App Registration |
| Missing scopes | Configure API permissions |
| Client secret expired | Generate new secret, update Key Vault |
| Wrong tenant | Verify tenant ID in App Configuration |
Verify App Registration:
- Open Azure Portal > Microsoft Entra ID > App registrations
- Select the application
- Check Authentication > Redirect URIs
- Verify the URI matches your deployment URL +
/signin-oidc
See Authentication Setup for detailed configuration.
2. Missing App Configuration Values
Symptoms
- Services fail to start
- Configuration-related errors in logs
- "Key not found" errors
Diagnosis
# Check App Configuration
az appconfig kv list --name <app-config-name> -o table
# Check specific key
az appconfig kv show --name <app-config-name> --key "FoundationaLLM:Instance:Id"
Solutions
Verify Key Vault References
# Check Key Vault secret exists az keyvault secret show --vault-name <vault-name> --name <secret-name>Check Managed Identity Permissions
- App Configuration Reader on App Config
- Key Vault Secrets User on Key Vault
Re-run Configuration Script
# Quick Start cd deploy/quick-start ../common/scripts/Set-AzdEnvEntra.ps1
3. Container Crashes
Symptoms
- Pods in
CrashLoopBackOffstate - Services intermittently unavailable
- Container restarts
Diagnosis
# Get pod status
kubectl get pods -n fllm
# Describe failing pod
kubectl describe pod <pod-name> -n fllm
# Get logs from crashed container
kubectl logs <pod-name> -n fllm --previous
// Query for container crashes
ContainerAppConsoleLogs
| where TimeGenerated > ago(24h)
| where Log contains "exception" or Log contains "fatal" or Log contains "crash"
| project TimeGenerated, ContainerAppName, Log
| order by TimeGenerated desc
Solutions
| Issue | Solution |
|---|---|
| Out of memory | Increase memory limits in Helm values |
| Missing config | Check environment variables and App Config |
| Dependency unavailable | Verify dependent services are running |
| Image pull error | Check registry access and image tag |
Check Resource Usage:
kubectl top pods -n fllm
Update Resource Limits:
# In Helm values
resources:
limits:
memory: "2Gi"
cpu: "1000m"
requests:
memory: "1Gi"
cpu: "500m"
4. API Errors
Symptoms
- 500 errors from APIs
- Timeout errors
- Incomplete responses
Diagnosis
// API error analysis
AppRequests
| where TimeGenerated > ago(1h)
| where Success == false
| summarize count() by Name, ResultCode
| order by count_ desc
# Check API pod logs
kubectl logs -n fllm deployment/core-api --tail=200
Solutions
| Error Code | Likely Cause | Solution |
|---|---|---|
| 401 | Authentication | Check token and permissions |
| 403 | Authorization | Verify RBAC roles |
| 500 | Server error | Check logs for details |
| 502 | Bad gateway | Check pod health |
| 503 | Service unavailable | Check service endpoints |
| 504 | Timeout | Check dependencies, increase timeout |
5. Azure OpenAI Errors
Symptoms
- "Model deployment not found"
- Quota exceeded errors
- Timeout on completions
Diagnosis
# Check OpenAI deployment
az cognitiveservices account deployment list \
--name <openai-account> \
--resource-group <resource-group>
# Check quota
az cognitiveservices account show \
--name <openai-account> \
--resource-group <resource-group>
Solutions
| Issue | Solution |
|---|---|
| Deployment not found | Create model deployment in Azure Portal |
| Quota exceeded | Request quota increase or use different region |
| Wrong deployment name | Update App Configuration with correct name |
| Region unavailable | Check model availability by region |
6. Vector Search Issues
Symptoms
- No results from knowledge queries
- "Index not found" errors
- Slow search responses
Diagnosis
# Check AI Search service
az search service show --name <search-service> --resource-group <resource-group>
# List indexes
az search index list --service-name <search-service> --resource-group <resource-group>
Solutions
- Verify Index Exists - Check in Azure Portal
- Check Permissions - Managed Identity needs Search Index Data Reader
- Re-run Indexing - Trigger data pipeline
7. Network Connectivity
Symptoms
- Services can't communicate
- DNS resolution failures
- Timeout between services
Diagnosis (AKS)
# Check service endpoints
kubectl get endpoints -n fllm
# Test DNS resolution
kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup core-api.fllm.svc.cluster.local
# Check network policies
kubectl get networkpolicies -n fllm
Solutions
| Issue | Solution |
|---|---|
| No endpoints | Check pod selector labels |
| DNS failure | Restart CoreDNS pods |
| Network policy blocking | Review network policy rules |
| Private endpoint issues | Check NSG and private DNS zones |
Diagnostic Tools
Log Analytics Queries
Save these queries for quick access:
// Error summary by service
ContainerAppConsoleLogs
| where TimeGenerated > ago(24h)
| where Log contains "error" or Log contains "Error"
| summarize ErrorCount = count() by ContainerAppName
| order by ErrorCount desc
// Request latency percentiles
AppRequests
| where TimeGenerated > ago(1h)
| summarize
p50 = percentile(DurationMs, 50),
p95 = percentile(DurationMs, 95),
p99 = percentile(DurationMs, 99)
by Name
Health Check Script
# Quick health check script
$namespace = "fllm"
Write-Host "=== Pod Status ===" -ForegroundColor Cyan
kubectl get pods -n $namespace
Write-Host "`n=== Service Status ===" -ForegroundColor Cyan
kubectl get svc -n $namespace
Write-Host "`n=== Recent Events ===" -ForegroundColor Cyan
kubectl get events -n $namespace --sort-by='.lastTimestamp' | Select-Object -Last 10
Write-Host "`n=== Resource Usage ===" -ForegroundColor Cyan
kubectl top pods -n $namespace
Reporting Issues
If issues persist after troubleshooting:
1. Check Existing Issues
Search GitHub Issues for similar problems.
2. Gather Information
Collect before opening an issue:
- Deployment type (Quick Start/Standard)
- Version/release tag
- Error messages and logs
- Steps to reproduce
- Configuration (sanitized)
3. Open GitHub Issue
- Navigate to GitHub Issues
- Click New Issue
- Select appropriate template
- Provide detailed information
- Add relevant labels
Issue Template
## Description
Brief description of the issue
## Environment
- Deployment Type: [Quick Start / Standard]
- Version: [e.g., 0.9.0]
- Azure Region: [e.g., eastus2]
## Steps to Reproduce
1. Step 1
2. Step 2
3. Step 3
## Expected Behavior
What should happen
## Actual Behavior
What actually happens
## Logs/Screenshots
[Attach relevant logs or screenshots]
## Additional Context
Any other relevant information