Troubleshooting Guide

This guide provides structured approaches to diagnosing and resolving common issues in FoundationaLLM deployments.

Quick Diagnostics

Health Check Commands

# AKS deployment - check all pods
kubectl get pods -n fllm

# AKS deployment - check services
kubectl get svc -n fllm

# Quick Start - check container apps
az containerapp list -g <resource-group> -o table

Expected Healthy State

Component	Status
All pods	`Running`
All services	`Active`
All containers	`Running`

Common Issues

1. Authentication Failures

Symptoms

Unable to log in to portals
"Invalid token" errors
Redirect loops after login

Diagnosis

// Check Entra ID sign-in logs
AADSignInLogs
| where TimeGenerated > ago(1h)
| where ResultType != 0
| project TimeGenerated, UserPrincipalName, AppDisplayName, ResultType, ResultDescription

Solutions

Issue	Solution
Invalid redirect URI	Update redirect URIs in App Registration
Missing scopes	Configure API permissions
Client secret expired	Generate new secret, update Key Vault
Wrong tenant	Verify tenant ID in App Configuration

Verify App Registration:

Open Azure Portal > Microsoft Entra ID > App registrations
Select the application
Check Authentication > Redirect URIs
Verify the URI matches your deployment URL + /signin-oidc

See Authentication Setup for detailed configuration.

2. Missing App Configuration Values

Symptoms

Services fail to start
Configuration-related errors in logs
"Key not found" errors

Diagnosis

# Check App Configuration
az appconfig kv list --name <app-config-name> -o table

# Check specific key
az appconfig kv show --name <app-config-name> --key "FoundationaLLM:Instance:Id"

Solutions

Verify Key Vault References

# Check Key Vault secret exists
az keyvault secret show --vault-name <vault-name> --name <secret-name>

Check Managed Identity Permissions
- App Configuration Reader on App Config
- Key Vault Secrets User on Key Vault

Re-run Configuration Script

# Quick Start
cd deploy/quick-start
../common/scripts/Set-AzdEnvEntra.ps1

3. Container Crashes

Symptoms

Pods in CrashLoopBackOff state
Services intermittently unavailable
Container restarts

Diagnosis

# Get pod status
kubectl get pods -n fllm

# Describe failing pod
kubectl describe pod <pod-name> -n fllm

# Get logs from crashed container
kubectl logs <pod-name> -n fllm --previous

// Query for container crashes
ContainerAppConsoleLogs
| where TimeGenerated > ago(24h)
| where Log contains "exception" or Log contains "fatal" or Log contains "crash"
| project TimeGenerated, ContainerAppName, Log
| order by TimeGenerated desc

Solutions

Issue	Solution
Out of memory	Increase memory limits in Helm values
Missing config	Check environment variables and App Config
Dependency unavailable	Verify dependent services are running
Image pull error	Check registry access and image tag

Check Resource Usage:

kubectl top pods -n fllm

Update Resource Limits:

# In Helm values
resources:
  limits:
    memory: "2Gi"
    cpu: "1000m"
  requests:
    memory: "1Gi"
    cpu: "500m"

4. API Errors

Symptoms

500 errors from APIs
Timeout errors
Incomplete responses

Diagnosis

// API error analysis
AppRequests
| where TimeGenerated > ago(1h)
| where Success == false
| summarize count() by Name, ResultCode
| order by count_ desc

# Check API pod logs
kubectl logs -n fllm deployment/core-api --tail=200

Solutions

Error Code	Likely Cause	Solution
401	Authentication	Check token and permissions
403	Authorization	Verify RBAC roles
500	Server error	Check logs for details
502	Bad gateway	Check pod health
503	Service unavailable	Check service endpoints
504	Timeout	Check dependencies, increase timeout

5. Azure OpenAI Errors

Symptoms

"Model deployment not found"
Quota exceeded errors
Timeout on completions

Diagnosis

# Check OpenAI deployment
az cognitiveservices account deployment list \
  --name <openai-account> \
  --resource-group <resource-group>

# Check quota
az cognitiveservices account show \
  --name <openai-account> \
  --resource-group <resource-group>

Solutions

Issue	Solution
Deployment not found	Create model deployment in Azure Portal
Quota exceeded	Request quota increase or use different region
Wrong deployment name	Update App Configuration with correct name
Region unavailable	Check model availability by region

6. Vector Search Issues

Symptoms

No results from knowledge queries
"Index not found" errors
Slow search responses

Diagnosis

# Check AI Search service
az search service show --name <search-service> --resource-group <resource-group>

# List indexes
az search index list --service-name <search-service> --resource-group <resource-group>

Solutions

Verify Index Exists - Check in Azure Portal
Check Permissions - Managed Identity needs Search Index Data Reader
Re-run Indexing - Trigger data pipeline

7. Network Connectivity

Symptoms

Services can't communicate
DNS resolution failures
Timeout between services

Diagnosis (AKS)

# Check service endpoints
kubectl get endpoints -n fllm

# Test DNS resolution
kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup core-api.fllm.svc.cluster.local

# Check network policies
kubectl get networkpolicies -n fllm

Solutions

Issue	Solution
No endpoints	Check pod selector labels
DNS failure	Restart CoreDNS pods
Network policy blocking	Review network policy rules
Private endpoint issues	Check NSG and private DNS zones

Diagnostic Tools

Log Analytics Queries

Save these queries for quick access:

// Error summary by service
ContainerAppConsoleLogs
| where TimeGenerated > ago(24h)
| where Log contains "error" or Log contains "Error"
| summarize ErrorCount = count() by ContainerAppName
| order by ErrorCount desc

// Request latency percentiles
AppRequests
| where TimeGenerated > ago(1h)
| summarize 
    p50 = percentile(DurationMs, 50),
    p95 = percentile(DurationMs, 95),
    p99 = percentile(DurationMs, 99)
    by Name

Health Check Script

# Quick health check script
$namespace = "fllm"

Write-Host "=== Pod Status ===" -ForegroundColor Cyan
kubectl get pods -n $namespace

Write-Host "`n=== Service Status ===" -ForegroundColor Cyan
kubectl get svc -n $namespace

Write-Host "`n=== Recent Events ===" -ForegroundColor Cyan
kubectl get events -n $namespace --sort-by='.lastTimestamp' | Select-Object -Last 10

Write-Host "`n=== Resource Usage ===" -ForegroundColor Cyan
kubectl top pods -n $namespace

Reporting Issues

If issues persist after troubleshooting:

1. Check Existing Issues

Search GitHub Issues for similar problems.

2. Gather Information

Collect before opening an issue:

Deployment type (Quick Start/Standard)
Version/release tag
Error messages and logs
Steps to reproduce
Configuration (sanitized)

3. Open GitHub Issue

Navigate to GitHub Issues
Click New Issue
Select appropriate template
Provide detailed information
Add relevant labels

Issue Template

## Description
Brief description of the issue

## Environment
- Deployment Type: [Quick Start / Standard]
- Version: [e.g., 0.9.0]
- Azure Region: [e.g., eastus2]

## Steps to Reproduce
1. Step 1
2. Step 2
3. Step 3

## Expected Behavior
What should happen

## Actual Behavior
What actually happens

## Logs/Screenshots
[Attach relevant logs or screenshots]

## Additional Context
Any other relevant information

Table of Contents

Troubleshooting Guide

Quick Diagnostics

Health Check Commands

Expected Healthy State

Common Issues

1. Authentication Failures

Symptoms

Diagnosis

Solutions

2. Missing App Configuration Values

Symptoms

Diagnosis

Solutions

3. Container Crashes

Symptoms

Diagnosis

Solutions

4. API Errors

Symptoms

Diagnosis

Solutions

5. Azure OpenAI Errors

Symptoms

Diagnosis

Solutions

6. Vector Search Issues

Symptoms

Diagnosis

Solutions

7. Network Connectivity

Symptoms

Diagnosis (AKS)

Solutions

Diagnostic Tools

Log Analytics Queries

Health Check Script

Reporting Issues

1. Check Existing Issues

2. Gather Information

3. Open GitHub Issue

Issue Template

Related Topics