API and Token Limits
Reference documentation for API rate limits, token usage limits, and quota enforcement in FoundationaLLM.
Overview
FoundationaLLM enforces various limits to ensure fair usage, protect system stability, and manage costs. Understanding these limits helps you build applications that work reliably within the platform's constraints.
Types of Limits
| Limit Type | Purpose | Scope |
|---|---|---|
| API Rate Limits | Prevent excessive API calls | Per user or global |
| Token Usage Limits | Control model consumption | Per request/agent/user |
| Quota Limits | Manage resource allocation | Per agent or platform |
API Rate Limits
Overview
API rate limits restrict how many requests can be made within a time window. When limits are exceeded, the API returns a 429 (Too Many Requests) response.
Rate Limit Structure
Rate limits are defined by:
- Metric Limit: Maximum number of requests
- Time Window: Period over which requests are counted (typically 60 seconds)
- Lockout Duration: How long you're blocked after exceeding limits
- Partition: Whether limits apply globally, per user, or per agent
Common Rate Limits
| Controller | Typical Limit | Description |
|---|---|---|
| Completions | 100/minute/user | Chat completion requests |
| CompletionsStatus | 500/minute/user | Polling for async completion status |
| Sessions | 60/minute/user | Session management operations |
| Files | 30/minute/user | File upload operations |
Rate Limit Response
When you exceed a rate limit:
{
"quota_exceeded": true,
"quota_name": "CoreAPICompletionsRateLimit",
"retry_after_seconds": 60,
"message": "Rate limit exceeded. Try again later."
}
Best Practices
- Implement retry logic with exponential backoff
- Monitor 429 responses in your application
- Use async endpoints for long operations to reduce polling
- Cache responses where appropriate
Token Usage Limits
What Are Tokens?
Tokens are the units AI models use to process text:
- ~1 token ≈ 4 characters in English
- ~1 token ≈ ¾ of a word
- 100 tokens ≈ 75 words
Token Counting
Each completion request consumes tokens in three categories:
| Category | Description |
|---|---|
| Prompt Tokens | Tokens in your input (system prompt + user message + context) |
| Completion Tokens | Tokens in the model's response |
| Total Tokens | Sum of prompt and completion tokens |
Per-Request Token Limits
The max_new_tokens parameter limits response length:
{
"user_prompt": "Explain quantum computing",
"settings": {
"model_parameters": {
"max_new_tokens": 2000
}
}
}
| Parameter | Typical Range | Default |
|---|---|---|
max_new_tokens |
1-4000+ | Model-dependent |
Model Context Limits
Different models have different context window sizes:
| Model | Context Window | Notes |
|---|---|---|
| GPT-4 | 8K-128K tokens | Varies by deployment |
| GPT-4o | 128K tokens | Large context support |
| GPT-3.5-turbo | 4K-16K tokens | Varies by version |
| Claude 3 | 200K tokens | Extended context |
If your request exceeds the context window, the API returns an error.
Token Limit Errors
When token limits are exceeded:
{
"error": {
"code": "TokenLimitExceeded",
"message": "Request exceeds maximum token limit for this model."
}
}
API Key Limits
Agent Access Token Limits
Agent Access Tokens have configurable constraints:
| Limit | Description | Configuration |
|---|---|---|
| Expiration | Token validity period | Set at creation time |
| Agent Scope | Which agent(s) can be accessed | Bound to single agent |
| Permission Scope | What actions are allowed | Defined by role assignments |
Creating Tokens with Appropriate Limits
When creating Agent Access Tokens:
- Set appropriate expiration dates (shorter = more secure)
- Limit scope to necessary agents only
- Review and rotate tokens periodically
Quota Configuration
Quota Definition Structure
Quotas are defined with these properties:
{
"name": "CoreAPICompletionsRateLimit",
"description": "100 requests per minute per user",
"context": "CoreAPI:Completions",
"type": "RawRequestRateLimit",
"metric_partition": "UserPrincipalName",
"metric_limit": 100,
"metric_window_seconds": 60,
"lockout_duration_seconds": 60,
"distributed_enforcement": false
}
Quota Types
| Type | Description |
|---|---|
RawRequestRateLimit |
Limits total API requests |
AgentRequestRateLimit |
Limits requests to specific agents |
Partition Strategies
| Partition | Description | Use Case |
|---|---|---|
None |
Global limit for all users | System protection |
UserPrincipalName |
Per-user limits | Standard user access |
UserIdentifier |
Per-identity limits | Service accounts |
Agent-Specific Limits
Per-Agent Rate Limits
Specific agents can have individual rate limits:
{
"name": "KnowledgeAgentRateLimit",
"context": "CoreAPI:Completions:knowledge-agent",
"type": "AgentRequestRateLimit",
"metric_partition": "UserPrincipalName",
"metric_limit": 50,
"metric_window_seconds": 60
}
Agent Token Configuration
Agents can be configured with:
- Maximum response tokens
- Temperature constraints
- Context window usage limits
Monitoring and Managing Limits
Viewing Current Usage
Users can monitor token consumption in the User Portal:
- Enable Show Message Tokens in agent configuration
- Token counts appear on each message
- Review consumption patterns over time
Administrator Monitoring
Administrators can:
- Review quota definitions in storage
- Monitor API logs for 429 responses
- Adjust quotas based on usage patterns
Adjusting Limits
To modify quota limits:
- Access the quota configuration in the main storage account
- Edit the
quota-store.jsonfile in thequotacontainer - Update metric limits, windows, or partitions
- Changes take effect on next API service restart
Limit Recommendations
For Developers
| Scenario | Recommendation |
|---|---|
| Interactive chat | Use default limits, implement retry |
| Batch processing | Implement rate limiting in your code |
| High-volume API | Request quota increase from admin |
| Long operations | Use async endpoints |
For Administrators
| Goal | Configuration |
|---|---|
| Prevent abuse | Set per-user limits with lockouts |
| Protect critical agents | Add agent-specific rate limits |
| Allow high-volume integrations | Consider higher limits for service accounts |
| Cost management | Monitor and set token-based quotas |
Error Handling Examples
Handling Rate Limits (Python)
import time
import requests
def call_api_with_retry(url, headers, data, max_retries=3):
for attempt in range(max_retries):
response = requests.post(url, headers=headers, json=data)
if response.status_code == 429:
retry_after = response.json().get('retry_after_seconds', 60)
print(f"Rate limited. Waiting {retry_after} seconds...")
time.sleep(retry_after)
continue
return response
raise Exception("Max retries exceeded")
Handling Token Limits
def truncate_to_token_limit(text, max_tokens=4000):
# Approximate: 4 chars per token
max_chars = max_tokens * 4
if len(text) > max_chars:
return text[:max_chars] + "..."
return text
Related Topics
- Quotas Reference — Detailed quota configuration
- Core API Reference — API endpoint details
- Monitoring Token Consumption — User token monitoring