API and Token Limits

Reference documentation for API rate limits, token usage limits, and quota enforcement in FoundationaLLM.

Overview

FoundationaLLM enforces various limits to ensure fair usage, protect system stability, and manage costs. Understanding these limits helps you build applications that work reliably within the platform's constraints.

Types of Limits

Limit Type	Purpose	Scope
API Rate Limits	Prevent excessive API calls	Per user or global
Token Usage Limits	Control model consumption	Per request/agent/user
Quota Limits	Manage resource allocation	Per agent or platform

API Rate Limits

Overview

API rate limits restrict how many requests can be made within a time window. When limits are exceeded, the API returns a 429 (Too Many Requests) response.

Rate Limit Structure

Rate limits are defined by:

Metric Limit: Maximum number of requests
Time Window: Period over which requests are counted (typically 60 seconds)
Lockout Duration: How long you're blocked after exceeding limits
Partition: Whether limits apply globally, per user, or per agent

Common Rate Limits

Controller	Typical Limit	Description
Completions	100/minute/user	Chat completion requests
CompletionsStatus	500/minute/user	Polling for async completion status
Sessions	60/minute/user	Session management operations
Files	30/minute/user	File upload operations

Rate Limit Response

When you exceed a rate limit:

{
  "quota_exceeded": true,
  "quota_name": "CoreAPICompletionsRateLimit",
  "retry_after_seconds": 60,
  "message": "Rate limit exceeded. Try again later."
}

Best Practices

Implement retry logic with exponential backoff
Monitor 429 responses in your application
Use async endpoints for long operations to reduce polling
Cache responses where appropriate

Token Usage Limits

What Are Tokens?

Tokens are the units AI models use to process text:

~1 token ≈ 4 characters in English
~1 token ≈ ¾ of a word
100 tokens ≈ 75 words

Token Counting

Each completion request consumes tokens in three categories:

Category	Description
Prompt Tokens	Tokens in your input (system prompt + user message + context)
Completion Tokens	Tokens in the model's response
Total Tokens	Sum of prompt and completion tokens

Per-Request Token Limits

The max_new_tokens parameter limits response length:

{
  "user_prompt": "Explain quantum computing",
  "settings": {
    "model_parameters": {
      "max_new_tokens": 2000
    }
  }
}

Parameter	Typical Range	Default
`max_new_tokens`	1-4000+	Model-dependent

Model Context Limits

Different models have different context window sizes:

Model	Context Window	Notes
GPT-4	8K-128K tokens	Varies by deployment
GPT-4o	128K tokens	Large context support
GPT-3.5-turbo	4K-16K tokens	Varies by version
Claude 3	200K tokens	Extended context

If your request exceeds the context window, the API returns an error.

Token Limit Errors

When token limits are exceeded:

{
  "error": {
    "code": "TokenLimitExceeded",
    "message": "Request exceeds maximum token limit for this model."
  }
}

API Key Limits

Agent Access Token Limits

Agent Access Tokens have configurable constraints:

Limit	Description	Configuration
Expiration	Token validity period	Set at creation time
Agent Scope	Which agent(s) can be accessed	Bound to single agent
Permission Scope	What actions are allowed	Defined by role assignments

Creating Tokens with Appropriate Limits

When creating Agent Access Tokens:

Set appropriate expiration dates (shorter = more secure)
Limit scope to necessary agents only
Review and rotate tokens periodically

Quota Configuration

Quota Definition Structure

Quotas are defined with these properties:

{
  "name": "CoreAPICompletionsRateLimit",
  "description": "100 requests per minute per user",
  "context": "CoreAPI:Completions",
  "type": "RawRequestRateLimit",
  "metric_partition": "UserPrincipalName",
  "metric_limit": 100,
  "metric_window_seconds": 60,
  "lockout_duration_seconds": 60,
  "distributed_enforcement": false
}

Quota Types

Type	Description
`RawRequestRateLimit`	Limits total API requests
`AgentRequestRateLimit`	Limits requests to specific agents

Partition Strategies

Partition	Description	Use Case
`None`	Global limit for all users	System protection
`UserPrincipalName`	Per-user limits	Standard user access
`UserIdentifier`	Per-identity limits	Service accounts

Agent-Specific Limits

Per-Agent Rate Limits

Specific agents can have individual rate limits:

{
  "name": "KnowledgeAgentRateLimit",
  "context": "CoreAPI:Completions:knowledge-agent",
  "type": "AgentRequestRateLimit",
  "metric_partition": "UserPrincipalName",
  "metric_limit": 50,
  "metric_window_seconds": 60
}

Agent Token Configuration

Agents can be configured with:

Maximum response tokens
Temperature constraints
Context window usage limits

Monitoring and Managing Limits

Viewing Current Usage

Users can monitor token consumption in the User Portal:

Enable Show Message Tokens in agent configuration
Token counts appear on each message
Review consumption patterns over time

Administrator Monitoring

Administrators can:

Review quota definitions in storage
Monitor API logs for 429 responses
Adjust quotas based on usage patterns

Adjusting Limits

To modify quota limits:

Access the quota configuration in the main storage account
Edit the quota-store.json file in the quota container
Update metric limits, windows, or partitions
Changes take effect on next API service restart

Limit Recommendations

For Developers

Scenario	Recommendation
Interactive chat	Use default limits, implement retry
Batch processing	Implement rate limiting in your code
High-volume API	Request quota increase from admin
Long operations	Use async endpoints

For Administrators

Goal	Configuration
Prevent abuse	Set per-user limits with lockouts
Protect critical agents	Add agent-specific rate limits
Allow high-volume integrations	Consider higher limits for service accounts
Cost management	Monitor and set token-based quotas

Error Handling Examples

Handling Rate Limits (Python)

import time
import requests

def call_api_with_retry(url, headers, data, max_retries=3):
    for attempt in range(max_retries):
        response = requests.post(url, headers=headers, json=data)
        
        if response.status_code == 429:
            retry_after = response.json().get('retry_after_seconds', 60)
            print(f"Rate limited. Waiting {retry_after} seconds...")
            time.sleep(retry_after)
            continue
            
        return response
    
    raise Exception("Max retries exceeded")

Handling Token Limits

def truncate_to_token_limit(text, max_tokens=4000):
    # Approximate: 4 chars per token
    max_chars = max_tokens * 4
    if len(text) > max_chars:
        return text[:max_chars] + "..."
    return text

Quotas Reference — Detailed quota configuration
Core API Reference — API endpoint details
Monitoring Token Consumption — User token monitoring

Table of Contents

API and Token Limits

Overview

Types of Limits

API Rate Limits

Overview

Rate Limit Structure

Common Rate Limits

Rate Limit Response

Best Practices

Token Usage Limits

What Are Tokens?

Token Counting

Per-Request Token Limits

Model Context Limits

Token Limit Errors

API Key Limits

Agent Access Token Limits

Creating Tokens with Appropriate Limits

Quota Configuration

Quota Definition Structure

Quota Types

Partition Strategies

Agent-Specific Limits

Per-Agent Rate Limits

Agent Token Configuration

Monitoring and Managing Limits

Viewing Current Usage

Administrator Monitoring

Adjusting Limits

Limit Recommendations

For Developers

For Administrators

Error Handling Examples

Handling Rate Limits (Python)

Handling Token Limits

Related Topics