Rate Limiting and Quotas for Multi-Tenant MCP Servers
Essential strategies for protecting shared MCP infrastructure from resource exhaustion and ensuring fair resource allocation across tenants.
Multi-tenant Model Context Protocol servers are powerful tools for scaling AI agent infrastructure, but without proper rate limiting and quota mechanisms, a single noisy tenant can degrade service for everyone. In this post, we'll explore practical strategies for implementing fair resource allocation in shared MCP deployments.
Why Rate Limiting Matters for MCP Servers
MCP servers expose tool capabilities to multiple agents and applications simultaneously. Without guardrails, a single client calling expensive operations in rapid succession can:
- Exhaust database connections
- Monopolize GPU resources (if applicable)
- Trigger cascading latency across all tenants
- Enable abuse or accidental denial-of-service scenarios
Rate limiting isn't just about preventing malicious actors—it's about maintaining predictable, fair performance for legitimate users.
Quota Dimensions in MCP Deployments
Effective quota management requires thinking beyond simple "requests per second." Consider these dimensions:
Request-level quotas: Traditional rate limiting per tenant or API key. Cap requests per minute or hour. Works well for CPU-bound operations.
Time-window quotas: Sliding window counters or token bucket algorithms. Token bucket is particularly elegant—each tenant gets a "bucket" that refills at a known rate, allowing occasional bursts while maintaining average throughput.
Concurrency limits: Enforce maximum simultaneous tool invocations per tenant. Prevents resource hoarding even if the request rate is acceptable.
Cost-based quotas: Weight expensive operations differently. A single database query might consume 10 units; a vector similarity search might consume 50. Tenants get a "cost budget" rather than a simple request count.
Long-tail protection: Cap maximum execution time for individual tool calls. Prevents runaway operations from blocking thread pools.
Implementing Token Bucket Rate Limiting
The token bucket algorithm is well-suited to MCP workloads. Here's the concept:
- Each tenant has a bucket with capacity
Ctokens - Tokens are added at rate
Rtokens per second - Each request costs
Ttokens - Requests proceed if sufficient tokens exist; otherwise, they're rejected or queued
class TokenBucket {
private tokens: number;
private readonly capacity: number;
private readonly refillRate: number;
private lastRefill: number = Date.now();
constructor(capacity: number, refillRate: number) {
this.capacity = capacity;
this.refillRate = refillRate;
this.tokens = capacity;
}
tryConsume(cost: number = 1): boolean {
this._refill();
if (this.tokens >= cost) {
this.tokens -= cost;
return true;
}
return false;
}
private _refill(): void {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000; // seconds
const newTokens = elapsed * this.refillRate;
this.tokens = Math.min(this.capacity, this.tokens + newTokens);
this.lastRefill = now;
}
}
For distributed MCP servers (multiple processes or regions), use a shared backend like Redis to track bucket state:
// Redis-backed token bucket for distributed rate limiting
async function checkRateLimit(
client: RedisClient,
tenantId: string,
cost: number = 1
): Promise<boolean> {
const key = `rate-limit:${tenantId}`;
const capacity = 1000; // tokens
const refillRate = 100; // tokens per second
// Use Lua script for atomic check + decrement
const script = `
local current = redis.call('get', KEYS[1]) or 0
local capacity = tonumber(ARGV[1])
local now = redis.call('time')[1] * 1000
local lastRefill = redis.call('get', KEYS[1]:'-refill') or now
local elapsed = (now - tonumber(lastRefill)) / 1000
local refilled = math.min(capacity, tonumber(current) + elapsed * tonumber(ARGV[2]))
if refilled >= tonumber(ARGV[3]) then
redis.call('set', KEYS[1], refilled - tonumber(ARGV[3]))
redis.call('set', KEYS[1]:'-refill', now)
return 1
end
return 0
`;
const result = await client.eval(
script,
1,
key,
capacity.toString(),
refillRate.toString(),
cost.toString()
);
return result === 1;
}
Cost-Based Quotas for Heterogeneous Operations
Not all MCP tools are equal. A simple tool might return data from cache; a vector search might scan millions of embeddings. Assign operation costs based on resource impact:
const toolCosts: Record<string, number> = {
'get_user_profile': 1,
'search_embeddings': 50,
'generate_report': 100,
'train_model': 500,
};
async function invokeTool(
tool: string,
args: unknown,
tenant: string
): Promise<unknown> {
const cost = toolCosts[tool] ?? 10; // default cost
const allowed = await checkRateLimit(redisClient, tenant, cost);
if (!allowed) {
throw new QuotaExceededError(
`Tenant quota exceeded. Tool "${tool}" costs ${cost} units.`
);
}
return await executeTool(tool, args);
}
Enforcing Concurrency Limits
Rate limiting controls throughput, but concurrency limits prevent resource hoarding:
class ConcurrencyLimiter {
private activeCount = 0;
private readonly maxConcurrent: number;
private queue: Array<() => void> = [];
constructor(maxConcurrent: number) {
this.maxConcurrent = maxConcurrent;
}
async run<T>(fn: () => Promise<T>): Promise<T> {
while (this.activeCount >= this.maxConcurrent) {
await new Promise(resolve => this.queue.push(resolve));
}
this.activeCount++;
try {
return await fn();
} finally {
this.activeCount--;
const resolve = this.queue.shift();
if (resolve) resolve();
}
}
}
// Usage: enforce max 5 concurrent invocations per tenant
const concurrencyLimiters = new Map<string, ConcurrencyLimiter>();
async function invokeTool(
tool: string,
args: unknown,
tenant: string
): Promise<unknown> {
if (!concurrencyLimiters.has(tenant)) {
concurrencyLimiters.set(tenant, new ConcurrencyLimiter(5));
}
const limiter = concurrencyLimiters.get(tenant)!;
return limiter.run(() => executeTool(tool, args));
}
Monitoring and Observability
Effective quota enforcement requires visibility into tenant behavior:
async function invokeTool(
tool: string,
args: unknown,
tenant: string
): Promise<unknown> {
const startTime = Date.now();
const cost = toolCosts[tool] ?? 10;
try {
const allowed = await checkRateLimit(redisClient, tenant, cost);
if (!allowed) {
recordMetric('quota_exceeded', { tenant, tool });
throw new QuotaExceededError(`Quota exceeded for ${tenant}`);
}
const result = await executeTool(tool, args);
const duration = Date.now() - startTime;
recordMetric('tool_invocation', {
tenant,
tool,
duration,
cost,
status: 'success',
});
return result;
} catch (error) {
recordMetric('tool_invocation', {
tenant,
tool,
status: 'error',
cost,
});
throw error;
}
}
Expose metrics for:
- Requests per tenant per second
- Cost consumption trending
- Quota rejection rate
- p99 latency by tenant
- Peak concurrency
Alert when a tenant consistently hits rate limits or when rejection rates exceed acceptable thresholds—this may indicate workload changes or need for quota adjustment.
Graceful Degradation and Client Strategies
Rate limits should fail predictably. Use standard HTTP status codes and include retry guidance:
if (!allowed) {
const resetTime = calculateResetTime(tenant);
throw new QuotaExceededError(
`Rate limit exceeded. Reset in ${resetTime}s`,
{ retryAfter: resetTime }
);
}
Clients should implement exponential backoff with jitter:
async function invokeWithRetry(
tool: string,
args: unknown,
maxRetries: number = 3
): Promise<unknown> {
let delay = 100;
for (let i = 0; i < maxRetries; i++) {
try {
return await invokeTool(tool, args);
} catch (error) {
if (error instanceof QuotaExceededError && i < maxRetries - 1) {
const jitter = Math.random() * delay;
await sleep(delay + jitter);
delay *= 2; // exponential backoff
} else {
throw error;
}
}
}
}
Best Practices
-
Set realistic quotas: Monitor baseline usage before enforcing limits. Quota too tight and you'll frustrate legitimate users; too loose and you lose protection.
-
Differentiate by tier: Offer multiple service tiers with different quota allowances. Developers pay for capacity, not surprises.
-
Communicate clearly: Document quota limits prominently. Include current usage in API responses.
-
Plan for elasticity: In cloud environments, consider dynamic quota scaling during predictable peak hours.
-
Implement circuit breakers: If a tenant's failure rate exceeds thresholds, temporarily degrade their access to prevent cascading failures.
-
Monitor cost attribution: Combine rate limiting with chargeback systems. Tenants should see exactly what they're consuming and why.
Proper rate limiting and quota management transform shared MCP infrastructure from a fragile shared resource into a predictable, fair platform that scales confidently as your agent ecosystem grows.