Back to blog
·8 min read·BitAtlas

Rate Limiting and Quotas for Multi-Tenant MCP Servers

Essential strategies for protecting shared MCP infrastructure from resource exhaustion and ensuring fair resource allocation across tenants.

MCP serverrate limitingquotasresource managementfairnessmulti-tenantAPI throttling

Multi-tenant Model Context Protocol servers are powerful tools for scaling AI agent infrastructure, but without proper rate limiting and quota mechanisms, a single noisy tenant can degrade service for everyone. In this post, we'll explore practical strategies for implementing fair resource allocation in shared MCP deployments.

Why Rate Limiting Matters for MCP Servers

MCP servers expose tool capabilities to multiple agents and applications simultaneously. Without guardrails, a single client calling expensive operations in rapid succession can:

  • Exhaust database connections
  • Monopolize GPU resources (if applicable)
  • Trigger cascading latency across all tenants
  • Enable abuse or accidental denial-of-service scenarios

Rate limiting isn't just about preventing malicious actors—it's about maintaining predictable, fair performance for legitimate users.

Quota Dimensions in MCP Deployments

Effective quota management requires thinking beyond simple "requests per second." Consider these dimensions:

Request-level quotas: Traditional rate limiting per tenant or API key. Cap requests per minute or hour. Works well for CPU-bound operations.

Time-window quotas: Sliding window counters or token bucket algorithms. Token bucket is particularly elegant—each tenant gets a "bucket" that refills at a known rate, allowing occasional bursts while maintaining average throughput.

Concurrency limits: Enforce maximum simultaneous tool invocations per tenant. Prevents resource hoarding even if the request rate is acceptable.

Cost-based quotas: Weight expensive operations differently. A single database query might consume 10 units; a vector similarity search might consume 50. Tenants get a "cost budget" rather than a simple request count.

Long-tail protection: Cap maximum execution time for individual tool calls. Prevents runaway operations from blocking thread pools.

Implementing Token Bucket Rate Limiting

The token bucket algorithm is well-suited to MCP workloads. Here's the concept:

  1. Each tenant has a bucket with capacity C tokens
  2. Tokens are added at rate R tokens per second
  3. Each request costs T tokens
  4. Requests proceed if sufficient tokens exist; otherwise, they're rejected or queued
class TokenBucket {
  private tokens: number;
  private readonly capacity: number;
  private readonly refillRate: number;
  private lastRefill: number = Date.now();

  constructor(capacity: number, refillRate: number) {
    this.capacity = capacity;
    this.refillRate = refillRate;
    this.tokens = capacity;
  }

  tryConsume(cost: number = 1): boolean {
    this._refill();
    if (this.tokens >= cost) {
      this.tokens -= cost;
      return true;
    }
    return false;
  }

  private _refill(): void {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000; // seconds
    const newTokens = elapsed * this.refillRate;
    this.tokens = Math.min(this.capacity, this.tokens + newTokens);
    this.lastRefill = now;
  }
}

For distributed MCP servers (multiple processes or regions), use a shared backend like Redis to track bucket state:

// Redis-backed token bucket for distributed rate limiting
async function checkRateLimit(
  client: RedisClient,
  tenantId: string,
  cost: number = 1
): Promise<boolean> {
  const key = `rate-limit:${tenantId}`;
  const capacity = 1000; // tokens
  const refillRate = 100; // tokens per second

  // Use Lua script for atomic check + decrement
  const script = `
    local current = redis.call('get', KEYS[1]) or 0
    local capacity = tonumber(ARGV[1])
    local now = redis.call('time')[1] * 1000
    local lastRefill = redis.call('get', KEYS[1]:'-refill') or now
    
    local elapsed = (now - tonumber(lastRefill)) / 1000
    local refilled = math.min(capacity, tonumber(current) + elapsed * tonumber(ARGV[2]))
    
    if refilled >= tonumber(ARGV[3]) then
      redis.call('set', KEYS[1], refilled - tonumber(ARGV[3]))
      redis.call('set', KEYS[1]:'-refill', now)
      return 1
    end
    return 0
  `;

  const result = await client.eval(
    script,
    1,
    key,
    capacity.toString(),
    refillRate.toString(),
    cost.toString()
  );

  return result === 1;
}

Cost-Based Quotas for Heterogeneous Operations

Not all MCP tools are equal. A simple tool might return data from cache; a vector search might scan millions of embeddings. Assign operation costs based on resource impact:

const toolCosts: Record<string, number> = {
  'get_user_profile': 1,
  'search_embeddings': 50,
  'generate_report': 100,
  'train_model': 500,
};

async function invokeTool(
  tool: string,
  args: unknown,
  tenant: string
): Promise<unknown> {
  const cost = toolCosts[tool] ?? 10; // default cost

  const allowed = await checkRateLimit(redisClient, tenant, cost);
  if (!allowed) {
    throw new QuotaExceededError(
      `Tenant quota exceeded. Tool "${tool}" costs ${cost} units.`
    );
  }

  return await executeTool(tool, args);
}

Enforcing Concurrency Limits

Rate limiting controls throughput, but concurrency limits prevent resource hoarding:

class ConcurrencyLimiter {
  private activeCount = 0;
  private readonly maxConcurrent: number;
  private queue: Array<() => void> = [];

  constructor(maxConcurrent: number) {
    this.maxConcurrent = maxConcurrent;
  }

  async run<T>(fn: () => Promise<T>): Promise<T> {
    while (this.activeCount >= this.maxConcurrent) {
      await new Promise(resolve => this.queue.push(resolve));
    }

    this.activeCount++;
    try {
      return await fn();
    } finally {
      this.activeCount--;
      const resolve = this.queue.shift();
      if (resolve) resolve();
    }
  }
}

// Usage: enforce max 5 concurrent invocations per tenant
const concurrencyLimiters = new Map<string, ConcurrencyLimiter>();

async function invokeTool(
  tool: string,
  args: unknown,
  tenant: string
): Promise<unknown> {
  if (!concurrencyLimiters.has(tenant)) {
    concurrencyLimiters.set(tenant, new ConcurrencyLimiter(5));
  }

  const limiter = concurrencyLimiters.get(tenant)!;
  return limiter.run(() => executeTool(tool, args));
}

Monitoring and Observability

Effective quota enforcement requires visibility into tenant behavior:

async function invokeTool(
  tool: string,
  args: unknown,
  tenant: string
): Promise<unknown> {
  const startTime = Date.now();
  const cost = toolCosts[tool] ?? 10;

  try {
    const allowed = await checkRateLimit(redisClient, tenant, cost);
    if (!allowed) {
      recordMetric('quota_exceeded', { tenant, tool });
      throw new QuotaExceededError(`Quota exceeded for ${tenant}`);
    }

    const result = await executeTool(tool, args);
    const duration = Date.now() - startTime;

    recordMetric('tool_invocation', {
      tenant,
      tool,
      duration,
      cost,
      status: 'success',
    });

    return result;
  } catch (error) {
    recordMetric('tool_invocation', {
      tenant,
      tool,
      status: 'error',
      cost,
    });
    throw error;
  }
}

Expose metrics for:

  • Requests per tenant per second
  • Cost consumption trending
  • Quota rejection rate
  • p99 latency by tenant
  • Peak concurrency

Alert when a tenant consistently hits rate limits or when rejection rates exceed acceptable thresholds—this may indicate workload changes or need for quota adjustment.

Graceful Degradation and Client Strategies

Rate limits should fail predictably. Use standard HTTP status codes and include retry guidance:

if (!allowed) {
  const resetTime = calculateResetTime(tenant);
  throw new QuotaExceededError(
    `Rate limit exceeded. Reset in ${resetTime}s`,
    { retryAfter: resetTime }
  );
}

Clients should implement exponential backoff with jitter:

async function invokeWithRetry(
  tool: string,
  args: unknown,
  maxRetries: number = 3
): Promise<unknown> {
  let delay = 100;
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await invokeTool(tool, args);
    } catch (error) {
      if (error instanceof QuotaExceededError && i < maxRetries - 1) {
        const jitter = Math.random() * delay;
        await sleep(delay + jitter);
        delay *= 2; // exponential backoff
      } else {
        throw error;
      }
    }
  }
}

Best Practices

  1. Set realistic quotas: Monitor baseline usage before enforcing limits. Quota too tight and you'll frustrate legitimate users; too loose and you lose protection.

  2. Differentiate by tier: Offer multiple service tiers with different quota allowances. Developers pay for capacity, not surprises.

  3. Communicate clearly: Document quota limits prominently. Include current usage in API responses.

  4. Plan for elasticity: In cloud environments, consider dynamic quota scaling during predictable peak hours.

  5. Implement circuit breakers: If a tenant's failure rate exceeds thresholds, temporarily degrade their access to prevent cascading failures.

  6. Monitor cost attribution: Combine rate limiting with chargeback systems. Tenants should see exactly what they're consuming and why.

Proper rate limiting and quota management transform shared MCP infrastructure from a fragile shared resource into a predictable, fair platform that scales confidently as your agent ecosystem grows.

Encrypt your agent's data today

BitAtlas gives your AI agents AES-256-GCM encrypted storage with zero-knowledge guarantees. Free tier, no credit card required.