Skip to main content
Back to List
AI Infrastructure

Rate Limiting

A control method that caps API request volume over a time window to protect stability and cost

#rate limiting#rate limit#request throttling#API quota#traffic control

What is rate limiting?

Rate limiting is an operational control that restricts how many requests can be sent within a fixed period.

For example, if a service allows 60 requests per minute, additional requests are delayed or blocked to prevent overload.

Why does it matter?

In AI and API-heavy systems, sudden traffic spikes can cause failed requests, high latency, and cost surges.

Rate limiting is a foundational safeguard for keeping reliability and cost under control.

Common implementation patterns

  • Fixed Window: limits requests per fixed time bucket
  • Sliding Window: applies limits with finer time continuity
  • Token Bucket: allows short bursts while controlling long-term average throughput

Related terms