API Rate Limits, Pagination, and Backoff: A Developer's Field Guide

Rate limits and pagination are the two things that make a demo integration fail in production. Every API has them; most integrations ignore them until they hit a 429 at 2am on a Tuesday.

Every API integration looks clean in a demo. You fetch twenty records, they come back instantly, the data is right, and you ship it. Then production data arrives — a few thousand records, multiple consumers, peak traffic hours — and within a week your integration is throwing 429s, skipping records, and silently dropping data.

Rate limits and pagination are the two constraints that separate a demo integration from a production one. They're not edge cases — they're fundamental properties of every external API you'll ever call. This guide covers both in detail: how they work, how to handle them correctly, and the patterns that experienced engineers use to build integrations that survive at scale.

Understanding Rate Limits

A rate limit is a cap on how many requests you can make to an API within a given time window. The cap exists to protect the API provider's infrastructure from being overwhelmed, to ensure fair access across customers, and to enforce tier boundaries between pricing plans.

Rate limits come in several varieties, and many APIs enforce more than one simultaneously:

Per-second limits — burst protection. Even if you have a large per-minute quota, you can't fire 1,000 requests in the first second.
Per-minute limits — the most common. 100 requests per minute, 600 per minute, 1,000 per minute.
Per-day limits — usually on expensive or data-heavy endpoints. Common in analytics APIs, enrichment services, and anything that does heavy server-side computation.
Per-IP limits — less common for authenticated APIs, more common for public endpoints. Can surprise you if you're running multiple services on shared infrastructure.
Per-user or per-org limits — relevant when you're building a multi-tenant product that calls an API on behalf of your customers. Each customer has their own quota.
Burst vs. sustained limits — some APIs allow brief bursts above the sustained limit using a token bucket model. You might be able to do 200 requests in a burst, but your sustained rate is 60 per minute.

How are rate limits communicated? Mostly through HTTP response headers. The most common ones are:

X-RateLimit-Limit — your total quota for the window
X-RateLimit-Remaining — how many requests you have left in the current window
X-RateLimit-Reset — Unix timestamp when the window resets
Retry-After — seconds to wait before retrying (returned on 429 responses)

Not all APIs use these headers consistently. Some use different header names. Some include the information only in the 429 response body, not in every response. Read the documentation carefully — and then test empirically, because the documentation is often incomplete.

The Right Way to Handle 429s

A 429 Too Many Requests response means you've exceeded a rate limit. The naive handling is to sleep for a fixed interval and retry. This works in development and fails reliably under load.

The correct handling starts with reading the Retry-After header. This tells you exactly how long to wait. Use it. Don't guess, don't use a fixed sleep — use the value the API tells you.

When Retry-After is absent or inconsistent, use exponential backoff with jitter. Exponential backoff means your wait time doubles with each consecutive failure: first retry after 1 second, second after 2, third after 4, fourth after 8, and so on up to a cap. The cap matters — don't let backoff grow indefinitely.

The jitter part is equally important. If you have 10 concurrent processes all hitting the same rate limit at the same time, they'll all back off for the same duration and then retry simultaneously — causing another rate limit. Jitter randomises the backoff duration (typically within ±50% of the calculated value) so that retries are spread out over time rather than thundering in together.

A simple Python implementation looks like:

import time, random

def backoff_sleep(attempt, base=1.0, cap=60.0):
    sleep = min(cap, base * (2 ** attempt))
    jitter = sleep * random.uniform(0.5, 1.5)
    time.sleep(jitter)

Beyond individual retries, 429s are a signal to add a circuit breaker to your integration. A circuit breaker tracks error rates and, when they exceed a threshold, stops sending requests for a period — rather than hammering a rate-limited API with a storm of retries. Libraries like Resilience4j (Java), Polly (.NET), and pybreaker (Python) implement this pattern. For TypeScript/Node integrations, Cockatiel is solid.

Proactive Rate Limit Management

The best approach to rate limits isn't better retry handling — it's not hitting them in the first place. This requires pre-emptive throttling: shaping your request volume to stay within limits before you get a 429, rather than reacting after.

The token bucket algorithm is the standard approach. You maintain a bucket that fills at the rate limit speed (e.g., 10 tokens per second for a 600-per-minute API). Each request consumes one token. If the bucket is empty, the request waits until a token is available. This smooths your request rate to stay within limits without requiring any 429 handling for normal operations.

When multiple services in your stack call the same API, you need to budget the rate limit across all consumers. If your rate limit is 1,000 requests per minute and you have three services that could each consume 1,000 per minute, you have a problem. A centralised rate limiter (Redis-backed for distributed systems) ensures that all services share a single quota correctly.

Another useful technique is spreading non-urgent requests over time. If you have a batch job that needs to make 10,000 API calls but they're not time-sensitive, schedule them to run at a rate well below the limit over an extended window. There's no reason to hit 80% of your daily quota in the first ten minutes.

Pagination Patterns

Pagination is how APIs split large result sets across multiple responses. Every API that returns lists uses some form of it; the differences between patterns matter significantly for correctness at scale.

Offset pagination is the simplest. You pass ?page=2&per_page=100 or ?offset=200&limit=100. It's intuitive to implement and easy to jump to a specific page. It's also broken at scale. If records are added or deleted while you're paginating, you'll get duplicates or skip records. And offset-based queries are expensive on large datasets — a database has to count past the first N rows to find your page. Avoid using offset pagination for large syncs if the API supports anything better.

Cursor-based pagination is the correct solution for most use cases. The API returns an opaque cursor (usually a base64-encoded ID or timestamp) alongside each page of results. You pass this cursor to get the next page. Cursors are stable — they don't drift when records are added or deleted — and the underlying queries are efficient because they use an index condition rather than a row count. Most modern APIs (Stripe, Shopify, GitHub) use cursor pagination.

Keyset pagination is similar to cursor pagination but uses explicit values from the data (like an ID or timestamp) rather than an opaque cursor. You pass ?after_id=12345 or ?since=2026-01-01T00:00:00Z. It's transparent and debuggable. It works well for chronological or sequential data but requires that you know the sort order and have an appropriate index.

Page-based pagination (as distinct from offset pagination) uses a page number without an explicit offset, often backed by a more efficient implementation. Still has the same drift problems as offset if records change during pagination.

To identify which pattern an API uses: look for cursor, next_cursor, or after parameters in the response — that's cursor-based. Look for page or offset — that's offset or page-based. Look for a links.next URL in the response body — that's typically cursor-based with the cursor embedded in the URL.

Implementing Robust Pagination

The correct structure for paginating through an API is a while loop that terminates when there's no next page signal. Here's the pattern:

cursor = None
while True:
    params = {"limit": 100}
    if cursor:
        params["after"] = cursor

    response = api.get("/records", params=params)
    records = response["data"]

    process(records)

    cursor = response.get("next_cursor")
    if not cursor:
        break

Several failure modes to defend against:

Infinite loops. If the API returns a cursor on the last page pointing to itself, your loop never terminates. Add a maximum page count guard or verify that each page returns fewer records than requested before treating the absence of a cursor as the terminal condition.
Empty pages. Some APIs return an empty page before returning no cursor. Handle this by checking both the cursor and the record count.
Resumable syncs. For large syncs that could fail partway through, checkpoint your progress by storing the last successful cursor in your database. On restart, resume from the checkpoint rather than the beginning. For an API that returns 500,000 records and fails on page 3,000, re-fetching from the start is expensive and slow.

Bulk APIs vs Individual Record APIs

Many APIs offer batch endpoints alongside individual record endpoints. A batch endpoint accepts an array of records and creates, updates, or retrieves all of them in a single request. Using batch endpoints where available is almost always the right choice:

It reduces request count (and therefore rate limit consumption) by orders of magnitude.
It reduces latency — one round trip for 100 operations instead of 100 round trips.
It often reduces cost on APIs that charge per request.

The tradeoff is handling partial failures. A batch request that creates 100 records might succeed for 97 and fail for 3. The API response will typically include per-record success/failure indicators. Your code needs to handle this correctly — don't treat a 200 response on a batch endpoint as a guarantee that all records were processed.

For very large operations, chunk your batches to the maximum size the API allows per request. Sending 10,000 records in a single request is often not supported and will return an error; sending 100 batches of 100 records is the correct approach.

API Quotas vs Rate Limits

These are related but distinct concepts with different mitigation strategies. A rate limit is a per-minute or per-second constraint — you can make more requests once the window resets. A quota is a per-day or per-month constraint — once you've used your daily quota, you're blocked until tomorrow.

Hitting a rate limit means: slow down and retry. Hitting a quota means: stop completely and wait for reset, or upgrade your plan. Your error handling needs to distinguish between them. A 429 from a rate limit gets retried with backoff. A 429 from a quota exhaustion should trigger an alert and pause processing — retrying aggressively won't help and may just cost you money.

Build quota tracking into your integrations for APIs with daily quotas. Check your remaining quota at startup and at regular intervals. Alert when you're approaching the limit (say, at 80%) so you can decide whether to slow down or arrange a quota increase.

Dealing with Undocumented Rate Limits

Most large APIs have secondary rate limits that are not documented in their official docs. These are typically:

Concurrent connection limits (how many requests can be in-flight simultaneously)
Per-endpoint rate limits that differ from the global rate limit
Rate limits on specific operations like bulk deletes or expensive searches
Rate limits that only apply when results sets are very large

GitHub's secondary rate limits are a famous example — you can stay within the documented 5,000 requests per hour while still being 429'd by undocumented concurrency limits.

Discovering undocumented limits requires empirical testing. Log your response times — a sudden increase in latency often precedes a 429. Log every response header, including undocumented ones; some APIs include hints in headers they don't document publicly. Treat any 429 you can't explain from the documented limits as evidence of an undocumented secondary limit and search the provider's developer forums and GitHub issues.

Testing Your Rate Limit Handling

Rate limit handling is difficult to test against live APIs because you'd need to deliberately exhaust your quota. The right approach is to mock 429 responses in your test suite.

Write tests that simulate: a 429 with a Retry-After header, a 429 without a Retry-After header, consecutive 429s (to verify your exponential backoff), and a 429 followed by a successful response (to verify recovery). These scenarios cover the cases most likely to fail in production.

For load testing integration endpoints, tools like k6 or Locust can generate controlled request volume. Run your integration against a staging environment with a lower rate limit to verify your throttling logic activates at the right threshold.

For chaos testing, introduce artificial throttling in your local or staging environment by adding a middleware layer that randomly returns 429s with configurable frequency. This verifies that your retry and backoff logic is actually invoked during normal operation, not just in unit tests.

Real-World Patterns by Platform

Each major platform has idiosyncrasies worth knowing before you start:

Stripe enforces a 100 requests per second limit per secret key, with a token bucket model that allows short bursts. Stripe includes rate limit headers on every response. Their idiosyncrasy: their webhook delivery has separate rate limits from their API, and webhook retries are independent of your API calls. Stripe events are also idempotent — use their idempotency key feature on all write operations to make retries safe.

HubSpot has a complex tiered rate limit system that varies by plan and by API product. Their CRM APIs have both daily and per-10-second limits. HubSpot's Retry-After header gives you the exact wait time but it's in seconds, not milliseconds — a common parsing bug. Their search APIs have separate, stricter limits than CRUD APIs, which surprises developers who assume search is just a read operation.

Salesforce uses a concurrent API calls limit (maximum simultaneous in-flight requests per org) in addition to a 24-hour limit. The concurrent limit is the one most integrations hit first, not the daily limit. Use Salesforce's Bulk API for large data operations — it's specifically designed for batch processing and has much more generous limits than the standard REST API.

Slack uses a tier-based rate limit system where different API methods have different limits. Tier 1 methods (like users.list) are rate limited to 1 request per minute. Tier 4 methods (most messaging operations) allow 100 requests per minute. Many developers don't read the per-method rate limit tiers and assume the same limit applies everywhere, which leads to surprising 429s on specific methods.

Building integrations that survive in production is mostly a matter of treating rate limits and pagination as first-class concerns from the start — not as afterthoughts to handle when they cause incidents. If you're working on an integration that needs to be production-grade, our team can help you build it right from day one.