Yesterday you read the rate-limit headers. Today you react to them. The pattern: sleep proportional to how low the remaining quota is.
import time
# Simulated headers from 5 successive calls (in real code these come from r.headers)
remaining_per_call = [80, 60, 40, 20, 5]
sleeps = []
for remaining in remaining_per_call:
if remaining < 10:
sleep_secs = 1.0 # near limit — slow down
elif remaining < 30:
sleep_secs = 0.2 # getting low — small pause
else:
sleep_secs = 0.0 # plenty of headroom
sleeps.append(sleep_secs)
# time.sleep(sleep_secs) # would do this in real code
print(sleeps)Expected: [0.0, 0.0, 0.0, 0.2, 1.0]. Three calls run free, the fourth pauses 200ms, the fifth pauses 1 second.
Why proportional? Why not always sleep the same amount?
Most calls don't need to slow down. If you sleep 200ms between every call when you're at 80% remaining, you waste 80% of your wall-clock time. Adaptive sleeps mean fast when you can, slow when you must.
And when remaining is zero?
You stop. Use the X-RateLimit-Reset timestamp — sleep until that time, then resume. We're not implementing that today; the threshold pattern is enough.
After each API call, read the rate-limit headers and decide whether to sleep before the next call:
import time
def sleep_for_remaining(remaining):
if remaining < 10:
return 1.0 # near limit — slow down
if remaining < 30:
return 0.2 # getting low — small pause
return 0.0 # plenty of headroom
for item in items:
r = requests.get(...)
remaining = int(r.headers.get("X-RateLimit-Remaining", 0))
time.sleep(sleep_for_remaining(remaining))Numbers above (10/30, 0.2s/1.0s) are reasonable defaults. Tune to your API's window — for an API with 100 calls per minute, the limits are tighter than one with 100 calls per hour.
A common mistake: read remaining, decide "I'm fine", then send a call that returns remaining lower than expected. Headers reflect the count before this call. So:
# A call that returned X-RateLimit-Remaining: 1 means YOU just spent the second-to-last token
# The next call will hit zero.Reading + sleeping based on the response of the previous call is the right rhythm.
for attempt in range(3):
r = requests.get(url, timeout=10)
if r.status_code == 429:
retry_after = int(r.headers.get("Retry-After", 5))
time.sleep(retry_after)
continue
breakThe two patterns layer: adaptive sleep on remaining + explicit Retry-After on 429. The first prevents hitting the limit; the second catches you when you do.
Server-side 429s are slow (round-trip + parse) and cost you a real call against the limit. Client-side rate limiting (sleep before sending) is free — your script self-throttles before the API has to push back.
Multiple instances of your script (or multiple users of the same API key) can synchronize their sleeps. Add a small random jitter to spread retries:
import random
time.sleep(sleep_secs + random.uniform(0, 0.1))For production. For learning, deterministic sleeps make the pattern legible.
Yesterday you read the rate-limit headers. Today you react to them. The pattern: sleep proportional to how low the remaining quota is.
import time
# Simulated headers from 5 successive calls (in real code these come from r.headers)
remaining_per_call = [80, 60, 40, 20, 5]
sleeps = []
for remaining in remaining_per_call:
if remaining < 10:
sleep_secs = 1.0 # near limit — slow down
elif remaining < 30:
sleep_secs = 0.2 # getting low — small pause
else:
sleep_secs = 0.0 # plenty of headroom
sleeps.append(sleep_secs)
# time.sleep(sleep_secs) # would do this in real code
print(sleeps)Expected: [0.0, 0.0, 0.0, 0.2, 1.0]. Three calls run free, the fourth pauses 200ms, the fifth pauses 1 second.
Why proportional? Why not always sleep the same amount?
Most calls don't need to slow down. If you sleep 200ms between every call when you're at 80% remaining, you waste 80% of your wall-clock time. Adaptive sleeps mean fast when you can, slow when you must.
And when remaining is zero?
You stop. Use the X-RateLimit-Reset timestamp — sleep until that time, then resume. We're not implementing that today; the threshold pattern is enough.
After each API call, read the rate-limit headers and decide whether to sleep before the next call:
import time
def sleep_for_remaining(remaining):
if remaining < 10:
return 1.0 # near limit — slow down
if remaining < 30:
return 0.2 # getting low — small pause
return 0.0 # plenty of headroom
for item in items:
r = requests.get(...)
remaining = int(r.headers.get("X-RateLimit-Remaining", 0))
time.sleep(sleep_for_remaining(remaining))Numbers above (10/30, 0.2s/1.0s) are reasonable defaults. Tune to your API's window — for an API with 100 calls per minute, the limits are tighter than one with 100 calls per hour.
A common mistake: read remaining, decide "I'm fine", then send a call that returns remaining lower than expected. Headers reflect the count before this call. So:
# A call that returned X-RateLimit-Remaining: 1 means YOU just spent the second-to-last token
# The next call will hit zero.Reading + sleeping based on the response of the previous call is the right rhythm.
for attempt in range(3):
r = requests.get(url, timeout=10)
if r.status_code == 429:
retry_after = int(r.headers.get("Retry-After", 5))
time.sleep(retry_after)
continue
breakThe two patterns layer: adaptive sleep on remaining + explicit Retry-After on 429. The first prevents hitting the limit; the second catches you when you do.
Server-side 429s are slow (round-trip + parse) and cost you a real call against the limit. Client-side rate limiting (sleep before sending) is free — your script self-throttles before the API has to push back.
Multiple instances of your script (or multiple users of the same API key) can synchronize their sleeps. Add a small random jitter to spread retries:
import random
time.sleep(sleep_secs + random.uniform(0, 0.1))For production. For learning, deterministic sleeps make the pattern legible.
Create a free account to get started. Paid plans unlock all tracks.