Back to Lab
RAXXO Studios 10 min read No time? Make it a 1 min read

Rate Limiting Without Redis: 3 Patterns I Use in Serverless

Backend
10 min read
TLDR
×
  • Durable counters give exact limits but cost per request
  • Edge config token buckets trade precision for near-zero latency
  • Signed-window limits need no storage at all
  • At small scale, signed windows handle most abuse cases free

Rate limiting in serverless broke my mental model the first time I tried it. There is no long-lived process to hold a counter. Every request might land on a fresh instance with empty memory. I ended up using three different patterns depending on how strict the limit needs to be, and none of them require running Redis.

Why The Usual Advice Fails In Serverless

Every tutorial tells you to use Redis with an INCR and an EXPIRE. That works great when you have a server process that lives for weeks and a Redis instance sitting next to it. In serverless I had neither.

My functions spin up cold, handle a request, and die. There is no shared memory between two invocations. I tested this directly: I stored a counter in a module-level variable, deployed it, and hammered the endpoint. The counter reset constantly because the platform was spinning up new instances under load. Sometimes I would see 1, sometimes 14, never a number I could trust.

So the in-memory approach is out the moment you have more than one concurrent request. That leaves external state. The obvious answer is a managed Redis like Upstash, and for a while I paid for one. It cost me around 10 EUR a month and added 30 to 60 milliseconds of latency per call because the request had to travel to the Redis region and back. For an endpoint getting 200 requests a day, that felt absurd.

The deeper problem is that Redis solves a coordination problem I did not actually have at small scale. I was not trying to enforce a global limit across thousands of nodes. I was trying to stop one IP address or one API key from sending 500 requests in a minute. That is a much smaller problem, and smaller problems have cheaper solutions.

I broke my needs into three buckets. First, strict limits where exact counts matter, like a paid API tier capped at 1000 calls a day. Second, soft limits where roughly-right is fine, like throttling a contact form. Third, abuse prevention where I just want to make scripted floods expensive without storing anything.

Each of those maps to a different pattern. If you want the broader picture of how I keep infrastructure lean as a one-person studio, the Claude Blueprint covers the decision framework I use. Below are the three patterns and the tradeoffs I hit with each at real, small scale.

Pattern One: Durable-Object Style Counters

When I need an exact count, I reach for a durable object. Cloudflare Workers calls them Durable Objects. The idea is a single addressable instance that holds state and processes requests for one key one at a time. Think of it as a tiny single-threaded actor that owns a counter.

Here is the shape. A request comes in with an API key. I derive a durable object ID from that key. Every request for that key routes to the same object, no matter which edge location served the function. Inside the object I run a plain INCR against its storage, check it against the limit, and decrement or reject as needed.

The win is correctness. Because one object handles one key serially, there are no race conditions. Two requests arriving at the same millisecond get queued and counted in order. I get an exact number every time. For a metered paid tier this matters, because a customer paying for 1000 calls a day should get exactly that, not 1043 because of a race.

The cost is real but small. Each durable object request bills separately from the worker request. On my traffic, an endpoint doing 50000 metered calls a month added roughly 1 EUR. That is fine for paid usage where the calls are tied to money coming in. It would be wasteful for a free contact form.

Latency was better than Redis for me, around 5 to 15 milliseconds, because the object lives at the edge near the request rather than in a single Redis region. The first request to a cold object is slower, maybe 40 milliseconds, then it warms.

The tradeoff to watch is the single-instance bottleneck. Because all requests for one key serialize through one object, a single very hot key can become a queue. For per-user limits this is fine because no single user generates enough traffic to matter. If you tried to route all of global traffic through one object you would create a chokepoint. I keep the partition key narrow, one object per API key, never a shared global object.

I reach for this pattern only when the count has to be exact and the calls have value attached. For everything else the next two patterns are cheaper.

Pattern Two: Edge Config Token Buckets

For soft limits I use a token bucket backed by edge config storage. Vercel calls it Edge Config, Cloudflare has KV, and both give you a key-value store that reads in single-digit milliseconds because the data is replicated near the function.

A token bucket works like this. Each key gets a bucket with a capacity, say 60 tokens, that refills at a rate, say 1 token per second. Every request takes a token. If the bucket is empty, the request is rejected. The clever part is I do not store a ticking counter. I store two numbers: the token count at last update and the timestamp of that update. On each request I calculate how many tokens should have refilled since then based on elapsed time, add them, cap at capacity, then subtract one.

That math means I only write on actual requests, not on a timer. A bucket that nobody touches for an hour just sits there, and the next request computes the full refill in one step. This keeps writes low, which matters because edge config writes are slower and sometimes metered while reads are cheap and fast.

The honest tradeoff is consistency. Edge config is eventually consistent. If two requests hit two different regions at the same instant, both might read the same stale bucket and both pass. At low traffic this almost never happens, and even when it does the worst case is one extra request slipping through. For throttling a contact form or a search endpoint, allowing 61 requests instead of 60 once in a while is harmless.

I measured the real numbers. Reads came back in 2 to 8 milliseconds. The eventual-consistency window on writes was usually under a second. For an endpoint where exactness does not matter, this is the best balance of cost and speed I found. It is effectively free at the volumes a small studio handles.

I covered the general principle of picking eventually-consistent storage when strict correctness is not required in my notes on structuring backends, and the token bucket is the cleanest example of that tradeoff. If you want to schedule social posts on a similar low-write cadence rather than hammering an API, I run that through Buffer instead of building my own queue.

Pattern Three: Signed-Window Limits With No Storage

The third pattern stores nothing at all, and it is the one I use most. The trick is a signed time window. Instead of a server remembering how many requests you made, the client carries a signed token that encodes the window, and the math does the limiting.

Here is the core idea. I divide time into fixed windows, say 60-second blocks. For a given client and window I compute an HMAC signature using a server secret. The first time a client hits the endpoint in a window, I issue a small token that includes the window start and a request count, signed so it cannot be forged. The client sends that token back on the next request. I verify the signature, read the count, increment, re-sign, and return the new token.

Because the token is signed, the client cannot lie about its count. Because the window is encoded, an old token from a previous window is simply ignored and the count resets. I never write to any database. The entire state lives in the signed token traveling between client and edge.

The obvious limit is that this only works when the client cooperates by sending the token back. A determined attacker can just drop the token and start fresh. So this is not a security boundary against motivated abuse. What it is good at is stopping accidental floods, misbehaving scripts, and casual abuse, all of which respect the request-response cycle. It made my contact form and newsletter signup quiet without any storage cost.

I also use a stateless variant for pure abuse slowing. I require the client to solve a tiny proof-of-work or include a signed nonce that expires after 10 seconds. This costs me one HMAC verification, roughly microseconds, and costs an attacker real CPU to flood. It will not stop a serious botnet but it makes a casual script not worth running.

The numbers that sold me: zero storage reads, zero writes, latency under 1 millisecond because it is all in-process crypto. On a 100000-request month this pattern cost me nothing beyond the function invocations I was already paying for. For voice or image endpoints where the real cost is the downstream call to a paid service like ElevenLabs or Magnific, a free front-line throttle that blocks obvious floods protects the expensive part without adding any of its own overhead.

Bottom Line

I do not use one rate limiter, I use three, and the choice comes down to how much the count matters and who is paying for the request. Durable-object counters when the number must be exact and the calls have money attached. Edge config token buckets when roughly-right is fine and I want near-zero latency. Signed-window limits when I want to stop accidental floods for free and store nothing.

The lesson I keep relearning is that Redis solves a coordination problem most small projects do not have. Before reaching for shared external state, I ask whether I actually need a global exact count or just need to make abuse expensive. Usually it is the second one, and the second one is nearly free.

If you are building lean serverless backends as a solo operator, the same restraint applies everywhere. Pick the cheapest pattern that meets the actual requirement, not the one the tutorial assumed. I walk through more of that decision-making in the Claude Blueprint, where I lay out how I keep a one-person studio running without a wall of paid infrastructure.

This article contains affiliate links. If you sign up through them, I may earn a small commission at no extra cost to you. (Ad)

Stay in the loop
New tools, drops, and AI experiments. No spam. Unsubscribe anytime.
Back to all articles