System Design – Designing a Rate Limiter

Where to implement a rate limiter mainly depends on factors such as our application, technology stack, technical team, etc. There are usually three locations to choose from: client-side, server-side, or middleware.

The client is an unreliable place to enforce rate limiting, since malicious actors can easily forge client requests.

A better way than putting a rate limiter on the server side is to use a rate limiter middleware which can even throttle our server side. So if you’re using a microservices architecture and you’re already using something like an authentication middleware, you can implement a rate limiter middleware alongside it.

75500ea6aa67478a2a4d0a9426000d0b. png

There are many rate limiting algorithms to choose from: we’ll cover a few, including “token buckets”, “leaky buckets”, “sliding window counters”, and others.

Token Bucket Algorithm

Stripe (click here to read)^[1] uses a token bucket algorithm to limit their API requests.

According to the Stripe tech blog:

We use the token bucket algorithm to enforce rate limiting. This algorithm has a central bucket host from which you take tokens with each request and slowly trickle more tokens into the bucket. If the bucket is empty, reject the request. In our case, Every Stripe user has a bucket, and every time they make a request, we delete a token from that bucket. We use Redis to implement our rate limiter.

Buckets have a predefined capacity of tokens. When a request arrives, it takes a token from the bucket and processes it further. If no tokens are available, the request will be dropped and the user will have to try again.

Example use case:

? We set the rate limiting rule to 3 requests per minute per user. ? User 1 makes a request at interval 00, and the current bucket is full with 3 tokens, so the request will be processed. The number of tokens in the bucket is now updated to 2. ? User 1 makes a second request at 10 seconds, there are 2 tokens in the bucket, so the request will be processed further. The number of tokens in the bucket is now updated to 1. ? User 1 makes a third request at 30 seconds, and there is 1 token in the bucket, so the request will be processed further. Now the bucket is empty for exactly 1 minute. ? User 1 makes a fourth request at 55 seconds, there is no token in the bucket, so the request is throttled, and the user will receive a 429 status code – Too many requests and is asked to try again later. The HTTP 429 Too Many Requests response status code indicates that the user has sent too many requests within a given period of time. ? At the completion of that 1 minute, the token count is refreshed at a fixed rate, and the bucket is again full of 3 tokens, and requests can be processed for that particular user again.

simply put:

In the token bucket algorithm, we process one token from the bucket per request. New tokens are put into the bucket at rate r. A bucket can hold at most b tokens. If the bucket is full when a request arrives, the request will be dropped.

The token bucket algorithm requires two parameters:

$81a76f037f39812886e043eac1ee526d.png\$

Token Bucket Algorithm Code Example

package main


import (
    "fmt"
    "math"
    "time"
)


const (
    MAX_BUCKET_SIZE float64 = 3
    REFILL_RATE int = 1
)


type TokenBucket struct {
    currentBucketSize float64
    lastRefillTimestamp int
}


func (tb *TokenBucket) allowRequest(tokens float64) bool {
    tb.refill() //refill of bucket happening at constant REFILL_RATE


    if tb. currentBucketSize >= tokens {
        tb.currentBucketSize -= tokens
        return true
    }


    return false
}


func getCurrentTimeInNanoseconds() int {
    return time. Now(). Nanosecond()
}


func (tb *TokenBucket) refill() {
    nowTime := getCurrentTimeInNanoseconds()


    tokensToAdd := (nowTime - tb.lastRefillTimestamp) * REFILL_RATE / 1e9


    tb.currentBucketSize = math.Min(tb.currentBucketSize + float64(tokensToAdd), MAX_BUCKET_SIZE)
    tb.lastRefillTimestamp = nowTime
}


func main() {
    obj := TokenBucket{
        currentBucketSize: 3,
        lastRefillTimestamp: 0,
    }


    fmt.Printf("Request processed: %v\\
", obj.allowRequest(1)) //true
    fmt.Printf("Request processed: %v\\
", obj.allowRequest(1)) //true
    fmt.Printf("Request processed: %v\\
", obj.allowRequest(1)) //true
    fmt.Printf("Request processed: %v\\
", obj.allowRequest(1)) //false, request dropped
}

Leaky Bucket Algorithm

Leaky Buckets are an easy and intuitive way to implement rate limiting using queues. It is a first-in-first-out queue (FIFO). Incoming requests are appended to a queue and dropped (leaked) if there is not enough room for a new request.

2cfc1632ba54d8a8c2b53647056d3f5c. png

? When a request arrives, the system checks to see if the queue is full. If not full, the request is added to the queue. ? Otherwise, the request will be dropped. ? Requests are pulled from the queue and processed at regular intervals.

How the algorithm works:

d7230a414f5d7cdf86187f353f18731 a.png

The leaky bucket algorithm accepts the following two parameters:

? Bucket size: equal to the queue size. A queue holds requests to be processed at a fixed rate. ? Outflow Rate: Defines how many requests can be processed at a fixed rate, usually in seconds.

The advantage of this algorithm is that it handles bursts of requests smoothly and processes them at roughly an average rate.

Sliding Window Algorithm

? The algorithm keeps track of the timestamp of the request. Timestamped data is usually kept in a cache, such as a Redis sorted set. ? When a new request comes, remove all expired timestamps. An expired timestamp is defined as being earlier than the start time of the current time window. ? Add timestamps of new requests to logs. ? If the log size is the same or lower than the number of allowed requests, accept the request. Otherwise, deny the request.

$2858fb4b8611ad48b5b877268fbe43e7.png\$

In the example below, the rate limiter allows us 2 requests per minute. Requests exceeding this limit within the window will be dropped.

c7932999503df19c9cc522c8c6a2c478. png

Note: We used the hh:mm:ss format for ease of understanding, but in redis we will be pushing UNIX timestamps.

? When a new request arrives at 1:00:01, the log is empty. Therefore, the request is allowed. ? A new request arrives at 1:00:30, and the timestamp 1:00:30 is inserted into the log. After inserting, the log size is 2, no larger than the number of allowed requests. Therefore, the request is allowed. ? A new request arrives at 1:00:50 and the timestamp is inserted into the log. After inserting, the log size is 3, which is larger than the allowed size of 2. Therefore, the request is rejected even though the timestamp is still present in the log. ? A new request arrives at 1:01:40. Requests in the range [1:00:40,1:01:40) are in the latest time range, but requests sent before 1:00:40 are outdated. Two obsolete timestamps 1:00:01 and 1:00:30 are removed from the log. After the delete operation, the log size becomes 2; therefore, the request is accepted.

Detailed design of the rate limiter:

We will use the following components:

? Configuration and data storage to store rate limiter rules. ? Worker processes frequently fetch rules from disk and store them in cache. ? The rate limiter middleware fetches rules from the cache. It also fetches data like timestamps, counters, etc. from Redis. When a request comes in, it does the calculations and decides whether to process the request or rate limit it. ? If you need to rate limit the request, here are two options? Option 1: Reject the request and send the status code 429: too many requests to the client. ? Option 2: Push the request to a message queue to process the request later.

35c334e6d02b9e65246964295919e5ec. png

Now, we may need to extend the above system into a distributed environment to support multiple servers and concurrent threads. Two challenges may arise here:

Race condition

As above:

Read the counter value from Redis, check if (counter + 1) exceeds the threshold. If not, increment the counter value by 1 in Redis.

Suppose the counter value in Redis is 3 (as shown in the image above). If two requests read the counter value concurrently, the other thread does not check until either request writes the value back. They both increment the counter by 1 and write back without checking other threads. Both requests (threads) think they have the correct counter value of 4. However, the correct counter value should be 5.

Locks can be used here, but this may affect performance. Therefore, we can use Redis Lua script ^[2].

? This may result in better performance. ?Furthermore, all steps in the script are executed atomically. While the script is executing, no other Redis commands can run.

Sync

In a distributed environment, synchronization is another important factor to consider. In order to support millions of users, a rate limiting server may not be able to handle the traffic. When using multiple rate limiting servers, synchronization is required.

One possible solution is to use sticky sessions, allowing clients to send traffic to the same rate limiter. This solution is neither scalable nor flexible. A better approach is to use a centralized data store like Redis.

After everything is in place, we can also monitor our rate limiter for metrics on performance, rules, algorithm effectiveness, etc. For example, if a rate limiting rule is too strict, many valid requests will be dropped. In this case, we want to relax the rules a bit. In another example, we noticed that our rate limiter became ineffective when there was a sudden increase in traffic, such as panic buying. In this case, we can change the algorithm to support bursty traffic. Token buckets are a perfect fit here.

Reference link

[1] (Click here to read): https://stripe.com/blog/rate-limiters
[2] Redis Lua scripting: https://www.freecodecamp.org/news/a-quick-guide-to-redis-lua-scripting/