Efficient paging with mysql and redis ordered sets

Paging API results with mysql

Let’s say you’re building an API and need to fetch a bunch of products, based on category. Your query will probably look something like this:

SELECT * FROM products WHERE category=1;

Now, this query might have thousands of results, and you’re certainly not going to return them all at once. You need to page your results – what are the possible solutions? Well, the obvious solution is to use an offset :

SELECT * FROM products WHERE category=1 LIMIT 100 OFFSET 200;

This would return page=2, if you return 100 results per page. But then you’re going to start running into problems very quickly when you get into this sort of territory:

SELECT * FROM products WHERE category=1 LIMIT 100 OFFSET 4000;

As explained here, offset actually needs to run through those 4000 rows before returning the ones you’re after.

There are a few other solutions, such as adding an id in the where clause like such:

SELECT * FROM products WHERE category=1 WHERE id > 12345 LIMIT 100;

But this only works if:

  • We’re building a webpage with a bunch of links for pages, to which we can append something like “last_id_seen=456″ as a querystring parameter and use that as our criterion (basically building the page links all in advance). This example case is for an API, so that won’t work. Also this is not a very elegant solution, I find.
  • Or if your ids are all perfectly consecutive, with no gaps, which will never happen.

Better paging with redis ordered sets

The main idea here is to perform our DB query only once and cache the result (pretty standard). There is an important nuance though, we cache the result in an ordered set, and not a large serialized string, so that we can query only the parts that we want (remember, the result set might be 10 000 rows, we don’t want to load the whole thing, and then slice it). For this we’ll use 2 redis keys

  • products/12345 as an ordered set to hold the products data from category 12345. Ordered sets cannot expire, so we need another key to know when to refresh the data.
  • valid:products/12345 as a normal key/value to tell us whether the data should be expired or not. For this example, I push the length of the result set to this key (I reuse it later to indicate the total length of the result set to the client response), but it could be anything else, really.

As mentioned earlier we need to push data into the ordered set by having continuous indexes we can use to query certain ranges. Redis’ ordered sets fits this perfectly, as you can assign a “score” to each element, and the value that matches it. Assuming a python implementation with Flask, here’s what it might look like:

cache_key = "products/{0}".format(product_category)
cache_key_valid = "valid:{0}".format(cache_key)
 
# get querystring parameters from the query. implement these as you wish.
count, page = filtered_params
offset = count*page
 
# we use a separate key to check whether we need to refresh the data in set
key_is_valid = redisser.get_key_is_valid(current_app.redis_conn, cache_key_valid)
 
# cache has expired, grab the data from db
if key_is_valid is None:
    # first make sure to delete the current redis cached data
    redisser.delete_ordered_set(current_app.redis_conn, cache_key)
 
    # get everything from DB, load up in redis
    products = current_app.db.get_products_for_category(category)
    redisser.add_elements_to_sorted_set(current_app.redis_conn, cache_key, products)
 
    # get the count of all elements, put in valid key element of redis
    cache_count = redisser.get_ordered_set_count(current_app.redis_conn, cache_key)
    redisser.add_key_valid(current_app.redis_conn, cache_key_valid, cache_count)
 
# the correct data is now in redis. fetch the right range from it
cached_products = redisser.get_products(current_app.redis_conn, cache_key, offset, count)
products = [pickle.loads(x) for x in cached_products]

As you’ll notice, we check to see if our “valid” key has expired yet. If the value is None, we reload the data from DB into an ordered set. We need to delete the set first, otherwise we might have extra data leftover if we insert new data that is smaller than the original one that was there (assuming the same key name). For good measure, here is what the other functions could look like:

def get_ordered_set_count(r, key):
    return r.zcount(key, "-inf", "+inf")
 
 
def add_key_valid(r, key, count):
    expiry = 60*60
    r.set(key, count, expiry)
 
 
def get_key_is_valid(r, key):
    result = r.get(key)
    return result
 
 
def delete_ordered_set(r, key):
    return r.delete(key)
 
 
def get_products(r, key, offset, count):
    upper_boundary = offset+int(count)-1
    return r.zrange(key, offset, upper_boundary)
 
 
def add_elements_to_sorted_set(r, key, data):
    f1 = (i for i in range(len(data)))
    f2 = (pickle.dumps(x) for x in data)
    together = zip(f1, f2)
    merged_list = list(itertools.chain(*together))
    r.zadd(key, *merged_list)

The magic really happens in the get_products function, where we query the ordered set and return only a particular range. In add_elements_to_sorted_set we define the score of the elements to just be a number from 0 to len(data). We’re basically rebuilding a small queryable table with consecutive ids, in memory.

This solution still means you have to optimize your original SQL query to be reasonably fast, otherwise the first client to hit the server with no cache available will wait a very long time. I’ve managed to get such queries returning under 1000 milliseconds, and everything else just hits redis. You should also play with the expiry time of the “valid” key, depending on how fresh you need your data to be.