Implementing a fake filesystem using a trie in python

A trie (also sometimes called a prefix tree or radix tree) is an ordered tree data structure. I was looking at ways to implement such a thing the other day, and came up with 3 possible applications, off the top of my head:

  • a URL shortening service
  • a virtual filesystem
  • a search engine

The idea behind a trie is that you can store shared parts of data in an ordered fashion. If you have a URL shortening service that needs to keep track of the url “http://baseurl.com/ABCDE” and “http://baseurl.com/ABCDEF”, you’re storing almost 2 identical strings, with a 1 character difference. How could we do this better? This is what a trie looks like:

secondTrie

It’s basically a map (or dictionary in python) of nested maps. It can be handy to have a delimiter when a node is the last one, but this is not necessary. To exemplify how it works in practice, I’ve started coding a quick and dirty fake filesystem in python. The idea behind using a trie to hold data structures if the same: if I have a directory /var, a directory, /var/www and another directory /var/www/mysite, it’s not necessary to repeat the “var” part 3 times, nor the “www” twice.

There are tons of tricky things about filesystems that complicates matters such as hidden files, creating intermediate directories when they don’t exist, permissions, etc. but I’ve ignored them for the most part here. I mainly wanted to implement an “add” function to see if this worked with a trie, and it sort of looks like this:

    @_input_path_sanitizer
    def add(self, path):
        logging.info("Adding directory {}".format(path))
        pieces = self.get_pieces(path)
        temp_trie = self._paths
        for piece in pieces:
            # if we're at the end of the trie, take out the delimiter.
            if piece in temp_trie and self.__end in temp_trie[piece]:
                del temp_trie[piece][self.__end]
            # setdefault() tries to get the key, if it exists. if it doesn't set it to the second parameter
            temp_trie = temp_trie.setdefault(piece, {})
        # set the delimiter here in all cases
        temp_trie.setdefault(self.__end, self.__end)

The whole thing is on github, but this function does most of the heavy lifting. Then you can do stuff like :

fs = FakeFs()
fs.mkdir("/poo/woo/soo/doo/poo")
fs.mkdir("/etc")
fs.mkdir("/")
fs.mkdir("/tmp")
fs.mkdir("/tmp/whatever")
fs.mkdir("/var/www/mysite")
fs.dir_exists("/sadsad")
fs.dir_exists("/etc")
print(fs.paths)

Which yields something along the lines of:

2014-08-23 12:15:42,449 Adding directory /poo/woo/soo/doo/poo
2014-08-23 12:15:42,449 Adding directory /etc
2014-08-23 12:15:42,449 Adding directory /
2014-08-23 12:15:42,449 Adding directory /tmp
2014-08-23 12:15:42,450 Adding directory /tmp/whatever
2014-08-23 12:15:42,450 Adding directory /var/www/mysite
2014-08-23 12:15:42,450 /sadsad does not exist
2014-08-23 12:15:42,450 /etc exists
['/', '/etc', '/poo/woo/soo/doo/poo', '/tmp/whatever', '/var/www/mysite']

Python threads VS processes

I was revisiting Jeff Knupp’s great article from 2012, Python’s hardest problemIt talks about the Global Interpreter Lock, or GIL, in python. It basically explains how the GIL works, and why it’s such an important problem for python coders.

Probably the most notable consequence of the GIL is that python cannot do “pure” multi-threaded operations, in the sense that only one thread can execute at any time. The GIL prevents strange things from happening when you can have more than one thread write to the same chunk of memory. Knupp also wrote a follow-up to that article, Python’s hardest problem, revisitedwhere he advises people who want to do many things at the same time (parallelism) to use the multiprocessing module.

It’s great advice, I’ve used multiprocessing in the wild. It needs a bit more effort to communicate data between the processes (typically using queues), but it’s well worth the security that separate processes afford you. In essence, every process can then run it’s own thread without sharing any memory.

As I was reading down the article, I noticed he didn’t have any examples! So I started playing around, just for fun. Let’s make a very simple  program that appends integers from 0 to 999998 and discards the list, 50 times.

Version 1: simple single-threaded

import time
 
nb_repeat = 50
 
 
def a_complex_operation(*args):
    a = []
    for x in range(999999):
        a.append(x)
    return None
 
 
t1 = time.time()
for _ in range(nb_repeat):
    a_complex_operation()
print time.time()-t1

Running time: 4.82960796356 seconds

Version 2: with processes

from multiprocessing import Pool
nb_repeat = 50
 
def a_complex_operation(*args):
    a = []
    for x in range(999999):
        a.append(x)
    return None
pool = Pool(processes=nb_repeat)
results = pool.map(a_complex_operation, [None for _ in range(nb_repeat)])

Running time: 2.74916887283 seconds! Almost half the initial time.

Version 3: threaded version

from threading import Thread
import time
 
nb_repeat = 50
 
def a_complex_operation(*args):
    a = []
    for x in range(999999):
        a.append(x)
    return None
 
t1 = time.time()
threads = []
for _ in range(nb_repeat):
    threads.append(Thread(target=a_complex_operation))
 
[x.start() for x in threads]
[x.join() for x in threads]
print time.time()-t1

Running time: 14.0888431072 seconds!

Not extremely surprising, but quite interesting still. As expected, the version with processes is the fastest. But threading our program does not only not improve its running time, it actually slows it down quite a bit. This is probably due to the overhead involved in switching context between 50 threads. The multiprocess version is a nice little optimization, but it’s fair to say that the normal, single-threaded version is running pretty quickly too. When in doubt, keep it simple!

Efficient paging with mysql and redis ordered sets

Paging API results with mysql

Let’s say you’re building an API and need to fetch a bunch of products, based on category. Your query will probably look something like this:

SELECT * FROM products WHERE category=1;

Now, this query might have thousands of results, and you’re certainly not going to return them all at once. You need to page your results – what are the possible solutions? Well, the obvious solution is to use an offset :

SELECT * FROM products WHERE category=1 LIMIT 100 OFFSET 200;

This would return page=2, if you return 100 results per page. But then you’re going to start running into problems very quickly when you get into this sort of territory:

SELECT * FROM products WHERE category=1 LIMIT 100 OFFSET 4000;

As explained here, offset actually needs to run through those 4000 rows before returning the ones you’re after.

There are a few other solutions, such as adding an id in the where clause like such:

SELECT * FROM products WHERE category=1 WHERE id > 12345 LIMIT 100;

But this only works if:

  • We’re building a webpage with a bunch of links for pages, to which we can append something like “last_id_seen=456″ as a querystring parameter and use that as our criterion (basically building the page links all in advance). This example case is for an API, so that won’t work. Also this is not a very elegant solution, I find.
  • Or if your ids are all perfectly consecutive, with no gaps, which will never happen.

Better paging with redis ordered sets

The main idea here is to perform our DB query only once and cache the result (pretty standard). There is an important nuance though, we cache the result in an ordered set, and not a large serialized string, so that we can query only the parts that we want (remember, the result set might be 10 000 rows, we don’t want to load the whole thing, and then slice it). For this we’ll use 2 redis keys

  • products/12345 as an ordered set to hold the products data from category 12345. Ordered sets cannot expire, so we need another key to know when to refresh the data.
  • valid:products/12345 as a normal key/value to tell us whether the data should be expired or not. For this example, I push the length of the result set to this key (I reuse it later to indicate the total length of the result set to the client response), but it could be anything else, really.

As mentioned earlier we need to push data into the ordered set by having continuous indexes we can use to query certain ranges. Redis’ ordered sets fits this perfectly, as you can assign a “score” to each element, and the value that matches it. Assuming a python implementation with Flask, here’s what it might look like:

cache_key = "products/{0}".format(product_category)
cache_key_valid = "valid:{0}".format(cache_key)
 
# get querystring parameters from the query. implement these as you wish.
count, page = filtered_params
offset = count*page
 
# we use a separate key to check whether we need to refresh the data in set
key_is_valid = redisser.get_key_is_valid(current_app.redis_conn, cache_key_valid)
 
# cache has expired, grab the data from db
if key_is_valid is None:
    # first make sure to delete the current redis cached data
    redisser.delete_ordered_set(current_app.redis_conn, cache_key)
 
    # get everything from DB, load up in redis
    products = current_app.db.get_products_for_category(category)
    redisser.add_elements_to_sorted_set(current_app.redis_conn, cache_key, products)
 
    # get the count of all elements, put in valid key element of redis
    cache_count = redisser.get_ordered_set_count(current_app.redis_conn, cache_key)
    redisser.add_key_valid(current_app.redis_conn, cache_key_valid, cache_count)
 
# the correct data is now in redis. fetch the right range from it
cached_products = redisser.get_products(current_app.redis_conn, cache_key, offset, count)
products = [pickle.loads(x) for x in cached_products]

As you’ll notice, we check to see if our “valid” key has expired yet. If the value is None, we reload the data from DB into an ordered set. We need to delete the set first, otherwise we might have extra data leftover if we insert new data that is smaller than the original one that was there (assuming the same key name). For good measure, here is what the other functions could look like:

def get_ordered_set_count(r, key):
    return r.zcount(key, "-inf", "+inf")
 
 
def add_key_valid(r, key, count):
    expiry = 60*60
    r.set(key, count, expiry)
 
 
def get_key_is_valid(r, key):
    result = r.get(key)
    return result
 
 
def delete_ordered_set(r, key):
    return r.delete(key)
 
 
def get_products(r, key, offset, count):
    upper_boundary = offset+int(count)-1
    return r.zrange(key, offset, upper_boundary)
 
 
def add_elements_to_sorted_set(r, key, data):
    f1 = (i for i in range(len(data)))
    f2 = (pickle.dumps(x) for x in data)
    together = zip(f1, f2)
    merged_list = list(itertools.chain(*together))
    r.zadd(key, *merged_list)

The magic really happens in the get_products function, where we query the ordered set and return only a particular range. In add_elements_to_sorted_set we define the score of the elements to just be a number from 0 to len(data). We’re basically rebuilding a small queryable table with consecutive ids, in memory.

This solution still means you have to optimize your original SQL query to be reasonably fast, otherwise the first client to hit the server with no cache available will wait a very long time. I’ve managed to get such queries returning under 1000 milliseconds, and everything else just hits redis. You should also play with the expiry time of the “valid” key, depending on how fresh you need your data to be.

PHP vs consistency

People have written extensively about how much PHP sucks. I tend to agree, but it’s still possible to hide away and make it a livable experience to code with.

And then once in a while I come across stuff like this :

imagepng

bool imagepng ( resource $image [, string $filename [, int $quality [, int $filters ]]] )

Scroll down to “quality”
Compression level: from 0 (no compression) to 9.

imagejpeg

bool imagejpeg ( resource $image [, string $filename [, int $quality ]] )

Scroll down to “quality”
quality is optional, and ranges from 0 (worst quality, smaller file) to 100 (best quality, biggest file). The default is the default IJG quality value (about 75).

Aw man. C’mon, one goes 0-9 (best being 0), and the other is 0-100 (best being 100)?