Implementing a fake filesystem using a trie in python

A trie (also sometimes called a prefix tree or radix tree) is an ordered tree data structure. I was looking at ways to implement such a thing the other day, and came up with 3 possible applications, off the top of my head:

  • a URL shortening service
  • a virtual filesystem
  • a search engine

The idea behind a trie is that you can store shared parts of data in an ordered fashion. If you have a URL shortening service that needs to keep track of the url “http://baseurl.com/ABCDE” and “http://baseurl.com/ABCDEF”, you’re storing almost 2 identical strings, with a 1 character difference. How could we do this better? This is what a trie looks like:

secondTrie

It’s basically a map (or dictionary in python) of nested maps. It can be handy to have a delimiter when a node is the last one, but this is not necessary. To exemplify how it works in practice, I’ve started coding a quick and dirty fake filesystem in python. The idea behind using a trie to hold data structures if the same: if I have a directory /var, a directory, /var/www and another directory /var/www/mysite, it’s not necessary to repeat the “var” part 3 times, nor the “www” twice.

There are tons of tricky things about filesystems that complicates matters such as hidden files, creating intermediate directories when they don’t exist, permissions, etc. but I’ve ignored them for the most part here. I mainly wanted to implement an “add” function to see if this worked with a trie, and it sort of looks like this:

    @_input_path_sanitizer
    def add(self, path):
        logging.info("Adding directory {}".format(path))
        pieces = self.get_pieces(path)
        temp_trie = self._paths
        for piece in pieces:
            # if we're at the end of the trie, take out the delimiter.
            if piece in temp_trie and self.__end in temp_trie[piece]:
                del temp_trie[piece][self.__end]
            # setdefault() tries to get the key, if it exists. if it doesn't set it to the second parameter
            temp_trie = temp_trie.setdefault(piece, {})
        # set the delimiter here in all cases
        temp_trie.setdefault(self.__end, self.__end)

The whole thing is on github, but this function does most of the heavy lifting. Then you can do stuff like :

fs = FakeFs()
fs.mkdir("/poo/woo/soo/doo/poo")
fs.mkdir("/etc")
fs.mkdir("/")
fs.mkdir("/tmp")
fs.mkdir("/tmp/whatever")
fs.mkdir("/var/www/mysite")
fs.dir_exists("/sadsad")
fs.dir_exists("/etc")
print(fs.paths)

Which yields something along the lines of:

2014-08-23 12:15:42,449 Adding directory /poo/woo/soo/doo/poo
2014-08-23 12:15:42,449 Adding directory /etc
2014-08-23 12:15:42,449 Adding directory /
2014-08-23 12:15:42,449 Adding directory /tmp
2014-08-23 12:15:42,450 Adding directory /tmp/whatever
2014-08-23 12:15:42,450 Adding directory /var/www/mysite
2014-08-23 12:15:42,450 /sadsad does not exist
2014-08-23 12:15:42,450 /etc exists
['/', '/etc', '/poo/woo/soo/doo/poo', '/tmp/whatever', '/var/www/mysite']