A trie (also sometimes called a prefix tree or radix tree) is an ordered tree data structure. I was looking at ways to implement such a thing the other day, and came up with 3 possible applications, off the top of my head:
- a URL shortening service
- a virtual filesystem
- a search engine
The idea behind a trie is that you can store shared parts of data in an ordered fashion. If you have a URL shortening service that needs to keep track of the url “http://baseurl.com/ABCDE” and “http://baseurl.com/ABCDEF”, you’re storing almost 2 identical strings, with a 1 character difference. How could we do this better? This is what a trie looks like:
It’s basically a map (or dictionary in python) of nested maps. It can be handy to have a delimiter when a node is the last one, but this is not necessary. To exemplify how it works in practice, I’ve started coding a quick and dirty fake filesystem in python. The idea behind using a trie to hold data structures if the same: if I have a directory /var, a directory, /var/www and another directory /var/www/mysite, it’s not necessary to repeat the “var” part 3 times, nor the “www” twice.
There are tons of tricky things about filesystems that complicates matters such as hidden files, creating intermediate directories when they don’t exist, permissions, etc. but I’ve ignored them for the most part here. I mainly wanted to implement an “add” function to see if this worked with a trie, and it sort of looks like this:
@_input_path_sanitizer def add(self, path): logging.info("Adding directory {}".format(path)) pieces = self.get_pieces(path) temp_trie = self._paths for piece in pieces: # if we're at the end of the trie, take out the delimiter. if piece in temp_trie and self.__end in temp_trie[piece]: del temp_trie[piece][self.__end] # setdefault() tries to get the key, if it exists. if it doesn't set it to the second parameter temp_trie = temp_trie.setdefault(piece, {}) # set the delimiter here in all cases temp_trie.setdefault(self.__end, self.__end) |
The whole thing is on github, but this function does most of the heavy lifting. Then you can do stuff like :
fs = FakeFs() fs.mkdir("/poo/woo/soo/doo/poo") fs.mkdir("/etc") fs.mkdir("/") fs.mkdir("/tmp") fs.mkdir("/tmp/whatever") fs.mkdir("/var/www/mysite") fs.dir_exists("/sadsad") fs.dir_exists("/etc") print(fs.paths) |
Which yields something along the lines of:
2014-08-23 12:15:42,449 Adding directory /poo/woo/soo/doo/poo 2014-08-23 12:15:42,449 Adding directory /etc 2014-08-23 12:15:42,449 Adding directory / 2014-08-23 12:15:42,449 Adding directory /tmp 2014-08-23 12:15:42,450 Adding directory /tmp/whatever 2014-08-23 12:15:42,450 Adding directory /var/www/mysite 2014-08-23 12:15:42,450 /sadsad does not exist 2014-08-23 12:15:42,450 /etc exists ['/', '/etc', '/poo/woo/soo/doo/poo', '/tmp/whatever', '/var/www/mysite'] |