Singly and doubly linked lists in python

Singly-linked list implementation with python:

class Node(object):
 
    def __init__(self, data, next):
        self.data = data
        self.next = next
 
 
class SingleList(object):
 
    head = None
    tail = None
 
    def show(self):
        print "Showing list data:"
        current_node = self.head
        while current_node is not None:
            print current_node.data, " -> ",
            current_node = current_node.next
        print None
 
    def append(self, data):
        node = Node(data, None)
        if self.head is None:
            self.head = self.tail = node
        else:
            self.tail.next = node
        self.tail = node
 
    def remove(self, node_value):
        current_node = self.head
        previous_node = None
        while current_node is not None:
            if current_node.data == node_value:
                # if this is the first node (head)
                if previous_node is not None:
                    previous_node.next = current_node.next
                else:
                    self.head = current_node.next
 
            # needed for the next iteration
            previous_node = current_node
            current_node = current_node.next
 
 
s = SingleList()
s.append(31)
s.append(2)
s.append(3)
s.append(4)
 
s.show()
s.remove(31)
s.remove(3)
s.remove(2)
s.show()

Doubly-linked list implementation in python

class Node(object):
 
    def __init__(self, data, prev, next):
        self.data = data
        self.prev = prev
        self.next = next
 
 
class DoubleList(object):
 
    head = None
    tail = None
 
    def append(self, data):
        new_node = Node(data, None, None)
        if self.head is None:
            self.head = self.tail = new_node
        else:
            new_node.prev = self.tail
            new_node.next = None
            self.tail.next = new_node
            self.tail = new_node
 
    def remove(self, node_value):
        current_node = self.head
 
        while current_node is not None:
            if current_node.data == node_value:
                # if it's not the first element
                if current_node.prev is not None:
                    current_node.prev.next = current_node.next
                    current_node.next.prev = current_node.prev
                else:
                    # otherwise we have no prev (it's None), head is the next one, and prev becomes None
                    self.head = current_node.next
                    current_node.next.prev = None
 
            current_node = current_node.next
 
    def show(self):
        print "Show list data:"
        current_node = self.head
        while current_node is not None:
            print current_node.prev.data if hasattr(current_node.prev, "data") else None,
            print current_node.data,
            print current_node.next.data if hasattr(current_node.next, "data") else None
 
            current_node = current_node.next
        print "*"*50
 
 
d = DoubleList()
 
d.append(5)
d.append(6)
d.append(50)
d.append(30)
 
d.show()
 
d.remove(50)
d.remove(5)
 
d.show()

Implementing a fake filesystem using a trie in python

A trie (also sometimes called a prefix tree or radix tree) is an ordered tree data structure. I was looking at ways to implement such a thing the other day, and came up with 3 possible applications, off the top of my head:

  • a URL shortening service
  • a virtual filesystem
  • a search engine

The idea behind a trie is that you can store shared parts of data in an ordered fashion. If you have a URL shortening service that needs to keep track of the url “http://baseurl.com/ABCDE” and “http://baseurl.com/ABCDEF”, you’re storing almost 2 identical strings, with a 1 character difference. How could we do this better? This is what a trie looks like:

secondTrie

It’s basically a map (or dictionary in python) of nested maps. It can be handy to have a delimiter when a node is the last one, but this is not necessary. To exemplify how it works in practice, I’ve started coding a quick and dirty fake filesystem in python. The idea behind using a trie to hold data structures if the same: if I have a directory /var, a directory, /var/www and another directory /var/www/mysite, it’s not necessary to repeat the “var” part 3 times, nor the “www” twice.

There are tons of tricky things about filesystems that complicates matters such as hidden files, creating intermediate directories when they don’t exist, permissions, etc. but I’ve ignored them for the most part here. I mainly wanted to implement an “add” function to see if this worked with a trie, and it sort of looks like this:

    @_input_path_sanitizer
    def add(self, path):
        logging.info("Adding directory {}".format(path))
        pieces = self.get_pieces(path)
        temp_trie = self._paths
        for piece in pieces:
            # if we're at the end of the trie, take out the delimiter.
            if piece in temp_trie and self.__end in temp_trie[piece]:
                del temp_trie[piece][self.__end]
            # setdefault() tries to get the key, if it exists. if it doesn't set it to the second parameter
            temp_trie = temp_trie.setdefault(piece, {})
        # set the delimiter here in all cases
        temp_trie.setdefault(self.__end, self.__end)

The whole thing is on github, but this function does most of the heavy lifting. Then you can do stuff like :

fs = FakeFs()
fs.mkdir("/poo/woo/soo/doo/poo")
fs.mkdir("/etc")
fs.mkdir("/")
fs.mkdir("/tmp")
fs.mkdir("/tmp/whatever")
fs.mkdir("/var/www/mysite")
fs.dir_exists("/sadsad")
fs.dir_exists("/etc")
print(fs.paths)

Which yields something along the lines of:

2014-08-23 12:15:42,449 Adding directory /poo/woo/soo/doo/poo
2014-08-23 12:15:42,449 Adding directory /etc
2014-08-23 12:15:42,449 Adding directory /
2014-08-23 12:15:42,449 Adding directory /tmp
2014-08-23 12:15:42,450 Adding directory /tmp/whatever
2014-08-23 12:15:42,450 Adding directory /var/www/mysite
2014-08-23 12:15:42,450 /sadsad does not exist
2014-08-23 12:15:42,450 /etc exists
['/', '/etc', '/poo/woo/soo/doo/poo', '/tmp/whatever', '/var/www/mysite']

Python threads VS processes

I was revisiting Jeff Knupp’s great article from 2012, Python’s hardest problemIt talks about the Global Interpreter Lock, or GIL, in python. It basically explains how the GIL works, and why it’s such an important problem for python coders.

Probably the most notable consequence of the GIL is that python cannot do “pure” multi-threaded operations, in the sense that only one thread can execute at any time. The GIL prevents strange things from happening when you can have more than one thread write to the same chunk of memory. Knupp also wrote a follow-up to that article, Python’s hardest problem, revisitedwhere he advises people who want to do many things at the same time (parallelism) to use the multiprocessing module.

It’s great advice, I’ve used multiprocessing in the wild. It needs a bit more effort to communicate data between the processes (typically using queues), but it’s well worth the security that separate processes afford you. In essence, every process can then run it’s own thread without sharing any memory.

As I was reading down the article, I noticed he didn’t have any examples! So I started playing around, just for fun. Let’s make a very simple  program that appends integers from 0 to 999998 and discards the list, 50 times.

Version 1: simple single-threaded

import time
 
nb_repeat = 50
 
 
def a_complex_operation(*args):
    a = []
    for x in range(999999):
        a.append(x)
    return None
 
 
t1 = time.time()
for _ in range(nb_repeat):
    a_complex_operation()
print time.time()-t1

Running time: 4.82960796356 seconds

Version 2: with processes

from multiprocessing import Pool
nb_repeat = 50
 
def a_complex_operation(*args):
    a = []
    for x in range(999999):
        a.append(x)
    return None
pool = Pool(processes=nb_repeat)
results = pool.map(a_complex_operation, [None for _ in range(nb_repeat)])

Running time: 2.74916887283 seconds! Almost half the initial time.

Version 3: threaded version

from threading import Thread
import time
 
nb_repeat = 50
 
def a_complex_operation(*args):
    a = []
    for x in range(999999):
        a.append(x)
    return None
 
t1 = time.time()
threads = []
for _ in range(nb_repeat):
    threads.append(Thread(target=a_complex_operation))
 
[x.start() for x in threads]
[x.join() for x in threads]
print time.time()-t1

Running time: 14.0888431072 seconds!

Not extremely surprising, but quite interesting still. As expected, the version with processes is the fastest. But threading our program does not only not improve its running time, it actually slows it down quite a bit. This is probably due to the overhead involved in switching context between 50 threads. The multiprocess version is a nice little optimization, but it’s fair to say that the normal, single-threaded version is running pretty quickly too. When in doubt, keep it simple!

Efficient paging with mysql and redis ordered sets

Paging API results with mysql

Let’s say you’re building an API and need to fetch a bunch of products, based on category. Your query will probably look something like this:

SELECT * FROM products WHERE category=1;

Now, this query might have thousands of results, and you’re certainly not going to return them all at once. You need to page your results – what are the possible solutions? Well, the obvious solution is to use an offset :

SELECT * FROM products WHERE category=1 LIMIT 100 OFFSET 200;

This would return page=2, if you return 100 results per page. But then you’re going to start running into problems very quickly when you get into this sort of territory:

SELECT * FROM products WHERE category=1 LIMIT 100 OFFSET 4000;

As explained here, offset actually needs to run through those 4000 rows before returning the ones you’re after.

There are a few other solutions, such as adding an id in the where clause like such:

SELECT * FROM products WHERE category=1 WHERE id > 12345 LIMIT 100;

But this only works if:

  • We’re building a webpage with a bunch of links for pages, to which we can append something like “last_id_seen=456″ as a querystring parameter and use that as our criterion (basically building the page links all in advance). This example case is for an API, so that won’t work. Also this is not a very elegant solution, I find.
  • Or if your ids are all perfectly consecutive, with no gaps, which will never happen.

Better paging with redis ordered sets

The main idea here is to perform our DB query only once and cache the result (pretty standard). There is an important nuance though, we cache the result in an ordered set, and not a large serialized string, so that we can query only the parts that we want (remember, the result set might be 10 000 rows, we don’t want to load the whole thing, and then slice it). For this we’ll use 2 redis keys

  • products/12345 as an ordered set to hold the products data from category 12345. Ordered sets cannot expire, so we need another key to know when to refresh the data.
  • valid:products/12345 as a normal key/value to tell us whether the data should be expired or not. For this example, I push the length of the result set to this key (I reuse it later to indicate the total length of the result set to the client response), but it could be anything else, really.

As mentioned earlier we need to push data into the ordered set by having continuous indexes we can use to query certain ranges. Redis’ ordered sets fits this perfectly, as you can assign a “score” to each element, and the value that matches it. Assuming a python implementation with Flask, here’s what it might look like:

cache_key = "products/{0}".format(product_category)
cache_key_valid = "valid:{0}".format(cache_key)
 
# get querystring parameters from the query. implement these as you wish.
count, page = filtered_params
offset = count*page
 
# we use a separate key to check whether we need to refresh the data in set
key_is_valid = redisser.get_key_is_valid(current_app.redis_conn, cache_key_valid)
 
# cache has expired, grab the data from db
if key_is_valid is None:
    # first make sure to delete the current redis cached data
    redisser.delete_ordered_set(current_app.redis_conn, cache_key)
 
    # get everything from DB, load up in redis
    products = current_app.db.get_products_for_category(category)
    redisser.add_elements_to_sorted_set(current_app.redis_conn, cache_key, products)
 
    # get the count of all elements, put in valid key element of redis
    cache_count = redisser.get_ordered_set_count(current_app.redis_conn, cache_key)
    redisser.add_key_valid(current_app.redis_conn, cache_key_valid, cache_count)
 
# the correct data is now in redis. fetch the right range from it
cached_products = redisser.get_products(current_app.redis_conn, cache_key, offset, count)
products = [pickle.loads(x) for x in cached_products]

As you’ll notice, we check to see if our “valid” key has expired yet. If the value is None, we reload the data from DB into an ordered set. We need to delete the set first, otherwise we might have extra data leftover if we insert new data that is smaller than the original one that was there (assuming the same key name). For good measure, here is what the other functions could look like:

def get_ordered_set_count(r, key):
    return r.zcount(key, "-inf", "+inf")
 
 
def add_key_valid(r, key, count):
    expiry = 60*60
    r.set(key, count, expiry)
 
 
def get_key_is_valid(r, key):
    result = r.get(key)
    return result
 
 
def delete_ordered_set(r, key):
    return r.delete(key)
 
 
def get_products(r, key, offset, count):
    upper_boundary = offset+int(count)-1
    return r.zrange(key, offset, upper_boundary)
 
 
def add_elements_to_sorted_set(r, key, data):
    f1 = (i for i in range(len(data)))
    f2 = (pickle.dumps(x) for x in data)
    together = zip(f1, f2)
    merged_list = list(itertools.chain(*together))
    r.zadd(key, *merged_list)

The magic really happens in the get_products function, where we query the ordered set and return only a particular range. In add_elements_to_sorted_set we define the score of the elements to just be a number from 0 to len(data). We’re basically rebuilding a small queryable table with consecutive ids, in memory.

This solution still means you have to optimize your original SQL query to be reasonably fast, otherwise the first client to hit the server with no cache available will wait a very long time. I’ve managed to get such queries returning under 1000 milliseconds, and everything else just hits redis. You should also play with the expiry time of the “valid” key, depending on how fresh you need your data to be.

PHP vs consistency

People have written extensively about how much PHP sucks. I tend to agree, but it’s still possible to hide away and make it a livable experience to code with.

And then once in a while I come across stuff like this :

imagepng

bool imagepng ( resource $image [, string $filename [, int $quality [, int $filters ]]] )

Scroll down to “quality”
Compression level: from 0 (no compression) to 9.

imagejpeg

bool imagejpeg ( resource $image [, string $filename [, int $quality ]] )

Scroll down to “quality”
quality is optional, and ranges from 0 (worst quality, smaller file) to 100 (best quality, biggest file). The default is the default IJG quality value (about 75).

Aw man. C’mon, one goes 0-9 (best being 0), and the other is 0-100 (best being 100)?

It’s the small things

I few weeks ago, Jetbrains (the ones that make the excellent PHPStorm, IntelliJ, RubyMine and others) released the community version of PyCharm – and since I code python most of the day…woohoo! I had tried other IDEs but nothing really stuck so I just kept using Sublime for the last year or two, which worked so so. Sublime is just a pretty notepad, there isn’t that much more to it (I DO like writing my essays on it though).

After about 2 weeks on PyCharm, I can already feel a difference. I’m not very good with the shortcuts yet, but all the small annoyances that make up that stinky PEP-8 specification are starting to sink in (notice the nice space after every comma?). It also helps you with small idioms and repetitive chunks of code. Case in point:

if a < 10 and a > 20:
    dostuff()

Will yield a squiggly line that you can click and…

if 10 > a > 20:
    dostuff()

Boom! Simplified chained comparison. Of course it also tells you which vars are not used, or referenced before being set and other nice nifty things – but these little tricks add up. I cannot recommend this IDE enough (it’s fast too!), try it – your codebase will definitively benefit from it.

AWS autoscaling : avoid ELB routing to a provisioning instance

I was doing load testing on instances earlier this week to see how our autoscale policies worked out and noticed something strange. Whenever traffic was getting too high, a cloudwatch alarm kicked off the launch of a new instance (rightly so), and then all of a sudden, nearly ALL of the requests were being routed to this new instance, which wasn’t completely done provisioning yet. Nginx threw a flurry of 502s’ while our gateway was setting up, but the ELB, for some strange reason, threw nearly all our requests at that specific instance – instead of the 10+ others that were up – why?

After some digging, I realized it was a mix of 2 things – but first it’s useful to understand how AWS performs their health checks. The TCP health check is :

TCP is the default, specified as a TCP: port pair, for example “TCP:5000″. In this case a healthcheck simply attempts to open a TCP connection to the instance on the specified port.

Which means that if you have an API that’s public-facing the internet, this is no good. Suppose Nginx or Apache is done setting up, but Rails or Django is not ready yet? Well according to the description, this would pass the test. First thing to fix : create a specific testing url for health check (in nginx in my case – I created a /test/health path) to make sure we don’t put instances that are not ready yet behind the load balancer.

For HTTP or HTTPS protocol, the situation is different. You have to include a ping path in the string. HTTP is specified as a HTTP:port;/;PathToPing; grouping, for example “HTTP:80/weather/us/wa/seattle”. In this case, a HTTP GET request is issued to the instance on the given port and path. Any answer other than “200 OK” within the timeout period is considered unhealthy.

So the HTTP(S) health check actually need to return a 200 OK status code, which gives us much more control. So even if nginx is ready but our gateway isn’t, our 502 will cause the health check to fail.

So that explained why some requests failed, but how come nearly all of them did? First, here’s a few interesting things about ELBs :

  • They scale as your number of requests grow in time. Which means that for load-testing, you need to slowly increment the number of requests using a long ramp-up time.
  • They can only accommodate instances in the availability zones for which then were initially configured for. This means that you need to add an instance from the us-east-a and us-east-b if you plan of using both those zones later.
  • ELBs will attempt to distribute requests evenly, unless certain instances are busier than others. If an instance can process requests faster than others, more of them will be sent to this particular instance.

Ah ha! Since returning a 502 code is surely faster than anything else my other instances were doing, the ELB probably thought that this machine was way faster than all others and passed over all the requests to it – very bad.

Using a python queue to upload files to S3 using Boto

Not too long ago I wrote a quick article on how to upload files using boto and the multiprocessing module or the Threading module. Here’s yet another variant, using a Queue!

AWS_KEY = ""
AWS_SECRET = ""
 
import time
from boto.s3.connection import S3Connection
from Queue import *
from threading import Thread
 
number_workers = 4
q = Queue()
 
filenames = ['1.json', '2.json', '3.json', '4.json', '5.json', '6.json', '7.json', '8.json', '9.json', '10.json']
 
# the actual upload
def upload(myfile):
    conn = S3Connection(aws_access_key_id=AWS_KEY, aws_secret_access_key=AWS_SECRET)
    bucket = conn.get_bucket("parallel_upload_tests")
    key = bucket.new_key(myfile).set_contents_from_string('some content')
    print "Uploaded %s." % myfile
 
# each worker does this job
def pull_from_queue():
    while True:
        item = q.get()
        print "Found %s in queue" % item
        upload(item)
 
# init the workers
for i in range(number_workers):
    t = Thread(target=pull_from_queue)
    t.daemon = True
    t.start()
 
# put files in the queue
for fname in filenames:
    q.put(fname)
 
while True:
    time.sleep(1)

Servers of Hacker News

I did a little experiment:

  • For about a month I scraped the front page of NH
  • For every link of the front page I looked at the “server” header value and stored it for every unique URL.
  • Repeat, every hour.

Goals of experiment, in order of importance

  • Have fun
  • Satisfy my curiosity
  • Try to find servers I have never heard of (achievement unlocked!)

Results

v1

 

Full breakdown of server count (total count = 2415)

Apache total : 995

Apache : 558
Apache-Coyote/1.1 : 66
Apache/1.3.41 : 2
Apache/1.3.42 : 2
Apache/2 : 4
Apache/2.0.52 (Red Hat) : 5
Apache/2.2 : 18
Apache/2.2.11 (Unix) mod_ssl/2.2.11 OpenSSL/0.9.8e-fips-rhel5 PHP/5.2.14 : 17
Apache/2.2.12 (Ubuntu) : 3
Apache/2.2.14 (Ubuntu) : 27
Apache/2.2.15 (CentOS) : 51
Apache/2.2.15 (Red Hat) : 9
Apache/2.2.15 (Red Hat) mod_ssl/2.2.15 OpenSSL/1.0.0-fips PHP/5.3. : 28
Apache/2.2.15 (Scientific Linux) : 2
Apache/2.2.16 (Debian) : 32
Apache/2.2.17 (Ubuntu) : 6
Apache/2.2.22 : 16
Apache/2.2.22 (Debian) : 12
Apache/2.2.22 (Ubuntu) : 50
Apache/2.2.22 (Unix) FrontPage/5.0.2.2635 : 7
Apache/2.2.23 (Amazon) : 6
Apache/2.2.24 : 6
Apache/2.2.24 (Amazon) : 4
Apache/2.2.24 (Unix) mod_ssl/2.2.24 OpenSSL/1.0.0-fips mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 : 5
Apache/2.2.3 (CentOS) : 33
Apache/2.2.3 (Red Hat) : 26

Nginx total : 703

nginx total : 435
nginx/0.7.65 : 6
nginx/0.7.67 : 9
nginx/0.8.53 : 7
nginx/0.8.54 : 10
nginx/0.8.55 : 5
nginx/1.0.11 + Phusion Passenger 3.0.11 (mod_rails/mod_rack) : 10
nginx/1.0.15 : 5
nginx/1.1.19 : 66
nginx/1.2.1 : 21
nginx/1.2.3 : 7
nginx/1.2.4 : 6
nginx/1.2.6 : 64
nginx/1.2.7 : 12
nginx/1.2.8 : 4
nginx/1.2.9 : 3
nginx/1.3.11 : 3
nginx/1.4.1 : 30

GitHub.com total : 207

GSE total : 90

Microsoft-IIS/ (6.0, 7.0, 7.5, 8.0) total : 72

Others total : 

cloudflare-nginx : 62
AmazonS3 : 30
Sun-Java-System-Web-Server/7.0 : 25
HTTP server (unknown) : 22
Google Frontend : 19
WP Engine/4.0 : 19
lighttpd/ (1.4.18, 1.4.31-devel-783a962) : 19
tfe : 17
YTS (1.19.11, 1.20.10, 1.20.27, 1.20.28) : 16
Economist Web Server : 12
WP Engine/1.2.0 : 11
TheAtlantic Web Server : 11
gunicorn/0.14.3 : 11
SSWS : 11
WEBrick/1.3.1 : 10
LiteSpeed : 10
Resin/4.0.34 : 8
thin 1.5.1 codename Straight Razor : 6
gwiseguy/2.0 : 6
ECD (dca/24FD) : 4
gws : 3
NPG Webhosting/1.0 : 3
Varnish : 3
IBM_HTTP_Server : 2
Oracle-Application-Server-11g Oracle-Web-Cache-11g/11.1.1.6.0 : 1
TornadoServer/3.0.2 : 1
Oracle-iPlanet-Web-Server/7.0 : 1
PCX : 1
publicfile : 1
PWS/8.0.15 : 1
QRATOR : 1
R2D2 : 1

Critical analysis : what does this mean?

Not much! Because… (read on)

Notes

  • I’m aware that the “server” header is not 100% reliable to determine the server type.
  • If 5 articles from the same organization’s website (github.com, cnn.com or google.com, etc.) made the front page in a day, that’s 5 “counts” for a single server.
  • I didn’t count all the single, weird server instances – including empty header values.
  • I didn’t scrape much metadata with this yet, so it’s hard to see it in full context.