R-C : The worst national news website in the world? Slow, insecure, buggy.

Meet Radio-Canada.ca, Canada’s French national news website. Essentially the cheap copy of CBC.ca, the sexy English counterpart. Being one of ~7.5 million francophones of Quebec, I’ve used this site a lot, and it’s been frustrating. I’m not sure if it’s actually the worst national news website in the world, but if not it must be close. This site has a bad URL, it’s painfully slow, extremely buggy, insecure and uses outdated technology. Don’t believe me? I decided to poke around for about an hour :

1. www.radio-canada.ca

look at this URL, just look at it. Now, R-C also controls national radio, and every time they spell out the URL out loud I choke on my coffee and die a little bit inside. It is that hard to get a shorter, hyphen-less URL? Please? (EDIT : I found one! see “bonus” section at the end!)

2. This site is SLOW.

Here’s a screenshot of the nextwork panel open from Chrome’s developer tools :

6.42 seconds! Usually my initial page load is 10-15 seconds.
201 requests (!!!)
1.3 MB transferred (!!!)

As if this wasn’t enough, have a look at this screenshot from Firefox’s debug pane :

There are more JS errors in there than the scrollbar can show me. Seriously?

I counted over 30 JS files, even someone who has been coding for 115 days knows that you have to condense all of those into a single file, avoiding 30+ requests!
There are probably over 100+ images in that list (image sprites, maybe?)
The rest of the files are HTML, JSON, SWF and other random files thrown in there.
I’m pretty sure they could cut down the request count by ~80%, if they even tried.

All these things slow the site down to a crawl. But as if it wasn’t enough, R-C loves to use 3rd-party scripts to engage their readers! A quick list of SOME of the script files downloaded :

scorecardresearch
jquery (on google code’s repos)
facebook.com (multiple)
tou.tv
googleadservices.com
ping.chartbeat.net
twitter.com
g+

Why not wait until I actually click an article before throwing all this crap at my browser?

Even google’s PageSpeed scrore (77) page make this site look like it was hosted on geocities (or neocities now?). Check out the list of redirects! :

http://altfarm.mediaplex.com/.../12308-180242-27534-0?...
http://mp.apmebf.com/.../12308-180242-27534-0?...
http://altfarm.mediaplex.com/.../12308-180242-27534-0?...
http://img.mediaplex.com/.../1662053_119543_CA_SM_NA_FY14Q2_W13-OA...?...
http://altfarm.mediaplex.com/.../12308-180242-27534-1?...
http://mp.apmebf.com/.../12308-180242-27534-1?...
http://altfarm.mediaplex.com/.../12308-180242-27534-1?...
http://img.mediaplex.com/.../1662055_119543_CA_SM_NA_FY14Q2_W13-OA...?...
http://b.scorecardresearch.com/b?...
http://b.scorecardresearch.com/b2?...
http://radiocanada.122.2o7.net/.../s51798515080008?...
http://radiocanada.122.2o7.net/.../s51798515080008?...
http://sitelife.radio-canada.ca/ver1.0/Direct/DirectProxy
http://sitelife.radio-canada.ca/.../directproxyfast.js
http://www.google.com/pagead/drt/ui
https://googleads.g.doubleclick.net/.../si?...

Mental! And none of the static images have an expiry header! (100+)

3. Article comments

Comments are nested, but only 1 level deep. So if you feel like conversing with someone, forget it.

They are posted, manually reviews 1 by 1, then approved. Comments can be arbitrarily discarded if R-C doesn’t feel like publishig them.
Anonymous comments are not allowed, you MUST sign up through their crappy portal.
CBC.ca do not have any of these issues – you can post anonymously, comments appear instantly (not 1h after posting them cause the guy reviewing it on taking his lunch break), and you can reply to replies! C’mon R-C.
I won’t start with the quality of the comments, but let’s say it makes Youtube’s comment section look like a Shakespearean novel. At least CBC’s average comment is not that much better…

4. Security

I’m NOT a security professional, but I’m pretty sure this site is messed up.

1. insecure API:

Browse some more through console URLs…OH! “api.radio-canada.ca”, looks like they have an API (at least internal?), let’s click it!

http://api.radio-canada.ca/pluck/ShowComments.aspx?articleId=ghtml-624483&numberofcomments=10000&userId=650736&displayPage=1&sorting=descending&type=&ts=1374712883795
Looks like a GET, with no AUTH whatsoever. Public data I assume?
numberofcomments=100? 1000? 2000? yes this works
Full comment objects! Let’s look at a little piece more closely…

{
    "Key": "ghtml-624610",
    "ObjectType": "Models.External.ExternalResourceKey"
}, "ScoreCount": 0,
"AbsoluteScore": 0,
"DeltaScore": 0,
"PositiveScore": 0,
"PositiveCount": 0,
"NegativeScore": 0,
"NegativeCount": 0,
"CurrentUserHasScored": false,
"CurrentUserScore": 0,
"ObjectType": "Models.Reactions.ItemScore"
}],
"ObjectType": "Models.Reactions.Comment"
},
{
    "CommentKey": {
        "Key": "CommentKey:d7ce4130-6ff5-417d-97fa-cce2609c2a1f",
        "ObjectType": "Models.Reactions.CommentKey"
    },
    "Owner": {
        "Age": "",
        "Sex": "None",
        "AboutMe": "",
        "Location": "",
        "ExtendedProfile": [{
            "Key": "Username",
            "Value": "nostalgie1944",
            "ObjectType": "Models.Common.SiteLifeKeyValuePair"
        }, {
            "Key": "First Name",
            "Value": "Aline",
            "ObjectType": "Models.Common.SiteLifeKeyValuePair"
        }, {
            "Key": "Last Name",
            "Value": "Morin",
            "ObjectType": "Models.Common.SiteLifeKeyValuePair"
        }, {
            "Key": "Province",
            "Value": "Québec",
            "ObjectType": "Models.Common.SiteLifeKeyValuePair"
        }, {
            "Key": "City",
            "Value": "Ferme-Neuve",
            "ObjectType": "Models.Common.SiteLifeKeyValuePair"
        }, {
            "Key": "u",
            "Value": "394173",
            "ObjectType": "Models.Common.SiteLifeKeyValuePair"
        }],
        "CustomAnswers": [],
        "NumberOfMessages": 0,
        "NumberOfFriends": 0,
        "NumberOfPendingFriends": 0,
        "NumberOfComments": 317,
        "MessagesOpenToEveryone": false,
        "PersonaPrivacyMode": 0,
        "DateOfBirth": "\/Date(-62135578800000)\/",
        "CommentsTabVisible": true,
        "PhotosTabVisible": true,
        "IsEmailNotificationsEnabled": true,
        "SelectedStyleId": "",
        "Signature": null,
        "AwardStatus": {
            "Badges": [],
            "LeaderboardRankings": [],
            "Activities": [],
            "ObjectType": "Models.PointsAndBadging.AwardStatus"
        },
        "AbuseCounts": {
            "AbuseReportCount": 0,
            "CurrentUserHasReportedAbuse": false,
            "ContentIsStandardAbuse": true,
            "ContentExceedsAbuseThreshold": false,
            "ObjectType": "Models.Moderation.AbuseCount"
        },
        "RecommendationCounts": {
            "NumberOfRecommendations": 0,
            "CurrentUserHasRecommended": false,
            "ObjectType": "Models.Reactions.RecommendationCount"
        },
        "ImageId": "00000000-0000-0000-0000-000000000000",
        "IsAnonymous": false,
        "AvatarPhotoUrl": "http://sitelife.radio-canada.ca/ver1.0/Content/images/no-user-image.gif",
        "AvatarPhotoID": "00000000-0000-0000-0000-000000000000",
        "UserKey": {
            "Key": "394173",
            "ObjectType": "Models.Users.UserKey"
        }

Interesting. So regardless if you check the “J’accepte que mes nom(s) et prénom(s) soient publiés” (I agree that my firstname and lastname are published) box, you can see every commenter’s firstname, lastname, NumberOfFriends, NumberOfComments, AbuseReportCount, city, user id – just to name a few. It appears that R-C is using a larger framework that awards badges, points and whatnot for interaction on the site, and some of these features are not in use (everyone has the same negative birthday timestamp). In any case, this is terrible.
Every article you load up fetches a file like this to return all comment objects.

2. some important sections are not https-enforced:

Register page : non-https and https are actually both valid. Same with the “user” page, which can be found https here or non-https here (with an added bonus, for whatever reason, you get the banner included on the non-http version!). I mean…how hard is it to disable port 80? I can only imagine all the nasty things you could do with ssl-strip!

3. Outdated tech stack?

Choosing technologies for your website is always difficult, but it seems like R-C settled on a combo of Microsoft’s IIS 7.5 and aspx pages. I’ve never used IIS myself but I have done asp programming, and I know it’s a goddamn mess. I work in the startup world, and I’ve never heard of anyone using IIS, but there might be a good reason for them to do so? Or maybe not, I don’t really know.

BONUS!

Javascript easter eggs!
Once in a while I click in the comments, open my dev console, close it (not sure) and I get this pop-up!

I’m pretty sure this feature has been there for a while, I just can’t figure out the correct combo. Help?

#2 CNAMED urls! : I was playing around checking what their hostnames were resolving to, and found my answer to complaint #1 (shitty URL), radio-canada.ca actually points to an IP which resolves to http://simondurivage.ca/ (an old R-C anchor)! They just need to start advertising this URL instead of radio-canada now! (no but seriously, WTF is going on???)

True can be False and False can be True

In python, you can actually do :

In [1]: True=False
In [2]: True is False
Out[2]: True

Seems a bit dangerous, as you modify the behavior of True/False, but as this post points out, you can do stuff like this:

>>> something=True=False
>>> myVariable=something
>>> myOtherVariable=True
>>> print myVariable
False
>>> print myOtherVariable
False
>>> print True
False

Why I stopped using Facebook

I’ve had a very tumultuous relationship with Facebook. I “deleted” (deactivated, of course) my account two or three times, for different reasons. I did it again about 5 months ago and I find it hard to explain to friends why anyone would want to voluntarily stay away from something that’s so useful and free! There are good things about Facebook and once in a while I miss them, but here are a few that keep me away :

1. Facebook is a for-profit, limited liability, publicly traded company.

Most people I know aren’t aware that Facebook has shareholders (your average Joe does not follow tech news like you and me), and it’s pretty clear by now that Facebook’s primary objective is to satisfy those shareholders. Making money is what Facebook needs to grow, and they’re pretty good at it. A lot believe that Facebook is either some kind of public service that we are now entitled to (a bit like Google’s search engine), or simply choose not to care about the implications this has. Wanting to make money doesn’t make you evil in itself, but when you do AND hold my very personal data on your centralized servers, I’m doubtful about your intentions.

2. Facebook is trying to redefine certain privacy norms which I find hard to agree with.

Mark Zuckerberg thinks our world has changed and people are more open about posting personal information on the Internet. He might be right but 1) it’s difficult to take such his statement very seriously, considering his business grows every time someone creates a new account 2) supposing he’s right – I don’t agree that it’s necessarily a good thing either. I would argue that the Internet has made the need for privacy even more important, and I find it dangerous that sites like Facebook are working so hard to convince people that posting everything about themselves is “normal” and “ok”. I think it’s OK if you want to reveal intimate details of your life to others, but I cannot stand the argument that goes : “why not? everybody else does it anyways“.

3. Facebook is trying to enhance, complement or replace my social interactions, and it’s doing a very shitty job at it.

I always found it weird that a distant relative, whom I’ve seen perhaps once in the last 10 years, would have access to the same information my girlfriend would on Facebook – just because we’re “friends”. It turns out that my social circles are not that simple. In real life, some people are friends of friends I don’t care for. Some people I will meet, and deliberately try to make them go away. Some people I really like and care for. Facebook just flattens all of this into a single group and has no real understanding of all the subtle variations inherent to relationships. The friends concept completely ignores the complexity of human interactions. I know I can make lists, hide certain posters, block others, etc. I’ve tried and it doesn’t work. You end up like a paranoiac lunatic, “viewing your page as :” to make sure you’re showing this status to her, but not to him, and this picture to him, but not to her. No thanks.

4. Facebook is a megaphone to the loud, attention-craving people.

We’ve all witnessed this. And then on some morning, you wake up to what you think is your perfectly curated Facebook feed, only to realize after a few minutes that they changed the front page algorithm and you’re once again drowning in high school material, baby pictures and duckface selfies. I had maxed out at ~100 friends, and after you silence the loud individuals, the political zealots, the life coaches, the social media cross-posters and the frustrated loners, there isn’t that much I feel checking on anymore.

5. Facebook is not a company I feel deserves my trust.

- I seem not to be the only one with a trust problem. Long story short : Facebook deletes nothing (how suprising, right?).

- I’ve visited Facebook and met some of their people. They well all brilliant, extremely knowledgeable and fully deserved the job they were given there. But none of them struck me as being particularly interested in privacy or ethics, which I found a bit odd, given the nature of the data which drives Facebook (data about people!).

- Their lack of commitment to produce a set of clear, unambiguous privacy settings shows they have their priorities elsewhere. These are at best confusing, and some days I felt like Facebook is trying to make me give up (of course Zuckerber’s argument is that nobody cares about privacy anymore, so maybe that figures).

Bruce Schneier: Talks at Google

There are a lot of important things mentioned in this talk, but my favorite, at 36:40:

In a lot of ways, the Internet is a fortuitous accident. It was a combination of : lack of commercial interest, government neglect, some military requirements for survivability and resilience and computer engineers with vaguely libertarian leanings doing what made technical sense. That was kinda the stew of the Internet. And that stew is gone.

Parallel S3 uploads using Boto and threads in python

A typical setup

Uploading multiple files to S3 can take a while if you do it sequentially, that is, waiting for every operation to be done before starting another one. S3 latency can also vary, and you don’t want one slow upload to back up everything else. Here’s a typical setup for uploading files – it’s using Boto for python :

AWS_KEY = "your_aws_key"
AWS_SECRET = "your_aws_secret"
 
from boto.s3.connection import S3Connection
 
filenames = ['1.json', '2.json', '3.json', '4.json', '5.json', '6.json', '7.json', '8.json', '9.json', '10.json']
conn = S3Connection(aws_access_key_id=AWS_KEY, aws_secret_access_key=AWS_SECRET)
for fname in filenames:
    bucket = conn.get_bucket("parallel_upload_tests")
    key = bucket.new_key(fname).set_contents_from_string('some content')
    print "uploaded file %s" % fname

Nothing fancy, this works fine, and it reuses the same S3Connection object. If I print the execution time though, it’s around 1.3 seconds.

How to speed this up

A) Using the multiprocessing module’s ThreadPool (concurrency)

Python has a multiprocessing module, which allows you to “side-step the Global Interpreter Lock by using subprocesses instead of threads”. What this means is that if you have a multi-processor machine, you can leverage them to your advantage. Here’s an example using a ThreadPool:

AWS_KEY = "your_aws_key"
AWS_SECRET = "your_aws_secret"
 
from boto.s3.connection import S3Connection
from multiprocessing.pool import ThreadPool
 
filenames = ['1.json', '2.json', '3.json', '4.json', '5.json', '6.json', '7.json', '8.json', '9.json', '10.json']
conn = S3Connection(aws_access_key_id=AWS_KEY, aws_secret_access_key=AWS_SECRET)
 
def upload(myfile):
        bucket = conn.get_bucket("parallel_upload_tests")
        key = bucket.new_key(myfile).set_contents_from_string('some content')
        return myfile
 
pool = ThreadPool(processes=10)
pool.map(upload, filenames)

Execution time? 0.3 seconds! That’s about 4X faster than our previous example. I’m running this example on a 4-CPU ThinkPad. Note that there’s an overhead cost of starting a 10 process ThreadPool as opposed to just using the same process over and over. Also note that we’re also reusing our S3Connection here, since we’re using subprocesses and not threads per se.

B) Using threads (parallelism)

This solution will effectively spawn new threads of control, which can be quite expensive. Also important to note, we can’t reuse our S3 connection here since Boto’s library isn’t thread-safe, apparently:

AWS_KEY = "your_aws_key"
AWS_SECRET = "your_aws_secret"
 
from boto.s3.connection import S3Connection
import threading
 
filenames = ['1.json', '2.json', '3.json', '4.json', '5.json', '6.json', '7.json', '8.json', '9.json', '10.json']
def upload(myfile):
    conn = S3Connection(aws_access_key_id=AWS_KEY, aws_secret_access_key=AWS_SECRET)
    bucket = conn.get_bucket("parallel_upload_tests")
    key = bucket.new_key(myfile).set_contents_from_string('some content')
    return myfile
 
for fname in filenames:
    t = threading.Thread(target = upload, args=(fname,)).start()

Execution time? 0.018 seconds, about 72X faster than our original script. Not bad at all – but don’t forget, we’re creating 10 threads here, uploading the files in parallel. Your threads will automatically die when the uploads finish, “when its run() method terminate” according to the docs. Please keep in mind that if have tons of files to upload at once, this might not be the best approach – on this topic, here’s a good discussion on How Many Threads is Too Many?

Linux monitoring tools

Some great slides from this talk, basically an overview of some good monitoring tools for linux.

I particularly enjoyed this slide, showing what looks after what.

There are lots of slides (115) but they are nicely separated into easy/intermediate/expert groups. One of the quintessentials, htop.

Nicer git log

Another really nice addition to to git aliases – git log is particularly ugly. This one I found here

Basically turn this:

into this:

Using:

git log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)&lt;%an&gt;%Creset' --abbrev-commit

Or as an alias in your ~/.bashrc ~/.zshrc:

alias gitlg="git log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)&lt;%an&gt;%Creset' --abbrev-commit"

python map()

A map is used to :

Apply function to every item of iterable and return a list of the results. [...] The iterable arguments may be a sequence or any iterable object; the result is always a list.

Let’s create a function that simply adds 1 to any number:

def addone(x): 
    return x+1

Let’s define a list of numbers. If we call the function addone() on every item of “numbers”, we get the following:

numbers = [0,1,2,3,4]
print map(addone, numbers)
>>> [1, 2, 3, 4, 5]

If we wanted, we could actually do this with a list comprehension – it’s just a little longer – something like this:

print [addone(x) for x in numbers]
>>> [1, 2, 3, 4, 5]

So map() is actually just an efficient shortcut for a loop that calls the function on every item:

added_numbers = []
for number in numbers:
    added_numbers.append(addone(number))
print added_numbers
>>> [1, 2, 3, 4, 5]

But now our code is getting out of hand…Let’s try something else. We can pass multiple arguments to a map(), which have to all be of the same size. Otherwise they will be defaulted to None.

Let’s say we have a multiplier function:

def multiply(x, y): 
    return x*y

Now let’s create two lists of numbers to be multiplied together, and use map() to do so:

numbers = [0,1,2,3,4]
multipliers = [10, 20, 30, 40, 50]
print map(multiply, numbers, multipliers)
>>> [0, 20, 60, 120, 200]

Same result, with a list comprehension (slightly longer) :

print [multiply(numbers[x], multipliers[x]) for x in xrange(len(numbers))]
>>> [0, 20, 60, 120, 200]

And just to be fully ridiculous, with the complete loop :

multiplied_numbers = []
for number_index in xrange(len(numbers)):
    multiplied_numbers.append(multiply(numbers[number_index], multipliers[number_index]))
print multiplied_numbers
>>> [0, 20, 60, 120, 200]

But really, don’t do this. Just use map()!

Nginx and Apache on the same server

Need to run multiple projects on the same server – ruby, python, php, nodeJS projects, all at once? It’s possible to have Nginx and Apache running side by side. The preferred way to do this is to have one server in front of another – basically you must choose which server will accept the initial requests and proxy to the correct application or to the second server if needed.

Since nginx is a little bit simpler to configure and more flexible (I find), we’ll put nginx in front of apache. To keep it simple, let’s say we wanted to have

a django project (python) running at mydjangourl.com
a wordpress (php) site at my-php-project-url.com

Both servers are configured to listen to port 80, so that’s not going to work. Since we’re putting nginx in front of apache though, nginx will have port 80.

Configuring nginx for your django app

Go to /etc/nginx/conf.d
Add (touch) a new file called django_project.conf
Assuming your django app is already running on port 8000 (in a screen session, or preferably by a process supervisor), add something like this to the file:

upstream my_django_app {
    server 127.0.0.1:8000;
}
 
server {
   listen       80;
   server_name  mydjangourl.com; #put the domain here
   location / {
       proxy_pass_header Server;
       proxy_set_header Host $http_host;
       proxy_redirect off;
       proxy_set_header X-Real-IP $remote_addr;
       proxy_set_header X-Scheme $scheme;
       proxy_pass http://my_django_app;
   }
}

An upstream server is defined as “my_django_app”
All requests with the “host” as mydjangourl.com will be routed through this directive
Proxy pass these requests to my_django_app, at 127.0.0.1:8000

Configuring nginx for the wordpress site

Since nginx is listening to port 80, we need to grab the requests for our php app and route them to apache on a different port. You should still be in /etc/nginx/conf.d

Create a new file called php_project.conf (name doesn’t matter, except for the .conf extension)
Add this to it (make sure to change the domain and note the port number, in this case 8050):

server {
   listen 80;
   server_name  my-php-project-url.com;
   location / {
       proxy_pass http://127.0.0.1:8050;
    }
    proxy_set_header Host $host;
}

This means that all requests looking for my-php-project-url.com will be routed to port 8050 on our localhost
Make sure to reload nginx (sudo service nginx reload)

Configuring apache for the wordpress site

Now that we’re proxying requests from nginx to our port 8050, let’s make sure apache is listening on that one.
Firstly, go to /etc/apache2/ports.conf and add the following:

NameVirtualHost *:8050
Listen 8050

Then go to wherever you keep the virtualhost configuration (it should be in /etc/apache2/sites-available or /etc/apache2/conf.d) and change the VirtualHost, ServerName and DocumentRoot lines like this:

        ServerName my-php-project-url.com
        DocumentRoot /var/www/mysite
 
                Options FollowSymLinks
                AllowOverride All
 
                Options Indexes FollowSymLinks MultiViews
                AllowOverride All
                Order allow,deny
                allow from all
 
        ...

Don’t forget to enable your site (sudo a2ensite mysite.conf) and reload apache (sudo service apache2 reload)!

Show all git branches, ordered by date

By default, “git branch -a” shows you both local and remote branches for your repository. Wondering when each branch was worked on last? It’s a bit hard, especially that git makes 2 piles of branches – the local ones and the remote ones separately. To see them ordered by when they were worked on last, just use :

   for k in `git branch -a|perl -pe s/^..//`;do echo -e `git show --pretty=format:"%Cgreen%ci %Cblue%cr%Creset" $k|head -n 1`\\t$k;done|sort -r

If you’re like me and use this a lot, make it a function and add it as an alias to your .bashrc/.zshrc/.whateverrc

function gitbr {
   for k in `git branch -a|perl -pe s/^..//`;do echo -e `git show --pretty=format:"%Cgreen%ci %Cblue%cr%Creset" $k|head -n 1`\\t$k;done|sort -r
}
alias gitbr=gitbr