Project Newman :: Further ideas not implemented

August 27th, 2006

Newman was meant to be a simple design that wouldn't take too long to build (it took me about 2 weeks of afternoons to write) and just focus on the issues that are simple to handle without making a complicated mess of it. There are all kinds of ways in which it could be improved and I'll mention some of the ideas that I decided to leave out.

Channel limits

Even in threads where it is deemed acceptable to post news articles, it isn't civil to post 20 articles a day in one thread. To overcome this problem, one could limit the number of stories that may be posted in one channel per day. This would require another cache, to keep track of how many stories have been posted in which channel already today.

This is something I thought I would do, but I haven't written it, because it doesn't seem to be a problem. Newman can post up to 5 stories in one thread (which is a bit much), but it will only do this in a couple of channels (the Real Madrid one tends to see a lot of news). So the amount of news posted in a channel reflects the amount of news about that club that day (and that seems quite fair).

A bandwidth monitor

Newman deals in receiving and sending data to web servers. This generates a fair bit of traffic. Web hosts tend to be somewhat sensitive about one client making lots of connections, because bandwidth isn't free. While Newman does not operate on a mass level and will not overload any server by itself, it may still be useful to know just how much bandwidth it generates.

I haven't looked deeply into this, so I'm not sure how to do it. Web servers send a Content-Length field in http headers, but this will only account for the traffic received. Perhaps a byte count of the uuencoded form input that Newman sends could be used to track outgoing traffic. But even this would not account for the low-level overhead in establishing socket connections.

User agent scrambling

Newman identifies itself as Firefox, so it looks like just another human client connecting to the web server. But the reporter retrieves a list of stories and then sequentially retrieves every story. This behavior is too systematic to be human (noone is interested in *every* story), and so a web admin who keeps track of logs, who sees that the user agent is always the same, will assume it's the same client making these connections.

To escape that kind of detection, we could easily scramble the user agent string and just make Newman report a different user agent for every connection. This would make it look like the connections are coming from a shared ip address, but from different clients (for instance different people in a school or company).

Client source scrambling

Closely related to the point above, a web admin that wants to block Newman can only do so by blocking the ip address the connections are coming from. This could be overcome by making Newman connect through anonymous proxies (or maybe use the Tor network?). And connect everytime through a different proxy - that would make blocking ip addresses a lot more difficult.

Running stories through online translator

This is a very silly idea, but sometimes people post stories from a non-English source, because the story talks about something that the English language sources haven't caught up with yet. Since is an English speaking forum, most people don't understand these articles. So the person who posts it will sometimes post an automatically generated translation of the text, using the notoriously bad Altavista Babelfish.

And so it could be an extra feature for Newman to retrieve articles from or Marca, translate them with Babelfish, and post them. This would further reduce the quality of the articles Newman is posting, but it certainly is something that human posters tend to do.

This entry is part of the series Project Newman.

:: random entries in this category ::