Archive for 2006

Six degrees of separation: nothing like a good play

August 30th, 2006

I haven't been to that many plays and I could probably enumerate the really good plays I've seen on the fingers of one hand. Because even though the theatre is a big thing, a lot of the stuff being played there is very mediocre. But, there are good plays from time to time. The problem is that the form of putting them on is terribly outdated. I mean that whole thing where the actor talks to the audience and "noone can hear him"? Noone is buying that. The theatre doesn't have the possibilities that movies have.

Which is precisely why taking a good play and making a movie of it can really bring out the bright points of that play. Like in the case of Six degrees of separation. I can imagine what it looks like on stage, which is why I like the movie all the more. The strongest point is that the plot is very good. Without trying to give anything away, the plot is very unclear and unexpected. It doesn't make you guess what's coming, you don't feel compelled to. And there's no way to know either.

It is a bit of a cliché on rich people living hollow lives and having all kinds of petty problems, but it has enough depth to not make that aspect be anymore than a secondary concern. The story is what drives it forward, and it is complex enough to fill 2 hours without boring you at all. The characters are 'tasteful' - vivid enough to be palpable, but subtle enough to not make you get sick of them. (This is something I get in theatres a lot - characters that are so dominating that their depth is exhausted long before the play breaks 30 minutes, and if they are annoying too, well..)

Banlieue 13: poetic senseless violence

August 29th, 2006

banlieue_13_poster.jpgLuc Besson strikes again. The guy has a passion for stunts and martial arts, but this movie from 2004 is far better than the Transporters and doesn't make the slightest effort to be funny or charming, in stark contrast to the Taxis. It is a far more focused effort - focused on long, intense action sequences that stir your imagination.

How often do you see a truly violent movie that ends in a morality tale saying that violence is not the solution to our problems? If you can stand to ignore your instincts for a tight plot, solid casting and a proper escalation of the story, there is a chance you may really enjoy Banlieue 13. The mass killing scenes do get tedious at times, but the movie features some ground breaking stunts, based on parkour, the discipline of running at and jumping over urban obstacles at a high pace. It is a sight to behold, especially considering that most of these scenes were made without any kind of technical aid. So much cooler than yet another car chase.

Then, of course, comes the added benefit of bragging rights for watching classy European cinema "a film, is what it is", rather than the same old Hollywood productions remade over and over. :D Until your friends see it and call your bluff, that is. ;)

Project Newman :: An evaluation

August 29th, 2006

The thing about a project like Newman is that it's basically impossible to make it work perfectly. It has a difficult job, because there are so many potential sources of error. Servers may go offline, connections may fail, article formats may change and so on. It is as good as impossible to guarantee that Newman will do the right thing, because at the end of the day we are trying to analyze text, and computers are not good at doing that. Just look at spam filters - they have been improved upon for years, but everyone is still getting spam. Much less than before, of course, so the filters are definitely useful. And Newman too makes mistakes, but it does still succeed quite often.

Newman has been posting on Xtratime.org under the username Carsonne, a French female impersonator of Carson35's it would seem. :D Carsonne averages about 15 posts a day since July 30, that is a little over 350 posts in all, 350+ news stories posted. While I haven't been keeping score to present statistical numbers, I have kept a close eye on Carsonne and I would estimate that upwards of 90% of the stories posted were correctly parsed, formatted and classified. In fact, I recall about 10-15 misposts of the ones I've seen (which I think is most). And that is an error rate no human poster would have, Carsonne at an estimated 95% success rate is at least an order of magnitude below a human poster (ie. I would claim that a human poster would have a >99.5% success rate at copy/pasting and classifying stories - less than 2 misposts in 350).

What about user input, then? Well, unfortunately Newman does present a certain configuration cost, not everything can be automated. In particular, finding channels is something that would be wonderful to automate, given how quickly the forum climate changes. Newman also requires that sources be configured (and if need be - updated) for the parsing to work. Of course, once that is in place, Newman can post at will. So that is still quite a limited set of abilities.

The screenshot below shows a typical run of Newman. Quite a few stories were fetched, some were selected for posting, and then posted. It also shows how Newman is fault reliant - a parsing error was handled gracefully, as was a timeout from the forum web server.

newman_running.png

After 20+ days on the forum, Carsonne has been active long enough to stir up some reactions about "her" ;) posting of news. Carson's long tenure has paved the way for posters like this, so Carsonne is seen by most as just another compulsive news poster. "She" has taken some heat over posting news in the wrong place (wrong classification), but beyond that it has been no worse than Carson gets daily.

So what have we learnt?

As it often is, it seems that Project Newman has yielded more questions than the number of answers it has given. Sure enough, it isn't too hard to automate posting on a forum, it isn't too hard to fetch stories from the web and parse them, it certainly isn't hard to automate this out of any human's ability to keep up. But it is hard to decide what text means, it is hard to decide which story is relevant to what thread, it is hard to decide whether a word in a sentence is a name and so on.

The question is just how to do these things in a reliable way?

Thus endeth Project Newman. Download the code from the code page if you're interested.

This entry is part of the series Project Newman.

how to spend a lovely weekend down south

August 28th, 2006

Sørlandet (literally "south of the country") is just the nicest place to be in the summer. Tourist flock from all of Europe to experience the Norwegian summer in this region. It's been years since I've spent summers there myself, but this weekend I had a chance to go back to Mandal, a lovely little town with the best beach in the country.

trondheim_stavanger.png

First, catch a flight down south to Stavanger (cause it's waaaay too far to drive). Then, get on the E39 direction Kristiansand. It's about 200km to Mandal, so set aside 3-4h for this, it's quite a scenic drive through the mountains and valleys, but you won't go fast cause the roads are narrow and there's traffic.

stavanger_mandal.png

In Mandal, you will discover a small, charming touristy town with classic white wooden houses and a vacation kind of atmosphere. (Make sure you don't back into the car with the German plates as you're getting out of your parking spot.) Once in Mandal, you may want to go for a coffee (or indeed an ice cream) at the town's best located ice cream bar, right on the main street. Then, why not take a stroll through town and pick up some nice paella for dinner at the fish market.

Sjøsanden camping is right outside town. Bring your camping gear or hire a bungalow if you don't have any. Head for the greatest beach in Norway, which is literally at the foot of the camping (I tried pulling up the Google Earth imagery for it, but as luck would have it, the imagery stinks for that particular place).

Once you're done with the beach and dusk sets in, have some dinner and head to Kristiansand, a mere 40km away. Kristiansand is *the* city in the south of the country. Also a nice place to be in the summer, the old fort dominates the bay, but there are marinas and there's a beach too.

The next morning, go for a swim in the sea before breakfast. If the water feels chilly, then it probably means it's the hot time of year. On the drive back, stop off at the little town of Flekkefjord. It isn't so much of a tourist place, but it's more the kind of small town Norway that much of this country is. As you drive into the center, crossing the bridge over the fjord, on the left you'll see a place called Kaffebørsen. They have a wide selection of coffee and in particular, their mocca is excellent.

Project Newman :: Further ideas not implemented

August 27th, 2006

Newman was meant to be a simple design that wouldn't take too long to build (it took me about 2 weeks of afternoons to write) and just focus on the issues that are simple to handle without making a complicated mess of it. There are all kinds of ways in which it could be improved and I'll mention some of the ideas that I decided to leave out.


Channel limits

Even in threads where it is deemed acceptable to post news articles, it isn't civil to post 20 articles a day in one thread. To overcome this problem, one could limit the number of stories that may be posted in one channel per day. This would require another cache, to keep track of how many stories have been posted in which channel already today.

This is something I thought I would do, but I haven't written it, because it doesn't seem to be a problem. Newman can post up to 5 stories in one thread (which is a bit much), but it will only do this in a couple of channels (the Real Madrid one tends to see a lot of news). So the amount of news posted in a channel reflects the amount of news about that club that day (and that seems quite fair).

A bandwidth monitor

Newman deals in receiving and sending data to web servers. This generates a fair bit of traffic. Web hosts tend to be somewhat sensitive about one client making lots of connections, because bandwidth isn't free. While Newman does not operate on a mass level and will not overload any server by itself, it may still be useful to know just how much bandwidth it generates.

I haven't looked deeply into this, so I'm not sure how to do it. Web servers send a Content-Length field in http headers, but this will only account for the traffic received. Perhaps a byte count of the uuencoded form input that Newman sends could be used to track outgoing traffic. But even this would not account for the low-level overhead in establishing socket connections.

User agent scrambling

Newman identifies itself as Firefox, so it looks like just another human client connecting to the web server. But the reporter retrieves a list of stories and then sequentially retrieves every story. This behavior is too systematic to be human (noone is interested in *every* story), and so a web admin who keeps track of logs, who sees that the user agent is always the same, will assume it's the same client making these connections.

To escape that kind of detection, we could easily scramble the user agent string and just make Newman report a different user agent for every connection. This would make it look like the connections are coming from a shared ip address, but from different clients (for instance different people in a school or company).

Client source scrambling

Closely related to the point above, a web admin that wants to block Newman can only do so by blocking the ip address the connections are coming from. This could be overcome by making Newman connect through anonymous proxies (or maybe use the Tor network?). And connect everytime through a different proxy - that would make blocking ip addresses a lot more difficult.

Running stories through online translator

This is a very silly idea, but sometimes people post stories from a non-English source, because the story talks about something that the English language sources haven't caught up with yet. Since Xtratime.org is an English speaking forum, most people don't understand these articles. So the person who posts it will sometimes post an automatically generated translation of the text, using the notoriously bad Altavista Babelfish.

And so it could be an extra feature for Newman to retrieve articles from gazzetta.it or Marca, translate them with Babelfish, and post them. This would further reduce the quality of the articles Newman is posting, but it certainly is something that human posters tend to do.

This entry is part of the series Project Newman.