numerodix blog

Project Newman :: The editor

August 22nd, 2006

The editor is basically the "brain" of Newman. It's the most complicated part, because it has to handle the most logic. Broadly speaking, it is the editor's job to figure out which news articles to post where. Once the target is set (ie. Xtratime.org), the editor has to figure out whether each of the articles delivered by the reporter should be published in any of the channels (ie. threads) we have available. The illustration below shows the editor's role in the chain.

Finding channels

But before we dive right into it, a small note on channels is in order. Let's start with a rather more basic question: How does a human post news articles? Carson35 will post articles wherever the particular story is relevant - either to the thread at large, or to the last few posts specifically. The question is whether Newman can imitate this behavior. Xtratime.org is divided into lots of forums, one for each club, where threads about that one club are found. Some of these forums have special threads active all through the season, like for instance a "transfer rumours" thread. So what should Newman do to decide where to post an article? It could iterate over the threads in a certain forum to figure out "what this thread is about". But that is rather difficult to do, for a bot. Given just one sentence, how do you mechanically establish such a terribly human observation - what it's "about"? If Newman could do this, it would be quite clever. But, I must admit that I can't think of a method.

So, the approach I took was to input a list of channels manually selected. I can't think of a way to establish what a post is about, or a thread is about, or if a certain story should be posted in a specific thread. So I had to fall back on a human method and simply give Newman a list of threads it can use to post articles in.

The subject filter

So I've built Newman to do all the tedious work for me, but I've already had to produce the channels myself, it would be nice if Newman could do some work too now. Given the channels, we now have a bunch of stories and a bunch of channels - how do we match them? I've selected my channels so that I have one channel per club forum. One thread to post news articles in is already enough to aggravate sensitive forum people, so I'm not going to push my luck. This also means that I have to figure out if a certain story is about a certain club, or not. This I call the subject filter (ie. to establish the subject of the story).

If you think this is already getting hazy, unfortunately it doesn't get any better. I'm not at all interested in trying to deduce the meaning of sentences in English (this would likely take me around forever to finish). Instead, I'm limiting myself to just looking at individual words. So while a complete analysis would reveal that the phrase "the royal club" may be talking about Real Madrid, I won't be getting into that. I will limit myself to looking for just words. Now, it may seem prudent that in order to establish that a story talks about a certain club, it would be helpful to look for the names of players who play for that club. But players change clubs all the time, so the list of players for every club would have to be updated every so often (and remember: we're trying to minimize the human input here). Worse still, half the stories in the papers about Real Madrid discuss possible signings of players who belong to other clubs. To consider adding all players linked with a club to the list of players at every club would be mad.

So, the only thing I will use is the one name that doesn't ever change: the name of the club itself. In its many incarnations. So a story that mentions "Real Madrid" is one that we probably want to classify as eligible for the Real Madrid channel. But it could also just mention Real or Madrid on their own, so we have to consider that too. But then again, Real could also refer to Real Zaragoza, so then "Real" should be a weaker match than "Real Madrid". As you can probably see by now, this is going in the direction of a spam filter: searching for words and scoring them according to certain rules. In addition, it struck me that the position of a word in a text tends to mean something (if it's in the beginning of the story, or in the title, it should give a higher score). Finally, the length of a story matters as well. In a typical story of the kind we like to analyze, the name of a club may appear 2-3 to 5 times. In a very short story, it may only appear once. In a very long story, it may appear more times, in the guise of nicknames and phrases like "the royal club". So a long story may mention Real Madrid 3 times, but may actually be about Barcelona, so we will not give "Real Madrid" as high a score as it would get in a shorter story.

Getting a bit hazy, is it? I thought it might. The subject filter works fairly well in most cases. There have been occasional whoopsies, like a story about Arsenal de Sarandí posted in the Arsenal (the English one) forum. And there was a story about Luis Valencia matching the Valencia channel. I have tried to filter this by searching for Valencia as part of a name (ie. as part of a sequence of capitalized words) - and penalizing that match under suspicion for being the name of a person - but this kind of thing is very imprecise.

The topic filter

So far so good (do I sense hesitation?). For some channels it is enough to use the subject filter. But for others, those which have to do with transfer rumours, we should also decide whether a certain story is about possible transfers. (Incidentally, most soccer news is.) For this I created a separate filter. So that a story matching on "Real Madrid" would then have to pass through the topic filter to see if it seems to be transfer news. For this I used a word list - a list of words that are highly relevant to transfers, such as contract and offer. Then I scored these words just like I did with the subject filter and set the threshold after some trial and error to filter stories fairly reliably. In any case, I would rather filter out more stories wrongly than to post irrelevant news (just like a spam filter would rather allow more spam than to risk losing your non-spam email). This way, some transfer news didn't make the cut, but afterall there are enough stories published everyday to suffice.

Is that all?!?

So finally, after running every story through the subject filter, and if need be the topic filter, I would have a list of stories to publish in certain channels. A story would rarely qualify for more than 2 channels (a transfer from one club to another), it would most often just qualify for one. In addition, the editor filters out stories by date - any story older than 24h is marked outdated.

The proper cherry on the cake would be to create a Channel Finder module - to find channels automatically. But after having thought about this for a few weeks, I still can't think of a way to do it that would assure any kind of half-decent success rate. Certainly not without trying to analyze English language to some extent, which would be incredibly complicated, if even the least bit effective.

This entry is part of the series Project Newman.

Posted in en, newman | No Comments »

Project Newman :: The reporter

August 20th, 2006

The reporter is the part of Newman which retrieves stories from various websites. The process is fairly straightforward:

Retrieve web page containing a list of the latest news stories.
Read the list of stories and retrieve links to individual stories.
Retrieve each story one by one.

This description is generic enough to be satisfied by every site of those I considered reporting from, notably Football Italia, Tribalfootball, Eurosport, and Goal. Every site has a list of stories and then individual stories on separate pages. But that doesn't mean there weren't a few challenges to make this work, notably:

Every site uses different html - we have to read the info we need out of the html source by using regular expressions.
The result from every story retrieval should be just plain text, no html tags or other code.
If the connection fails or times out, Newman should ignore the error and continue, it shouldn't crash.

Out of every story we need the title, the date, and the body of the story. The rest we can blissfully ignore. But evenso, Football Italia presents these three elements in the order we want, but Goal prints the date first, then the title and body. It also divides the body into a summary and the rest. So these trivial variations had to be handled specifically for each site. Doing this requires analysis of the html code, which is not something Newman can do automatically. The image below shows a sample of html source and below it the regular expression needed to parse it.

One other point is that this parsing (text analysis) depends on the html being a certain way, everytime. So if one story has two <br> tags between the date and the body, but another story has three, the parsing is likely to fail (the parsing is in fact a bit smarter than that, but it will only work with small variations). Even worse, should one of these sites do a redesign and change their whole html code, the whole analysis would have to be redone (this took me anything from 5 to 30 minutes for every site).

Once the three elements of the story have been read, it all has to be cleaned up and formatted. We don't want any html tags anywhere, and we don't want any funny characters that will come out garbled. Anything retrieved from the web is by definition garbage, so we need to make sure that we clean it up whether or not it is clean. Once we've done that, we need to do some formatting. Again we assume nothing about how the story is formatted when it comes in. For all we know there may be 14 spaces between each word (html ignores whitespaces when there is more than one), 5 line breaks between paragraphs and so on. There are some things we can fix easily - for instance there should never be a space between a character and a comma that follows it - and some things we cannot do much about - it is difficult to determine whether there is a line break within a sentence, because it's hard to tell what is a sentence and what isn't (do sentences always begin with a capital letter? what if there is a typo in the story? or what if a name is capitalized, how do you know if that's the start of the sentence or just a part of it? what if the previous sentence is missing a full stop? etc).

Ultimately, Newman is quite good at reporting stories. It tolerates connection errors and it has a very high success rate in cleaning and formatting stories correctly. It does sometimes miss funky special characters on account of web sites not telling us what character set they use (or saying they use one but then encoding in another one, or differences in encoding from one story to the next etc).

One last important issue the reporter does for us is handle the story cache. When the list of stories is retrieved, Newman stores the story title and url to the story in a cache, so that next time it again retrieves the list of stories, it will know which stories it has already retrieved in the past (to make sure the same story won't be posted multiple times). This reduces the amount of bandwidth that Newman uses (let's be nice to web hosts) and it speeds up Newman as well.

This entry is part of the series Project Newman.

Posted in en, newman | No Comments »

Project Newman :: An overview

August 19th, 2006

Project Newman table of contents

The reporter

The editor

The publisher

Additional features

The scheduler

Further ideas not implemented

An evaluation

Getting started As mentioned in the introduction, Project Newman is about building a newsbot - a robot to post news. Now that the purpose and basic idea has been drawn up, it's time to get into some specifics.

Newman would basically be doing three things and so it makes sense to design those three functions in separate parts:

the reporter will fetch news stories from various football news websites, which we call sources
the editor will edit the stories, deciding which one to post and which to discard
the publisher will post stories on Xtratime.org (or theoretically other sites, which we call targets)

So that's the basic architecture. (If you think this smells too much of java-speak, don't worry, I only used OO where it was feasible, most of it is just python modules).

And there is one rule in Project Newman:

Newman must run without any user interaction!

Posted in en, newman | 10 Comments »

Project Newman :: An introduction

August 17th, 2006

I have been posting on Xtratime.org (a football forum) since sometime in 2000. The site has been through a lot in that time, but one thing that hasn't changed is a member called Carson35 posting news stories from various football news sites with astonishing regularity. He now has 74k+ posts, far more than anyone else, and most of those are copy/paste jobs of news stories. Over the years he's become a celebrity for his undaunting commitment to bring the news, decorated with a special title - XT Post Number King. Some have jokingly suggested that he's a robot, programmed to do this one thing.

So I thought it would be fun to try and imitate Carson, as a tribute if you will. And, of course, I mean computationally, in an automated manner. The purpose of such a thing would be to satisfy my curiosity in certain areas:

how hard would it be to imitate Carson35 by posting news articles?
how closely could I be able to reproduce his activity?
what are the biggest challenges in making this work without any user input?
just how automated could it be done?
could I build a bot that would be accepted (or at least not hated) by other members for spamming?

The project was first dubbed Carson36, as an increment of the Carson we all know. But then Erik suggested Newman - for a bot that brings the news - and I couldn't resist that name. :D

While this is a technical topic, I'll try to do something I'm not good at - explain it in simple terms. That's what good technical writers do, and it would be nice to imitate.

This entry is part of the series Project Newman.

Posted in en, newman | 1 Comments »

charset wars

August 17th, 2006

Have you ever opened a web page and all you could see was garbled text? That was a charset conflict. The page had been written in a charset other than in which it was displayed to you. If you look at this page, the Norwegian characters should display correctly, but if you do this:

(ie. change the charset manually), then non-ascii characters will mess up. Why? Because the file was written as utf-8 text, but is being read in iso-8859-1 encoding. So characters found in utf-8 which are not found in iso-8859-1 are "improvised" (or in other words - wrongly translated) by the function that reads the text. Since utf-8 uses two bytes per character and iso-8859-1 only uses one, the characters that are 'mis-translated' show up as two characters instead of one.

This is usually not a problem, because most websites (and most half-conscious web coders) have the decency of setting the charset in the header of the page, like so:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

So much for the web. What's worse is when you get charset conflicts in terminals. Most modern linux distros now ship in full utf8 mode, that is applications are set to use utf8 by default to avoid all these problems. But then I log into a server and use nano or vim to edit files (if need be - emacs), and I get in trouble. The text I write (my terminal controls what characters are sent to the server), is in utf8. The server will most likely not support that (because some of these server distributions are ancient and do *not* use utf8 by default), so when I type the text in nano and save it, if I use non-ascii characters, the text will get garbled. vim supports utf8, so the problem is much reduced. But in nano, I basically have to save, then open the file again to see where the bugs are. This has to do with how text is handled, characters are counted left to right, so if I type a utf8 character (which is two bytes), and I try to erase it, nano will just erase one byte. So "half" the character is still there. And so on and so forth. Very annoying, I tell you.

So why bother with utf8? Because utf8 (and unicode in general) was designed to solve all these charset conflicts. ISO 8859 is a legacy standard, and with its various extensions it supports many different languages. But you can only use one at a time, so if you write text in French in one file, you cannot also use Russian text in there, the charset won't support both. Enter utf8, which supports pretty much _everything_. But as long as we still have piles of legacy systems that aren't designed to handle utf8 (or don't use utf8 by default at least), we will continue to experience these problems forever. Standards are only salvation insofar as they are applied. Correctly, consistently and universally. That much we have already learnt from IE vs the world in terms of web page rendering.

Posted in en, technology | 11 Comments »

Project Newman :: The editor

Project Newman :: The reporter

Project Newman :: An overview

Project Newman :: An introduction

charset wars

Quick links

Google.me

Categories

Meta