Archive for 2006

Project Newman :: An overview

August 19th, 2006

Project Newman table of contents

Getting started As mentioned in the introduction, Project Newman is about building a newsbot - a robot to post news. Now that the purpose and basic idea has been drawn up, it's time to get into some specifics.

Newman would basically be doing three things and so it makes sense to design those three functions in separate parts:

  • the reporter will fetch news stories from various football news websites, which we call sources
  • the editor will edit the stories, deciding which one to post and which to discard
  • the publisher will post stories on Xtratime.org (or theoretically other sites, which we call targets)

So that's the basic architecture. (If you think this smells too much of java-speak, don't worry, I only used OO where it was feasible, most of it is just python modules).

And there is one rule in Project Newman:

  • Newman must run without any user interaction!

Project Newman :: An introduction

August 17th, 2006

I have been posting on Xtratime.org (a football forum) since sometime in 2000. The site has been through a lot in that time, but one thing that hasn't changed is a member called Carson35 posting news stories from various football news sites with astonishing regularity. He now has 74k+ posts, far more than anyone else, and most of those are copy/paste jobs of news stories. Over the years he's become a celebrity for his undaunting commitment to bring the news, decorated with a special title - XT Post Number King. Some have jokingly suggested that he's a robot, programmed to do this one thing.

So I thought it would be fun to try and imitate Carson, as a tribute if you will. And, of course, I mean computationally, in an automated manner. The purpose of such a thing would be to satisfy my curiosity in certain areas:

  • how hard would it be to imitate Carson35 by posting news articles?
  • how closely could I be able to reproduce his activity?
  • what are the biggest challenges in making this work without any user input?
  • just how automated could it be done?
  • could I build a bot that would be accepted (or at least not hated) by other members for spamming?

The project was first dubbed Carson36, as an increment of the Carson we all know. But then Erik suggested Newman - for a bot that brings the news - and I couldn't resist that name. :D

newman.jpg

While this is a technical topic, I'll try to do something I'm not good at - explain it in simple terms. That's what good technical writers do, and it would be nice to imitate.

This entry is part of the series Project Newman.

charset wars

August 17th, 2006

Have you ever opened a web page and all you could see was garbled text? That was a charset conflict. The page had been written in a charset other than in which it was displayed to you. If you look at this page, the Norwegian characters should display correctly, but if you do this:

charset.png

(ie. change the charset manually), then non-ascii characters will mess up. Why? Because the file was written as utf-8 text, but is being read in iso-8859-1 encoding. So characters found in utf-8 which are not found in iso-8859-1 are "improvised" (or in other words - wrongly translated) by the function that reads the text. Since utf-8 uses two bytes per character and iso-8859-1 only uses one, the characters that are 'mis-translated' show up as two characters instead of one.

This is usually not a problem, because most websites (and most half-conscious web coders) have the decency of setting the charset in the header of the page, like so:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

So much for the web. What's worse is when you get charset conflicts in terminals. Most modern linux distros now ship in full utf8 mode, that is applications are set to use utf8 by default to avoid all these problems. But then I log into a server and use nano or vim to edit files (if need be - emacs), and I get in trouble. The text I write (my terminal controls what characters are sent to the server), is in utf8. The server will most likely not support that (because some of these server distributions are ancient and do *not* use utf8 by default), so when I type the text in nano and save it, if I use non-ascii characters, the text will get garbled. vim supports utf8, so the problem is much reduced. But in nano, I basically have to save, then open the file again to see where the bugs are. This has to do with how text is handled, characters are counted left to right, so if I type a utf8 character (which is two bytes), and I try to erase it, nano will just erase one byte. So "half" the character is still there. And so on and so forth. Very annoying, I tell you.

So why bother with utf8? Because utf8 (and unicode in general) was designed to solve all these charset conflicts. ISO 8859 is a legacy standard, and with its various extensions it supports many different languages. But you can only use one at a time, so if you write text in French in one file, you cannot also use Russian text in there, the charset won't support both. Enter utf8, which supports pretty much _everything_. But as long as we still have piles of legacy systems that aren't designed to handle utf8 (or don't use utf8 by default at least), we will continue to experience these problems forever. Standards are only salvation insofar as they are applied. Correctly, consistently and universally. That much we have already learnt from IE vs the world in terms of web page rendering.

zealots rejoice

August 16th, 2006

In the linux community there is a segment of people who are extremely focused on what they perceive as Linux's battle against the Evil Empire. I have to say those kinds of opinions seem a little far fetched to me, when it comes to a point where people don't care much for the merits of the argument anymore, they simply want to have it their way no matter what. There have been times when I was inclined to do that myself, but I think with age pragmatism starts to set in. I think people should know the facts and they should use whatever works the best for them.

And this shift has come about without political struggles, it has just arrived naturally, catering to the circumstances of the situation. Since nowadays I'm not home in Norway most of the time, my old desktop computer isn't of much use to me, I just take the laptop to Holland. Meanwhile, the desktop is perfectly functional and shouldn't go to waste, so I leave behind Ubuntu on this "family PC". I've also set up Ubuntu on another desktop, which is catered more to personal use. I tried to make this happen over Easter, but at the time Ubuntu Hoary just wasn't up to it, it took too much work to get it running smoothly (and the default KDE setup was mind boggingly ugly, what the hell are the Kubuntu people thinking :wth: ). The recent improvements in usabilty have made Ubuntu Dapper a proper option for regular desktop users.

Downstairs, the file server has been dismantled on account of a hardware failure. The firewall still remains, so that's 2 Ubuntu desktops, one Gentoo laptop, one RedHat firewall and just one lonely WindowsXP desktop left (which will remain such on account of its user being very attached to it). I suppose I'm in a position to write one of those boring "I set up linux for my grandma and it's working really well" articles on newsforge.com now.

perceptions of programming

August 15th, 2006

I remember when I was much younger and I first started programming, it was such an exciting thing to do. I mean nowadays I think of it as normal, but back then it was so unique, so unlike anything else I ever did. I wasn't too good at it, I'm not terribly gifted at coding to begin with. And I didn't do *that* much of it either, there were always distractions in the form of games, later internet forums and whatnot.

But it was fun to *create* something and I certainly tried a bunch of times back in high school, we even had sort of a club for it. I remember that I always wanted to create something brilliant, something really awesome. And I didn't know the first thing about it, I was clueless. So I would just sit down and start coding. Whatever came to mind that had to be done as part of my project, I started out with that. A program that encrypts files (encryption, that was a favorite theme back then)? Ok, I need to read the file first. So I would code that to begin with, reading the contents of a file, I didn't think any further at all. Then once I did that, I would start thinking what do I need next. And so on. So working in that fashion, it's no big surprise that my code wasn't of the highest standard. I didn't read any books either, I just searched for code examples and I used the help files for the IDE. (Borland Pascal had a really great integrated api doc, very handy.)

So when I started computer science in college, that was all I knew about coding. Basically just trying to put an idea into code, however it was done, just as long as it worked. And didn't take too long to do it. And didn't crash. Mission accomplished. And we did a couple of those missions, most never got completed, cause we didn't know how, some worked just fine, which was a great satisfaction, but ultimately we didn't solve too many problems doing it.

I recall that the concept of "something awesome" intuitively was that great piece of inspiration. I knew intuitively that I wanted my code to be as good as it could be, as organized and structured and all that. But I didn't know _anything_ about how it *should* be done. I didn't know any theory, any principles, guidelines, practices. I sorta started from scratch. So college really changed that point of view, I learnt what developing systems means, what it entails, what people actually do, and how. But then I had assignments and again I would try to write that great piece of code, which I had no idea how it would even be determined to be good or bad.

My concept of great programming was that good idea. It's a bit like a painter would rig up a new canvas and just sit there waiting for inspiration. How do I paint a masterpiece? I just wait for that amazing, great idea to come along. I will not settle for mediocre ones, I will ignore them, I won't compromise on this. And when that briliant idea comes to me, I'll use it and I'll paint a masterpiece. That's how I thought I could write something awesome. This, of course, is the Big Design Upfront mentality, with pretty much the worst possible starting point (ie. complete ignorance and zero experience) for actually designing something good. But I really didn't understand anything about modularity. I would think to myself that "this encryption bit should be exported to a separate file", but I had no clue about how it should be interfaced, what logic it should and shouldn't have, only that it should be separate because that "seems like a good idea". I would start a lot and end up nowhere, just fumbling in the dark.

When I did have some "great idea" about how to solve some problem, it was always rather complicated. I had such faith in my scheme. I was confident that beside solving the basic problem, I also solved other issues that other people "probably wouldn't have thought about". My first assignments in college were all like that, based on some idea of how to improve upon the "bare minimum" by making things more complicated than they had to be. And it would obscure the issue, because there was lots of unnecessary additional logic in it. I was disheartened, time after time, to learn that the simplest way of doing something was much simpler than what I had devised. But because I had this attitude, I felt it was a great achievement to write all that code.

Gradually, partly in college and partly on my own - reading articles, watching interviews and talks - I began to see how it actually works. And of course it's nothing like what I just described. For a long time, my biggest problem was not knowing where to start, what bits I would have to create. It would just seem that I had to follow that one thread of thinking, something I needed, and then take it from there. Years on, after I've been through quite a few projects, the reality of development is starting to approximate the theory that I learnt about how it's supposed to be.

I've done enough of this now to design a system, to see what the components should be, to consider different variants and see what the tradeoffs are. Nowadays I don't even start coding until I have a good mental picture of how it's going to work. And what wonderful things can be achieved with modular coding, now I actually see how separate pieces should be divided. They can be classes (in object oriented world), they can be just modules exporting functions (python, c etc.), they can be higher order functions (functional programming) and so on. But I understand how an API works, how it's meant to work.

Nowadays, I'm much more likely to find the simplest solution in the beginning. And if I don't I try to iterate my code to reach it. One basic theory about how to manage programmers is that they have great pride in their code and everyone wants their code to be in the final product. So if you remove their code because it's not necessary, they get mad. And I'm sure that's true. But, on the way to finding the optimal solution, the more code I can remove the happier I will be. Often the simplest solution is 2-3x less code than what I started out with. It's really just a matter of expressing ideas in a very clear way. When the mind is terribly foggy, that's a huge problem.

It is such a huge shift in thinking. Coding isn't some impulsive, ingenious spur of the moment scheme, it's a well defined, highly organized, methodical way of building solutions. So when I look back at how I used to think about the subject, I'm almost amazed that I had so little natural intuition about the subject, I really had no idea what it's like.

Peter Norvig said "Teach Yourself Programming in Ten Years". It's been about ten years since I started out. And I feel I'm well grounded in the basics now. Like a formula 1 driver knowing that he can start his car without choking the engine 9 times out of 10.