charset wars

August 17th, 2006

Have you ever opened a web page and all you could see was garbled text? That was a charset conflict. The page had been written in a charset other than in which it was displayed to you. If you look at this page, the Norwegian characters should display correctly, but if you do this:

charset.png

(ie. change the charset manually), then non-ascii characters will mess up. Why? Because the file was written as utf-8 text, but is being read in iso-8859-1 encoding. So characters found in utf-8 which are not found in iso-8859-1 are "improvised" (or in other words - wrongly translated) by the function that reads the text. Since utf-8 uses two bytes per character and iso-8859-1 only uses one, the characters that are 'mis-translated' show up as two characters instead of one.

This is usually not a problem, because most websites (and most half-conscious web coders) have the decency of setting the charset in the header of the page, like so:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

So much for the web. What's worse is when you get charset conflicts in terminals. Most modern linux distros now ship in full utf8 mode, that is applications are set to use utf8 by default to avoid all these problems. But then I log into a server and use nano or vim to edit files (if need be - emacs), and I get in trouble. The text I write (my terminal controls what characters are sent to the server), is in utf8. The server will most likely not support that (because some of these server distributions are ancient and do *not* use utf8 by default), so when I type the text in nano and save it, if I use non-ascii characters, the text will get garbled. vim supports utf8, so the problem is much reduced. But in nano, I basically have to save, then open the file again to see where the bugs are. This has to do with how text is handled, characters are counted left to right, so if I type a utf8 character (which is two bytes), and I try to erase it, nano will just erase one byte. So "half" the character is still there. And so on and so forth. Very annoying, I tell you.

So why bother with utf8? Because utf8 (and unicode in general) was designed to solve all these charset conflicts. ISO 8859 is a legacy standard, and with its various extensions it supports many different languages. But you can only use one at a time, so if you write text in French in one file, you cannot also use Russian text in there, the charset won't support both. Enter utf8, which supports pretty much _everything_. But as long as we still have piles of legacy systems that aren't designed to handle utf8 (or don't use utf8 by default at least), we will continue to experience these problems forever. Standards are only salvation insofar as they are applied. Correctly, consistently and universally. That much we have already learnt from IE vs the world in terms of web page rendering.

:: random entries in this category ::

11 Responses to "charset wars"

  1. erik says:

    Good observation. That one can be quite the nicklepick (as in trying to pick a nickle off the floor, it's almost impossible. And, yes I just came up with that :D )

  2. numerodix says:

    Good pun :D

  3. ash says:

    How ironic...the first time I loaded this page the browser messed it up somehow and everything was all over the place. I wonder what caused it?

    Btw, still not sure if your feed's working. Bloglines doesn't seem to be picking it up, though it also isn't picking up andre's at "a beautiful revolution" so it might be the feed-reader

  4. numerodix says:

    Well, I don't know what the problem was. Maybe the css didn't load and the fonts were all Times New Roman? That happens occasionally. A simple reload will fix it.

    Never tried Bloglines, I've only used rss client programs to track blogs, like Liferea. But I tested it just now and it's WordPress's funky way of writing the url that's messing it up for you. Just delete feed:http://www.matusiak.eu/numerodix/blog/?feed=rss2, cause apparently this new "protocol" is imaginary and noone knows it except WordPress.

    Btw, I have to say that Bloglines really looks like a nice service. I subscribed to my own blog and the viewer preserves formatting, which is more than I can say about all the semi-crappy feed readers I've tried in the gentoo repositories. I mean what point is there to lay out text is the feed reader lumps it all together, stripping paragraphs and images?

  5. ash says:

    Thanks, that seems to have fixed things.
    Bloglines is good, though I'm not a fan of the frames layout. I don't actually read the feeds in Bloglines, I just use it to tell me when blogs have updated and then I click on the link to actually visit the site.
    I just prefer reading blogs at the actual site, with the template and layout and everything.

    Speaking of which, any chance of an RSS feed on your site Erik?

  6. numerodix says:

    Yeah, I don't either, not blogs. But articles I might. In any case, when the story looks all messed up it doesn't do much to promote the site.

    Erik is using old skool 90s technology on his blog, that was before rss was known. :D

  7. erik says:

    Make that 80s :D

  8. ash says:

    So....a type writer then?

  9. erik says:

    I actually did learn to type (I have a diploma, wee) on an old typewriter, the kind that kinda required you to put your full body mass behind your finger or you couldn't press the button all the way down.

  10. ash says:

    Yeah, Martin (the Martin who comments on my site) learnt to type like that. Then he got a computer and whenever we were doing introductory IT lessons at school he was not only the fastest typer by far, but also the very loudest - you could hear him from anywhere in the room, hammering each and every key. It took him a year or two to adjust to the change.

  11. [...] We don’t want any html tags anywhere, and we don’t want any funny characters that will come out garbled. Anything retrieved from the web is by definition garbage, so we need to make sure that we clean it [...]