numerodix blog

Archive for August, 2007

If you live in Europe you're running Linux

August 21st, 2007

There is a certain large class of internet users who, despite knowing that the internet is global, only seem to be interested in the national parts of it. They will use webmail, but only some national provider in their language. They will read papers, but only papers of their country. This to me is a little weird, because why would you limit yourself to just that little part when you have the whole thing to choose from? Anyway, my point is that if you're one of these people you're not going to understand what I'm about to say.

If you are one of those people on the internet, of which there are many, who have a wider perspective than just their country, then you can appreciate how often it feels like you're a second class citizen. Want to buy something on amazon? Well too bad, because there is no Dutch amazon, so you have to buy it from the US and overpay for shipping (or the French, German or UK one, but those don't have nearly the same selection). Oh, and it takes two weeks instead of 3 days. Ebay? Same deal. More often than not, you get a worse deal when you don't have a US shipping address. A lot of things you can't get at all. And most sites default to their largest, US version as well. It's not discrimination on purpose, of course, it's just that that's their biggest market.

Well that's kind of like it is to run Linux. There is less software available, you don't get support for hardware, basically the total number of services offered for Linux is much less, because the market is lesser. Either you can't get the same deals or you have to do more work to get them. But it's actually less painful than online shopping, so those who are switching will find it's not as bad as they thought. Basically if you can stand to live in Europe and shop online, Linux is a delight. :cap:

Posted in en, observations | No Comments »

Bill Maher

August 20th, 2007

Bill Maher is an atypical comic in that he's a public person as well. He isn't trying to play the part of a comic, and nothing more, or to play several discrete roles, and that's it. He's rather open to be engaged on anything. On his show he invites people whom apparently he respects a lot, but sometimes it really gets into a heated argument, they don't shy away from it. There is a lot of authenticity about him, less theater. And rather than a performer, he's more just about talking his mind.

So by that he is a rather well rounded character, which I think helps him as a stand-up. And above all, he's very good at telling a story. And that's what he does on stage too, he tells stories in a very relaxed and natural way, it really doesn't seem scripted. They are funny, but they aren't really sound bites either, they aren't manufactured to be funny word-for-word, it's the content that counts. And that's a different way of telling a joke than comics usually do, because they put every effort into wording it best. And they do that because they don't have the wealth of material that Maher does, or at least it appears so. He can easily go off on a little tangent, be funny, then return to where he was and continue with ease. He has that kind of coherence and clarity that not many people have.

But the funniest thing about him is that he's not really trying to be funny on purpose. The stories he tells are funny, but they aren't really jokes, they are just stories, presumably true most of them too. He's not really playing for laughs, it seems, because what he has to say is funny enough. And that's pretty confusing, cause you can't quite figure out how it works.

Posted in comedy, en | 1 Comments »

whyfirefoxisblocked: adorable muppets

August 17th, 2007

If you're looking for a good laugh, look no further. :D

So what's funny about it?

Many site owners therefore install scripts that prevent people using ad blocking software from accessing their site.

Many? :howler: I've never seen this page before. If "many" were doing this, then surely at some point I would have noticed it.

Secondly, he kills his own argument already in the second paragraph.

Blocking FireFox is the only alternative. Demographics have shown that not only are FireFox users a somewhat small percentage of the internet, they actually are even smaller in terms of online spending, therefore blocking FireFox seems to have only minimal financial drawbacks, whereas ending resource theft has tremendous financial rewards for honest, hard-working website owners and developers..

So if you block all Firefox users from accessing a website, that only has minimal financial drawbacks. That would necessarily imply that Firefox users running Adblock would also be a minimal financial drawback, since the browser is a somewhat small percentage of the market. So it's not even a problem, is it? :D

But the central argument here is a morality tale. Appealing to our sense of decency and all that, by telling us that we're crooks. :cap: That's right, we're stealing from honest, hard-working website owners and developers. I love those implications btw, apparently every non-developer is lazy and dishonest. :D

But then he says..

Netscape users can simply set their browser to IE mode to continue to enjoy the site that sent you here. FireFox users can use Internet Explorer, Opera or Netscape (in IE mode) to access it. FireFox users also have the option of using the IE Tab plug-in which uses the IE rendering engine to display pages, but also disables the Ad Block Plus plug-in.

Careful, of course, to not mention that any Firefox extension that allows you to switch your User Agent string will also allow you to enjoy the site just like Netscape (in IE mode). :D (In fact, quite a few Firefox users run in IE mode by default, purely because some idiotic sites block non-IE browsers.)

The guy would also be a bit more convincing about his denouncements of The Firefox Cult and Firefox Fanboys if his website (btw I can load it just fine in Firefox :P ) didn't look like some sort of shrine to a certain monopolistic company we know. He even copied the layout and the font (isn't that stealing btw?) :D He also has a page comparing browsers, where shockingly IE is the editor's pick. :D

I agree zealots are annoying and cults are dangerous, but Firefox is hardly the most dangerous cult out there. It's mostly about freedom from a certain company and control of your own computer. That's hardly the most evil plot ever.

Stealing what exactly?

I love this stealing argument. It's the same argument the RIAA uses to complain about their record sales. "If people would buy more records, we would have more money, and so since they aren't, that means they steal from us." Isn't it wonderful to claim profits based on projected income? Or better yet, *desired* income.

The fact is that Firefox is a community driven project, and the features it has, much more so the extensions it has, is a reflection of what people actually want. As opposed to a company telling them what they can have. The tv parallel is actually a very good one. If people had the option not to watch commercials, many of the garbage tv stations would be wiped out. Their whole existence is an excuse to mediocrity, because noone would actually pay for that content if they had to.

This may be a controversial view in the world of people who think they should be controlled by companies, but giving people the right to choose what they want to see is actually sort of the way it's supposed to be in a free society. Then they can decide for themselves if your content is something that a) they will only take for free or b) are actually willing to pay for.

Another thing is that Ad Block wouldn't be so popular if web companies didn't allow their websites to become the ad infested crap (even if the content is decent) that they are. A lot of sites are unbearable without (also often with) Ad Block and it's the one extension I definitely would hate to lose the most.

But but but what about these thousands of honest, hard-working website owners and developers? Well, do you weep over SCO going bust? (Should be any day now.) Lots of companies, no scratch that, most companies are started on a business model that isn't sustainable. So then the plan is that companies that control our technology will enforce ads so that we can keep these other bad companies afloat.

While we're on the subject.. If you've been here before you may have noticed that I slapped on Google Adsense recently just to see if it would make any sort of difference. I wonder if that's some sort of double standard, but on the other hand for those who are willing to look at ads, I'm letting them. :D I never see it myself cause I use Ad Block, and I'm guessing almost all my visitors are too. :D

Posted in comedy, en | 14 Comments »

desktop hackery with grep

August 15th, 2007

Just as every self respecting Unix user knows (and every Mac user should know, but probably doesn't), grep is the tool supreme for finding stuff in text files.

Here I describe harvest, a similar tool to grep for searching in all kinds of files (and not).

Why on earth?
The original problem was rather contrived, admittedly. I had been resisting the facebook bandwagon for the longest time, but finally a friend talked me into trying it. If you know facebook, you how the system works with adding "friends". And it's rather nice in how it imports your contacts from gmail and such. Of course, when you have your contacts elsewhere, you're left with a chunk of manual labor, searching&adding one-by-one. Not that it's a big problem, just a one time thing after all. But I was reluctant to undertake it, so I was clicking through facebook instead and found an import contacts from file feature, hm now that sounds more like it.

So I thought to myself in my university email account, I have a year worth of email history.. wouldn't it be satisfying to scan the whole thing and produce a list of email addresses I can import straight into facebook? Yes, I'm quite aware that my train of thought is somewhat off the beaten path much of the time. :D

In case you didn't know, email is just text. You may get a different idea when your email reader hides all the technical bits and just shows you the body of the message, but a stack of messages is just a bunch of text files, so there's no reason you can't treat them as any other text. So I went along and downloaded the whole thing, some 30mb. Now for the fun part. :cool:

So I need a tool that will scan the huge chunk of text and extract all the email addresses, and print them out in csv format. Preferably also remove duplicates from the list. Since I'm riding the ruby wave these days, it was the obvious choice, not least because I like how it handles regular expressions natively. So I hack up a script to do this, calling it harvest. It gets the data from the standard input, scans it for matches, and spits out the email addresses, very simple. And it works like a charm on my huge hunk of email data.

At this point you'll be wondering why on earth not use grep? Because to my knowledge grep only matches line-by-line, whereas I wanted something more general than that. And of course it's also the case that once you actually code it up, you have all the freedom you could ask for, rather than being limited to what grep does and doesn't do.

Can you make this thing go any faster?
Later on I realized that I can run harvest on just any file, and it would still work. Not that I had just discovered a new continent, strings already extracts all text strings from any file, including binaries. But the difference was I could search for things. So I found a nice test subject, pagefile.sys, which is Windows's swap file. :D I boot Windows once every few months, and when I do I rarely remember what I was doing last time. But apparently I had decided at some point that the swap file should be 1.5gb.

So I run harvest on it, while keeping an eye on things in htop. And ouch, harvest is consuming the entire file and reading it into memory. Next it's going to search for email addresses in a 1.5gb long string. :D Needless to say, that wasn't a success, the system started choking as it ran out of memory.

So I thought it would be better to buffer the file and read a chunk at a time. The only question is how do I still match for strings in a file of which there is only a chunk available? I wasn't exactly planning on matching super long strings, but then again there is the case where a string you want to find is part in the chunk currently in memory, and part in the next one, so how do you make sure you don't miss it? I tried an algorithm, and it was no good. It turns out for a long time I was barking up the wrong tree, and as a result I rewrote it about five times until I got it right. It is uncanny how the best solution is usually the simplest also.

To make sure it runs at an acceptable speed, I also experimented with the buffer size vis a vis speed and memory use, finding that a small buffer is actually better. When hunting for performance problems, it's often a good idea to run your app through a profiler just to be sure that it does what you think it will.

With a 10kb buffer (79s):

%   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 10.43     8.27      8.27   156723     0.05     0.05  Regexp#match
  8.73    15.19      6.92   618095     0.01     0.01  String#length
  6.87    20.64      5.45   153806     0.04     0.04  IO#read
  4.45    24.17      3.53   153810     0.02     0.02  String#+

This surprised me. I thought it should be spending far more time matching than say reading from disk. So I tried with a bigger buffer to see if I could marginalize disk io in the overall cost.

With a 10mb buffer (115s):

%   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 94.46   108.96    108.96     3057    35.64    35.64  Regexp#match
  2.52   111.87      2.91      152    19.14    19.14  IO#read
  1.31   113.38      1.51      156     9.68     9.68  String#+
  0.28   113.70      0.32       75     4.27    33.47  Kernel.require

This is more like what I expected, now almost all the time is spent matching. But it actually takes longer (and uses more memory as well, obviously), so there's nothing to gain by increasing the buffer unless the string we're searching for is so long that we need a buffer of megabytes. (Obviously, emails and urls are much shorter than that.)

To profile your script in ruby, try:

ruby -rprofile ninja.rb

The script now runs pretty fast, scanning the Windows pagefile in a couple of minutes, which I'm quite satisfied with.

More fun than a bag of chips, but useful?
the famous sliced bread I'm sure you're still wondering if the facebook scheme was a success. It wasn't. :D It turns out that out of all the emails harvested, a single one was found on facebook. As popular as the site is in Norway, apparently it doesn't have any Dutch users. :confused:

But since I already had harvest, I thought I would add an option to find urls as well, just for the heck of it. I also discovered that I could run it on any kind of file, not just text files. For instance, if you visited some cool site and forgot to bookmark it, it's probably still in Firefox's history file, so you can do:

harvest.rb --dat < ~/.mozilla/firefox/<profile>/history.dat

And not just files, either. To take a rather unexpected use case.. say you had an important email address, like for a job interview at the chocolate tasting lab, and you lost it.. well maybe it was swapped out at some point. Use harvest to scan your swap for email addresses:

cat /dev/hdXY | harvest.rb --email

And you can run that on any filesystem actually. :cool: I don't know how to access live memory in the same way, but that would be fun to try also. :cap: Things like zip files won't work, of course, because the text is scrambled, but otherwise (most of the time) you can read text out of any file whether it's a text file or not.

So is it actually useful? Not really. :D

But the useful observation is that your data is right there, and though you may not see it directly, it doesn't take more than this to actually look through it.

#!/usr/bin/env ruby
#
# Author: Martin Matusiak <numerodix@gmail.com>
# Licensed under the GNU Public License, version 3.
#
# revision 3 - allow spaces in urls
# revision 2 - introduce buffering to handle large files out of memory
# revision 1 - performance hacking: output entries immediately, only sort on
# emailcsv


require "optparse"


email = /([a-zA-Z0-9_\.-])+@(([a-zA-Z0-9-])+\.)+([a-zA-Z0-9]{2,4})+/m
url_orig = /([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9\/](([A-Za-z0-9$_.+!*,;\/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;\/?:@&~=%-]{0,1000}))?)/m
url = /([A-Za-z][A-Za-z0-9+.-]{1,120}:\/\/(([A-Za-z0-9$_.+!*,;\/?:@&~(){}\[\]=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9 $_.+!*,;\/?:@&~(){}\[\]=%-]{0,1000}))?)/m

pattern=url
joinlines=false
emailcsv=false
buffer_size=10*1024
hardlimit=100


## parse options
OptionParser.new do |opts|
	opts.on("--url", "url format") do |v|
		pattern = url
	end
	opts.on("--dat", "firefox history.dat format = \\\\n in urls") do |v|
		joinlines = true
	end
	opts.on("--email", "email format") do |v|
		pattern = email
	end
	opts.on("--emailcsv", "csv output (facebook contact import)") do |v|
		pattern = email
		emailcsv = true
	end
end.parse!


entries = []
previous = ""
while string = previous + STDIN.read(buffer_size).to_s and string.length > previous.length do
	partial = ""
	joinlines and string.gsub!(/\\\n/, "")
	while string and m = pattern.match(string) and m.size > 0 do
		m.end(0) == string.length and partial = m.to_s
		if partial.empty?
			if emailcsv
				entries << m.to_s
			else
				puts m.to_s
			end
		end
		pos = m.end(0)
		string = string[pos..-1]
	end
	if !partial.empty?
		previous = partial
	else
		if hardlimit < string.length
			previous = string[string.length-hardlimit..-1]
		else
			previous = string
		end
	end
end

# special stuff for csv email output
if !entries.empty?
	entries = entries.sort{ |a, b| a.downcase <=> b.downcase }.uniq
	puts '"Email Address","Formatted Name"'
	entries.each { |i| puts '"' + i + '",""' }
end

Posted in code, en | 3 Comments »

bugs are better when fun

August 12th, 2007

Gentoo is fabled to be the maverick of a distro that blows up right, left and center. While I find that to be a gross exaggeration, it does occasionally crash in scary ways. :) Although I don't think it's any worse than the others, upgrading Ubuntu from release to another is still broken (Edgy -> Feisty, check), and last time I tried the Fedora upgrade cycle it was busted as well. Obviously, I wouldn't be running Gentoo these last five years if it were as fragile as some people think. I actually find it the most sane distro in many ways.

But yesterday I must say I had quite a bit of fun with it. I keep to a fairly recent upgrade path, so I rarely fall behind more than a week. The expat1->2 upgrade had been in the works for some time, but the einfo flew by without me seeing it, as it often is. Interestingly, half the packages on the system depend on expat, and all of them were suddenly broken. The first thing I noticed was Firefox behaving odd.

Some things cannot readily be captured in a picture, an audio recording, or a video. Not even in a piece of text. A cool bug is one of those things. It has to be investigated, experimented upon, to understand the extent of it. I regret that I had no way to record this work of art for posterity. So Firefox was running happily along, as it had been doing for hours. Then at one point, I open a new tab and type fa in the location field to call up facebook from the drop down list. But I never got to a. Just as I hit f on the keyboard, Firefox crashed.

Of course, Firefox *does* love the crash, it crashes several times a day. But it almost always crashes over Adobe's frightfully stable flash plugin. This was not one of those times, I didn't have flash content in any of the open tabs. Well, never mind, it was probably a coincidence. So I start it up again, I get my tabs back, and again I open a new tab to load facebook. Again I never get to a. :D Hm, this is starting to get interesting.

It turns out that a single key press would crash Firefox, but it didn't have a problem with the mouse. And not just Firefox, any Gtk application. :cool: At this point I was quite amused.

Not surprisingly, bugs are more fun when you have an idea what to do about them. I did track down the expat problem after some digging, but it made me recompile a boat load of packages. All of them seem to work again, although revdep-rebuild hadn't been run for a while and exposed some other problems. One package that didn't recover, however, was digikam. I'm back on 0.8.2-r1 after running 0.9.2 happily for weeks. Odd.

The moral is this: if you're going to crash, find a fun way to crash. :party:

Posted in en, gentoo | 5 Comments »

Archive for August, 2007

If you live in Europe you're running Linux

Bill Maher

whyfirefoxisblocked: adorable muppets

desktop hackery with grep

bugs are better when fun

Quick links

Google.me

This month

August 2007
M	T	W	T	F	S	S
« Aug				Sep »
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31