Archive for 2007

desktop hackery with grep

August 15th, 2007

Just as every self respecting Unix user knows (and every Mac user should know, but probably doesn't), grep is the tool supreme for finding stuff in text files.

Here I describe harvest, a similar tool to grep for searching in all kinds of files (and not).

Why on earth?
facebook_sign.jpgThe original problem was rather contrived, admittedly. I had been resisting the facebook bandwagon for the longest time, but finally a friend talked me into trying it. If you know facebook, you how the system works with adding "friends". And it's rather nice in how it imports your contacts from gmail and such. Of course, when you have your contacts elsewhere, you're left with a chunk of manual labor, searching&adding one-by-one. Not that it's a big problem, just a one time thing after all. But I was reluctant to undertake it, so I was clicking through facebook instead and found an import contacts from file feature, hm now that sounds more like it.

So I thought to myself in my university email account, I have a year worth of email history.. wouldn't it be satisfying to scan the whole thing and produce a list of email addresses I can import straight into facebook? Yes, I'm quite aware that my train of thought is somewhat off the beaten path much of the time. :D

In case you didn't know, email is just text. You may get a different idea when your email reader hides all the technical bits and just shows you the body of the message, but a stack of messages is just a bunch of text files, so there's no reason you can't treat them as any other text. So I went along and downloaded the whole thing, some 30mb. Now for the fun part. :cool:

So I need a tool that will scan the huge chunk of text and extract all the email addresses, and print them out in csv format. Preferably also remove duplicates from the list. Since I'm riding the ruby wave these days, it was the obvious choice, not least because I like how it handles regular expressions natively. So I hack up a script to do this, calling it harvest. It gets the data from the standard input, scans it for matches, and spits out the email addresses, very simple. And it works like a charm on my huge hunk of email data.

At this point you'll be wondering why on earth not use grep? Because to my knowledge grep only matches line-by-line, whereas I wanted something more general than that. And of course it's also the case that once you actually code it up, you have all the freedom you could ask for, rather than being limited to what grep does and doesn't do.

Can you make this thing go any faster?
make_it_go_faster.jpgLater on I realized that I can run harvest on just any file, and it would still work. Not that I had just discovered a new continent, strings already extracts all text strings from any file, including binaries. But the difference was I could search for things. So I found a nice test subject, pagefile.sys, which is Windows's swap file. :D I boot Windows once every few months, and when I do I rarely remember what I was doing last time. But apparently I had decided at some point that the swap file should be 1.5gb.

So I run harvest on it, while keeping an eye on things in htop. And ouch, harvest is consuming the entire file and reading it into memory. Next it's going to search for email addresses in a 1.5gb long string. :D Needless to say, that wasn't a success, the system started choking as it ran out of memory.

So I thought it would be better to buffer the file and read a chunk at a time. The only question is how do I still match for strings in a file of which there is only a chunk available? I wasn't exactly planning on matching super long strings, but then again there is the case where a string you want to find is part in the chunk currently in memory, and part in the next one, so how do you make sure you don't miss it? I tried an algorithm, and it was no good. It turns out for a long time I was barking up the wrong tree, and as a result I rewrote it about five times until I got it right. It is uncanny how the best solution is usually the simplest also.

To make sure it runs at an acceptable speed, I also experimented with the buffer size vis a vis speed and memory use, finding that a small buffer is actually better. When hunting for performance problems, it's often a good idea to run your app through a profiler just to be sure that it does what you think it will.

With a 10kb buffer (79s):

%   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 10.43     8.27      8.27   156723     0.05     0.05  Regexp#match
  8.73    15.19      6.92   618095     0.01     0.01  String#length
  6.87    20.64      5.45   153806     0.04     0.04  IO#read
  4.45    24.17      3.53   153810     0.02     0.02  String#+

This surprised me. I thought it should be spending far more time matching than say reading from disk. So I tried with a bigger buffer to see if I could marginalize disk io in the overall cost.

With a 10mb buffer (115s):

%   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 94.46   108.96    108.96     3057    35.64    35.64  Regexp#match
  2.52   111.87      2.91      152    19.14    19.14  IO#read
  1.31   113.38      1.51      156     9.68     9.68  String#+
  0.28   113.70      0.32       75     4.27    33.47  Kernel.require

This is more like what I expected, now almost all the time is spent matching. But it actually takes longer (and uses more memory as well, obviously), so there's nothing to gain by increasing the buffer unless the string we're searching for is so long that we need a buffer of megabytes. (Obviously, emails and urls are much shorter than that.)

To profile your script in ruby, try:

ruby -rprofile ninja.rb

The script now runs pretty fast, scanning the Windows pagefile in a couple of minutes, which I'm quite satisfied with.

More fun than a bag of chips, but useful?
the famous sliced breadI'm sure you're still wondering if the facebook scheme was a success. It wasn't. :D It turns out that out of all the emails harvested, a single one was found on facebook. As popular as the site is in Norway, apparently it doesn't have any Dutch users. :confused:

But since I already had harvest, I thought I would add an option to find urls as well, just for the heck of it. I also discovered that I could run it on any kind of file, not just text files. For instance, if you visited some cool site and forgot to bookmark it, it's probably still in Firefox's history file, so you can do:

harvest.rb --dat < ~/.mozilla/firefox/<profile>/history.dat

And not just files, either. To take a rather unexpected use case.. say you had an important email address, like for a job interview at the chocolate tasting lab, and you lost it.. well maybe it was swapped out at some point. Use harvest to scan your swap for email addresses:

cat /dev/hdXY | harvest.rb --email

And you can run that on any filesystem actually. :cool: I don't know how to access live memory in the same way, but that would be fun to try also. :cap: Things like zip files won't work, of course, because the text is scrambled, but otherwise (most of the time) you can read text out of any file whether it's a text file or not.

So is it actually useful? Not really. :D

But the useful observation is that your data is right there, and though you may not see it directly, it doesn't take more than this to actually look through it.

#!/usr/bin/env ruby
#
# Author: Martin Matusiak <numerodix@gmail.com>
# Licensed under the GNU Public License, version 3.
#
# revision 3 - allow spaces in urls
# revision 2 - introduce buffering to handle large files out of memory
# revision 1 - performance hacking: output entries immediately, only sort on
# emailcsv


require "optparse"


email = /([a-zA-Z0-9_\.-])+@(([a-zA-Z0-9-])+\.)+([a-zA-Z0-9]{2,4})+/m
url_orig = /([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9\/](([A-Za-z0-9$_.+!*,;\/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;\/?:@&~=%-]{0,1000}))?)/m
url = /([A-Za-z][A-Za-z0-9+.-]{1,120}:\/\/(([A-Za-z0-9$_.+!*,;\/?:@&~(){}\[\]=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9 $_.+!*,;\/?:@&~(){}\[\]=%-]{0,1000}))?)/m

pattern=url
joinlines=false
emailcsv=false
buffer_size=10*1024
hardlimit=100


## parse options
OptionParser.new do |opts|
	opts.on("--url", "url format") do |v|
		pattern = url
	end
	opts.on("--dat", "firefox history.dat format = \\\\n in urls") do |v|
		joinlines = true
	end
	opts.on("--email", "email format") do |v|
		pattern = email
	end
	opts.on("--emailcsv", "csv output (facebook contact import)") do |v|
		pattern = email
		emailcsv = true
	end
end.parse!


entries = []
previous = ""
while string = previous + STDIN.read(buffer_size).to_s and string.length > previous.length do
	partial = ""
	joinlines and string.gsub!(/\\\n/, "")
	while string and m = pattern.match(string) and m.size > 0 do
		m.end(0) == string.length and partial = m.to_s
		if partial.empty?
			if emailcsv
				entries << m.to_s
			else
				puts m.to_s
			end
		end
		pos = m.end(0)
		string = string[pos..-1]
	end
	if !partial.empty?
		previous = partial
	else
		if hardlimit < string.length
			previous = string[string.length-hardlimit..-1]
		else
			previous = string
		end
	end
end

# special stuff for csv email output
if !entries.empty?
	entries = entries.sort{ |a, b| a.downcase <=> b.downcase }.uniq
	puts '"Email Address","Formatted Name"'
	entries.each { |i| puts '"' + i + '",""' }
end

bugs are better when fun

August 12th, 2007

Gentoo is fabled to be the maverick of a distro that blows up right, left and center. While I find that to be a gross exaggeration, it does occasionally crash in scary ways. :) Although I don't think it's any worse than the others, upgrading Ubuntu from release to another is still broken (Edgy -> Feisty, check), and last time I tried the Fedora upgrade cycle it was busted as well. Obviously, I wouldn't be running Gentoo these last five years if it were as fragile as some people think. I actually find it the most sane distro in many ways.

But yesterday I must say I had quite a bit of fun with it. I keep to a fairly recent upgrade path, so I rarely fall behind more than a week. The expat1->2 upgrade had been in the works for some time, but the einfo flew by without me seeing it, as it often is. Interestingly, half the packages on the system depend on expat, and all of them were suddenly broken. The first thing I noticed was Firefox behaving odd.

Some things cannot readily be captured in a picture, an audio recording, or a video. Not even in a piece of text. A cool bug is one of those things. It has to be investigated, experimented upon, to understand the extent of it. I regret that I had no way to record this work of art for posterity. So Firefox was running happily along, as it had been doing for hours. Then at one point, I open a new tab and type fa in the location field to call up facebook from the drop down list. But I never got to a. Just as I hit f on the keyboard, Firefox crashed.

Of course, Firefox *does* love the crash, it crashes several times a day. But it almost always crashes over Adobe's frightfully stable flash plugin. This was not one of those times, I didn't have flash content in any of the open tabs. Well, never mind, it was probably a coincidence. So I start it up again, I get my tabs back, and again I open a new tab to load facebook. Again I never get to a. :D Hm, this is starting to get interesting.

It turns out that a single key press would crash Firefox, but it didn't have a problem with the mouse. And not just Firefox, any Gtk application. :cool: At this point I was quite amused.

Not surprisingly, bugs are more fun when you have an idea what to do about them. I did track down the expat problem after some digging, but it made me recompile a boat load of packages. All of them seem to work again, although revdep-rebuild hadn't been run for a while and exposed some other problems. One package that didn't recover, however, was digikam. I'm back on 0.8.2-r1 after running 0.9.2 happily for weeks. Odd.

The moral is this: if you're going to crash, find a fun way to crash. :party:

Live Free or Die Hard: how forgettable

August 11th, 2007

live_free_or_die_hard.jpgAs we all know, Die Hard with a vengeance was probably the best action movie of all time. So it's a lot to live up to. Unfortunately, Die Hard 4 falls at the first fence.

As I'm watching the opening sequence I can't even believe that it *is* Die Hard, thought maybe it was trailer. But then the title comes up. Basically, this is not a Die Hard movie. These people have no idea what they're doing. Die Hard is about a band of armed robbers with an ingenious plot to steal a ton of money. It's not a disaster movie, and it's not a computer cracker movie. Half an hour into it I was thinking enough with the computer crap, already!

And I wasn't the only one, every 15 minutes John McClane was asking wtf is going on, he had no idea. I mean the whole point of Die Hard is for John McClane, a clever cop cut way above the dumb-cops stereotype to overturn the plot. But he had no clue what was going on here. If McClane can't figure it out, there's something very wrong with the story.

This movie is a story about computer crime retrofitted with John McClane.

The story isn't terrible, it has its merits. I think it's a quite acceptable computer-terrorist-takeover plot as they come. But since they call it Die Hard I'm going to continue discussing it on that premise. So let's focus on the bright points, cause we need to savor them.

Bruce Willis is indeed a bit gray haired for this role. But considering how estranged he is from the plot, he does a decent job. Basically he's the best feature this movie has. His cracker/stoner companion starts off very lame, but he comes along.

But now let me ask this. Did they not have any money for this movie? The casting is like a who's who of bad actors. Bowman is the worst FBI director of any movie ever. He's completely clueless, gutless, and worthless. He has no idea what's going on, and absolutely no concept of what to do. Then there's the villain Thomas Gabriel. Now if you know anything about Die Hard, you know that the whole success of the story is predicated upon a great bad guy. Simon Guber (Jeremy Irons) was a *genius* in Die Hard with a vengeance, he made the story a success. Thomas Gabriel, meanwhile, is a puny security expert gone loco with a plot to steal billions of dollars. If Bowman is the least convincing character, Gabriel Thomas is a close second. Guber was a psycho, Gabriel cries on the phone when he finds out his girlfriend is dead. This is supposed to be a Die Hard villain? He looks more like an insurance salesman.

Then there's Mai, Gabriel's right hand. How relieved was I when she was killed. It's like a contest of who can make the worst fit for their role. Then there's Gabriel's squad of French/Italian terrorists/crackers/soldiers. This is a very odd mix of outsourced personnel, they don't even speak English. Consequently they don't have any terrible lines either, so perhaps that's a plus. Cyril Raffaelli does a decent job with the parkour, but frankly if you want to see coolass parkour, you'll go see Banlieue 13, which is much better at that.

The action sequences are for the most part terribly misguided. Here's the thing: if you want to do an action sequence, you have to build up the plot first, so that it culminates into the action. In Die Hard 4 you just have a lot of very random action bits. Like the car falling into the elevator shaft - Jurassic Park already did that several times with a bus, enough already. Then there's the helicopter-assassinated-by-car idea, which is beyond ludicrous. Apparently a speeding car hitting a toll booth is supposed to elevate some 30m right into a chopper of killers. Can't you at least try to make it believable? The most complicated action sequence must be the fighter-jet-hunting-a-semitrailer. And it's not really that bad, it's just that you have to somehow buy the story to enjoy it, which is unlikely. Also, it's high time for Hollywood to stop telling us that you can drive a car at high speed and come to an instant stop without a scratch. No one is buying it.

I also don't buy the soldier/cracker idea. We've all seen so many action movies with terrorists where you have these really big chunks of muscles, expert with weapons and combat. But in this movie, they try to make them computer experts too, which isn't convincing.

Also, why are crackers always so shy and timid? Trey is Gabriel's geek-who-makes-it-all-possible friend, and he's like every cracker in every terrorist movie, full of scruples and hesitation. Theo, from the first Die Hard, was much better - he was actually evil. Sure he didn't kill anyone, but it didn't deter him either. He even tried to make away with the money when everyone else was taken out of action.

Saving grace that he is, John McClane struggles to fit into this plot. In Die Hard with a vengeance, he fought the bad guys, but he was just trying to catch them. In Die Hard 4 he actually announces his intention to "kill them all". This is not the John McClane we know.

A surprisingly well kept secret is that it's the story that drives a good action movie. That is why Die Hard with a vengeance was a masterpiece. If not, it's just boring combat and shooting. A fact lost on this guy and the 44,517 people on imdb.com. This review gets it right:

Boring characters, crappy script, interesting fight scenes. But fight scenes never make a movie. The worst die hard in my opinion, even worse than die hard 2. At least die hard 1 and 3 had interesting tutonic villianry.

At the end of the day, it's not such a terrible movie, it's just not Die Hard. Which is a shame, considering what they advertise on the movie poster. If they didn't do that, it would be a much quieter movie, with a fraction of the people come to see it, and they'd probably be more satisfied than those of us who wanted a Die Hard movie.

parking authorities can kiss my ass

August 8th, 2007

There was a time when you could park for free in the city. Obviously not in the main street, but there were areas that if you knew about them you could still park pretty close to the center. But today it's basically impossible. They've covered every cm2.

It's not even that parking costs money. Because sometimes it costs an arm and a leg, but let's leave that aside for the moment. It's how incredibly redundant it is. What exactly are we paying for? It's a completely pointless tax. People have cars so obviously they need to park. Do we park less because it costs money? I doubt it, I don't see rows of free parking spaces. In fact, often it's hard to find a spot even when you're paying. And it's getting worse. So what is it for? No one is prevented from buying a car, so what the hell do they expect us to do with them? If they want to reduce traffic then have some guts and actually make inner cities pedestrian-only.

And paid parking is an incredible annoyance, because it's not enforced either. It's an idiotic system. Basically you park and it's up to you. If you don't buy a ticket, there's every possibility that you will get away with it. Or you could be slapped with a fine. And the fines are obscene amounts of money. So people speculate - they don't pay when they're parking for a short time, or they only pay for as much as they think they need. Of course, the second your ticket expires the humble civil servant of a parking attendant is in his right to slap that insane fine on you.

I could understand paid parking if it were a fee for something, like some expense they had to cover. Some of those things do make sense, like a toll to pay for a new bridge or tunnel. But parking fees are completely pointless. And to realize that, you only have to see how it works. First of all, they have never ever said that the fees go to some special important cause. It is merely money into the city coffers. And just how sensibly cities spend their money I think we've all witnessed. Secondly, it's not enforced at all. You can get away with not paying if you're lucky or you know a certain place isn't monitored as much. In fact, some places are crawling with parking attendants while some aren't. Again this says that it's not an organized process, it's just a contest to write the most fines they possibly can.

And finally, there is no regulation as to how much you pay, how much parking costs or what exactly it is you pay for. Parking fees vary wildly, and without any semblance of order. There is never a reason given as to why the fee has to be increased, it just goes up. And if you ask yourself what it is you're paying for, there is no obvious answer. If you park for 10 minutes and you pay for 5, you pay half price. But if you park for 30 minutes and it turns out you only need 10, you're royally ripped off. The whole system is rigged so that they get as much as they can from you. To be safe, you should pay for more than you need, just in case you need more time. So you end up with a spare 20 minutes. Now someone else comes around and takes your spot. If parking fees were some kind of real estate payment, that person would now get 20 minutes free parking. Because the spot has already been paid for. But no, it doesn't work that way, the new guy has to pay for the time I've already covered. So the payment isn't actually for anything, a product or service that costs a certain amount, it's just paying for the sake of paying.

It would make a lot more sense if you were somehow charged when pulling out (like in a parking garage). And then just paying for the time you parked. But you can't do this, can you? If you didn't pay and the parking attendant is writing you a ticket just as you get back, he won't accept that you pay for the time you used (which would be sensible), he will give you the fine anyway. Why? Because it's only paying for the sake of paying, and the more they can squeeze you for the happier the leeches are.

And what about parking attendants? This is the definition of redundant. It almost sounds like a scheme the government introduced to lower unemployment. An utterly pointless job that is entirely self serving. Think how depressed those people must be on the weekends. Here they landed a job and there isn't a single person in the world who thinks what they're doing is even a tiny bit useful. And yet it's our taxes that are paying their wages, isn't that amazing?

The Bourne Supremacy

August 6th, 2007

bourne_supremacy.jpgRobert Ludlum is just really good at these spy tales. Intriguing, complex, coherent plots that hold together. With plenty of skill and some gadgets thrown in. A touch of combat, but not all that much, it's all about the hunt, and about finding the truth. About messing with their heads. :D

Dare I say Matt Damon is quite good in this role. I like him a lot better than in The Bourne Identity. He's not an awkward lost kid, that just didn't fit well for a spy. He's a resolute character who doesn't hesitate or "need to think". The trench coat may be a bit much, he never takes it off.

It's the classical agency-that-lost-the-agent plot. But he's found, and he has to find out who and what. It's a neat story, and in less than two hours they wrap it up. I do wonder if there isn't more to the story, though. If Ludlum didn't in fact craft a more complicated tale. As far as I can see, there's nothing missing from the movie, everything fits. What is a bit unsettling is just how many consecutive car crashes Bourne [and his car!] can survive without a scratch. :/

I must say that these intelligence operatives do sound more impressive in a book than they do in a movie like this. That scene in the Berlin square with the tram, they looked rather disoriented and helpless as he blends into the crowd and goes unnoticed. And with all that CIA surveillance not much help tssk.

John Powell did more than a good job on the score, it is very compelling. :thumbup: