numerodix blog

Archive for the ‘technology’ Category

welcome to the family, Trekstor

September 26th, 2006

I bought a usb stick 6 months ago. I was skeptical, as I didn't think I would really need it, and it turns out it was only useful for about 2 weeks. Now it's just lying on a shelf waiting to be used for something again. I bought the cheapest one they had at Media Markt, I believe it was about €22 for 256mb. The brand was Trekstor, a company I had never heard of before, but on the box they print that the stick is compatible with Linux, how unusual. So I bought it, and it was. After all, it's just a usb mass storage device, of course it's compatible, it's the most common type of usb device. But it's refreshing to get a break from the "requires Windows" mantra, for a company to have the guts to print "Windows/Mac/Linux" on the box.

So the other day I was looking for an external usb drive, cause my laptop drive isn't that big, and lo and behold, there's Trekstor again. Again they're the cheapest and they even have a penguin on the box. So I went Trekstor again, a nice quiet 200gb external drive is now the latest addition to my [very short] list of gadgets. The little printed manual doesn't mention Linux at all, but the instructions for Windows are exceedingly simple, and Linux users don't need that hand holding anyway, if there's a driver for the device somewhere on the internet, they will find it and figure out how to use it.

It turns out Trekstor also manufactures mp3 players. If and when my iRiver dies, I will seriously consider going with Trekstor. They may not have the strong audio focus of iRiver, but they support ogg (which so few companies do) and their players are based on... usb mass storage, just like the usb stick and the external hard drive. And since iRiver seem to have gone completely native with DRM, it's time to look for another vendor anyway.

Posted in en, reviews | 3 Comments »

cutting the fat off binaries

September 20th, 2006

What's amazing to me is that for every technical problem, there's already been lots of people who've thought about it and tried to solve it. Given the world population is at some 6 billion, that's not really surprising, nevertheless it's very satisfying. Like lately I've been having thoughts about writing a small application using the Qt library. I haven't even begun designing it, I've just been doing preliminary research. My concern is that the program should only be a single binary and it should be as small as possible. The reason for this aim is that I want to make it as accessible as possible - it should require only a small download, and no installation necessary. So if it's a single binary, that's the easiest way to accomplish this.

But, of course, using any library at all already adds filesize in the shape of dependencies. Since I only want a single binary, I'm looking to compile statically, which will include all the library code that I'm using. Qt in particular, is a huge library. It probably adds up to about 15mb of library objects, and I don't want all of that in my "little" binary.

Let's do this by example. A few years ago I was dealing with High Dynamic Range (HDR) images, I even wrote a tutorial on how to produce them starting with pictures taken with a digital camera. I used Greg Ward's hdrgen utility for this. Greg's program is a single static binary. Now, if Greg had the same goal as I do, there are a couple of things he could have done.

$ ls -lh hdrgen -rwxr-xr-x 1 alex users 8.7M Oct 24 2003 hdrgen

So the file is over 8mb in size, is there any way we can shrink it?

$ file hdrgen hdrgen: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.0.0, statically linked, for GNU/Linux 2.0.0, not stripped

The information we're looking for is shown here in emphasis. The file is not stripped, which means it contains a bunch of symbols that aren't strictly necessary for it to run. Symbols that ease debugging or relocating the binary. Also note that the binary is statically linked, which means it does not depend on any libraries to run.

The first thing we can do it strip the binary. Stripping removes symbols and leaves only the bare essentials.

$ strip -s hdrgen $ file hdrgen hdrgen: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.0.0, statically linked, for GNU/Linux 2.0.0, stripped $ ls -lh hdrgen -rwxr-xr-x 1 alex users 1.9M Sep 19 22:59 hdrgen

As expected, the binary is now stripped. Notice also that the filesize has been reduced from 8.7mb to just 1.9mb! That's pretty sweet.

But it doesn't end here. A further way to reduce filesize (for static binaries only!!), is to compress them. UPX is a way to compress binaries to reduce their size further. It is a lossless compression method (otherwise it would be useless, of course), which bundles the compressed binary, and everything needed to uncompress it, in a single file.

$ upx -9 hdrgen Ultimate Packer for eXecutables Copyright (C) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004 UPX 1.25 Markus F.X.J. Oberhumer & Laszlo Molnar Jun 29th 2004 File size Ratio Format Name -------------------- ------ ----------- ----------- 1889480 -> 694035 36.73% linux/386 hdrgen Packed 1 file. $ file hdrgen hdrgen: ELF 32-bit LSB executable, Intel 80386, version 1, statically linked, corrupted section header size $ ls -lh hdrgen -rwxr-xr-x 1 alex users 678K Sep 19 22:59 hdrgen

The binary is now compressed. As you can see, the file utility is having some problems understanding what it is, because of the added compression. But the binary is definitely smaller, down to just 678kb!

Posted in en, technology | 1 Comments »

scanning for hosts on the local network

September 17th, 2006

One of the first things that pique my curiosity when I find myself in a new network environment is "what's around me?". To me it feels a bit like waking up from a dream and not remembering where I am or how I got there, so I want to look around a bit. I wrote a little script for this, and I can't say that it was terribly effective. It was based on using ping to send packets to every possible host on the current network (ie. the one I'm connected to presently). The scan was sequential, so it would ping 10.0.0.1, then ping 10.0.0.2 and so on. Most of these addresses had no hosts bound to them, so the scan would take forever for the ping to time out and move on to the next host. It would actually take so long (10min+) that in a wireless network, clients would come and go between the start and the end of the scan.

I didn't use ping because it was such a great choice for this problem, just that it was the first thing that occurred to me. I did get the script to run a bit faster by parallelling the pings, but this is a very silly thing to do, because with a Class C network, there are now 254 instances of ping running on the system. This would often drown out the packets from the hosts which were connected and the script would fail to report any hosts at all. I'm not sure why that is, but I improved the situation a bit by pausing for one second before starting every new thread.

Just the other day I stumbled upon a mention of using nmap to do this same thing. Sure enough, nmap was *designed* for this, so it should be the obvious choice. Somehow that never occurred to me. :lala: So I rewrote my little script to use nmap in place of ping. nmap does essentially the same thing as my script did, it pings hosts in parallell, but it does so without forking itself 254 times and it has some clever algorithms that monitor the state of the network to get best throughput at least congestion. To put that in plain English, here's a little comparison for a scan across 254 IP addresses:

parallell nmap: 0m 5.868s
parallell ping: 4m 15.912s

In other words, the ping method is absolutely rubbish. But, while I always have nmap available on my laptop, it's not an application that is installed by default on every system (unlike ping), so perhaps it would be handy to be able to fall back on the ping method, as lame as it is, if that's all we have.

Another small refinement is checking ifconfig for network info, so the user doesn't have to supply this manually. Again, this could fail (no priviliges, no ifconfig), so it's made to be an option, not a requirement.

#!/usr/bin/env python
#
# Author: Martin Matusiak <numerodix@gmail.com>
# Licensed under the GNU Public License, version 2.
#
# revision 2 - add hostname lookup


import os, string, re, sys, time, thread


def main():
	network = None
	try:
		netinfo = check_network()
		(ip, mask) = netinfo
		network = ip + "/" + mask
	except:
		print "Warning: No network connection found, scan may fail."

	if len(sys.argv) > 1:
		network = sys.argv[1]

	if not network:
		print "Error: No network range given."
		print "Usage:\t" + sys.argv[0] + " 10.0.0.0/24"
		sys.exit(1)


	if cmd_exists("nmap"):
		nmap_scan(network)
	else:
		print "Warning: nmap not found, falling back on failsafe ping scan method."
		ping_scan(network)


def nmap_scan(network):
	try:
		print "Using network: " + network
		cmd = 'nmap -n -sP -T4 ' + network + ' 2>&1'
		res = invoke(cmd)
		lines = res.split('\n')
		for i in lines:
			m = find('Host\s+\(?([0-9\.]+)\)?\s+appears to be up.', i)
			if m:
				print m, "\t", nslookup(m)
	except: pass


def ping_scan(network):
	iprange = find('(\w+\.\w+\.\w+)', network)
	print "Using network: " + iprange + ".0/24"
	for i in range(1,254):
		host = iprange + '.' + str(i)
		thread.start_new_thread(ping, (host, None))
		time.sleep(1)


def ping(host, dummy):
	try:
		cmd = 'ping -c3 -n -w300 ' + host + ' 2>&1'
		res = invoke(cmd)
		if "bytes from" in res: print host, "\t", nslookup(host)
	except: pass


def nslookup(ip):
	if cmd_exists("host"):
		cmd = 'host ' + ip + ' 2>&1'
		res = invoke(cmd)
		if "domain name pointer" in res:
			return res.split(" ")[4][:-2]
	return ""


def check_network():
	cmd = "/sbin/ifconfig"
	res = invoke(cmd)

	iface, ip, mask = None, None, None
	lines = res.split('\n')
	for i in lines:
		
		# find interface
		m = find('^(\w+)\s+', i)
		if m: iface = m
		
		# ignore loopback interface
		if iface and iface != "lo":
			
			# find ip address
			m = find('inet addr:([0-9\.]+)\s+', i)
			if m: ip = m
			
			# find net mask
			m = find('Mask:([0-9\.]+)$', i)
			if m: mask = m

	if ip and mask:
		mask = mask_numerical(mask)
		return (ip, mask)


def mask_numerical(mask):
	segs = find('(\w+)\.(\w+)\.(\w+)\.(\w+)', mask)
	mask = 0
	adds = (0, 128, 192, 224, 240, 248, 252, 254, 255)
	for i in segs:
		for j in range(0, len(adds)):
			if int(i) == adds[j]:
				mask += j
	return str( mask )


def find(needle, haystack):
	try:
		match = re.search(needle, haystack)
		if len(match.groups()) > 1:
			return match.groups()
		else: 
			return match.groups()[0]
	except: pass


def invoke(cmd):
	(sin, sout) = os.popen2(cmd)
	return sout.read()


def cmd_exists(cmd):
	if invoke("which " + cmd + " 2>&1").find("no " + cmd) == -1:
		return True
	return False



if __name__ == "__main__":
	main()

The output looks like this:

Using network: 192.168.2.119/24
192.168.2.1
192.168.2.119	james.home.lan

The first host listed, whose address ends in a 1, is often a router. Then there's the host transmitting the scan, that is localhost. At the time of the scan there were no other hosts connected on the network. Of course, beyond finding hosts, there's a lot more one can find out about them using.. *drumroll*.. nmap.

Update: I added a name lookup feature so that if there is a nameserver on the network, you not only get ip addresses, but hostnames as well. :)

Posted in code, en | 2 Comments »

fixing greedy emoticon matching in kopete

September 13th, 2006

I have a lot of admiration for the KDE project. The way that things come together and integrate into a common desktop with KDE is quite extraordinary. And all the time there are people interested in improving just about every bit of it. Now, of course, it's all about KDE4, the long awaited upgrade will come at some point in 2006, I guess the date hasn't been set yet.

Anyway, the beauty of free software is that if there's a bug that gets to you, you can fix it yourself. And one such bug irks me in Kopete. I've been testing the xtorg emoticon theme and with a fairly rich set of emoticons (82 images, 117 replacement strings), it's quite a good testset and exposes certain problems. The emoticon theme comes with a file called test_suite.txt, which just lists all the emoticon replacement strings, so that you can paste them into a chat client and see if they come up correctly. The special thing about the xtorg theme is that I've made sure to include the most common Msn Messenger strings, so that Windows people can reuse the ones they're used to already. In Kopete 0.12.2, using the test suite gives this result.

So evidently, Kopete's parsing of emoticons is not as good as it could be. I have examined the issue and found that the problem lies in non-greedy matching. This means that if : s and : s t a r : are both defined as emoticon strings, and : s just happens to appear before : s t a r : in Kopete's internal list of emoticons, : s t a r : will be parsed as [: s] t a r :, not as [: s t a r :]. This is not what the user expects, having defined a list of replacement strings, the user expects all of them to work.

This is not the kind of bug that will affect a lot of users, because the average user does not use big emoticon styles like this one (and will probably never encounter the error). Thus if I were to report the bug, it's not likely to be very high priority. Meanwhile it does bother *me*, so I thought I would try and fix it myself. So after a little hacking, I wrote a patch for Kopete, and it now does this.

I've reported this on KDE Bugzilla and Kopete developers willing, the fix will find itself into Kopete at some point. In the meantime, the patch is attached below.

diff -Naur kopete-0.12.2/kopete/libkopete/private/kopeteemoticons.cpp kopete-changed/kopete/libkopete/private/kopeteemoticons.cpp
--- kopete-0.12.2/kopete/libkopete/private/kopeteemoticons.cpp	2006-08-12 02:51:47.000000000 +0200
+++ kopete-changed/kopete/libkopete/private/kopeteemoticons.cpp	2006-09-13 07:20:28.000000000 +0200
@@ -48,6 +48,8 @@
 struct Emoticons::Emoticon
 {
 	Emoticon(){}
+	/* sort by longest to shortest matchText */
+	bool operator< (const Emoticon &e){ return matchText.length() > e.matchText.length(); }
 	QString matchText;
 	QString matchTextEscaped;
 	QString	picPath;
@@ -424,6 +426,7 @@
 		node = node.nextSibling();
 	}
 	mapFile.close();
+	sortEmoticons();
 }
 
 
@@ -492,9 +495,24 @@
 		node = node.nextSibling();
 	}
 	mapFile.close();
+	sortEmoticons();
 }
 
 
+void Emoticons::sortEmoticons()
+{
+	/* sort strings in order of longest to shortest to provide convenient input for
+		greedy matching in the tokenizer */
+	QValueList<QChar> keys = d->emoticonMap.keys();
+	for ( QValueList<QChar>::const_iterator it = keys.begin(); it != keys.end(); ++it )
+	{
+		QChar key = (*it);
+		QValueList<Emoticon> keyValues = d->emoticonMap[key];
+ 		qHeapSort(keyValues.begin(), keyValues.end());
+ 		d->emoticonMap[key] = keyValues;
+	}
+}
+
 
 
 
diff -Naur kopete-0.12.2/kopete/libkopete/private/kopeteemoticons.h kopete-changed/kopete/libkopete/private/kopeteemoticons.h
--- kopete-0.12.2/kopete/libkopete/private/kopeteemoticons.h	2006-08-12 02:51:47.000000000 +0200
+++ kopete-changed/kopete/libkopete/private/kopeteemoticons.h	2006-09-13 07:19:17.000000000 +0200
@@ -156,6 +156,12 @@
 	 * @see initEmoticons
 	 */
 	void initEmoticon_JEP0038( const QString & filename);
+	
+	/**
+	 * sorts emoticons for convenient parsing, which yields greedy matching on
+	 * matchText
+	 */
+	void sortEmoticons();
 
 
 	struct Emoticon;

EDIT: The original conclusion of this entry was that Gaim has parsing bugs too. This seems to be incorrect. A fresh new screenshot shows that Gaim handles the xtorg theme just fine.

UPDATE: The patch was accepted verbatim into kopete svn, so the next release (kde-3.5.5), whenever it will be, should have this problem fixed. :)

Posted in code, en | No Comments »

find sizes of installed packages

September 9th, 2006

Sometimes, especially when disk space is low (or when system backups grow unreasonably large), it's nice to know exactly how much space the biggest packages occupy. Obviously, OpenOffice is never above suspicion, but certain others can take up way more space than you would think.

I wrote a little script to print the size of all packages installed on the system. It uses the CONTENTS file for every installed ebuild to check the size of the files which belong to a package and give a sorted listing of packages by size.

#!/usr/bin/env python
#
# Author: Martin Matusiak <numerodix@gmail.com>
# Licensed under the GNU Public License, version 2.
#
# revision 1 - bugfix for paludis symlink in pkgdb

pkgdb = "/var/db/pkg"


import os, string, stat
from operator import itemgetter

sizes = {}

cats = os.listdir(pkgdb)
for c in cats:
	cpath = os.path.join(pkgdb, c)
	if os.path.isdir(cpath):
		cat = os.listdir(cpath)
		for p in cat:
			size = 0
			
			cont = os.path.join(pkgdb, c, p, "CONTENTS")
			fd = open(cont, 'r')
			
			strings = fd.readlines()
			for s in strings:
				line = string.split(s, " ")
				if line[0] == "obj" and os.path.exists(line[1]):
					size += os.path.getsize(line[1])
			
			fd.close()
			
			sizes[os.path.join(c, p)] = size

pkglist = sorted(sizes.items(), key=itemgetter(1))

for i in pkglist:
	(size, pkg) = ( str(i[1]), i[0] )
	print string.rjust(size, 11), " ", pkg

The output looks like this:

          0   virtual/x11-7.0-r2
         66   kde-base/kde-env-3-r4
        393   kde-base/kdebase-pam-6
        889   sys-apps/coldplug-20040920-r1
...
      94217   net-ftp/ftp-0.17-r6
      94642   kde-base/kcminit-3.5.3
      95629   sys-process/psmisc-22.2
      95931   sys-apps/ivman-0.6.12
      97358   app-admin/gnomesu-0.3.1
...
  122593614   dev-java/sun-jdk-1.5.0.08
  132864794   dev-lang/ghc-6.4.2
  145477793   app-text/tetex-2.0.2-r8
  221943002   sys-kernel/gentoo-sources-2.6.17-r7
  340336824   app-office/openoffice-bin-2.0.3

Unsurprisingly, OpenOffice claims victory, but this is a small reminder about how big kernel sources are. Tetex and GHC aren't minimalistic either.

UPDATE: Paludis bug fixed.

Posted in en, gentoo | 3 Comments »