numerodix blog

Archive for the ‘technology’ Category

kwin leaks memory

May 27th, 2008

Something is very wrong here. Right after starting a KDE session everything looks normal.

But after running for a day we have a different story. I'm assuming this isn't the expected behavior (if so I didn't expect it).

This time I specifically took a screenshot to prove it, but I've seen it eat up as much as 1.3gb of my memory, which is rather unnerving.

kwin-kde4 4:4.0.4-0ubuntu1

Bug report.

Posted in en, technology | No Comments »

People always seem to say that datastructures and algorithms are supposed to be embarrassingly obvious to anyone who's a coder. I wonder to what extent this is actually the case. I recall taking a course on this way back in college when I wasn't quite ready for it yet, of which not a whole lot remains in memory.

So I thought it would be fun to revisit that stuff and write some of the algorithms. As a bonus, why not write them in c, a language I know almost nothing about, just to add some spice to the menu. The way I was doing is I would look up the algorithm on 'pedia, figure out what it does, and then try to write it. And if I got totally stuck I would peer at the code to debug my broken mental model. This only happened once.

From the outset, I was thinking that it would be useful to compare them on performance as well, so timing would definitely be involved. I remember we did some timing back in college, which was pretty amateurish. Clocking runtimes and curve fitting, that was lame. It doesn't address the core of the problem. We don't actually care whether an algorithm is fast or slow, that's not the point. What we're really interested in is how much work it's doing. Speed is just a function of that. Naive algorithms do too much work, they go grocery shopping and leave the shopping list at home, so they have to go home and back for every item to see what it was.

So taking that concern into account, I devised a simple struct to accumulate some metrics as the algorithm is executing. cmp stores the number of comparisons made during execution, ie. comparing elements of the array against each other. ass stores the number of assignments of array elements. rec stores the number of function calls to the particular function (interesting when functions are recursive). Finally, unit and ulen are just type variables that define the type of array elements and array indexes (lengths) respectively.

typedef unsigned long unit;
typedef unsigned long ulen;

typedef struct {
	char *name;
	ulen cmp;
	ulen ass;
	ulen rec;
} Metrics;

With that out of the way, the data types should be clear (as clear as they'll ever be I guess). I actually did it this way so I could switch the type in just one place while I was playing with it. (It's a bit easier to see if a few shorts are in order before you go to large datasets and want a wide value space.)

Before we move along to the algorithms themselves, one thing to consider is what sort of data they'll be sorting. Different algorithms have different performance characteristics depending on how much of the data is already sorted. I decided it would be most instructive to give them a real workout, so I'm feeding data from /dev/urandom. This is essentially the worst case scenario, the data is pretty damn random and the stats below reflect this.

One last thing. In order to collect metrics the algorithms were mildly instrumented. I've tried to make this as unintrusive as possible, but it necessarily adds a line here and there. I'm not sure if the metrics are exactly right, but they should at least give you a decent order-of-magnitude idea.

Bubble sort

The first algorithm making an appearance is bubble sort. What bubble sort does is start at the beginning of the array, and repeat the same action for each position in the array. It takes the element that is found there, and if it's larger than the element on the right, it swaps them. And then it keeps swapping up the array until it finds a neighbor that isn't smaller. Once that happens, it moves on to the next position in the array.

It turns out that when you do this for each position in the array, you get all the elements in order. If you're still skeptical think of it this way. The largest element will have been swapped all the way to the end, so it's definitely in order. The smallest element, no matter where it happened to start out, will have been swapped past by all the larger elements to its left, so it eventually ends up at the very beginning. And the same holds for all the elements in between.

Metrics bubblesort(unit *arr, ulen length) {
	Metrics ms = {"bubblesort", 0, 0, 1};
	ulen swaps = 1;
	unit swap;
	for(ulen i=0; i<length-1 && swaps > 0; i++) {
		swaps = 0;
		for (ulen j=0; j<length-1; j++, ms.cmp++) {
			if (arr[j] > arr[j+1]) {
				swaps += 3;
				swap = arr[j];
				arr[j] = arr[j+1];
				arr[j+1] = swap;
			}
		}
		ms.ass += swaps;
	}
	return ms;
}

=== bubblesort ===
n          : 100000
n*log(n)   : 1660964
n^2        : 10000000000
calls      : 1
comparisons: 9938500614
assignments: 7498934958
runtime    : 30.540000s

As you'll have spotted, ms is the variable carrying the metrics, so any reference to ms.something has to do with collecting metrics, the rest is algorithm logic. Below the code is the metric listing I computed. Just to keep it consistent all these runs were done on 100,000 item arrays. For your convenience, there are also the values for n² and n·log(n), useful reference points for sorting algorithms.

This algorithm stops working once it determines that no more sorting is necessary. In this case, you'll notice the number of comparisons made actually approaches n², which is a pretty good indication that the data was quite random indeed. The assignments counter shows that the assignments in the innermost if block were executed once for every four comparisons. I would imagine that assignments are more expensive than comparisons, but I don't really know, and it's really an implementation detail. Obviously, the reason to make a comparison is to set the stage for an assignment, so the two numbers always follow in the same order of magnitude.

The runtime shows how long the algorithm took to execute, but it's a somewhat less useful measure, influenced as it is by how heavy the traffic was to the supermarket that day.

Selection sort

Selection sort works more the way that a human would sort things. It first scans the whole array for the smallest element, and then swaps it with the element in the first position. It then moves one slot over and again scans the rest of the array for the smallest element, swapping it into the second position. This obviously produces a sorted array.

Metrics selectionsort(unit *arr, ulen length) {
	Metrics ms = {"selectionsort", 0, 0, 1};
	unit swap;
	ulen cursor;
	for(ulen i=0; i<length-1; i++) {
		cursor = 0;
		for(ulen j=i+1; j<length; j++, ms.cmp+=2) {
			if (arr[i] > arr[j] && (cursor == 0 || arr[cursor] >= arr[j])) {
				cursor = j;
			}
		}
		if (cursor > 0) {
			swap = arr[i];
			arr[i] = arr[cursor];
			arr[cursor] = swap;
			ms.ass += 3;
		}
	}
	return ms;
}

=== selectionsort ===
n          : 100000
n*log(n)   : 1660964
n^2        : 10000000000
calls      : 1
comparisons: 9999900000
assignments: 299958
runtime    : 27.130000s

Sorting this way is more predictable, because after every pass in the sorting you know that you have a fully sorted sub array at the beginning that needs no further work. And therefore you know that the remainder of the array that is going to be scanned on every subsequent pass will shrink steadily. This is in contrast to bubble sort where even after coming half way you don't know how many comparisons will have to be made on the next pass.

But there is still a whole lot of comparisons to be made to find the smallest element each time. There are far fewer assignments, however, because the element once located is put in its final position.

Insertion sort

Insertion sort is rather different from the first two. Instead of working on the whole array and finding the global minimum, it sorts a growing part of the array for every pass. Starting at the beginning, it compares the element in the second position with the one in the first position and inserts it where it's supposed to go, either in position one or doing nothing (remains in position two). It then looks at the third element and inserts it where it's supposed to go, and so on.

The difference here is that the portion that has been sorted isn't final yet, because there are a bunch of elements remaining that will be inserted somewhere in between those we have in order.

Metrics insertionsort(unit *arr, ulen length) {
	Metrics ms = {"insertionsort", 0, 0, 1};
	ulen j;
	unit swap;
	for(ulen i=1; i<length; i++, ms.ass+=2) {
		swap = arr[i];
		for(j=i-1; ms.cmp++, j>=0 && arr[j] > swap; j--, ms.ass++) {
			arr[j+1] = arr[j];
		}
		arr[j+1] = swap;
	}
	return ms;
}

=== insertionsort ===
n          : 100000
n*log(n)   : 1660964
n^2        : 10000000000
calls      : 1
comparisons: 2499744985
assignments: 2499844984
runtime    : 3.860000s

Insertion sort is substantially more efficient, and the big win comes from not looking at the whole array to find the right element. Instead, elements are admitted in the order they stand, and put away in the right place. This dramatically reduces the number of comparisons and assignments alike. However, there is still the undesirable case of having to insert a small element near the beginning of a long sorted sub array, having to move all the elements to the right up the array, by one.

But even though it is more clever, we can't get away from the fact that insertion sort is still an O(n²) algorithm, as the numbers show.

Quicksort

In order to break the n² barrier we have to look at algorithms that actually guarantee never having to look through the whole array, because that's what really takes a lot of work. Quicksort is an odd one, because it doesn't give this guarantee, but it's sufficiently unpredictable to almost never have to do this.

Here's how it works. It picks a random element called the pivot. This can be any element really, in our case it will be the first one in the array. Now what we want to achieve is that the pivot is put in the right place. So we run through the whole array (but just the first time!) and assemble two new arrays, for those elements smaller than the pivot, and those greater. We don't know anything else about the elements in these two sub lists, only that they are on either side of the pivot. But by now we've already moved all the elements that belong to the left of the pivot over to that side, and vice versa. Which means that we can sort the two halves independently. So for each half we again choose a pivot, assemble two sub lists, and so on.

Eventually this division gives two sub arrays of just one element, with a pivot in the middle, which don't have to be sorted any further. So then we're all done, and we just collect all the sub results into one array.

Metrics quicksort(unit *arr, ulen length) {
	Metrics ms = {"quicksort", 0, 0, 1};
	if (length > 1) {
		unit pivot = arr[0];

		unit *left = malloc(length * sizeof(unit));
		unit *right = malloc(length * sizeof(unit));

		ulen l = 0, r = 0;
		for(ulen i=1; i<length; i++, ms.cmp++, ms.ass++) {
			if (arr[i] < pivot) {
				left[l++] = arr[i];
			} else {
				right[r++] = arr[i];
			}
		}

		arr[l] = pivot;
		memcpy(arr, left, l * sizeof(unit));
		memcpy(&arr[l+1], right, r * sizeof(unit));

		free(left);
		free(right);

		Metrics lms = quicksort(arr, l);
		Metrics rms = quicksort(&arr[l+1], r);

		ms.cmp += lms.cmp + rms.cmp;
		ms.ass += lms.ass + rms.ass + length + 1;
		ms.rec += lms.rec + rms.rec;
	}
	return ms;
}

=== quicksort ===
n          : 100000
n*log(n)   : 1660964
n^2        : 10000000000
calls      : 133599
comparisons: 1923150
assignments: 3979898
runtime    : 0.040000s

What makes the implementation a bit complicated is that it's an in-place sort, so every time we have a pivot and a pair of sub arrays ready, we copy them back into the original array, overwriting it. Quicksort is a bit simpler when it returns a new array, but just to make it consistent with the other algorithms (which are in-place), we do it this way.

Since this is a recursive algorithm, the number of times it gets called depends on the length of the array. We want this number to be high. That might seem strange at first, but we know that the more calls we have, the more evenly we're dividing the array each time, ie. the quicker it's going to get shorter. Recall that at every step we need to scan the entire current sub array so that we can move the pivot into the middle position. And that is a lot of work, so if the sub arrays are half as large each time, this won't be so bad, but if they decrease by only one element (ie. the pivot is always at the very end), it's going to be very tiresome. Notice that the number of comparisons is still a lot higher than the number of calls, so we can afford to maximize the calls. What we get out of it is more calls that each do less work, rather than fewer calls that each do more work.

To make this more concrete, let's look at it more carefully. If we divide the array evenly each time, we first get 1 call on the whole array, then 2 calls on the two halves, then 4 and so on. This gives a total of Sum(2^x,1,m), where m = log₂n, ie. around 260,000 calls (with m = log₂100,000 ~= 17). This must also be the upper bound for the number of calls possible, because if you divide each piece up evenly then they cannot possibly be subdivided any more. On the other hand, if we were unfortunate to choose the pivot on the end of the array each time, we would have 100,000 calls, but each call would do much more work, and we'd be back in n² territory.

This is why Quicksort isn't guaranteed to do n·log(n) work, it depends on the input data (random data will cause the pivot to be chosen randomly, whereas on [mostly] sorted data the algorithm should make sure the pivot is chosen at random).

Merge sort

What we've seen so far is that a recursive algorithm should maximize the number of calls to minimize the amount of work on each call. Merge sort is an algorithm that takes this to heart. It's less chaotic and more structured than Quicksort, because it always sub divides into equal chunks.

Here's how it goes. First we divide the array in subsequent calls into halves. Once we have arrays of size 1, we start merging each pair of arrays into one. When we're merging we know that each half that needs to be merged is sorted internally. So we take the head of each half, compare the two elements, and stick the smallest one in the result array. We continue this as long as we have elements left in both halves. Once we run out, we know that the half that remains belongs on the end of the result array. And so we're merged two halves into one, which keeps on going until the whole array has been put back together.

Metrics mergesort(unit *arr, ulen length) {
	Metrics ms = {"mergesort", 0, 0, 1};
	if (length > 1) {
		ulen mid = length / 2;
		Metrics rms = mergesort(arr, mid);
		Metrics lms = mergesort(&arr[mid], length-mid);

		unit *temp = malloc(length * sizeof(unit)); 

		ulen i, l = 0, r = mid;
		for(i=0; i<length && l < mid && r < length; i++, ms.cmp++, ms.ass++) {
			if (arr[l] < arr[r]) {
				temp[i] = arr[l++];
			} else {
				temp[i] = arr[r++];
			}
		}
		if (l < mid)
			memcpy(&temp[i], &arr[l], (length-i) * sizeof(unit));
		if (r < length)
			memcpy(&temp[i], &arr[r], (length-i) * sizeof(unit));

		memcpy(arr, temp, length * sizeof(unit));
		free(temp);

		ms.cmp += lms.cmp + rms.cmp;
		ms.ass += lms.ass + rms.ass + (length - i) + length;
		ms.rec += lms.rec + rms.rec;
	}
	return ms;
}

=== mergesort ===
n          : 100000
n*log(n)   : 1660964
n^2        : 10000000000
calls      : 199999
comparisons: 1536114
assignments: 3337856
runtime    : 0.040000s

Here again we gradually overwrite the input array with chunks of sorted elements. This is safe, because at each stage we're working with half the previous array, which we know isn't affected by what happens in the other half. Once both halves are sorted, we merge them together into a temporary result array, and once that is done, write that back over the input array.

As with Quicksort before, rather than iterate over elements in order to copy them across from one array to another verbatim, we use the helpful memcpy function, which probably is a bit faster. But we still count the number of assignments in terms of how many actual elements are being copied when this happens.

Unlike Quicksort, we actually have more calls with this algorithm. There are two reasons for this: a) we don't have pivot elements, so there are more elements to sub divide, and b) the division is exactly symmetrical. This reduces the number of comparisons and gets us right under the n·log(n) limit, officially in logarithmic territory.

So there you have it, 5 different sorting algorithms, each with their own special characteristic. Want to take a closer look? Here is the whole thing:

sorting.c

Build with:

gcc -std=c99 -lm sorting.c

Unfortunately, noone has yet been clever enough to discover a general sorting algorithm that would only do n work. But as should be plain by now, n·log(n) is a dramatic improvement over n². When in doubt just remember: lazy is better. If you had to do this by hand you'd be lazy and figure out a way to get paid the same money for less work, so let your program get the same deal. ;)

EDIT: Fixed a mistake in the selection sort algorithm.

UPDATE: I just realized this code doesn't run particularly well on x86, because the int data types start to overflow, long just isn't long enough. I didn't notice this as I was running on x86_64.

Posted in code, en | No Comments »

renaming sequentially

May 1st, 2008

If you've been dealing with files for a while you will have noticed that there is a slight semantic gap between how humans see files and how computers do. If you've ever seen a file list like this you know what I mean:

Lecture10.pdf
Lecture11.pdf
Lecture12.pdf
Lecture1.pdf
Lecture2.pdf
...

Numbering these files was done in good faith, and a user understands what it means, but the computer doesn't get it. Sorting in dictionary order produces the wrong order as far as the user is concerned. The reason is that the digits in these filenames are not treated and compared as integers, merely as strings. (Actually, . comes before 0 in ASCII, what's going on here?)

While we're not expecting our computers to wisen up about this anytime soon, there is the obvious fix:

Lecture01.pdf
Lecture02.pdf
...
Lecture10.pdf
Lecture11.pdf
Lecture12.pdf

You've probably done this by hand once or twice, while cursing.

On the upshot, this is very easy to fix with a few lines of code:

#!/usr/bin/env python
#
# Author: Martin Matusiak <numerodix@gmail.com>
# Licensed under the GNU Public License, version 3.
#
# revision 1 - support multiple digit runs in filenames

import os, string, glob, re, sys

def renseq():
    if (len(sys.argv) != 2):
        print "Usage:\t" + sys.argv[0] + " <num_digits>"
    else:
        ren_seq_files(sys.argv[1])


def ren_seq_files(num_digits):
    files = glob.glob("*")
    for filename in files:
        m = re.search("(.*)(\..*)", filename)
        ext = ""
        if m: (filename, ext) = m.groups()

        digit_runs = re.finditer("([0-9]+)", filename)
        spans = [m.span() for m in digit_runs if digit_runs]
        if spans:
            spans.reverse()
            arr = list(filename)
            for (s, e) in spans:
                arr[s:e] = string.zfill(str( int(filename[s:e]) ), int(num_digits))
            os.rename(filename+ext, "".join(arr)+ext)
    


if __name__ == "__main__":
    renseq()

This works on all the files in the current directory. Pass an integer to renseq.py and it will change all the numbers in a filename (if there are any) to the same numbers, padded with zeros if they have fewer digits than the amount you want. So on the example

renseq.py 2

will turn the first list into the second list.

If say, there are filenames with numbers of three digits and you pass 2 to renseq.py, the numbers will be preserved (so it's not a destructive rename), you'll just revert to your incorrect ordering as it was in the beginning.

renseq.py will rewrite all the numbers in a filename, but not the extension. So mp3 won't become mp03. ;)

Posted in code, en | 8 Comments »

spiderfetch, part 2

April 27th, 2008

Note: If you haven't read part 1 you may be a little lost here.

So, the inevitable happened (as it always does, duh). I start out with a simple problem and not too many ambitions about what I want to accomplish. But once I reach that plateau, nay well before reaching it, I begin to ever so quietly ask myself the question "wait a second, what if x?" and "this looks specialized, I wonder if I could generalize...". And so before ever even reaching that hill top I've already, covertly, committed myself to taking it one step further. Not through a conscious decision, but through those lingering peripheral thoughts that you know won't disappear once they've struck. A bell cannot be unrung and all that.

I realized this was happening, but I didn't want to get into too much grubby stuff in the first blog entry, so I decided to keep that one simple and continue the story here. The first incarnation of spiderfetch had a couple of flaws that bugged me.

No way to inspect how urls were being matched on the page, or even reason to believe this was happening correctly, other than giving an input and checking that all the expected urls were found. To make matching evident, I would need to be able to see the matches visually on the page.
This has been addressed with a new option --dumpcolor, which dumps the index page and highlights the matches. This has made it much easier to verify that matching is done correctly.
Matching wasn't sufficiently effective. The regex I had written would match urls inside tags, as long as they were in quotes. But this would still miss unquoted urls, and it also excluded all other urls on the page, which may or may not be of interest. I also realized that a single regex, no matter how refined, would be unlikely to match simultaneously all the urls that may be of interest.
The obvious response is to add an option for multiple regexes, which is exactly what happened. This obviously adds another layer of complexity to debugging regexes, so the match highlighting was extended to colorize every match in a different color. Furthermore, where two regexes would match the same characters, the highlighting is in bold to indicate this.

With that, I was far happier with the ability to infer and verify correctness in the matching behavior. Surely now everything is honkey dorey?

Or not? (As a classmate of mine likes to say after delivering a convincing argument, but graciously gives you the chance to state your objections anyway). Well, if you read part 1 of this adventure right to the end, noting the observation that spiderfetch could be run recursively, you may have thought what I thought. Well gosh, Bubba, this is starting to sound like wget --mirror. Since I've set up all this infrastructure already -- to spider a single page -- it wouldn't really take much to generalize it to run recursively.

There are a couple of problems to solve, however. Firstly, the operational model for spiderfetch was very simple: spider a page, then fetch all the urls that match the pattern. In terms of multiplicity: 1 page to spider, 1 pattern to match urls against, n urls to find. If we now take this a step further, in the next pass we have n urls to spider (obtained form the n urls found in the first step), and we may need 1 pattern to filter some of them. Next, we spider these pages, which produces (m₁+m₂+...) (or roughly, n*m) urls and so on. This becomes rather convoluted to explain in words, so let's visualize.

Starting at the url to be spidered (the top green node), we spider the page for urls. For each of the urls found, it ends up in one of three categories:

It matches the spider filter, so it becomes a url to spider in the next round (a black arrow).
It matches the fetch filter, so it becomes a url to fetch (a blue arrow).
It matches neither and is discarded (not shown).

In the next round, we gather up all the urls that are to be spidered (the black arrows starting at the top green node) and do the same thing for each one as we did with just the one page to begin with.

But this complicates matters quite a lot. We now have to deal with a bunch of new issues:

How do we traverse the nodes? wget in mirror/spider modes goes depth-first, which I always thought was eccentric. I don't know why they do it this way, but I'm guessing to minimize memory use. If you go breadth-first then at every step you have to keep track of all the green nodes at the current level, which grows exponentially. Meanwhile, depth-first give you linear growth, so that choice is well justified. But, on the other hand, the traversal order seems a bit unintuitive, because you "jump" from the deepest corner of your filesystem back to the top level and so on. I wonder if this turns out to be foolish (I don't expect spiderfetch to get the same kind of workout that wget does, obviously), but I've chosen the opposite approach, which I think also makes it easier to track what is happening underway.
How deep do we want to go? Do we want to set an upper bound or (gasp) let it run until it stops?
Until now we've only needed one filter (for the blue arrows at the top green node). Now we suddenly have a lot more arrows that we should be able to filter in some meaningful way. Obviously, we don't want a pair of filters for every single node. Not only would that be madness, but we don't know in advance how many nodes there will be.
Our old friend wget only has one filter you can set for the whole site. But we want to be more specific than that, so there is a pair of filters (spider, fetch) for every level of the tree. This gives pretty decent granularity.

So how can we represent this cleanly? Well, it would be rather messy to have to input this as a command line parameter, besides which a once written scheme for a particular scenario could be reusable. So instead we introduce the idea of a recipe composed of rules. Starting from the top of the tree, each rule applies to the next level of the tree. And once we have no more rules -- or no more urls to spider -- we stop.

Let's take the asx example from part 1, where we had a custom made bash script to do the job. We can now rewrite it like this. First, the recipe is a list of rules, each rule is a hash. So starting from the top green node, we grab the first rule in the list, the one that contains the symbol :spider. This gives us the pattern to match urls on that page for spidering. There are no other patterns in there, so we spider these urls and then move on to the next step. We are now at the level below the top green node in the tree, with a bunch of pages from urls ending in .asx. We now grab the next rule in the recipe. This one gives a pattern for :dump, which means "dump these urls to the screen". So we find all the urls that match this pattern in all of our green nodes and dump them. Since there are no more rules left, this is where we stop.

module Recipe 
	RECIPE = [
		{ :spider => "\.asx$" },
		{ :dump => "^mms:\/\/" },
	]
end

So you would use it like this:

spiderfetch.rb --recipe asx http://www.something.com/somewhere

The options for patterns are :spider, :fetch, and :dump. If you want to repeat the same rule several times (for example to spider an image gallery with 10 pages, which are linked together with Next and Previous links), you can also set :depth to a positive integer value. This will descend in the tree the given number of times, using the same rule again and again.

And if you're feeling completely mental, you can even set :depth => -1, which will repeat the same rule until it runs out of urls to spider. You should probably combine this with --host, which will make sure you only spider the host (domain, to be exact) you started with, rather than the whole internet. (It will still allow :fetch and :dump to match urls on other hosts, so if you're spidering for images and they live on http://img.host.com rather than http://www.host.com, they will still be found.)

Lastly, as a heavy handed arbitration measure, if you execute a recipe and pass either of --dump or --fetch this will switch all your :fetch patterns to :dump or vice versa. Might be nice to be able to check that the right urls are being found before you start fetching, for instance.

Download and go nuts:

spiderfetch-0.3.1.tar.gz

UPDATE: Paul Hawkins wrote to say that wget actually runs in breadth-first mode.

Posted in en, spiderfetch | 1 Comments »

download all media links on a webpage

April 26th, 2008

This has probably happened to you. You come to a web page that has links to a bunch of pictures, or videos, or documents that you want to download. Not one or two, but all. How do you go about it? Personally, I use wget for anything that will take a while to download. It's wonderful, accepts http, https, ftp etc, has options to resume and retry, it never fails. I could just use Firefox, and if it's small files then I do just that, and click all the links in one fell swoop, then let them all download on their own. But if it's larger files then it's not practical. You don't want to download 20 videos of 200mb each in parallel, that's no good. If Firefox crashes within the next few hours (which it probably will) then you'll likely end up with not even one file successfully downloaded. And Firefox doesn't have a resume function (there is a button but it doesn't do anything :rolleyes: ).

So there is a fallback option: copy all the links from Firefox and queue them up for wget: right click in document, Copy Link Location, right click in terminal window. This is painful and I last about 4-5 links before I get sick of it, download the web page and start parsing it instead. That always works, but I have to rig up a new chain of grep, sed, tr and xargs wget (or a for loop) for every page, I can never reuse that and so the effort doesn't go a long way.

There is another option. I could use a Firefox extension for this, there are some of them for this purpose. But that too is fraught with pain. Some of them don't work, some only work for some types of files, some still require some amount of manual effort to pick the right urls and so on, some of them don't support resuming a download after Firefox crashes. Not to mention that every new extension slows down Firefox and adds another upgrade cycle you have to worry about. Want to run Firefox 3? Oh sorry, your download extension isn't compatible. wget, in contrast, never stops working. Most limiting of all, these extensions aren't Unix-y. They assume they know what you want, and they take you from start to end. There's no way you can plug in grep somewhere in the chain to filter out things you don't want, for example.

So the problem is eventually reduced to: how can I still use wget? Well, browsers being as lenient as they are, it's difficult to guarantee that you can parse every page, but you can at least try. spiderfetch, whose name describes its function: spider a page for links and then fetch them, attacks the common scenario. You find a page that links to a bunch of media files. So you feed the url to spiderfetch. It will download the page and find all the links (as best it can). It will then download the files one by one. Internally, it uses wget, so you still get the desired functionality and the familiar output.

If the urls on the page require additional post-processing, say they are .asx files you have to download one by one, grab the mms:// url inside, and mplayer -dumpstream, you at least get the first half of the chain. (Unlikely scenario? If you wanted to download these freely available lectures on compilers from the University of Washington, you have little choice. You could even chain spiderfetch to do both: first spider the index page, download all the .asx files, then spider each .asx file for the mms:// url, print it to the screen and let mplayer take it from there. No more grep or sed. :) )

Features

Spiders the page for anything that looks like a url.
Ability to filter urls for a regular expression (keep in mind this is still Ruby's regex, so .* to match any character, not * as in file globbing, (true|false) for choice and so on.)
Downloads all the urls serially, or just outputs to screen (with --dump) if you want to filter/sort/etc.
Can use an existing index file (with --useindex), but then if there are relative links among the urls, they will need post-processing, because the path of the index page on the server is not known after it has been stored locally.
Uses wget internally and relays its output as well. Supports http, https and ftp urls.
Semantics consistent with for url in urls; do wget $url... does not re-download completed files, resumes downloads, retries interrupted transfers.

Limitations

Not guaranteed to find every last url, although the matching is pretty lenient. If you can't match a certain url you're still stuck with grep and sed.
If you have to authenticate yourself somehow in the browser to be able to download your media files, spiderfetch won't be able to download them (as with wget in general). However, all is not lost. If the urls are ftp or the web server uses simple authentication, you can still post-process them to: ftp://username:password@the.rest.of.the.url, same for http.

Download spiderfetch:

spiderfetch.rb

Recipes

To make the use a bit clearer, let's see some concrete examples.

Recipe: Download the 2008 lectures from Fosdem:

spiderfetch.rb http://www.fosdem.org/2008/media/video 2008.*ogg

Here we use the pattern 2008.*ogg. If you first run spiderfetch with --dump, you'll see that all the urls for the lectures in 2008 contain the string 2008. Further, all the video files have the extension ogg. And whatever characters come in between those two things, we don't care.

Recipe: Download .asx => mms videos

Like it or not, sometimes you have to deal with ugly proprietary protocols. Video files exposed as .asx files are typically pointers to urls of the mms:// protocol. Microsoft calls them metafiles. This snippet illustrates how you can download them. First you spider for all the .asx urls, using the pattern \.asx$, which means "match on strings containing .asx as the last characters of the string". Then we spider each of those urls for actual urls to video files, which begin with mms. And for each one we use mplayer -dumpstream to actually download the video.

#!/bin/bash

mypath=$(cd $(dirname $0); pwd)
webpage="$1"

for url in $($mypath/spiderfetch.rb $webpage "\.asx$" --dump); do 
	video=$($mypath/spiderfetch.rb $url "^mms" --dump)
	mplayer -dumpstream $video -dumpfile $(basename $video)
done

Posted in en, spiderfetch | 2 Comments »