download all media links on a webpage

April 26th, 2008

This has probably happened to you. You come to a web page that has links to a bunch of pictures, or videos, or documents that you want to download. Not one or two, but all. How do you go about it? Personally, I use wget for anything that will take a while to download. It's wonderful, accepts http, https, ftp etc, has options to resume and retry, it never fails. I could just use Firefox, and if it's small files then I do just that, and click all the links in one fell swoop, then let them all download on their own. But if it's larger files then it's not practical. You don't want to download 20 videos of 200mb each in parallel, that's no good. If Firefox crashes within the next few hours (which it probably will) then you'll likely end up with not even one file successfully downloaded. And Firefox doesn't have a resume function (there is a button but it doesn't do anything :rolleyes: ).

So there is a fallback option: copy all the links from Firefox and queue them up for wget: right click in document, Copy Link Location, right click in terminal window. This is painful and I last about 4-5 links before I get sick of it, download the web page and start parsing it instead. That always works, but I have to rig up a new chain of grep, sed, tr and xargs wget (or a for loop) for every page, I can never reuse that and so the effort doesn't go a long way.

There is another option. I could use a Firefox extension for this, there are some of them for this purpose. But that too is fraught with pain. Some of them don't work, some only work for some types of files, some still require some amount of manual effort to pick the right urls and so on, some of them don't support resuming a download after Firefox crashes. Not to mention that every new extension slows down Firefox and adds another upgrade cycle you have to worry about. Want to run Firefox 3? Oh sorry, your download extension isn't compatible. wget, in contrast, never stops working. Most limiting of all, these extensions aren't Unix-y. They assume they know what you want, and they take you from start to end. There's no way you can plug in grep somewhere in the chain to filter out things you don't want, for example.

So the problem is eventually reduced to: how can I still use wget? Well, browsers being as lenient as they are, it's difficult to guarantee that you can parse every page, but you can at least try. spiderfetch, whose name describes its function: spider a page for links and then fetch them, attacks the common scenario. You find a page that links to a bunch of media files. So you feed the url to spiderfetch. It will download the page and find all the links (as best it can). It will then download the files one by one. Internally, it uses wget, so you still get the desired functionality and the familiar output.

If the urls on the page require additional post-processing, say they are .asx files you have to download one by one, grab the mms:// url inside, and mplayer -dumpstream, you at least get the first half of the chain. (Unlikely scenario? If you wanted to download these freely available lectures on compilers from the University of Washington, you have little choice. You could even chain spiderfetch to do both: first spider the index page, download all the .asx files, then spider each .asx file for the mms:// url, print it to the screen and let mplayer take it from there. No more grep or sed. :) )

Features

  • Spiders the page for anything that looks like a url.
  • Ability to filter urls for a regular expression (keep in mind this is still Ruby's regex, so .* to match any character, not * as in file globbing, (true|false) for choice and so on.)
  • Downloads all the urls serially, or just outputs to screen (with --dump) if you want to filter/sort/etc.
  • Can use an existing index file (with --useindex), but then if there are relative links among the urls, they will need post-processing, because the path of the index page on the server is not known after it has been stored locally.
  • Uses wget internally and relays its output as well. Supports http, https and ftp urls.
  • Semantics consistent with for url in urls; do wget $url... does not re-download completed files, resumes downloads, retries interrupted transfers.

Limitations

  • Not guaranteed to find every last url, although the matching is pretty lenient. If you can't match a certain url you're still stuck with grep and sed.
  • If you have to authenticate yourself somehow in the browser to be able to download your media files, spiderfetch won't be able to download them (as with wget in general). However, all is not lost. If the urls are ftp or the web server uses simple authentication, you can still post-process them to: ftp://username:password@the.rest.of.the.url, same for http.

Download spiderfetch:

Recipes

To make the use a bit clearer, let's see some concrete examples.

Recipe: Download the 2008 lectures from Fosdem:

spiderfetch.rb http://www.fosdem.org/2008/media/video 2008.*ogg

Here we use the pattern 2008.*ogg. If you first run spiderfetch with --dump, you'll see that all the urls for the lectures in 2008 contain the string 2008. Further, all the video files have the extension ogg. And whatever characters come in between those two things, we don't care.

Recipe: Download .asx => mms videos

Like it or not, sometimes you have to deal with ugly proprietary protocols. Video files exposed as .asx files are typically pointers to urls of the mms:// protocol. Microsoft calls them metafiles. This snippet illustrates how you can download them. First you spider for all the .asx urls, using the pattern \.asx$, which means "match on strings containing .asx as the last characters of the string". Then we spider each of those urls for actual urls to video files, which begin with mms. And for each one we use mplayer -dumpstream to actually download the video.

#!/bin/bash

mypath=$(cd $(dirname $0); pwd)
webpage="$1"

for url in $($mypath/spiderfetch.rb $webpage "\.asx$" --dump); do 
	video=$($mypath/spiderfetch.rb $url "^mms" --dump)
	mplayer -dumpstream $video -dumpfile $(basename $video)
done

clocking jruby1.1

April 21st, 2008

Did you hear the exciting news? JRuby 1.1 is out! For real, you can call your grandma with the great news. :party: Wow, that was quick.

Okay, so the big new thing in JRuby is a bytecode compiler. As you may know, up to 1.0 it was just a Ruby interpreter in Java. Now you can actually compile Ruby modules to Java classes and no one will know the difference, very devious. :cool: Sounds like Robin Hood in a way, doesn't it?

The JRuby guys are claiming that this makes JRuby on par with "regular Ruby" on performance, if not better. Hmm. Just to be on the safe side, what size shoes do you wear? Oh ouch, those are going to be tricky to fit in your mouth. :/ And Freud will say you're stuck in the oral stage. Too much? Okay.

So here is my completely unvetted, dirty, real world test. No laboratory conditions here, you're in the ghetto. First we need something *to* test. I don't have a great deal of Ruby code at my disposal, but this should do the trick. How does scanning the raw filesystem for urls sound? The old harvest script actually does a half decent job of turning up a bunch of findings.

Now introducing the contenders. First up, his name is JRuby, you know him from occasional mentions on obscure blogs and the programming reddit past the top 500 entries. He promises to free all Java slaves by giving away free Rubies to everyone!

Aaand the incumbent, the famous... Ruby! You know him, your parents know him, every family would adopt him as their own child if they could. He's the destroyer of kingdoms and the creator of empires, he's bigger than Moses himself!

Our two drivers will be racing across a hostile territory. Your track is a 25gb ext3 live file system. During this time, I can promise you that only Firefox is likely to be writing new urls to disk, but I could be lying eheheh. Due to the unpredictable nature of this rally track, regulations allow only one racer at a time, but you will be clocked.

First up is the new kid on the block Jay....Ruby. The Ruby code will not be compiled before execution, we'll let the just-in-time compiler do its thing.

$ time ( sudo cat /dev/sda5 | bin/jruby harvest.rb --url > /tmp/fsurls.jruby )
real 39m26.547s
user 37m19.072s
sys 1m28.406s

Not too shabby for a first run, but since this a brand new venue, we have no frame of reference yet. Let's see how Ruby will do here.

$ time ( sudo cat /dev/sda5 | harvest.rb --url > /tmp/fsurls.ruby )
real 78m42.186s
user 62m12.537s
sys 2m18.721s

Well, look at that! The new kid is pretty slick, isn't he? Sure is giving the old man a run for his money. Let's see how they answered the questions.

$ lh
-rw-r--r-- 1 alex alex 86M 2008-04-21 18:29 fsurls.jruby
-rw-r--r-- 1 alex alex 8.6G 2008-04-21 20:58 fsurls.ruby

Yowza! No less than a hundred times more matches with Ruby. What is going on here? Did Jay just race to the finish line, dropping the vast majority of his parcels? Or did father Ruby see double and triple and quadruple, ending up with lots and lots of duplicates? Well, we don't really *know* how many urls exist in those 25gb of data, but it seems a little bit suspect that there would be in excess of 8gb of them.

One way or the other, it's pretty clear that the regular expression semantics are not entirely identical. In fact, you might be sweating a little right now if your code uses them heavily.

UPDATE: Squashing duplicates in both files actually produces two files of very similar size (13mb), in which the disparity of unique entries is only a very reasonable 4% (considering the file system was being written to in the process). The question still remains how did Ruby produce 8gb of output.

what the heck is a closure?

April 20th, 2008

That's a question that's been bugging me for months now. It's so vexing to try to find something out and not getting it. All the more so when you look it up in a couple of different places and the answers don't seem to have much to do with each other. Obviously, once you have the big picture, all those answers intersect in a meaningful place, but while you're still hunting for it, that's not helpful at all.

I put this question to a wizard and the answer was (not an exact quote):

A function whose free variables have been bound.

Don't you love to get a definition in terms of other terms you're not particularly comfortable with? Just like a math textbook. This answer confused me, because I couldn't think of a case that I had seen where that wasn't the case, so I thought I must be missing something. The Python answer is very simple:

A nested function.

It's sad, but one good answer is enough. When you can't get that, sometimes you end up stacking up several unclear answers and hoping you can piece it all together. And that can very well fail.

I read a definition today that finally made it clear to me. It's not the simplest and far from the most intuitive description. In fact, it too reads like a math textbook. But it's simply what I needed to hear in words that would speak to me.

A lexical closure, often referred to just as a closure, is a function that can refer to and alter the values of bindings established by binding forms that textually include the function definition.

I read it about 3 times, forwards and backwards, carefully making sure that as I was lining up all the pieces in my mind, they were all in agreement with each other. And once I verified that, and double checked it, I felt so relieved. Finally!

I can't follow the Common Lisp example that follows on that page, but scroll down and you find a piece of code that is much simpler.

(define (foo x)
	(define (bar y)
		(+ x y))
	bar)

(foo 1) 5 => 6
(foo 2) 5 => 7

What's going on here? First there is a function being defined. Its name is foo and it takes a parameter x. Now, once we enter the body of this function foo, straight away we have another function definition - a nested function. This inner function is called bar and takes a parameter y. Then comes the body of the function bar, which says "add variables x and y". And then? Follow the indentation (or the parentheses). We have now exited the function definition of bar and we're back in the body of foo, which says "the value bar", so that's the return value of foo: the function bar.

In this example, bar is the closure. Just for a second, look back at how bar is defined in isolation, don't look at the other code. It adds two variables: y, which is the formal parameter to bar, and x. How does x receive its value? It doesn't. Not inside of bar! But if you look at foo in its entirety, you see that x is the formal parameter to foo. Aha! So the value of x, which is set inside of foo, carries through to the inner function bar.

Can we square this code with the answers quoted earlier? Let's try.

A function whose free variables have been bound. - A function, in this case bar. Free variables, in this case x. Bound, in this case defined as the formal parameter x to the function foo.

A nested function. - The function bar.

A lexical closure, often referred to just as a closure, is a function that can refer to and alter the values of bindings established by binding forms that textually include the function definition. - A function, in this case bar. That can refer to and alter, in this case bar refers to the variable x. values of bindings, in this case the value of the bound variable x. established by binding forms, in this case the body of the function foo. that textually include the function definition, in this case foo includes the function definition of bar.

So yes, they all make sense. If you understand what it's all about. :/

Let's return to the code example. We now call the function foo with argument 1. As we enter foo, x is bound to 1. We now define the function bar and return it, because that is the return value of foo. So now we have the function bar, which takes one argument. We give it the argument 5. As we enter bar, y is bound to 5. And x? Is it an undefined argument, since it's not defined inside bar? No, it's bound *from before*, from when foo was called. So now we add x and y.

In the second call, we call foo with a different argument, thus x inside of bar receives a different value, and once the call to bar is made, this is reflected in the return value.

Well, that was easy. And to think I had to wait so long to clarify such a simple idiom. So what is all the noise about anyway? Think of it as a way to split up the assignment of variables. Suppose you don't want to assign x and y at the same time, because y is a "more dynamic" variable whose value will be determined later. Meanwhile, x is a variable you can assign early, because you know it's not going to need to be changed.

So each time you call foo, you get a version of bar that has a value of x already set. In fact, from this point on, for as long as you use this version of bar, you can think of x as a constant that has the value that it was assigned when foo was called. You can now give this version of bar to someone and they can use it by passing in any value for y that they want. But x is already determined and can't be changed.

when faced with ethical ickiness

April 16th, 2008

And by ickiness I mean a question that you don't have the answer to, but you nevertheless have a gut feeling one way or the other. For instance: should gay couples be allowed to adopt? Another example would be: should it be permitted to clone humans? Or how about the old favorite: should sex play in kinder garden be encouraged (which I have absolutely no answer to)?

These are questions which have no prior answer, because we've only just been faced with them for the first time (or for that matter, only now been willing to consider them). There are many questions like this which have no answer (yet), but which nevertheless raise a certain instinctive feeling in us that makes us prone to lean to one side. This icky feeling is a fear within us that "something bad will happen" if this new thing is allowed to happen, without knowing what we really are scared of.

Many such questions have received answers in the past. For example the question of whether a brother and sister should be allowed to marry has been settled on the basis that children of such parents are born with serious deformities. Therefore we have a rational answer, not merely a fear.

What not to do: alternative A

Do not take your unarticulated fear to draw the conclusion that your instinct must be correct, and therefore suggest banning or condemning the practice. This is a purely emotional response with no rational justification.

Do not further aim to strengthen your argument by associating yourself with a large group of people who share your unarticulated fear and has decided to "do something about it". The ignorance of a thousand is no more equivalent to wisdom than the fact that the sun is the center of our solar system was discovered by popular opinion.

Those who would rather pretend that certain new possibilities were never discovered will desire to ban these, so that we can go back to believing these things are not possible. And if it is banned, no one will be doing it, so we can live in this illusion we've created for ourselves.

What to do: alternative B

Resign yourself to the fact that certain questions have no answer at the moment, and that at any given time there will always be such questions. Your pretty little head will resist this, because this makes certain things undecidable. But it is nevertheless the quickest path to happiness, as you will soon see.

What to do: alternative C

Pursue the answer intellectually, and aggressively. Read up on the science that is happening in this field and the discourse that is taking place between interested parties. Once you go in depth you will begin to understand not just the issue, but also your own fear and what it really is you're worried about. This will then prevent you from choosing the emotional answer of alternative A, because you will no longer be able to convince yourself that a rational answer is optional.

The final, undisputed answer to certain questions may not come for a long time, not even in the span of your lifetime. But with every step that you veer closer to the truth you will have a better idea of what it's likely to be. Until the truth is actually discovered, you will regularly find yourself faced with alternative B.

book signings - are they utterly stupid?

April 13th, 2008

So it turns out that authors have "book tours" (yeah, it sounds crazy, doesn't it?). You would think that everything they had to say was already in the book, but they do this to sell more books. They go around to various cities and they talk about their book and sometimes participate in panel discussions with other authors.

An integral part of this is the book signing. Now suppose you read a book that was very good and you really appreciate the ideas of this person and their ability to express them in such a way that they have. What benefit do you possibly see in having it signed by the author? First of all, their name is already on the book (the cover, in fact), so it's redundant. So what do you benefit from knowing that this person wrote their name on this paper? What difference does it make?

It's stupid celebrity worship every day of the week. I can sort of understand more how people ask sportsmen for autographs, because when you meet an athlete then you don't really have anything "of theirs" to keep. So even an autograph (which again is meaningless, who cares about the calligraphic skills of a sportsman? that's not what you admire them for) is something. With an author this is turned on its head, because the item being signed is the very work that you appreciate, so you already _have_ their best output in your hand.