Did you hear the exciting news? JRuby 1.1 is out! For real, you can call your grandma with the great news. :party: Wow, that was quick.
Okay, so the big new thing in JRuby is a bytecode compiler. As you may know, up to 1.0 it was just a Ruby interpreter in Java. Now you can actually compile Ruby modules to Java classes and no one will know the difference, very devious. :cool: Sounds like Robin Hood in a way, doesn't it?
The JRuby guys are claiming that this makes JRuby on par with "regular Ruby" on performance, if not better. Hmm. Just to be on the safe side, what size shoes do you wear? Oh ouch, those are going to be tricky to fit in your mouth. :/ And Freud will say you're stuck in the oral stage. Too much? Okay.
So here is my completely unvetted, dirty, real world test. No laboratory conditions here, you're in the ghetto. First we need something *to* test. I don't have a great deal of Ruby code at my disposal, but this should do the trick. How does scanning the raw filesystem for urls sound? The old harvest script actually does a half decent job of turning up a bunch of findings.
Now introducing the contenders. First up, his name is JRuby, you know him from occasional mentions on obscure blogs and the programming reddit past the top 500 entries. He promises to free all Java slaves by giving away free Rubies to everyone!
Aaand the incumbent, the famous... Ruby! You know him, your parents know him, every family would adopt him as their own child if they could. He's the destroyer of kingdoms and the creator of empires, he's bigger than Moses himself!
Our two drivers will be racing across a hostile territory. Your track is a 25gb ext3 live file system. During this time, I can promise you that only Firefox is likely to be writing new urls to disk, but I could be lying eheheh. Due to the unpredictable nature of this rally track, regulations allow only one racer at a time, but you will be clocked.
First up is the new kid on the block Jay....Ruby. The Ruby code will not be compiled before execution, we'll let the just-in-time compiler do its thing.
$ time ( sudo cat /dev/sda5 | bin/jruby harvest.rb --url > /tmp/fsurls.jruby )
real 39m26.547s
user 37m19.072s
sys 1m28.406s
Not too shabby for a first run, but since this a brand new venue, we have no frame of reference yet. Let's see how Ruby will do here.
$ time ( sudo cat /dev/sda5 | harvest.rb --url > /tmp/fsurls.ruby )
real 78m42.186s
user 62m12.537s
sys 2m18.721s
Well, look at that! The new kid is pretty slick, isn't he? Sure is giving the old man a run for his money. Let's see how they answered the questions.
$ lh
-rw-r--r-- 1 alex alex 86M 2008-04-21 18:29 fsurls.jruby
-rw-r--r-- 1 alex alex 8.6G 2008-04-21 20:58 fsurls.ruby
Yowza! No less than a hundred times more matches with Ruby. What is going on here? Did Jay just race to the finish line, dropping the vast majority of his parcels? Or did father Ruby see double and triple and quadruple, ending up with lots and lots of duplicates? Well, we don't really *know* how many urls exist in those 25gb of data, but it seems a little bit suspect that there would be in excess of 8gb of them.
One way or the other, it's pretty clear that the regular expression semantics are not entirely identical. In fact, you might be sweating a little right now if your code uses them heavily.
UPDATE: Squashing duplicates in both files actually produces two files of very similar size (13mb), in which the disparity of unique entries is only a very reasonable 4% (considering the file system was being written to in the process). The question still remains how did Ruby produce 8gb of output.
Not bad for JRuby, but run with -J-server and see what sort of numbers you get. I'll wager it will be faster again.
Also, what Java version? Java 6 (or prerelease Java 7/OpenJDK builds) is recommended for best performance.
Hello Charlie, just last night I was watching your talk at Fosdem :)
I have java version 1.6.0_03.
Well I reran the tests and to make it a more controlled environment I dd'ed 1gb of disk data to search in. The results were quite different, and I ran it twice to make sure it wasn't a fluke:
jruby -J-server
real 3m56.149s
user 3m36.662s
sys 0m17.017s
jruby
real 3m47.196s
user 3m30.149s
sys 0m16.497s
ruby
real 1m11.586s
user 1m6.280s
sys 0m3.324s
On the upshot, the output was identical this time in all cases. :D
Looks like you have a potential perf bug for us :) If you can sort out what's actually slower, we'd love to have it reported. My guess is that it might be IO, which we haven't done a lot to optimize. But if you've changed anything since the original results, like tweaked the regex or added a duplicate checker, that could be the culprity too.
At any rate, for a run this long, we should be faster. So you found a perf bug.
URL to report JRuby bugs: http://jira.codehaus.org/browse/JRUBY
Right. I tried to apply a little more rigor and redo the test on slightly more data to get longer runtimes. Bug is filed: http://jira.codehaus.org/browse/JRUBY-2436