The Ultimate Java Versus C%23 Benchmark

Share the article!

20030520073353

The problem with Cameron Purdy’s benchmarks are:

  1. It’s not a “Real Life” benchmark.

  2. It’s overly simplistic.

  3. It emphasizes numerical computation over symbolic manipulation. 

  4. It doesn’t accentuate the performance advantage of Java over C%23.

So, without much adieu, I present the “Ultimate Java versus C%23 Benchmark”.  What program could be more “Real Life” than a program about the building blocks of Life itself (i.e. DNA).  This program matches a DNA sequence against a million DNA patterns expressed as regular expressions.   Definitely more complex a problem than computing the average salary of many people, more symbolic ( After all if you want to do computation do it in Fortran), best of all you’ll love the results.

The C%23 code:

using System.Text.RegularExpressions;

using System.Text;

using System;

namespace TestRegex

{

public class RegexBench2

{

private static String _doc =

“CGAATCTAAAAATAGATTCGGACGTGATGTAGTCGTACAAATGAAAAAGTAAGCC”;

private static int ITERATIONS = 1000000;

public static void Main()

{

long start = System.DateTime.Now.Ticks / 10000;

long end;

int length = 1;

for( int i = ITERATIONS; i <= ITERATIONS * 2; i++ )

{

length = (int) (Math.Log((double)i)/Math.Log(4));

String matchthis = generateWord(i, length + 1);

Regex regexpr = new Regex(matchthis, RegexOptions.Compiled);

Boolean b = regexpr.IsMatch(_doc);

if( b )

{

end = System.DateTime.Now.Ticks / 10000;

Console.WriteLine(“found {0} at {1} it took {2} miliseconds”,

matchthis, i, end – start );

}

}

end = System.DateTime.Now.Ticks / 10000;

Console.WriteLine(“.NET regex took {0} miliseconds”,end – start);

}

public static String generateWord(int value, int length )

{

StringBuilder buf = new StringBuilder();

int current = value;

for(int i = 0; i < length; i++ )

{

int v = current % 4;

current = current / 4;

buf.Append( convert(v) );

}

return buf.ToString();

}

private static String convert(int value)

{

switch(value)

{

case 0: return “A”;

case 1: return “G”;

case 2: return “T”;

case 3: return “C”;

default:

return “0″;

}

}

}

}

The Java code:

import java.text.*;

import java.util.regex.*;

public class RegexBench2

{

static String _doc = “CGAATCTAAAAATAGATTCGGACGTGATGTAGTCGTACAAATGAAAAAGTAAGCC”;

static int ITERATIONS = 1000000;

public static void main(String args[]) {

long start = System.currentTimeMillis();

int length = 1;

for( int i = ITERATIONS; i <= ITERATIONS * 2; i++ ) {

length = (int) (Math.log((double)i)/Math.log(4));

String matchthis = generateWord(i, length + 1);

Pattern regexpr = Pattern.compile(matchthis); Matcher matcher = regexpr.matcher(_doc);

boolean b = matcher.find();

if(b){

long end = System.currentTimeMillis();

System.out.println( MessageFormat.format(“found {0} at {1} it took {2} miliseconds”,

new Object[] {matchthis, “” + i, “” + (end-start) } )); }

}

long end = System.currentTimeMillis();

System.out.println(“Java regex took ” + (end – start) + ” miliseconds”);

}

static String generateWord(int value, int length ) {

StringBuffer buf = new StringBuffer(); int current = value;

for(int i = 0; i < length; i++ ) {

int v = current % 4; current = current / 4; buf.append( convert(v) );

}

return buf.toString();

}

static String convert(int value) {

switch(value) {

case 0: return “A”; case 1: return “G”; case 2: return “T”; case 3: return “C”; default: return “0″; }

}

}

The code can generally be applied in several real applications.  You could build a spam filter, a RSS categorization engine or even a rule based message broker.  Just to make sure nobody is cheating, matches must be found at 1000000 and 2000000. Also, don’t even think of replacing the regex match with a string search, that’s tantamount to dumbing down the requirements. 

Here are the results for the Java run:

found AAAGTAAGCC at 1000000 it took 0 miliseconds

found AAATGAAAAAG at 1048960 it took 578 miliseconds

found GAAAAAGTAAG at 1085441 it took 922 miliseconds

found TCTAAAAATAG at 1179694 it took 1844 miliseconds

found ACGTGATGTAG at 1204636 it took 2094 miliseconds

found AATAGATTCGG at 1548576 it took 5328 miliseconds

found TCGTACAAATG at 1576094 it took 5578 miliseconds

found CGGACGTGATG at 1599255 it took 5813 miliseconds

found ATTCGGACGTG at 1689064 it took 6657 miliseconds

found AGATTCGGACG at 1859204 it took 8235 miliseconds

found TGATGTAGTCG at 1984902 it took 9423 miliseconds

found AAATAGATTCG at 2000000 it took 9563 miliseconds

Java regex took 9563 miliseconds

If you run main() inside a loop using hotspot server, the time reduces 6516 miliseconds (oops! spelled that wrong again!).  Also, the memory footprint required for Java is slightly over 7MBytes.

The C%23 results? Well they’re just too embarrassing to post. Suffice to say, it’s several orders of magnitude slower! Furthermore, the memory usage exploded to who knows where! I can’t believe that people may be thinking of deploying mission critical applications on this virtual machine!

Just like Cameron’s benchmark, everything is self contained, you can cut and paste it, compile it and run it yourself.  Better yet, send it to your resident C%23 expert, have him tweak it as much as he can.  Bring a digital camera,  capturing the anguish in his face when he realizes the truth, would be priceless.

(BTW, did anyone notice? The Java version actually took less lines of code than the C%23 version, hmmm?)

[update] Cameron speculates about the numbers, he makes a guess that Java may be 100x faster.  The real reason why I didn’t post the numbers was that I couldn’t get the numbers in time for the post.  Well here it is:

found AAAGTAAGCC at 1000000 it took 31 miliseconds
found AAATGAAAAAG at 1048960 it took 4470117 miliseconds

Unhandled Exception: OutOfMemoryException.

In summary Java is at least 7,733x faster than C%23, sorry Cameron!  Better yet it completes the task and doesn’t gag of memory!


Share the article!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>