Friday, July 17, 2015

The All Star Game Should Be an Exhibition

I was talking with a co-worker the other day who told me that viewership for the MLB All Star game is down nowadays.  I used to love watching the All Star game, but my interest over the years as waned.  I believe it's for several reasons:

A) The rosters have exploded, so being an All Star doesn't mean as much as it used to.  Including those players who were injured, both the NL and AL rosters had 38 players on them each, well in excess of the normal 25 man rosters.  As a completely arbitrary comparison, the 2000 rosters had 35 and 34 players on the NL and AL rosters respectively.  In 1990 they were 29 for both the NL and AL rosters.

B) The game "meaning" something, means that players can't be loose and have as much fun as they used to.

Some of the best moments I can recall in All Star history are just fun moments.  The best players in baseball coming together for one day to show off their talents.

Two of the most memory moments I recall in the All Star game involved Randy Johnson.  One in which he psyched out John Kruk and the other in which he threw behind the back of Larry Walker.







Another great scene was when Torii Hunter stole a home run from Barry Bonds and Bonds playfully tackled him in center field afterwards.





The 2015 game had several fun highlights such as Jacob Degrom and Aroldis Champman striking out the side in their innings.  But for some reason, it just doesn't seem to be quite the same nowadays compared to years past.

Saturday, July 11, 2015

HPC Clusters vs Big Data Clusters: Two Different Worlds

Recently, I was thinking about why it's so hard for "HPC" cluster users to understand why "Big Data" cluster users do what they do, and vice versa. 

I wrote down this chart with a comparison of the software sometimes/often used on each:


Software HPC Clustering Big Data Clustering
Schedulers/Resource Managers Moab, Slurm, LSF, Torque, PBS YARN, Mesos
File Systems Lustre, GPFS, pNFS, PVFS HDFS
API "Framework" MPI, OpenMP MapReduce
Main Programming Languages C/C++, Fortran Java, Scala
Interconnect Infiniband, Myrinet, ... GigE
Higher Level Scripting ??? Pig, Hive


I could probably go on, but hopefully you get the gist of things.

Basically, everything listed under the "HPC Clustering" column isn't used on the "Big Data Clustering" column, and vice versa.

Here in lies the issue why the users of both don't understand each other.

I believe HPC cluster users look at the list on the right and immediately think things like:

  • "Why would you use HDFS, it's not a Posix file system."
  • "Why would you use Java, it's so slow."
  • "Why use GigE, that's so slow."
  • "Why did you write a whole new scheduler, why not use the schedulers HPC users developed years ago."
 In contrast, Big Data cluster users think nearly the opposite:

  • "Why would you use a Posix file system, that API/interface is ancient."
  • "Why would you use a networked file system, it's so slow."
  • "Why waste money on Infiniband, it's completely unnecessary to spend money on unused bandwidth."
  • "Why use MPI, the API is so complex, you can't develop programs quickly."
  • "Why use C or C++, the programming language is so complex, you can't develop programs quickly."

The problems users face are so different, that neither side can really understand why the other user would even bother to use the software/hardware that they are actually using.


So what happens when users in one world want to run in the other world?  I think what often happens is you hear "Can you port your code/application to work here?"  The answer is likely "No, that's not reasonable."  I believe you get these answers because most don't understand the difference between these two worlds because they don't understand the chart above.

So for those who are trying to mix environments, I think the most important thing to do is to try and accept the differences listed above and work for solutions that bridge the two worlds.

To some extent, that is part of my goal when developing Magpie (github).  Accept that the traditional HPC world isn't going to change and the Big Data world will not change either.  Better to try and get the Big Data world into HPC clusters with as little change as possible.

Update: See "Big Data vs HPC" follow up.