Saturday, July 11, 2015

HPC Clusters vs Big Data Clusters: Two Different Worlds

Recently, I was thinking about why it's so hard for "HPC" cluster users to understand why "Big Data" cluster users do what they do, and vice versa. 

I wrote down this chart with a comparison of the software sometimes/often used on each:

Software HPC Clustering Big Data Clustering
Schedulers/Resource Managers Moab, Slurm, LSF, Torque, PBS YARN, Mesos
File Systems Lustre, GPFS, pNFS, PVFS HDFS
API "Framework" MPI, OpenMP MapReduce
Main Programming Languages C/C++, Fortran Java, Scala
Interconnect Infiniband, Myrinet, ... GigE
Higher Level Scripting ??? Pig, Hive

I could probably go on, but hopefully you get the gist of things.

Basically, everything listed under the "HPC Clustering" column isn't used on the "Big Data Clustering" column, and vice versa.

Here in lies the issue why the users of both don't understand each other.

I believe HPC cluster users look at the list on the right and immediately think things like:

  • "Why would you use HDFS, it's not a Posix file system."
  • "Why would you use Java, it's so slow."
  • "Why use GigE, that's so slow."
  • "Why did you write a whole new scheduler, why not use the schedulers HPC users developed years ago."
 In contrast, Big Data cluster users think nearly the opposite:

  • "Why would you use a Posix file system, that API/interface is ancient."
  • "Why would you use a networked file system, it's so slow."
  • "Why waste money on Infiniband, it's completely unnecessary to spend money on unused bandwidth."
  • "Why use MPI, the API is so complex, you can't develop programs quickly."
  • "Why use C or C++, the programming language is so complex, you can't develop programs quickly."

The problems users face are so different, that neither side can really understand why the other user would even bother to use the software/hardware that they are actually using.

So what happens when users in one world want to run in the other world?  I think what often happens is you hear "Can you port your code/application to work here?"  The answer is likely "No, that's not reasonable."  I believe you get these answers because most don't understand the difference between these two worlds because they don't understand the chart above.

So for those who are trying to mix environments, I think the most important thing to do is to try and accept the differences listed above and work for solutions that bridge the two worlds.

To some extent, that is part of my goal when developing Magpie (github).  Accept that the traditional HPC world isn't going to change and the Big Data world will not change either.  Better to try and get the Big Data world into HPC clusters with as little change as possible.

Update: See "Big Data vs HPC" follow up.

1 comment:

  1. really Good blog post.provided a helpful information.I hope that you will post more updates like thisBig Data Hadoop Online course