Thursday, April 28, 2016

HPC vs Big Data - Part 3

 In several prior posts comparing Big Data to HPC, I asked a question about how HPC and Big Data were different and why it was difficult for HPC people to "get" Big Data and vice versa.

I had another idea of how to present it.

Earlier I stated that: 

HPC: Computation Time >> IO Time

Big Data: IO Time >> Computation Time

However, this by itself can sometimes be confusing.  Some algorithms are programmed in both the HPC world (e.g. MPI) and the Big Data world (e.g. Hadoop/Spark).  So how can this dichotomy be represented?

The reason the above definition can be confusing is because it suggests things are black and white, which is of course not true.  In reality, the trade off between I/O time and Computation time is along a sliding scale.  To represent this, lets look at this simple (terribly drawn) diagram.

In it, we have increasing computation on the X axis going right and increasing I/O on the Y axis going up.  Based on my prior definition, we can consider HPC jobs to be those along the bottom right, where computation is much greater than the I/O.  Big Data is the jobs in the upper left of this diagram, where I/O time is much greater than computation time.

Naturally, if your users and/or code is significantly in one area vs the other, you'll have different prioritization on hardware, software, etc. than those in the other area.  Thus leading to the very different universes of cluster computing.

What (hopefully) clarifies some of the confusion on HPC vs Big Data is the fact that at some point in the middle these two worlds sort of meet.  There are Data + Computation combinations where it's nebulous which direction is superior.  A user might in fact be indifferent towards which type of hardware/software situation is better for them.  For users in the middle, it is far more likely they will simply gravitate towards the system they are more familiar with (e.g. if you know Java, use Hadoop/Spark, if you know C, use MPI) or simply have access to (if you are familiar with AWS, just use it, if you have access to an MPI cluster, just use it).

I hope this clarifies the confusion between implementation that may be done in both worlds.

No comments:

Post a Comment