Distributed Computing Models
Distributed computing involves processing happening on multiple computers in parallel enabling huge amounts of data to be processed. Without distributed computing processing models, search engines and other services would not have been possible. When data is processed on multiple machines like thousands of machines, there is a good probability that few of the machines go down, a couple of hard disk crashes happen on a daily basis. The Distributed computing models not only address how the data has to be processed in parallel, but also address the above-mentioned failure scenarios in an automated fashion.
Before we take a look at the distributed computing models in a bit more detail, let’s look at the different database models which most of us are aware of and try to compare it with the distributed computing model. In the database world, there are many models. New models are being invented continuously to address the challenges with the existing models. Hierarchical, Object Oriented, Relational, NoSQL are few of the database models. These models are purely theoretical and discuss how data can be stored, accessed, managed and updated in an efficient way. The relational model was invented by E. F. Codd at IBM in 1970. So, the name Codd’s 12 rules. You can find a link to original document paper on their website.
As mentioned above, the models are theoretical and we need some implementations to use them. MySQL, Informix etc few of the implementations of the Relational model. Below are the few of the database models and the corresponding implementations. A single model/implementation will not fit each and every requirement. So, based on the requirements first, the appropriate model and then the implementation has to be chosen. Along the same lines, below are few of the distributed computing models and the corresponding implementations. As with the database models, new distributed processing models are being invented to address the limitations of the previous models.
The MapReduce model was published by Google in 2004. To read the original paper from Google from which Hadoop has been inspired and implemented. Google is a company which has a unique requirement to store and process a huge amount of data in a quick and cost-effective way and so they keep spending a lot of effort on researching in the distributed space. And once in a while, Google publishes the research work into the public domain here. Following the research papers published by Google will give an insight into what’s happening in the Big Data space. Google is the leading company whose efforts have led the Big Data space.
Their research papers led to ZooKeeper, HBase, Cassandra and a lot of other software in the Big Data space. The MapReduce implementations had been there for quite some time and are very stable. But, users of MapReduce are finding out the problems with the model and so other models like the RDD (Resilient Distributed Datasets) and the corresponding implementation Apache Spark are getting a lot of traction. In the coming blogs, we will look at what the MapReduce model is all about and how the RDD model tries to address some of the limitations of the MapReduce model in a bit more detail.