Hadoop is no longer Hadoop. These days Cloudera is seen using Kudu instead of HDFS and stating that Spark is at the center of its universe. It also leads to the replacement of MapR from all places. Besides these changes, Hortonworks is also leaning towards Spark which means that the only item you can find in the Hadoop cluster now is YARN. And there is an exception to this too that is Spark people choosing Mesos over YARN. Also, Spark does not need HDFS.
But that doesn’t diminish the utility of distributed file system. BI proves to be a major use case for Impala and Kudu that optimize for it. Spark proves immensely useful in some tasks, but at times there is a need for MPP solution such as Impala. At the same time, Hive has its utility in the file-to-table management system. In times when Hadoop is not in use because the focus is towards in-memory, real-time analytics with Spark, you might use a little Hadoop in some instances. Hadoop is not dead – at all, that’s just not true. But it’s not just Hadoop anymore.
There’s a lot of newness in the Hadoop and Spark world. Here are some points worth taking notice of:
Spark
There are no doubts when it comes to the fastness of Spark, but what’s more important than speed is the fact that in a Spark environment, API is relatively easier to use and requires less code as compared to earlier distributed computing paradigm. Since IBM has promised to deliver one million new Spark developers and is going to invest huge amounts of money in the project, Spark being touted as the center of everything, and the whole-hearted support from Hortonworks – Spark is going to rule the roost.
Spark also has economics on its side. There was a time when it was a costly affair, but with the advent of cloud computing and higher computing elasticity, there is an increase in the number of workloads which can load into the memory. It does not pertain to all your data but the volume which is needed to get the result.
Spark still needs some fine tuning – its roughness is most evident in a production environment, but the bumps in the road are worth it. Spark is much faster and better overall. It is ironical how the noise around Spark is about streaming, where it lacks the most – in 80% use cases. But still, you would be required to seek alternatives for sub-second or high-volume data intake.
Hive
Hive allows running SQL queries against text or structured files. The ones which are live on HDFS, while you are using Hive; cataloging the data and exposing them like tables. You can use JDBC or ODBC to connect your favorite tool with Hive.
Hence, Hive can be summed up as a boring, slow but useful tool. It is Hive’s default setting to convert the SQL into MapR jobs. You always have an option of changing it to use DAG-based Tez, which is way faster. The switch can also be made over to Spark, but there’s little or no relevance of the word “alpha.”
It is important to have an understanding of Hive because a large number of Hadoop projects begin with the idea of dumping the data somewhere, and Hive is the simplest way to achieve that. Other tools might be required to do that comparatively such as Phoenix or Impala.
Kerberos
In these times Kerberos is the only fully implemented authentication for Hadoop. At time tools like Ranger or Sentry might come to your rescue, but as far as integrating with Active Directory is concerned you would still need Kerberos.
Ranger/Sentry
If you plan to skip Ranger or Sentry, then keep in mind that your big data platform will authenticate and authorize on its own. It would lack central control, and every component will have its perspective.
The confusion is which one to choose? Presently, Ranger looks a little ahead in the race and more complete, but it’s a Hortonworks product. Sentry, however, is Cloudera’s tool. They just support those parts of Hadoop which their vendor supports. If you are not inclined towards either Cloudera or Hortonworks, then Ranger seems like a more viable option. However, if updates from Cloudera are believed – with the advent of its one platform strategy, Sentry is going to gain speed real soon.
Cognixia provides an excellent training program on Hadoop and Spark. Our Hadoop administrator training and Hadoop developer training are designed in a way which walks you through the entire Hadoop ecosystem. These online training educates you on tools like Pig, Hive, MapR, and HDFS, etc. Cognixia’s Spark Training program is structured to train you on the Spark environment with a holistic view. For further information, you can write to us at write to us.