Apache Spark Big Data Boot Camp

Overview

Apache Spark is a popular toolset for powering Big Data solutions with distributed cluster computing, owing to its speed, expanded versatility and access to powerful APIs and libraries. Spark gives applications the ability to support data science capabilities with R-type dataframes and Big Data streaming helping overcome the time constraints. This fast-paced three day course provides a thorough, hands-on overview of the Apache Spark platform as well as the technologies and paradigms that form a part of Spark. The course will help participants master all the skills necessary to be able to use Apache Spark for their own applications.

What You'll Learn

The origin of Apache Spark
Apache Spark vs. Apache Hadoop
Apache Spark use cases
Streaming architecture of Spark
SQL architecture in Spark
Apache Spark and Machine Learning
Machine Learning libraries
Apache Spark GraphX

Curriculum

Introduction to data analysis
Introduction to Big Data
Definition of Big Data
Introduce the techniques and challenges in Big Data
Introduce the techniques and challenges in Distributed Computing
Show how the functional programming approach is particularly useful in tackling these challenges
Short overview of previous solutions – Google’s MapReduce and Apache Hadoop
Introduction to Apache Spark
Exercise: Exposure to Admin and setup

Spark architecture in a cluster
Spark ecosystem and cluster management
Deploying Spark on a cluster
Deploying Spark on a standalone cluster
Deploying Spark on Mesos cluster
Deploying Spark on YARN cluster
Cloud-based deployment
Exercise: Learn to deploy and begin using Spark

Dig deeper into Apache Spark
Introduce Resilient Distributed Datasets (RDD)
Apache Spark installation
Introduce the Spark Shell
Actions and Transformations (Laziness)
Caching
Loading and saving data files from the file system
Exercise: Get hands-on with Spark Code and RDDs

Tailored RDD
Pair RDD
NewHadoop RDD
Aggregations
Partitioning
Broadcast variables
Accumulators
Exercise: You’ll learn expanded RDD capabilities

SparkSQL and DataFrames
DataFrame and SQL API
DataFrame Schema
Datasets and Encoders
Loading and saving data
Aggregations
Joins
Exercise: Learn to use one of Spark’s most powerful features – DataFrames using R-style modelling supported by supercomputer clusters

A brief introduction to streaming
Spark streaming
Discretized streams
Structured streaming
Stateful/stateless transformations
Checkpointing
Inter-operability with streaming platforms (Apache Kafka)
Exercise: Another of Spark 2.1’s most exciting features is the ability to provide Big Data streaming to allow beating the timeframe constraints of previous Big Data solutions

Introduction to Machine Learning
Spark Machine Learning APIs
Feature extractor and transformation
Classification using logistic regression
Best practices in machine learning for the practitioners
Exercise: Use Spark to perform production-friendly calls for powerful machine learning service and predictive analysis

Brief introduction to Graph theory
GraphX
Vertex and Edge RDDs
Graph operators
Pregel API
PageRank/ Travelling salesman problem
Exercise: get hands-on practice using GraphX

Testing in a distributed environment
Testing Spark application
Debugging Spark application
Exercise: Lab practice supporting Spark solutions with best practices for testing, debugging and normal-day production issues for Spark solutions

Who should attend

This boot camp is highly recommended for –

Developers and team leads
Software engineers
Business analysts
System analysts
Data analysts and scientists
Data scientists
Operations and DevOps engineers
Java developers
Big Data engineers

Prerequisites

There are no mandatory prerequisites for this course, however, completing having a basic understanding of Scala/Python would be beneficial. It is also recommended to complete the Fundamentals of DevOps course prior to taking the Apache Spark Big Data boot camp.

Apache Spark Big Data Boot Camp

Overview

What You'll Learn

Curriculum

Who should attend

Prerequisites

Interested in this Course?

Ready to recode your DNA for GenAI?
Discover how Cognixia can help.

Generative AI - Rewire

Generative AI - Organization

JUMP

Digital Mindset & Culture

Change & Adoption

REWIRE

Organization Transformation

Apache Spark Big Data Boot Camp

Overview

What You'll Learn

Curriculum

Module 1: Introduction to Big Data and Apache Spark

Module 2: Deploying and understanding Apache Spark architecture

Module 3: Spark Core, RDDs and Spark Shell

Module 4: Deep dive into RDD

Module 5: Spark SQL and DataFrames

Module 6: Spark Streaming

Module 7: Spark MLib and ML

Module 8: GraphX

Module 9: Testing and debugging Spark

Who should attend

Prerequisites

Interested in this Course?

Ready to recode your DNA for GenAI? Discover how Cognixia can help.

Ready to recode your DNA for GenAI?
Discover how Cognixia can help.