Apache Spark for Data Scientists

Overview

This three-day course equips participants with the knowledge, skills and tools to use Apache Spark effectively for their data analysis needs. Participants will gain practical training in Apache Spark and will be taught how to build comprehensive big data applications by integrating batch, streaming, and interactive analytics on all their data. The course explores using Apache Spark for common data related activities. With Spark, participants will learn how to write sophisticated parallel applications to execute faster, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries.

What You'll Learn

Introduction to Spark
Spark and data science
Build comprehensive big data applications by integrating batch, streaming, and interactive analytics on all data
Use Apache Spark for common data related activities
Understand data frames and Spark SQL
Work with Spark MLib, Spark GraphX and Spark streaming
Understand memory management

Curriculum

Data Science: The state of the art
Hadoop, Yarn, and Spark
Architectural overview
Spark and Storm
MLib and Mahout
Distributed vs. local run modes
Hello, Spark

Spark core
Spark SQL
Spark and Hive
MLib
Mahout
Spark streaming
Spark API

DataFrames and resilient distributed datasets (RDDs)
Partitions
DataFrame types
DataFrame operations
Map/Reduce with DataFrames

Spark SQL overview
Data stores: HDFS, Cassandra, HBase, Hive, and S3
Table definitions
ETL in Spark
Queries

MLib overview
MLib algorithms overview

Streaming overview
Real-time data ingestion
State
Window operations

GraphX overview
ETL with GraphX
Graph computation

Broadcast variables
Accumulators
Memory management

Standalone cluster
Masters and workers
Configurations
Working with large data sets

Who should attend

This course is intended for systems administrators, testers or technical data related roles who need to learn to use Spark for data analysis or processing data.

This course is highly recommended for:

Big data & analytics – Consultants
Data scientists
Data engineers
System administrators
System testers
Data analysts

Prerequisites

There are no mandatory prerequisites for this course, however, completing the Foundations of Agile course prior to taking up this course would be beneficial.

Apache Spark for Data Scientists

Overview

What You'll Learn

Curriculum

Who should attend

Prerequisites

Interested in this Course?

Ready to recode your DNA for GenAI?
Discover how Cognixia can help.

Generative AI - Rewire

Generative AI - Organization

JUMP

Digital Mindset & Culture

Change & Adoption

REWIRE

Organization Transformation

Apache Spark for Data Scientists

Overview

What You'll Learn

Curriculum

Module 1: Spark and Data Science

Module 2: Spark overview

Module 3: DataFrames

Module 4: Spark SQL

Module 5: Spark MLib

Module 6: Spark streaming

Module 7: Spark GraphX

Module 8: Performance and tuning

Module 9: Cluster mode

Who should attend

Prerequisites

Interested in this Course?

Ready to recode your DNA for GenAI? Discover how Cognixia can help.

Ready to recode your DNA for GenAI?
Discover how Cognixia can help.