Apache Spark for Data Scientists

Live Classroom
Duration: 5 days
Live Virtual Classroom
Duration: 5 days
Pattern figure


This three-day course equips participants with the knowledge, skills and tools to use Apache Spark effectively for their data analysis needs. Participants will gain practical training in Apache Spark and will be taught how to build comprehensive big data applications by integrating batch, streaming, and interactive analytics on all their data. The course explores using Apache Spark for common data related activities. With Spark, participants will learn how to write sophisticated parallel applications to execute faster, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries.

What You'll Learn

  • Introduction to Spark
  • Spark and data science
  • Build comprehensive big data applications by integrating batch, streaming, and interactive analytics on all data
  • Use Apache Spark for common data related activities
  • Understand data frames and Spark SQL
  • Work with Spark MLib, Spark GraphX and Spark streaming
  • Understand memory management


  • Data Science: The state of the art
  • Hadoop, Yarn, and Spark
  • Architectural overview
  • Spark and Storm
  • MLib and Mahout
  • Distributed vs. local run modes
  • Hello, Spark

  • Spark core
  • Spark SQL
  • Spark and Hive
  • MLib
  • Mahout
  • Spark streaming
  • Spark API

  • DataFrames and resilient distributed datasets (RDDs)
  • Partitions
  • DataFrame types
  • DataFrame operations
  • Map/Reduce with DataFrames

  • Spark SQL overview
  • Data stores: HDFS, Cassandra, HBase, Hive, and S3
  • Table definitions
  • ETL in Spark
  • Queries

  • MLib overview
  • MLib algorithms overview

  • Streaming overview
  • Real-time data ingestion
  • State
  • Window operations

  • GraphX overview
  • ETL with GraphX
  • Graph computation

  • Broadcast variables
  • Accumulators
  • Memory management

  • Standalone cluster
  • Masters and workers
  • Configurations
  • Working with large data sets
Ripple wave

Who should attend

This course is intended for systems administrators, testers or technical data related roles who need to learn to use Spark for data analysis or processing data.

This course is highly recommended for:

  • Big data & analytics – Consultants
  • Data scientists
  • Data engineers
  • System administrators
  • System testers
  • Data analysts


There are no mandatory prerequisites for this course, however, completing the Foundations of Agile course prior to taking up this course would be beneficial.

Interested in this Course?

    Ready to recode your DNA for GenAI?
    Discover how Cognixia can help.

    Get in Touch
    Pattern figure
    Ripple wave