Developing with Spark for Big Data| Enterprise-Grade Spark Programming for the Hadoop & Big Data Ecosystem

Live Classroom
Duration: 5 days
Live Virtual Classroom
Duration: 5 days
Pattern figure


The Developing with Spark for Big Data course introduces participants to enterprise-grade Spark programming, covering intermediate-level and advanced-level concepts of Spark programming, and enabling participants to work with the key components of Apache Spark for developing data science solutions. The course equips participants with the skills and knowledge to work with Apache Spark in real-world enterprises and execute effective data-driven decisions. The course involves various hands-on exercises to ensure that participants have a thorough understanding of all the concepts covered in the course.

What You'll Learn

  • Basics of Spark architecture and applications
  • Executing Spark programs
  • Creating and manipulating both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
  • Restoring data frames
  • Essential NoSQL access
  • Integrating machine learning into Spark applications
  • Sing Spark streaming and Kafka to create streaming applications


  • Hadoop ecosystem
  • Hadoop YARN vs. Mesos
  • Spark vs. MapReduce
  • Spark with MapReduce – Lambda Architecture
  • Spark in the enterprise data science architecture

  • Spark Shell
  • RDDs: Resilient Distributed Datasets
  • Data frames
  • Spark 2 unified DataFrames
  • Spark sessions
  • Functional programming
  • Spark SQL
  • MLib
  • Structured streaming
  • Spark R
  • Spark and Python

  • Coding with RDDs
  • Transformation
  • Actions
  • Lazy evaluation and optimization
  • RDDs n MapReduce

  • RDDs vs. DataFrames
  • Unified DataFrames (UDF) in Spark 2.0
  • Partitioning

  • Spark sessions
  • Running applications
  • Logging

  • RDD persistence
  • DataFrame and Unified DataFrame persistence
  • Distributed Persistence

  • Streaming overview
  • Streams
  • Structured streaming
  • DStreams and Apache Kafka

  • Ingesting data
  • Parquet files
  • Relational databases
  • Graph databases (Neo4J, GraphX)
  • Interacting with Hive
  • Accessing Cassandra data
  • Document databases (MongoDB, CouchDB)

  • MapReduce and Lambda integration
  • Camel integration
  • Drools and Spark

  • MLib and Mahout
  • Classification
  • Clustering
  • Decision trees
  • Decompositions
  • Pipelines
  • Spark packages

  • Spark SQL
  • SQL and DataFrames
  • Spark SQL and Hive
  • Spark SQL and JDBC

  • Graph APIs
  • GraphX
  • ETL in GraphX
  • Exploratory analysis
  • Graph computation
  • Pregel APi overview
  • GraphX algorithms
  • Neo4J as an alternative

  • Using web notebooks (Zeppelin, Jupyter)
  • R on Spark
  • Python on Spark
  • Scala on Spark

  • Parallelizing Spark applications
  • Clustering concerns for developers

  • Monitoring Spark performance
  • Tuning memory
  • Tuning CPU
  • Tuning Data Locality
  • Troubleshooting
Ripple wave

Who should attend

  • The course is highly recommended for –
  • Developers
  • Architects
  • Big Data professionals
  • Hadoop professionals


Participants need to have experience working in a development role and have an understanding of the Big Data and Hadoop ecosystem.

Interested in this Course?

    Ready to recode your DNA for GenAI?
    Discover how Cognixia can help.

    Get in Touch
    Pattern figure
    Ripple wave