Overview
This course offers a holistic overview in some of the most cutting-edge technologies in the data science spectrum, with an emphasis on Spark and related tools. The framework of this course is structured for developers interested in enhancing their skills and learning enterprise-grade Spark programming. The course covers a wide array of topics ranging from features of Spark to practical experience with the specific set of technologies.
What You'll Learn
- Basics of Spark architecture and applications
- Executing Spark Programs
- Creating and manipulating both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
- Restoring data frames
- Essential NOSQL access
- Integrating machine learning into Spark applications
- Using Spark Streaming and Kafka to create streaming applications
Curriculum
- Overview of Spark
- Hadoop ecosystem
- Hadoop YARN vs. Mesos
- Spark vs. Map/Reduce
- Spark: Lambda architecture
- Spark in the enterprise data science architecture
- Spark shell
- RDDs: Resilient distributed datasets
- DataFrames
- Spark 2 unified DataFrames
- Spark sessions
- Functional programming
- Spark SQL
- MLib
- Structured streaming
- Spark R
- Spark and Python
- Exercise: Hello, Spark
- Coding with RDDs
- Transformations
- Actions
- Lazy evaluation and optimization
- RDDs in Map/Reduce
- Exercise: Working with RDDs
- RDDs vs. DataFrames
- Unified Dataframes (UDF) in Spark 2.x
- Partitioning
- Exercise: Working with unified DataFrames
- RDD persistence
- DataFrame and unified DataFrame persistence
- Distributed persistence
- Exercise: Saving and restoring DataFrames
- Ingesting data
- Relational databases and Sqoop
- Interacting with Hive
- Graph data
- Accessing Cassandra data
- Exercise: NoSQL data access
- Spark SQL
- SQL and DataFrames
- Spark SQL and Hive
- Spark SQL and JDBC
- Exercise: Working with SparkSQL
- ML Lib
- Mahout
- Exercise: Hello, MLib
- Streaming overview
- Streams
- Structured streaming
- Lambda streaming
- Spark and Kafka
- Exercise: Hello, Spark Streaming
Who should attend
This course is geared for experienced developers and architects (with development experience) who seek to be proficient in advanced, modern development skills, working with Apache Spark in an enterprise data environment.
This course is highly recommended for:
- Hadoop/Spark developers
- Data scientists
- Data engineers
- Big Data engineers
- Java developers
- Application developers
- Full stack developers