There are currently no items in your shopping cart.

User Panel

Forgot your password?.

Extending Hadoop for Data Science: Streaming Spark Storm and Kafka


What you should know
Using the exercise files

1. Hadoop Core Fundamentals

Modern Hadoop
File system used with Hadoop
Apache and commerical Hadoop distributions
Hadoop libraries
Hadoop on Google Cloud Platform
Run Hadoop job on GCP
Databricks on AWS

2. Setting Up a Hadoop Dev Environment

Set up IDE - VS Code + Python extension
Sign up for Databricks community edition
Add Hadoop libraries to your test environment
Your first cluster on Databricks Community Edition
Load data into tables

3. Hadoop Batch Processing

Processing options
Prerequisite understanding
Resource coordinators
Compare YARN vs. Standalone

4. Fast Hadoop Options

Fast Hadoop use cases
Big data streaming
Streaming options
Apache Spark basics
Spark use cases

5. Spark Basics

Apache Spark libraries
Spark data interfaces
Select your programming language
Spark session objects
Spark shell

6. Using Spark

Tour the DataBricks Environment
Tour the notebook
Import and export notebooks
Calculate pi on Spark
Run wordcount of Spark with Scala
Understand wordcount on Spark with Python
Import data
Transformations and actions
Caching and the DAG

7. Spark Libraries

Spark SQL
Spark ML: Preparing data
Spark ML: Building the model
Spark ML: Evaluating the model
Advanced machine learning on Spark
MXNet or TensorFlow
Spark with GraphX
Spark with ADAM for genomics

8. Spark Streaming

Re-examine streaming pipelines
Spark streaming
Streaming ingest services
Advanced Spark streaming with MLeap

9. Hadoop Streaming

Pub/Sub on GCP
Apache Kafka
Kafka architecture
Apache Storm
Storm architecture

10. Modern Hadoop Architectures

Combine Hadoop libraries and more
Review batch architecture for ETL
Spark architecture for interactive analytics
Spark architecture for genomics
Spark Streaming architecture for IoT
Spark Streaming architecture for dynamic prediction


Next steps