Learning Path: Data Science With Apache Spark 2

Course

Online

£ 40 + VAT

Description

  • Type

    Course

  • Methodology

    Online

  • Start date

    Different dates available

Spark is one of the most widely-used large-scale data processing engines and runs extremely fast. It is a framework that has tools that are equally useful for application developers as well as data scientists.This Learning Path begins with an introduction to Apache Spark. We first cover the basics of Spark, introduce SparkR, then look at the charting and plotting features of Python in conjunction with Spark data processing, and finally Spark's data processing libraries. We then develop a real-world Spark application. Next, we enable you to become comfortable and confident working with Spark for data science by exploring Spark's data science libraries on a dataset of tweets.Begin your journey into fast, large-scale, and distributed data processing using Spark with this Learning Path.About the AuthorsRajanarayanan Thottuvaikkatumana Rajanarayanan Thottuvaikkatumana, Raj, is a seasoned technologist with more than 23 years of software development experience at various multinational companies. He has lived and worked in India, Singapore, and the USA, and is presently based out of the UK. His experience includes architecting, designing, and developing software applications. He has worked on various technologies including major databases, application development platforms, web technologies, and big data technologies. Since 2000, he has been working mainly in Java related technologies, and does heavy-duty server-side programming in Java and Scala. He has worked on very highly concurrent, highly distributed, and high transaction volume systems. Currently he is building a next generation Hadoop YARN-based data processing platform and an application suite built with Spark using Scala.Raj holds one master's degree in Mathematics, one master's degree in Computer Information Systems and has many certifications in ITIL and cloud computing to his credit. Raj is the author of Cassandra Design Patterns - Second Edition, published by Packt.

Facilities

Location

Start date

Online

Start date

Different dates availableEnrolment now open

About this course

Get to know the fundamentals of Spark 2.0 and the Spark programming model using Scala and Python
Know how to use Spark SQL and DataFrames using Scala and Python
Get an introduction to Spark programming using R
Perform Spark data processing, charting, and plotting using Python
Get acquainted with Spark stream processing using Scala and Python
Be introduced to machine learning with Spark using Scala and Python
Get started with graph processing with Spark using Scala
Develop a complete Spark application
Understand the Spark programming language and its ecosystem of packages in Data Science
Obtain and clean data before processing it
Understand the Spark machine learning algorithm to build a simple pipeline
Work with interactive visualization packages in Spark
Apply data mining techniques on the available data sets
Build a recommendation engine

Questions & Answers

Add your question

Our advisors and other users will be able to reply to you

Who would you like to address this question to?

Fill in your details to get a reply

We will only publish your name and question

Reviews

This centre's achievements

2021

All courses are up to date

The average rating is higher than 3.7

More than 50 reviews in the last 12 months

This centre has featured on Emagister for 4 years

Subjects

  • Install
  • Programming
  • Quality Training
  • Systems
  • Web
  • Wine
  • Quality
  • SQL
  • Apache
  • Server
  • Java
  • Options

Course programme

Apache Spark 2 for Beginners. 45 lectures 05:38:51 Apache Spark 2 for Beginners - The Course Overview This video gives an overview of the entire course. An Overview of Apache Hadoop This video will take you through the overview of Apache Hadoop. You will also explore the Apache Hadoop Framework and the MapReduce process.
  • Understand the Apache architecture
  • Know the MapReduce application
Understanding Apache Spark By the end of this video, you will learn in depth about Spark and its advantages. You will also go through the Spark libraries and then dive into Spark Programming Paradigm.
  • Understand the Direct Acyclic Engine (DAG)
  • Explore Spark Programming Paradigm and Spark Libraries
Installing Spark on Your Machines In this video, you will learn Python installation and also how to install R. Finally, you will be able to set up the Spark environment for your machine.
  • Install Python and R
  • Install R and development tool
Functional Programming with Spark and Understanding Spark RDD Ability to get consistent results from a program or function because of the side effect that the program logic has, which makes many applications very complex
  • Construct a spark RDD
  • Complete the data processing job
Data Transformations and Actions with RDDs Learn to process data using RDDs from the relevant data source, such as text files and NoSQL data stores
  • Create spark transformation and spark actions
  • Calculate the account-level summary of the transactions from the RDD of the form
  • Count the number of elements in the RDD
Monitoring with Spark Learn to handle the tools for monitoring the jobs running in a given Spark ecosystem
  • Create event logging mechanism
  • Edit the newly created spark-defaults.conf file
  • Customize the property, spark.driver.memory, to have a higher value
The Basics of Programming with Spark Ability to explain the core concepts from which the elementary data items have been picked up.
  • Start the Scala REPL for Spark and make sure that it starts without any errors
  • Calculate the sum, maximum, and minimum of all transaction amounts from the good records
Creating RDDs from Files and Understanding the Spark Library Stack Ability to handle the appropriate Spark connector program to be used and the appropriate API to be used for reading data.
  • Understand the concept of stacking libraries on top of the core framework
  • Standardize the data processingtoolset without vendorlock-in
Understanding the Structure of Data and the Need of Spark SQL What if you could not make use of the RDD-based Spark programming model as it requires some amount of functional programming? The solution to this is Spark SQL, which you will learn in this video.
  • Explore data types in RDBMS
  • Understand Hadoop Distributed File System (HDFS)
Anatomy of Spark SQL This video will take you through the structure and internal workings of Spark SQL.
  • Know the advantages of DataFrame
  • Learn about query planning and optimizations of Spark SQL
DataFrame Programming This video will demonstrate to you two types of DataFrame programming models, one using the SQL queries and the other usingthe DataFrameAPIs for Spark.
  • Create RDD and convert it to DataFrame
  • Calculate aggregates using a mix of DataFrame and RDD-like operations
  • Calculate the minimum using a mix of DataFrame and RDD-like operations
  • Use SQL to create another DataFrame containing the top 3 account detail records
Understanding Aggregations and Multi-Datasource Joining with SparkSQL Spark SQL allows the aggregation of data. Instead of running SQL statements on a single data source located in a single machine, you can use SparkSQL to do the same on distributed data sources.
  • Define the case classes to use in conjunction with DataFrames
  • Create DataFrame using the API for the account summary records
  • Register temporary view in the DataFrame for using it in SQL
Introducing Datasets and Understanding Data Catalogs This video will show you the methods used to create a Dataset, along with its usage, conversion of RDD to DataFrame, and conversion of DataFrame to dataset. You will also learn the usage of Catalog API in Scala and Python.
  • Apply a filter and create a Dataset of good and high value transaction
  • Use Spark SQL to find out invalid transaction records
  • Get the catalog object from the SparkSession object
The Need for Spark and the Basics of the R Language This video will make you understand the necessity of SparkR and the basic data types in the R language.
  • Explore the basics of the R language and the use of its datatypes
  • Create vectors and matrices
  • Extract values in DataFrame.
DataFrames in R and Spark You may encounter several situations where you need to convert an R DataFrame to a Spark DataFrame or vice versa. Let’s see how to do it
  • Use the dataset, faithful
  • Convert R DataFrameto Spark DataFrame
  • Convert Spark DataFrame to R
Spark DataFrame Programming with R This video will show you how to write programs with SQL and R DataFrame APIs.
  • Select DataFrame using SQL
  • Create a DataFrame by taking the union of two DataFrames
  • Extract the DataFrame containing a good account number
Understanding Aggregations and Multi- Datasource Joins in SparkR In SQL, the aggregation of data is very flexible. The same thing is true in Spark SQL too. Let’s see its use and the implementation of multi-datasource joins
  • Pull the DataFrame containing account summary records using API
  • Persist data in the DataFrame into a Parquet file
  • Retrieve the DataFrame containing account detail records using SQL
Charting and Plotting Libraries and Setting Up a Dataset This video will walk you through the Charting and Plotting Libraries and give a brief description of the application stack. You will also learn how to set up a dataset with Spark in conjunction with Python, NumPy, SciPy, and matplotlib.
  • Use the NumPy and SciPylibraries
  • Download datasets from grouplens.org
Charts, Plots, and Histograms There are several instances where you need to create various charts and plots to visually represent the various aspects of the dataset and then perform data processing, charting, and plotting. This video will enable you to do this with Spark.
  • Create the DataFrame of the user dataset
  • Plot the age distribution
  • Create the density plot
Bar Chart and Pie Chart This video will let you explore more on the different types of charts and bars, namely Stacked Bar Chart, Donut Chart, Box Plot, and Vertical Bar Chart. So, let’s do it!
  • Define the x location of the group and the width of the bars
  • Plot the Bar chart and Pie chart
  • Display the Box plot
Scatter Plot and Line Graph Through this video, you will learn in detail about scatter plot and line graph using Spark. You will also see how to enhance scatter plot in depth.
  • Draw points with proportionate area circles on the graph
  • Create Spark DataFrames for the number of action movies and drama movies
Data Stream Processing and Micro Batch Data Processing Data sources generate data like a stream, and many real-world use cases require them to be processed in real time. This video will give you a deep understanding of Stream processing in Spark.
  • Understand data stream processing framework
  • Explore the production of Discretized Stream or DStreams
  • Learn programming with DStreams
A Log Event Processor These days, it is very common to have a central repository of application log events in many enterprises. Also, the log events are streamed live to data processing applications in order to monitor the performance of the running applications on a real-time basis. This video demonstrates the real-time processing of log events using a Spark Streaming data processing application.
  • Set the Netcat server
  • Submit the jobs to Spark clusters
  • Monitor and compile the application
Windowed Data Processing and More Processing Options This video will let you know the different processing options that you can pick up in Spark to work in a smart way with any data.
  • Count the number of log event messages in Scala and Python
  • Dive into other processing options
Kafka Stream Processing Kafka is a publish-subscribe messaging system used by many IoT applications to process a huge number of messages. Let’s see how to use it!
  • Start zoopkeeper and Kafka
  • Perform the implementation of Kafka processing in Scala and Python
Spark Streaming Jobs in Production When a Spark Streaming application is processing the incoming data, it is very important to have an uninterrupted data processing capability so that all the data that is getting ingested is processed. This video will take you through those tasks that enable you to achieve this goal.
  • Implement fault-tolerance in Spark Streaming data processing applications
  • Dive into structured streaming
Understanding Machine Learning and the Need of Spark for it This video will let you know the basics of machine learning and understand the ability of Spark to achieve the goals of machine learning in an efficient manner.
  • Explore the overview of machine learning
  • Know the necessity of Spark for machine learning
  • Learn the terminology and concepts used in Spark Machine Learning
Wine Quality Prediction and Model Persistence By the end of this video, you will be able to perform predictions on huge data such as the Wine quality, which is a widely used data set in data analysis.
  • Perform Wine Quality Prediction on Wine Quality dataset
  • Perform model persistence in Python and Scala
Wine Classification Let’s use Spark to perform Wine classification by using various algorithms.
  • Model the relationship between the wine quality and the features of the wine
  • Use the Logistic Regression algorithm to train the model
Spam Filtering Spam filtering is a very common use case that is used in many applications. It is ubiquitous in e-mail applications. It is one of the most widely used classification problems. This video will enable you to deal with this problem and show you the best approach to resolve it in Spark.
  • Split lines into words and transform words using the HashingTF algorithm
  • Training a Logistic Regression model
  • Use the Pipeline abstraction and perform the prediction
Feature Algorithms and Finding Synonyms It is not very easy to get raw data in the appropriate form of features and labels in order to train the model. Through this video, you will be able to play with the raw data and use it efficiently for processing.
  • Perform tokenization to convert the sentences into words
  • Use regular expressions, remove the gaps, and stop words
  • Use the Word2Vec estimator
Understanding Graphs with Their Usage Graphs are widely used in data analysis. Let’s explore some commonly used graphs and their usage.
  • Exploring different types of graphs along with their usage
The Spark GraphX Library Many libraries are available in the open source world. Giraph, Pregel, GraphLab, and Spark GraphX are some of them. Spark GraphX is one of the recent entrantsinto this space. Let’s dive into it!
  • Explore Graph X library
  • Learn how to do graph partitioning
Graph Processing and Graph Structure Processing Just like any other data structure, a graph also undergoes lots of changes because of the change in the underlying data. Let’s learn to process these changes.
  • Create the graph using the vertices and edges
  • Create a new graph with the original vertices and the new edges
  • Print this graph
Tennis Tournament Analysis Since the basic graph processing fundamentals are in place, now it is time to take up a real-world use case that uses graphs. Let’s take the tennis tournament's results for it.
  • Finding different players and groups based on their performance
  • Print the list of players
Applying PageRank Algorithm When searching the web using Google, pages that are ranked highly by its algorithm are displayed. In the context of graphs, instead of web pages, if vertices are ranked based on the same algorithm, lots of new inferences can be made. Let’s jump right in and see how to do this.
  • Define property classes to hold all the properties of the edges and vertices
  • Create a graph with the vertices and edges
  • Run the PageRank algorithm to calculate the rank of each vertex
Connected Component Algorithm In a graph, finding a subgraph consisting of connected vertices is a very common requirement with tremendous applications. This video will enable you to find the connected vertices, making it easy for you to work on the given data.
  • Create the RDD with users as the vertices and edges connecting the users
  • Create a graph and find the connected components of the graph
  • Extract the user names with their CC component ID
Understanding GraphFrames and Its Queries GraphFrames is a new graph processing library available as an external Spark package developed by Databricks. Though this video, you will learn the concepts and queries used in GraphFrames.
  • Apply filter and select only the needed edges
  • Create aGraphFrame-based graph from the Spark GraphXbased-graph
  • Convert the GraphFrame based graph to a Spark GraphX based graph
Lambda Architecture Application architecture is very important for any kind of software development. Lambda Architecture is a recent and popular architecture that's ideal for developing data processing applications. Let’s dive into it!
  • Explore the different layers of Lambda Architecture
Micro Blogging with Lambda Architecture In the recent years, the concept of microblogging included the general public in the culture of blogging. Let’s see how we could work it and have fun!
  • Understand the overview of SfbMicroblog
  • Dive into the different datasets in a blog
  • Set the data dictionary
Implementing Lambda Architecture and Working with Spark Applications Since the Lambda Architecture is a technology-agnostic architecture framework, when designing applications with it, it is imperative to capture the technology choices used in the specific implementations...

Additional information

Requires basic knowledge of either Python or R

Learning Path: Data Science With Apache Spark 2

£ 40 + VAT