Learning Path: Data Science With Apache Spark 2
Course
Online
Description
-
Type
Course
-
Methodology
Online
-
Start date
Different dates available
Spark is one of the most widely-used large-scale data processing engines and runs extremely fast. It is a framework that has tools that are equally useful for application developers as well as data scientists.This Learning Path begins with an introduction to Apache Spark. We first cover the basics of Spark, introduce SparkR, then look at the charting and plotting features of Python in conjunction with Spark data processing, and finally Spark's data processing libraries. We then develop a real-world Spark application. Next, we enable you to become comfortable and confident working with Spark for data science by exploring Spark's data science libraries on a dataset of tweets.Begin your journey into fast, large-scale, and distributed data processing using Spark with this Learning Path.About the AuthorsRajanarayanan Thottuvaikkatumana Rajanarayanan Thottuvaikkatumana, Raj, is a seasoned technologist with more than 23 years of software development experience at various multinational companies. He has lived and worked in India, Singapore, and the USA, and is presently based out of the UK. His experience includes architecting, designing, and developing software applications. He has worked on various technologies including major databases, application development platforms, web technologies, and big data technologies. Since 2000, he has been working mainly in Java related technologies, and does heavy-duty server-side programming in Java and Scala. He has worked on very highly concurrent, highly distributed, and high transaction volume systems. Currently he is building a next generation Hadoop YARN-based data processing platform and an application suite built with Spark using Scala.Raj holds one master's degree in Mathematics, one master's degree in Computer Information Systems and has many certifications in ITIL and cloud computing to his credit. Raj is the author of Cassandra Design Patterns - Second Edition, published by Packt.
Facilities
Location
Start date
Start date
About this course
Get to know the fundamentals of Spark 2.0 and the Spark programming model using Scala and Python
Know how to use Spark SQL and DataFrames using Scala and Python
Get an introduction to Spark programming using R
Perform Spark data processing, charting, and plotting using Python
Get acquainted with Spark stream processing using Scala and Python
Be introduced to machine learning with Spark using Scala and Python
Get started with graph processing with Spark using Scala
Develop a complete Spark application
Understand the Spark programming language and its ecosystem of packages in Data Science
Obtain and clean data before processing it
Understand the Spark machine learning algorithm to build a simple pipeline
Work with interactive visualization packages in Spark
Apply data mining techniques on the available data sets
Build a recommendation engine
Reviews
This centre's achievements
All courses are up to date
The average rating is higher than 3.7
More than 50 reviews in the last 12 months
This centre has featured on Emagister for 4 years
Subjects
- Install
- Programming
- Quality Training
- Systems
- Web
- Wine
- Quality
- SQL
- Apache
- Server
- Java
- Options
Course programme
- Understand the Apache architecture
- Know the MapReduce application
- Understand the Direct Acyclic Engine (DAG)
- Explore Spark Programming Paradigm and Spark Libraries
- Install Python and R
- Install R and development tool
- Construct a spark RDD
- Complete the data processing job
- Create spark transformation and spark actions
- Calculate the account-level summary of the transactions from the RDD of the form
- Count the number of elements in the RDD
- Create event logging mechanism
- Edit the newly created spark-defaults.conf file
- Customize the property, spark.driver.memory, to have a higher value
- Start the Scala REPL for Spark and make sure that it starts without any errors
- Calculate the sum, maximum, and minimum of all transaction amounts from the good records
- Understand the concept of stacking libraries on top of the core framework
- Standardize the data processingtoolset without vendorlock-in
- Explore data types in RDBMS
- Understand Hadoop Distributed File System (HDFS)
- Know the advantages of DataFrame
- Learn about query planning and optimizations of Spark SQL
- Create RDD and convert it to DataFrame
- Calculate aggregates using a mix of DataFrame and RDD-like operations
- Calculate the minimum using a mix of DataFrame and RDD-like operations
- Use SQL to create another DataFrame containing the top 3 account detail records
- Define the case classes to use in conjunction with DataFrames
- Create DataFrame using the API for the account summary records
- Register temporary view in the DataFrame for using it in SQL
- Apply a filter and create a Dataset of good and high value transaction
- Use Spark SQL to find out invalid transaction records
- Get the catalog object from the SparkSession object
- Explore the basics of the R language and the use of its datatypes
- Create vectors and matrices
- Extract values in DataFrame.
- Use the dataset, faithful
- Convert R DataFrameto Spark DataFrame
- Convert Spark DataFrame to R
- Select DataFrame using SQL
- Create a DataFrame by taking the union of two DataFrames
- Extract the DataFrame containing a good account number
- Pull the DataFrame containing account summary records using API
- Persist data in the DataFrame into a Parquet file
- Retrieve the DataFrame containing account detail records using SQL
- Use the NumPy and SciPylibraries
- Download datasets from grouplens.org
- Create the DataFrame of the user dataset
- Plot the age distribution
- Create the density plot
- Define the x location of the group and the width of the bars
- Plot the Bar chart and Pie chart
- Display the Box plot
- Draw points with proportionate area circles on the graph
- Create Spark DataFrames for the number of action movies and drama movies
- Understand data stream processing framework
- Explore the production of Discretized Stream or DStreams
- Learn programming with DStreams
- Set the Netcat server
- Submit the jobs to Spark clusters
- Monitor and compile the application
- Count the number of log event messages in Scala and Python
- Dive into other processing options
- Start zoopkeeper and Kafka
- Perform the implementation of Kafka processing in Scala and Python
- Implement fault-tolerance in Spark Streaming data processing applications
- Dive into structured streaming
- Explore the overview of machine learning
- Know the necessity of Spark for machine learning
- Learn the terminology and concepts used in Spark Machine Learning
- Perform Wine Quality Prediction on Wine Quality dataset
- Perform model persistence in Python and Scala
- Model the relationship between the wine quality and the features of the wine
- Use the Logistic Regression algorithm to train the model
- Split lines into words and transform words using the HashingTF algorithm
- Training a Logistic Regression model
- Use the Pipeline abstraction and perform the prediction
- Perform tokenization to convert the sentences into words
- Use regular expressions, remove the gaps, and stop words
- Use the Word2Vec estimator
- Exploring different types of graphs along with their usage
- Explore Graph X library
- Learn how to do graph partitioning
- Create the graph using the vertices and edges
- Create a new graph with the original vertices and the new edges
- Print this graph
- Finding different players and groups based on their performance
- Print the list of players
- Define property classes to hold all the properties of the edges and vertices
- Create a graph with the vertices and edges
- Run the PageRank algorithm to calculate the rank of each vertex
- Create the RDD with users as the vertices and edges connecting the users
- Create a graph and find the connected components of the graph
- Extract the user names with their CC component ID
- Apply filter and select only the needed edges
- Create aGraphFrame-based graph from the Spark GraphXbased-graph
- Convert the GraphFrame based graph to a Spark GraphX based graph
- Explore the different layers of Lambda Architecture
- Understand the overview of SfbMicroblog
- Dive into the different datasets in a blog
- Set the data dictionary
Additional information
Learning Path: Data Science With Apache Spark 2