Learning PySpark

Course

Online

£ 150 + VAT

Description

  • Type

    Course

  • Methodology

    Online

  • Start date

    Different dates available

Building and deploying data-intensive applications at scale using Python and Apache Spark.Apache Spark is an open-source distributed engine for querying and processing data. In this tutorial, we provide a brief overview of Spark and its stack. This tutorial presents effective, time-saving techniques on how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Apache Spark architecture and how to set up a Python environment for Spark.You'll learn about different techniques for collecting data, and distinguish between (and understand) techniques for processing data. Next, we provide an in-depth review of RDDs and contrast them with DataFrames. We provide examples of how to read data from files and from HDFS and how to specify schemas using reflection or programmatically (in the case of DataFrames). The concept of lazy execution is described and we outline various transformations and actions specific to RDDs and DataFrames.Finally, we show you how to use SQL to interact with DataFrames. By the end of this tutorial, you will have learned how to process data using Spark DataFrames and mastered data collection techniques by distributed data processing.About the AuthorTomasz Drabas is a Data Scientist working for Microsoft and currently residing in the Seattle area. He has over 12 years' international experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance, and consulting..
Tomasz started his career in 2003 with LOT Polish Airlines in Warsaw, Poland while finishing his Master's degree in strategy management. In 2007, he moved to Sydney to pursue a doctoral degree in operations research at the University of New South Wales, School of Aviation; his research crossed boundaries between discrete choice modeling and airline operations research

Facilities

Location

Start date

Online

Start date

Different dates availableEnrolment now open

About this course

Learn about Apache Spark and the Spark 2.0 architecture
Understand schemas for RDD, lazy executions, and transformations
Explore the sorting and saving elements of RDD
Build and interact with Spark DataFrames using Spark SQL
Create and explore various APIs to work with Spark DataFrames
Learn how to change the schema of a DataFrame programmatically
Explore how to aggregate, transform, and sort data with DataFrames

Questions & Answers

Add your question

Our advisors and other users will be able to reply to you

Who would you like to address this question to?

Fill in your details to get a reply

We will only publish your name and question

Emagister S.L. (data controller) will process your data to carry out promotional activities (via email and/or phone), publish reviews, or manage incidents. You can learn about your rights and manage your preferences in the privacy policy.

Reviews

This centre's achievements

2021

All courses are up to date

The average rating is higher than 3.7

More than 50 reviews in the last 12 months

This centre has featured on Emagister for 6 years

Subjects

  • Project
  • Apache
  • Information Systems
  • Information Systems management
  • IT
  • IT Management
  • Management
  • Computing
  • Programming
  • Programme Planning

Course programme

A Brief Primer on PySpark 6 lectures 14:55 The Course Overview This video gives an overview of the entire course. Brief Introduction to Spark The aim of the video is to explain Spark and its Python interface. • Learn about Spark • Explain its popularity • Touch upon the Python interface Apache Spark Stack The aim of this video is to provide a brief overview of Apache Spark stack components. • Apache Spark ecosystem introduction • Introduce SparkSQL and DataFrames • Explain the overview of remaining components – MLlib, GraphX, and streaming Spark Execution Process The aim of this video is to briefly review the execution process. • Explain the overview of interactions between the driver and workers • Discuss how Spark decides which tasks run in parallel • Represent the job execution plan as a directed acyclic graph Newest Capabilities of PySpark 2.0+ The aim of this video is to briefly review the newest features of Spark 2.0+. • Introduce the phase two of the project Tungsten • Instantiate Spark session instead of three other contexts • Brief overview of the Spark structured streaming Cloning GitHub Repository The aim of this video is to clone the GitHub repository for the course. Doing this will set everything we need for the following videos. • Find the repository on GitHub • Open terminal • Clone the repository A Brief Primer on PySpark 6 lectures 14:55 The Course Overview This video gives an overview of the entire course. Brief Introduction to Spark The aim of the video is to explain Spark and its Python interface. • Learn about Spark • Explain its popularity • Touch upon the Python interface Apache Spark Stack The aim of this video is to provide a brief overview of Apache Spark stack components. • Apache Spark ecosystem introduction • Introduce SparkSQL and DataFrames • Explain the overview of remaining components – MLlib, GraphX, and streaming Spark Execution Process The aim of this video is to briefly review the execution process. • Explain the overview of interactions between the driver and workers • Discuss how Spark decides which tasks run in parallel • Represent the job execution plan as a directed acyclic graph Newest Capabilities of PySpark 2.0+ The aim of this video is to briefly review the newest features of Spark 2.0+. • Introduce the phase two of the project Tungsten • Instantiate Spark session instead of three other contexts • Brief overview of the Spark structured streaming Cloning GitHub Repository The aim of this video is to clone the GitHub repository for the course. Doing this will set everything we need for the following videos. • Find the repository on GitHub • Open terminal • Clone the repository The Course Overview This video gives an overview of the entire course. The Course Overview This video gives an overview of the entire course. The Course Overview This video gives an overview of the entire course. The Course Overview This video gives an overview of the entire course. This video gives an overview of the entire course. This video gives an overview of the entire course. Brief Introduction to Spark The aim of the video is to explain Spark and its Python interface. • Learn about Spark • Explain its popularity • Touch upon the Python interface Brief Introduction to Spark The aim of the video is to explain Spark and its Python interface. • Learn about Spark • Explain its popularity • Touch upon the Python interface Brief Introduction to Spark The aim of the video is to explain Spark and its Python interface. • Learn about Spark • Explain its popularity • Touch upon the Python interface Brief Introduction to Spark The aim of the video is to explain Spark and its Python interface. • Learn about Spark • Explain its popularity • Touch upon the Python interface The aim of the video is to explain Spark and its Python interface. • Learn about Spark • Explain its popularity • Touch upon the Python interface The aim of the video is to explain Spark and its Python interface. • Learn about Spark • Explain its popularity • Touch upon the Python interface Apache Spark Stack The aim of this video is to provide a brief overview of Apache Spark stack components. • Apache Spark ecosystem introduction • Introduce SparkSQL and DataFrames • Explain the overview of remaining components – MLlib, GraphX, and streaming Apache Spark Stack The aim of this video is to provide a brief overview of Apache Spark stack components. • Apache Spark ecosystem introduction • Introduce SparkSQL and DataFrames • Explain the overview of remaining components – MLlib, GraphX, and streaming Apache Spark Stack The aim of this video is to provide a brief overview of Apache Spark stack components. • Apache Spark ecosystem introduction • Introduce SparkSQL and DataFrames • Explain the overview of remaining components – MLlib, GraphX, and streaming Apache Spark Stack The aim of this video is to provide a brief overview of Apache Spark stack components. • Apache Spark ecosystem introduction • Introduce SparkSQL and DataFrames • Explain the overview of remaining components – MLlib, GraphX, and streaming The aim of this video is to provide a brief overview of Apache Spark stack components. • Apache Spark ecosystem introduction • Introduce SparkSQL and DataFrames • Explain the overview of remaining components – MLlib, GraphX, and streaming The aim of this video is to provide a brief overview of Apache Spark stack components. • Apache Spark ecosystem introduction • Introduce SparkSQL and DataFrames • Explain the overview of remaining components – MLlib, GraphX, and streaming Spark Execution Process The aim of this video is to briefly review the execution process. • Explain the overview of interactions between the driver and workers • Discuss how Spark decides which tasks run in parallel • Represent the job execution plan as a directed acyclic graph Spark Execution Process The aim of this video is to briefly review the execution process. • Explain the overview of interactions between the driver and workers • Discuss how Spark decides which tasks run in parallel • Represent the job execution plan as a directed acyclic graph Spark Execution Process The aim of this video is to briefly review the execution process. • Explain the overview of interactions between the driver and workers • Discuss how Spark decides which tasks run in parallel • Represent the job execution plan as a directed acyclic graph Spark Execution Process The aim of this video is to briefly review the execution process. • Explain the overview of interactions between the driver and workers • Discuss how Spark decides which tasks run in parallel • Represent the job execution plan as a directed acyclic graph The aim of this video is to briefly review the execution process. • Explain the overview of interactions between the driver and workers • Discuss how Spark decides which tasks run in parallel • Represent the job execution plan as a directed acyclic graph The aim of this video is to briefly review the execution process. • Explain the overview of interactions between the driver and workers • Discuss how Spark decides which tasks run in parallel • Represent the job execution plan as a directed acyclic graph Newest Capabilities of PySpark 2.0+ The aim of this video is to briefly review the newest features of Spark 2.0+. • Introduce the phase two of the project Tungsten • Instantiate Spark session instead of three other contexts • Brief overview of the Spark structured streaming Newest Capabilities of PySpark 2.0+ The aim of this video is to briefly review the newest features of Spark 2.0+. • Introduce the phase two of the project Tungsten • Instantiate Spark session instead of three other contexts • Brief overview of the Spark structured streaming Newest Capabilities of PySpark 2.0+ The aim of this video is to briefly review the newest features of Spark 2.0+. • Introduce the phase two of the project Tungsten • Instantiate Spark session instead of three other contexts • Brief overview of the Spark structured streaming Newest Capabilities of PySpark 2.0+ The aim of this video is to briefly review the newest features of Spark 2.0+. • Introduce the phase two of the project Tungsten • Instantiate Spark session instead of three other contexts • Brief overview of the Spark structured streaming The aim of this video is to briefly review the newest features of Spark 2.0+. • Introduce the phase two of the project Tungsten • Instantiate Spark session instead of three other contexts • Brief overview of the Spark structured streaming The aim of this video is to briefly review the newest features of Spark 2.0+. • Introduce the phase two of the project Tungsten • Instantiate Spark session instead of three other contexts • Brief overview of the Spark structured streaming Cloning GitHub Repository The aim of this video is to clone the GitHub repository for the course. Doing this will set everything we need for the following videos. • Find the repository on GitHub • Open terminal • Clone the repository Cloning GitHub Repository The aim of this video is to clone the GitHub repository for the course. Doing this will set everything we need for the following videos. • Find the repository on GitHub • Open terminal • Clone the repository Cloning GitHub Repository The aim of this video is to clone the GitHub repository for the course. Doing this will set everything we need for the following videos. • Find the repository on GitHub • Open terminal • Clone the repository Cloning GitHub Repository The aim of this video is to clone the GitHub repository for the course. Doing this will set everything we need for the following videos. • Find the repository on GitHub • Open terminal • Clone the repository The aim of this video is to clone the GitHub repository for the course. Doing this will set everything we need for the following videos. • Find the repository on GitHub • Open terminal • Clone the repository The aim of this video is to clone the GitHub repository for the course. Doing this will set everything we need for the following videos. • Find the repository on GitHub • Open terminal • Clone the repository Resilient Distributed Datasets. 11 lectures 38:48 Brief Introduction to RDDs In this video, we will provide a brief overview of one of the fundamental data structures of Spark – the RDDs. • Explain the system requirements • What are RDDs • Learn about Spark’s Python API Creating RDDs In this video, we will learn how to create RDDs in many different ways. • Parallelize a list of integers • Parallelize a list of tuples • Read from files Schema of an RDD In this video, we explore the advantages and disadvantages of RDD’s lack of schema. • Explain how RDDs handles structured data • Explain how RDDs handles unstructured data • Explain how RDDs handles semi-structured highly-heterogeneous data Understanding Lazy Execution Spark is lazy to process data. In this video we will learn why this is an advantage. • Execute a transformation • Execute an action • Track the execution process Introducing Transformations – .map(…) In this video, we will introduce lambdas and the .map(…) transformation. • Define functions to use inside the .map(…) transformation • Use Lambda inside the .map(…) transformation • What is the .map(…) transformation after all Introducing Transformations – .filter(…) In this video, we will learn how to filter data from RDDs. • Use the RDD from the .map(…) video • Subset the number of elements • Use .filter(…) to remove labels row Introducing Transformations – .flatMap(…) In this video, we will explain the difference between .flatMap(…) and .map(…) transformations and we will learn to use it to filter malformed records. • Explain the difference between .map(…) versus .flat Map(…) • How does the .flatMap(…) work • Use the .flatMap(…) to filter malformed records Introducing Transformations – .distinct(…) In this video, we will explore what the .distinct(…) transformation does. • Create an unordered list of integers • Explain the way a distinct items are found • Use the .distinct(…) transformation Introducing Transformations – .sample(…) In this video, we will learn how to sample data from RDDs. • Check how many records are in an RDD • Decide what proportion of records to return • Use the .sample(…) transformation to sample without replacement Introducing Transformations – .join(…) In this video, we will learn how to join two RDDs. • Populate two RDDs with some random data • The first element of the record is the key, the remained is values • Use the .join(…) transformation to join two RDDs Introducing Transformations – .repartition(…) In this video, we will explore how to effectively use repartitioning. • Use data read earlier • Check how many partitions • Use the • Subset the number of elements • Use .filter(…) to remove labels...

Additional information

A firm understanding of Python

Learning PySpark

£ 150 + VAT