Learning PySpark
Course
Online
Description
-
Type
Course
-
Methodology
Online
-
Start date
Different dates available
Building and deploying data-intensive applications at scale using Python and Apache Spark.Apache Spark is an open-source distributed engine for querying and processing data. In this tutorial, we provide a brief overview of Spark and its stack. This tutorial presents effective, time-saving techniques on how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Apache Spark architecture and how to set up a Python environment for Spark.You'll learn about different techniques for collecting data, and distinguish between (and understand) techniques for processing data. Next, we provide an in-depth review of RDDs and contrast them with DataFrames. We provide examples of how to read data from files and from HDFS and how to specify schemas using reflection or programmatically (in the case of DataFrames). The concept of lazy execution is described and we outline various transformations and actions specific to RDDs and DataFrames.Finally, we show you how to use SQL to interact with DataFrames. By the end of this tutorial, you will have learned how to process data using Spark DataFrames and mastered data collection techniques by distributed data processing.About the AuthorTomasz Drabas is a Data Scientist working for Microsoft and currently residing in the Seattle area. He has over 12 years' international experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance, and consulting..
Tomasz started his career in 2003 with LOT Polish Airlines in Warsaw, Poland while finishing his Master's degree in strategy management. In 2007, he moved to Sydney to pursue a doctoral degree in operations research at the University of New South Wales, School of Aviation; his research crossed boundaries between discrete choice modeling and airline operations research
Facilities
Location
Start date
Start date
About this course
Learn about Apache Spark and the Spark 2.0 architecture
Understand schemas for RDD, lazy executions, and transformations
Explore the sorting and saving elements of RDD
Build and interact with Spark DataFrames using Spark SQL
Create and explore various APIs to work with Spark DataFrames
Learn how to change the schema of a DataFrame programmatically
Explore how to aggregate, transform, and sort data with DataFrames
Reviews
This centre's achievements
All courses are up to date
The average rating is higher than 3.7
More than 50 reviews in the last 12 months
This centre has featured on Emagister for 6 years
Subjects
- Project
- Apache
- Information Systems
- Information Systems management
- IT
- IT Management
- Management
- Computing
- Programming
- Programme Planning
Course programme
Additional information
Learning PySpark
