2hr 3min



About this Course

PySpark helps you perform data analysis. It helps you to build more scalable analyses and data pipelines. This course starts by introducing you to PySpark's potential for performing analysis of large datasets. You'll learn how to interact with Spark from Python and connect to Spark on windows as local machine. By the end of this course, you will not only be able to perform efficient data analytics but will have also learned to use PySpark to easily analyze large datasets at-scale in your organization. This course will greatly appeal to data science enthusiasts, data scientists, or anyone who is familiar with Machine Learning concepts and wants to scale out his/her work to work with big data. If you find it difficult to analyze large datasets that keep growing, then this course is the perfect guide for you! Note: A working knowledge of Python assumed. What You Will Learn: 👉 Gain a solid knowledge of PySpark with Data Analytics concepts via practical use cases 👉 Run, process, and analyze large chunks of datasets using PySpark 👉 Utilize Spark SQL to easily load big data into DataFrames 👉 How to use PySpark SQL Functions. 👉 How you can extract data from multiple sources 👉 We will be using Pycharm as an IDE to run PySpark and Python.

Who should take this course

  • Who should course answer 1
  • Who should course answer 1
  • Who should course answer 1
  • Who should course answer 1
  • Who should course answer 1

Browse Lesson Plans

We will begin this course by familiarizing ourselves with Spark and Big Data, and get an overview on learning to implement distributed data management and machine learning.

  • Big Data and Spark Overview 12min 7sec

In this section, we will be focusing on Resilient Distributed Dataset (RDD). What is RDD? In its very essence, RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. You will learn RDD operations and how to pair RDD.

  • RDD Introduction 18min 44sec
  • RDD Operations 15min 7sec
  • Pair RDD 11min 52sec

Moving on we will look at PySpark DataFrames and explore different functions of it for data analysis. DataFrames are a handy data structure for storing petabytes of data and PySpark dataframes can run on parallel architectures and even support SQL queries.

  • PySpark Dataframes Overview 25min 28sec
  • PySpark Column Class | Operators & Functions 15min 37sec

We will continue our learning journey and look into SQL Functions in PySpark. This section will be broken down into two functions, mainly: SQL Aggregate Functions and SQL Windows Functions.

  • SQL Aggregate Functions 10min 51sec
  • SQL Windows Functions 8min 20sec

Lastly, we will look into how we can use Matplotlib with PySpark. By the end of this session, you will be able to plot your data to help you analyze your data sets. 👉 Slides/resource for Matplotlib with PySpark in the Resources section 👈

  • Matplotlib with PySpark 5min 37sec
15+ enrolled on this course

About Expert

Wajahutullah Khan

Data Architect @ Afiniti

Wajahatullah Khan has a BS in Information Systems from Pakistan's #1 University and MS in Information Technology from Isik University, Istanbul, have a lot of certifications and 9+ years of experience as a Data Professional and trainer for Data Science and programming. Over the course of his career, he has developed a skill set in analyzing data and he hopes to use his experience in teaching and data science to help other people learn the power of programming the ability to analyze data, as well as present the data in clear and beautiful visualizations. Currently, he works as the Data Architect in Afiniti. Feel free to contact him on LinkedIn for more information on in-person training sessions or group training sessions.


0 stars 0 ratings
There are currently no Reviews available for this course

Come out of this class as a

  • Someone with knowledge in PySpark
One time fee for the whole master course
15+ enrolled on this course