Learn the basics of Pyspark and build up to analyze large data sets at scale!
PySpark helps you perform data analysis. It helps you to build more scalable analyses and data pipelines. This course starts by introducing you to PySpark's potential for performing analysis of large datasets. You'll learn how to interact with Spark from Python and connect to Spark on windows as local machine. By the end of this course, you will not only be able to perform efficient data analytics but will have also learned to use PySpark to easily analyze large datasets at-scale in your organization. This course will greatly appeal to data science enthusiasts, data scientists, or anyone who is familiar with Machine Learning concepts and wants to scale out his/her work to work with big data. If you find it difficult to analyze large datasets that keep growing, then this course is the perfect guide for you! Note: A working knowledge of Python assumed. What You Will Learn: 👉 Gain a solid knowledge of PySpark with Data Analytics concepts via practical use cases 👉 Run, process, and analyze large chunks of datasets using PySpark 👉 Utilize Spark SQL to easily load big data into DataFrames 👉 How to use PySpark SQL Functions. 👉 How you can extract data from multiple sources 👉 We will be using Pycharm as an IDE to run PySpark and Python.
We will begin this course by familiarizing ourselves with Spark and Big Data, and get an overview on learning to implement distributed data management and machine learning.
In this section, we will be focusing on Resilient Distributed Dataset (RDD). What is RDD? In its very essence, RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. You will learn RDD operations and how to pair RDD.
Moving on we will look at PySpark DataFrames and explore different functions of it for data analysis. DataFrames are a handy data structure for storing petabytes of data and PySpark dataframes can run on parallel architectures and even support SQL queries.
We will continue our learning journey and look into SQL Functions in PySpark. This section will be broken down into two functions, mainly: SQL Aggregate Functions and SQL Windows Functions.
Lastly, we will look into how we can use Matplotlib with PySpark. By the end of this session, you will be able to plot your data to help you analyze your data sets. 👉 Slides/resource for Matplotlib with PySpark in the Resources section 👈
Wajahatullah Khan has a BS in Information Systems from Pakistan's #1 University and MS in Information Technology from Isik University, Istanbul, have a lot of certifications and 9+ years of experience as a Data Professional and trainer for Data Science and programming. Over the course of his career, he has developed a skill set in analyzing data and he hopes to use his experience in teaching and data science to help other people learn the power of programming the ability to analyze data, as well as present the data in clear and beautiful visualizations. Currently, he works as the Data Architect in Afiniti. Feel free to contact him on LinkedIn for more information on in-person training sessions or group training sessions.