PySpark Archives - Tori Tompkins

Identifying Data Outliers in Apache Spark 3.0

The secret to getting machine learning to work effectively is in ensuring that the data we are using for training is as clean as possible and has any bias removed from it. When working with machine learning, we should be building in a generalised mode and to do this we need to understand what is…… Continue reading Identifying Data Outliers in Apache Spark 3.0

Will Koalas replace PySpark?

One of the first of many big announcements at the 2020 Spark and AI Summit was the official release of Koalas 1.0, the pandas API on top of Apache Spark. This blog will explore how Koalas differs from PySpark. Pandas and Spark To understand what makes Koalas so important, you need to understand the importance…… Continue reading Will Koalas replace PySpark?