Enhance apache spark join with an in memory index

Chandika, Keerthisinghe Alankarage Janith

Home
→
Dissertations & Thesis
→
MSc Bigdata Analytics
→
2021
→
View Item

Enhance apache spark join with an in memory index

Chandika, Keerthisinghe Alankarage Janith

URI: http://dlib.iit.ac.lk/xmlui/handle/123456789/775

Date: 2021

Abstract:

Even though Apache Spark fulfils state-of-the-art big data processing needs, some performance issues still exist in Spark JOIN operator which is one of a heavily used operators in many applications. Further spark itself does not have a way to add indexes on dataframes and it makes most RDDs shuffle between distributed nodes while processing data rather than referring indexed metadata. This study aims to improve Spark JOIN operation performance with index friendly dataframes which make data scan faster and reduce the amount of data shuffle in between spark nodes by keeping index metadata in a shared volume. To test the correlation between data volume, amount of data shuffle and execution time an experiment has been conducted, and results showed that there is a linear relationship between the volume of data and amount of shuffled/execution time. Then compare execution time taken by native spark dataframe against novel indexed dataframe. The experimental results show that once indexed dataframes are initialized, they can make Spark JOIN operation execution time faster up to 97 % by comparing to native spark dataframes. In addition to execution time, it also makes no data shuffling at all while processing data. Hence, the use of an indexed friendly dataframe can suggest as a best-fit solution to Spark query slowness

Show full item record