Enhance apache spark join with an in memory index

Chandika, Keerthisinghe Alankarage Janith

Home
→
Dissertations & Thesis
→
MSc Bigdata Analytics
→
2021
→
View Item

dc.contributor.author	Chandika, Keerthisinghe Alankarage Janith
dc.date.accessioned	2022-02-25T09:22:13Z
dc.date.available	2022-02-25T09:22:13Z
dc.date.issued	2021
dc.identifier.citation	Chandika, Keerthisinghe Alankarage Janith (2021) Enhance apache spark join with an in memory index. MSc. Dissertation Informatics Institute of Technology	en_US
dc.identifier.issn	2018535
dc.identifier.uri	http://dlib.iit.ac.lk/xmlui/handle/123456789/775
dc.description.abstract	Even though Apache Spark fulfils state-of-the-art big data processing needs, some performance issues still exist in Spark JOIN operator which is one of a heavily used operators in many applications. Further spark itself does not have a way to add indexes on dataframes and it makes most RDDs shuffle between distributed nodes while processing data rather than referring indexed metadata. This study aims to improve Spark JOIN operation performance with index friendly dataframes which make data scan faster and reduce the amount of data shuffle in between spark nodes by keeping index metadata in a shared volume. To test the correlation between data volume, amount of data shuffle and execution time an experiment has been conducted, and results showed that there is a linear relationship between the volume of data and amount of shuffled/execution time. Then compare execution time taken by native spark dataframe against novel indexed dataframe. The experimental results show that once indexed dataframes are initialized, they can make Spark JOIN operation execution time faster up to 97 % by comparing to native spark dataframes. In addition to execution time, it also makes no data shuffling at all while processing data. Hence, the use of an indexed friendly dataframe can suggest as a best-fit solution to Spark query slowness	en_US
dc.language.iso	en	en_US
dc.title	Enhance apache spark join with an in memory index	en_US
dc.type	Thesis	en_US