Abstract:
Even though Apache Spark fulfils state-of-the-art big data processing needs, some performance
issues still exist in Spark JOIN operator which is one of a heavily used operators in many
applications. Further spark itself does not have a way to add indexes on dataframes and it makes
most RDDs shuffle between distributed nodes while processing data rather than referring
indexed metadata. This study aims to improve Spark JOIN operation performance with index friendly dataframes which make data scan faster and reduce the amount of data shuffle in
between spark nodes by keeping index metadata in a shared volume.
To test the correlation between data volume, amount of data shuffle and execution time an
experiment has been conducted, and results showed that there is a linear relationship between
the volume of data and amount of shuffled/execution time. Then compare execution time taken
by native spark dataframe against novel indexed dataframe.
The experimental results show that once indexed dataframes are initialized, they can make
Spark JOIN operation execution time faster up to 97 % by comparing to native spark
dataframes. In addition to execution time, it also makes no data shuffling at all while processing
data. Hence, the use of an indexed friendly dataframe can suggest as a best-fit solution to Spark
query slowness