Abstract:
"
The challenge of efficiently managing and processing large volumes of data has driven the demand for scalable, reliable, and flexible ETL (Extract, Transform, Load) solutions with excellent performance. Ensuring data integrity is critical in any ETL data pipeline, making automatic rollback an indispensable feature. Moreover, rather than relying solely on infrastructure auto-scaling, it is crucial to enhance performance programmatically through the right architecture in each stage of the ETL process.
To address these challenges, this thesis explores the implementation of a Function-as-a-Service (FaaS)-based ETL pipeline utilizing AWS Step Functions – standard edition as an orchestration tool. This research presents an architecture implementation of a FaaS-based ETL data pipeline that provides an automatic rollback error handling and it experiments with both asynchronous and synchronous sequential programming approaches to determine their impact on performance.
Different parameters, including data sizes and RAM capacities (512 MB, 1024 MB), were tested to evaluate the average performance for each combination. Upgrading from 512 MB to 1024 MB RAM resulted in an average execution time improvement of 0.90% for the sequential code paradigm and 2.94% for the asynchronous paradigm. This reflects a modest performance boost with increased RAM, particularly for asynchronous execution, which showed a greater improvement.
"