What Strategies and  Technologies Can Be Developed  to Achieve Full Automation in  the Identification and Creation  of Dimension and Fact Tables  Within a Data Warehouse,  Leveraging Data Sourced From  Hadoop?

Kathriarachchi, Visal

Home
→
Dissertations & Thesis
→
MSc Business Analytics
→
2023
→
View Item

dc.contributor.author	Kathriarachchi, Visal
dc.date.accessioned	2024-02-14T07:48:02Z
dc.date.available	2024-02-14T07:48:02Z
dc.date.issued	2023
dc.identifier.citation	Kathriarachchi, Visal (2023) What Strategies and Technologies Can Be Developed to Achieve Full Automation in the Identification and Creation of Dimension and Fact Tables Within a Data Warehouse, Leveraging Data Sourced From Hadoop?. MSc. Dissertation, Informatics Institute of Technology	en_US
dc.identifier.issn	20210082
dc.identifier.uri	http://dlib.iit.ac.lk/xmlui/handle/123456789/1668
dc.description.abstract	"In the realm of data management and analysis, Data Warehousing and Apache Hadoop stand as two powerful and complementary technologies. Their roles in handling and processing vast quantities of data have become indispensable for organizations seeking to extract meaningful insights and inform critical decisions. Data Warehousing involves the systematic collection, organization, and structuring of structured data within a centralized repository known as a data warehouse. These structured data sources encompass a variety of inputs, ranging from transactional databases to CRM systems and external data feeds. The central aim of a data warehouse is to provide a unified and integrated representation of an organization's data, facilitating efficient querying, analysis, and reporting. Key attributes of Data Warehousing include a reliance on structured data, the application of schema designs such as star or snowflake schemas to optimize data integration, the amalgamation of data from diverse sources into a cohesive repository, and the storage of historical data to support trend analysis and strategic decision-making. On the other hand, Apache Hadoop, an open-source framework, offers a distributed computing and storage platform for processing and analyzing big data. Its scalability and fault-tolerant capabilities arise from the distribution of data and processing across clusters of commodity hardware. Key components of Apache Hadoop encompass the Hadoop Distributed File System (HDFS), MapReduce for parallel processing, Yet Another Resource Negotiator (YARN) for cluster resource management, and a rich ecosystem of complementary tools like Apache Hive, Apache Pig, Apache Spark, and Apache HBase. The synergy between Data Warehousing and Apache Hadoop stems from their complementary strengths and their roles in addressing distinct facets of data management and analysis. Apache Hadoop excels in data ingestion and scalable storage, while its distributed processing capacities, coupled with ecosystem tools, enhance data processing and transformation. These capabilities are instrumental in optimizing the performance and efficiency of analytics processes. As the volume of data generated by modern technologies continues to grow exponentially, the imperative to harness this data using methodologies such as those employed by data warehouses and Apache Hadoop becomes increasingly pronounced. Many organizations have successfully bridged this divide, utilizing these technologies to analyze data effectively. Apache Hadoop's distributed storage system ensures robustness, fault tolerance, and adaptability in the face of hardware failures. Its batch processing model enables the handling of large data sets efficiently, and its ability to stream data into distributed components is a testament to its speed and scalability. MapReduce, a cornerstone of Hadoop, offers scalability, flexibility, speed, and simplicity, making it a preferred choice for big data processing. In this research, we delve into the critical aspect of data warehousing - the schema design. The conventional approach to schema development entails exhaustive manual analysis of the data, a process that consumes substantial time and resources. However, the correctness of the data warehouse schema is paramount, as errors at this stage can render the entire endeavor futile and costly. "	en_US
dc.language.iso	en	en_US
dc.publisher	IIT	en_US
dc.subject	DataWarehouse	en_US
dc.subject	Hadoop	en_US
dc.subject	Machine Learning	en_US
dc.title	What Strategies and Technologies Can Be Developed to Achieve Full Automation in the Identification and Creation of Dimension and Fact Tables Within a Data Warehouse, Leveraging Data Sourced From Hadoop?	en_US
dc.type	Thesis	en_US