Digital Repository

What Strategies and Technologies Can Be Developed to Achieve Full Automation in the Identification and Creation of Dimension and Fact Tables Within a Data Warehouse, Leveraging Data Sourced From Hadoop?

Show simple item record

dc.contributor.author Kathriarachchi, Visal
dc.date.accessioned 2024-02-14T07:48:02Z
dc.date.available 2024-02-14T07:48:02Z
dc.date.issued 2023
dc.identifier.citation Kathriarachchi, Visal (2023) What Strategies and Technologies Can Be Developed to Achieve Full Automation in the Identification and Creation of Dimension and Fact Tables Within a Data Warehouse, Leveraging Data Sourced From Hadoop?. MSc. Dissertation, Informatics Institute of Technology en_US
dc.identifier.issn 20210082
dc.identifier.uri http://dlib.iit.ac.lk/xmlui/handle/123456789/1668
dc.description.abstract "In the realm of data management and analysis, Data Warehousing and Apache Hadoop stand as two powerful and complementary technologies. Their roles in handling and processing vast quantities of data have become indispensable for organizations seeking to extract meaningful insights and inform critical decisions. Data Warehousing involves the systematic collection, organization, and structuring of structured data within a centralized repository known as a data warehouse. These structured data sources encompass a variety of inputs, ranging from transactional databases to CRM systems and external data feeds. The central aim of a data warehouse is to provide a unified and integrated representation of an organization's data, facilitating efficient querying, analysis, and reporting. Key attributes of Data Warehousing include a reliance on structured data, the application of schema designs such as star or snowflake schemas to optimize data integration, the amalgamation of data from diverse sources into a cohesive repository, and the storage of historical data to support trend analysis and strategic decision-making. On the other hand, Apache Hadoop, an open-source framework, offers a distributed computing and storage platform for processing and analyzing big data. Its scalability and fault-tolerant capabilities arise from the distribution of data and processing across clusters of commodity hardware. Key components of Apache Hadoop encompass the Hadoop Distributed File System (HDFS), MapReduce for parallel processing, Yet Another Resource Negotiator (YARN) for cluster resource management, and a rich ecosystem of complementary tools like Apache Hive, Apache Pig, Apache Spark, and Apache HBase. The synergy between Data Warehousing and Apache Hadoop stems from their complementary strengths and their roles in addressing distinct facets of data management and analysis. Apache Hadoop excels in data ingestion and scalable storage, while its distributed processing capacities, coupled with ecosystem tools, enhance data processing and transformation. These capabilities are instrumental in optimizing the performance and efficiency of analytics processes. As the volume of data generated by modern technologies continues to grow exponentially, the imperative to harness this data using methodologies such as those employed by data warehouses and Apache Hadoop becomes increasingly pronounced. Many organizations have successfully bridged this divide, utilizing these technologies to analyze data effectively. Apache Hadoop's distributed storage system ensures robustness, fault tolerance, and adaptability in the face of hardware failures. Its batch processing model enables the handling of large data sets efficiently, and its ability to stream data into distributed components is a testament to its speed and scalability. MapReduce, a cornerstone of Hadoop, offers scalability, flexibility, speed, and simplicity, making it a preferred choice for big data processing. In this research, we delve into the critical aspect of data warehousing - the schema design. The conventional approach to schema development entails exhaustive manual analysis of the data, a process that consumes substantial time and resources. However, the correctness of the data warehouse schema is paramount, as errors at this stage can render the entire endeavor futile and costly. " en_US
dc.language.iso en en_US
dc.publisher IIT en_US
dc.subject DataWarehouse en_US
dc.subject Hadoop en_US
dc.subject Machine Learning en_US
dc.title What Strategies and Technologies Can Be Developed to Achieve Full Automation in the Identification and Creation of Dimension and Fact Tables Within a Data Warehouse, Leveraging Data Sourced From Hadoop? en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search


Advanced Search

Browse

My Account