Digital Repository

Transforming Data Extraction with AI-Powered Intelligent Web Scraping

Show simple item record

dc.contributor.author Lokuhetti, Nadil
dc.date.accessioned 2026-04-07T05:28:47Z
dc.date.available 2026-04-07T05:28:47Z
dc.date.issued 2025
dc.identifier.citation Lokuhetti, Nadil (2025) Transforming Data Extraction with AI-Powered Intelligent Web Scraping. BSc. Dissertation, Informatics Institute of Technology en_US
dc.identifier.issn 20210038
dc.identifier.uri http://dlib.iit.ac.lk/xmlui/handle/123456789/3120
dc.description.abstract The efficacy of web scraping techniques is often disproved by the complexity of modern web sites including dynamically generated pages, anti-scraping mechanisms, and contextual awareness needs. Traditional web scraping techniques rely on hard-coded rules and parsing approaches, which limit their responsiveness and accuracy. In order to overcome such problems, this research suggests an AI-driven web scraping system based on Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) that scrapes, processes, and summarizes web data with the least amount of human intervention. The system enhances the web scraping process by integrating Selenium and BeautifulSoup for data retrieval, FAISS indexing for retrieving similarities, and transformer-based architectures for intelligent summarization. Using context-aware retrieval and text processing automation, the system generates quality, semantically dense, and structured outputs. The evaluation process involved experimentation using various datasets, accuracy assessments by ROUGE metrics and Cosine Similarity, and comparison to traditional scraping mechanisms. According to the experiments conducted, the proposed system is demonstrated to greatly improve data extraction speed and relevance of context. For example, news articles obtained a cosine similarity score of 90% and an ROUGE-1 score of 85%, confirming the accuracy and brevity of summaries produced by the system. The results demonstrate that the use of LLMs and RAG with web scraping enables more scalable, dynamic, and intelligent data extraction technologies. en_US
dc.language.iso en en_US
dc.subject Intelligent Web Scraping en_US
dc.subject Retrieval-Augmented Generation en_US
dc.subject Large Language Models en_US
dc.title Transforming Data Extraction with AI-Powered Intelligent Web Scraping en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search


Advanced Search

Browse

My Account