Transforming Data Extraction with AI-Powered  Intelligent Web Scraping

Lokuhetti, Nadil

dc.contributor.author	Lokuhetti, Nadil
dc.date.accessioned	2026-04-07T05:28:47Z
dc.date.available	2026-04-07T05:28:47Z
dc.date.issued	2025
dc.identifier.citation	Lokuhetti, Nadil (2025) Transforming Data Extraction with AI-Powered Intelligent Web Scraping. BSc. Dissertation, Informatics Institute of Technology	en_US
dc.identifier.issn	20210038
dc.identifier.uri	http://dlib.iit.ac.lk/xmlui/handle/123456789/3120
dc.description.abstract	The efficacy of web scraping techniques is often disproved by the complexity of modern web sites including dynamically generated pages, anti-scraping mechanisms, and contextual awareness needs. Traditional web scraping techniques rely on hard-coded rules and parsing approaches, which limit their responsiveness and accuracy. In order to overcome such problems, this research suggests an AI-driven web scraping system based on Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) that scrapes, processes, and summarizes web data with the least amount of human intervention. The system enhances the web scraping process by integrating Selenium and BeautifulSoup for data retrieval, FAISS indexing for retrieving similarities, and transformer-based architectures for intelligent summarization. Using context-aware retrieval and text processing automation, the system generates quality, semantically dense, and structured outputs. The evaluation process involved experimentation using various datasets, accuracy assessments by ROUGE metrics and Cosine Similarity, and comparison to traditional scraping mechanisms. According to the experiments conducted, the proposed system is demonstrated to greatly improve data extraction speed and relevance of context. For example, news articles obtained a cosine similarity score of 90% and an ROUGE-1 score of 85%, confirming the accuracy and brevity of summaries produced by the system. The results demonstrate that the use of LLMs and RAG with web scraping enables more scalable, dynamic, and intelligent data extraction technologies.	en_US
dc.language.iso	en	en_US
dc.subject	Intelligent Web Scraping	en_US
dc.subject	Retrieval-Augmented Generation	en_US
dc.subject	Large Language Models	en_US
dc.title	Transforming Data Extraction with AI-Powered Intelligent Web Scraping	en_US
dc.type	Thesis	en_US