| dc.description.abstract |
The efficacy of web scraping techniques is often disproved by the complexity of modern web
sites including dynamically generated pages, anti-scraping mechanisms, and contextual
awareness needs. Traditional web scraping techniques rely on hard-coded rules and parsing
approaches, which limit their responsiveness and accuracy. In order to overcome such
problems, this research suggests an AI-driven web scraping system based on Large Language
Models (LLMs) and Retrieval-Augmented Generation (RAG) that scrapes, processes, and
summarizes web data with the least amount of human intervention.
The system enhances the web scraping process by integrating Selenium and BeautifulSoup for
data retrieval, FAISS indexing for retrieving similarities, and transformer-based architectures
for intelligent summarization. Using context-aware retrieval and text processing automation,
the system generates quality, semantically dense, and structured outputs. The evaluation
process involved experimentation using various datasets, accuracy assessments by ROUGE
metrics and Cosine Similarity, and comparison to traditional scraping mechanisms.
According to the experiments conducted, the proposed system is demonstrated to greatly
improve data extraction speed and relevance of context. For example, news articles obtained a
cosine similarity score of 90% and an ROUGE-1 score of 85%, confirming the accuracy and
brevity of summaries produced by the system. The results demonstrate that the use of LLMs
and RAG with web scraping enables more scalable, dynamic, and intelligent data extraction
technologies. |
en_US |