Abstract:
Humans rely on information deeply in this competitive modern era. A common and prominent
method of seeking information is by using forums. People express their opinions and gather
answers to questions using them. Off-topic posts are commonly found with forums. They
reduce the readability and user experience for users. Therefore, it is important to detect off topics posts to manage it. The identification is a tedious task as typical forums contain a
significant amount of content. Due to the increase of internet users during the recent years,
capturing off-topic posts had become an even more challenging task.
This research demonstrates an automated solution to identify off-topic posts in a forum. The
core of the solution was compiled using techniques sourced from Natural Language Processing
and Deep Neural Networking. The similarity analysis method uses document representation
through WordNet Path vectors and cosine angle difference calculation. This method is
computationally expensive. Therefore, a word count reduction model was introduced to
decrease the words pushed for the analysis. It was created using the Long Short Term Memory
encoder-decoder mechanism. The reduction model was trained on a food domain dataset and
the entire prototype was tested exhaustively on a similar domain, forum dataset. A 71.36%
accuracy was recorded with the vanilla similarity analysis mechanism while a 67.95% accuracy
was recorded with the reduction enabled model. A significant performance increase was
captured when the dataset was sent through reduction before similarity analysis.
Subject Descriptors
1.2: Artificial Intelligence
1.2.6: Learning
1.2.7: Natural Language Processing
H.3 Information Storage and Retrieval
H.3.3 Information Search and Retrieval