Abstract:
"Software Vulnerability Detection has been a topic of extensive research due to its significance in assuring that software systems are not vulnerable to exploits that can cause disruptions. Existing systems suffer from significantly high false negatives, false positives and low evaluations across other metrics indicating that they are ineffective and unreliable for the purpose of securing software applications.
This research employs a novel approach to Deep Learning based Code Vulnerability Detection where C/C++ vulnerabilities are classified according to the Common Weakness Enumeration label, by employing a large and extensive dataset and 2 models, where a CodeBert model which is fine tuned on test data is focused on processing slice sequences and a Graph Attention Network on graph structure of the code. Extracted vulnerability features from these models are used to train a Random Forest classifier. The use of transfer learning in the CodeBert model, combined use of graph structure, slice sequences as well as efficient processing of syntax elements of code in the form of vectors, embeddings, and tokens allow capturing both semantic and syntactic characteristics of code effectively resulting in higher efficiency and better classification performance in the proposed system.
The results show that the setup is effective where it outperforms models setups like GCN, BiLSTM and standalone codebert models, with Weighted Scores of 0.70 across all metrics using 2 CWEs, 1000 dataset samples, 32 batch size and 12 epochs. Further testing using high performance hardware, improvements in context mechanisms such as using syntax trees in code, graph parsing and processing more syntax characteristics could lead to even better performance in future work. "