Abstract:
Problem: Large Language Models (LLMs) have revolutionized software engineering by
automating tasks such as code quality improvement, refactoring, and code review suggestions.
Despite these advancements, current tools face significant limitations in generalizability,
safety, and evaluation. They are often tailored to specific programming languages (e.g., Java,
Python) and lack robust mechanisms to ensure the safety and reliability of refactoring outputs
(Pomian et al., 2024). Moreover, the absence of standardized evaluation metrics complicates
their benchmarking, hindering their scalability and effectiveness in diverse, multi-language
environments (Wadhwa et al., 2023; Liu et al., 2024).
Methodology: This research introduces a generalized LLM-based framework designed to
overcome these challenges by enabling automated code quality improvement, refactoring, and
code review suggestions across multiple programming languages. The framework integrates
fine-tuned LLMs (e.g., GPT-4, StarCoder2) trained on diverse datasets such as Ref-Dataset
and Java refactoring commits (Yu et al., 2024; Cordeiro et al., 2024). Safety mechanisms, such
as RefactoringMirror, are incorporated to validate outputs and prevent unsafe changes (Liu et
al., 2024). Standardized evaluation metrics for accuracy, scalability, and generalizability are
utilized to benchmark performance (Finkman et al., 2024). The framework development
adheres to Object-Oriented Analysis and Design (OOAD) principles and employs Agile
methodologies for iterative refinement and adaptability.