| dc.description.abstract |
The magnified demand for precise 3D reconstruction in computer vision and augmented reality fields has persuaded significant advances in model architectures that are capable of recovering voxel 3D representations from single or multiple views. Despite the significant progress made, current approaches have yet to witness a typical breakthrough due to the incapability to explore local and global dependencies while preserving the spatial relationships between views, resulting in incomplete or hollowed-out areas. Addressing these limitations, this research work proposes a hybrid model, SwinVox, incorporating the strengths of Convolutional Neural Networks (CNNs) and Swin Transformers (SwinT). The primary objective of this research is to address the above limitation in the field of voxel-based 3D reconstruction by integrating CNN for localised feature extraction and SwinT for capturing multi-scale global features, including cross-view spatial relationships.
To achieve this objective, a novel CNN-SwinT hybrid architecture was designed, where CNN
layers extract initial spatial features while SwinT captures the long-range global dependencies and provides multi-scale feature representation. In addition, SwinT performs cross-view attention to capture spatial relationships across different viewpoints to recover more accurate 3D representation. The SwinVox architecture is optimised to capture complex topological structures and spatial relationships across the input views, which are then translated into highfidelity 3D voxel representations.
Extensive experiments were conducted on benchmark datasets and evaluated the model’s performance by comparing it with standard 3D voxel reconstruction models, and achieved a promising IoU of ≈0.68 (68%) and F-Score of ≈0.79, suggesting that the SwinVox outperforms many standard models by a large margin. |
en_US |