SwinVox: A Hybrid CNN-SwinT Cross-View Attention Architecture for VoxelBased 3D Reconstruction from Single and Multi-View Images

Samaranayake, Sandeepa

dc.contributor.author	Samaranayake, Sandeepa
dc.date.accessioned	2026-04-10T09:52:13Z
dc.date.available	2026-04-10T09:52:13Z
dc.date.issued	2025
dc.identifier.citation	Samaranayake, Sandeepa (2025) SwinVox: A Hybrid CNN-SwinT Cross-View Attention Architecture for VoxelBased 3D Reconstruction from Single and Multi-View Images. BSc. Dissertation, Informatics Institute of Technology	en_US
dc.identifier.issn	20210302
dc.identifier.uri	http://dlib.iit.ac.lk/xmlui/handle/123456789/3163
dc.description.abstract	The magnified demand for precise 3D reconstruction in computer vision and augmented reality fields has persuaded significant advances in model architectures that are capable of recovering voxel 3D representations from single or multiple views. Despite the significant progress made, current approaches have yet to witness a typical breakthrough due to the incapability to explore local and global dependencies while preserving the spatial relationships between views, resulting in incomplete or hollowed-out areas. Addressing these limitations, this research work proposes a hybrid model, SwinVox, incorporating the strengths of Convolutional Neural Networks (CNNs) and Swin Transformers (SwinT). The primary objective of this research is to address the above limitation in the field of voxel-based 3D reconstruction by integrating CNN for localised feature extraction and SwinT for capturing multi-scale global features, including cross-view spatial relationships. To achieve this objective, a novel CNN-SwinT hybrid architecture was designed, where CNN layers extract initial spatial features while SwinT captures the long-range global dependencies and provides multi-scale feature representation. In addition, SwinT performs cross-view attention to capture spatial relationships across different viewpoints to recover more accurate 3D representation. The SwinVox architecture is optimised to capture complex topological structures and spatial relationships across the input views, which are then translated into highfidelity 3D voxel representations. Extensive experiments were conducted on benchmark datasets and evaluated the model’s performance by comparing it with standard 3D voxel reconstruction models, and achieved a promising IoU of ≈0.68 (68%) and F-Score of ≈0.79, suggesting that the SwinVox outperforms many standard models by a large margin.	en_US
dc.language.iso	en	en_US
dc.subject	Computer Vision	en_US
dc.subject	3D Reconstruction	en_US
dc.subject	Voxel Based Reconstruction	en_US
dc.title	SwinVox: A Hybrid CNN-SwinT Cross-View Attention Architecture for VoxelBased 3D Reconstruction from Single and Multi-View Images	en_US
dc.type	Thesis	en_US