三维建模学习算法（英文版）最新章节_吴素萍著

1.1 3D Object Modeling

1.1.1 Single-View 3D Reconstruction

Single-View 3D reconstruction aims to recover a target object's three-dimensional structure and appearance information from a single image, typically achieved through computer vision and deep learning technologies. The challenge of this task lies in inferring the object's 3D shape and surface details from an image of a single viewpoint, which requires effectively utilizing the texture, colour, and projection information in the image and inferring and estimating based on prior knowledge of the object. Single-View 3D reconstruction tasks have a wide range of applications in many fields, including computer-aided design, virtual reality, augmented reality, and medical image processing.

Existing single-view reconstruction methods often fall short in reconstructing the details of edges and corners. Although some approaches attempt to mitigate this deficiency by using synthetic data, this strategy tends to encounter domain adaptation issues when applied to real data. To overcome these problems, we propose a single-view reconstruction method called DmifNet ^[1] , which aims to address the limitations of current methods in reconstructing edge and corner details. Unlike previous methods, we employ a dynamic multi-branch information fusion strategy to recover 3D shapes of arbitrary topology with high fidelity from a single 2D image. On the other hand, existing single-view reconstruction methods often struggle to reconstruct objects with complex topological structures, specifically manifesting as overly blurred boundaries between object components. Moreover, global and local information both play distinct roles in a single view, yet current methods fail to correctly balance the learning of global geometric structure and local detail information, leading to a reconstruction that often compromises one for the other. To address these issues, we propose a Multi-Scale Edge Guided Learning Network (MEGLN) ^[2] , which fully leverages global edge and local detail information. On one hand, it addresses the edge reconstruction problem of complex-shaped objects, and on the other hand, it balances the utilization of global and local information. Most of the available methods focused on reconstructing the overall shape of the object while ignoring some fine-grained details. Moreover, these methods make it hard to exactly reconstruct complex topological structures. We propose a Multi-Granularity Relationship Reasoning Network (MGRRNet) ^[3] , which aims to recover 3D shapes with high fidelity and rich details via the relationship reasoning between different granularity information. Our model captures the discriminative and detailed features at different granularities for extracting attentional regions, and performs the relationship reasoning between different granularities to reinforce the multi-granularity consistency and inter-granularity correlation. By doing this, our network can achieve robust feature representation and fine reconstruction. During the learning process, we jointly optimize procedures of different granularity feature representations via a sequence of inter-granularity cycle loss iterations.

1.1.2 Multi-View 3D Reconstruction Method

Multi-View stereo 3D reconstruction aims to reconstruct the complete three-dimensional geometric structure of a target object or scene using image sequences captured from multiple viewpoints and corresponding camera parameters, through dense matching algorithms and deep learning techniques. The main challenge lies in learning and fusing of information from multiple viewpoints to accurately restore the target's 3D shape, texture, and surface details, thereby achieving high-quality reconstruction of real-world objects. Multi-View 3D reconstruction has important applications in computer vision, robotics, and cartography, providing effective solutions for scene modeling, virtual reality, and cultural heritage preservation.

The existing multi-view stereo reconstruction methods often suffer from problems such as insufficient learning of detail information in depth map prediction, low surface accuracy, and incomplete 3D point clouds. These issues are primarily due to the inability of current networks to extract detailed and rich feature information from reference images. So, we propose a self-attention-guided multi-view stereo network AG-MVSNet ^[4] to address the existing issues in multi-view reconstruction. We use MVSNet ^[5] as the backbone network and design two modules to optimize the network's feature information extraction:the first module employs a smaller model with lower computational resource consumption to predict a coarser depth map as the initial depth map, which to guide the feature extraction and depth prediction. Considering that reference images in natural environments contain the detailed feature information required for the reconstruction process, the second module extracts target detail feature information from the reference images and concatenates the edge contour details of 2D images with the 3D geometric information of the coarse depth map. It utilizes an attention network to refine the predicted depth map, making full use of the rich feature information in the reference images. In challenging reconstruction scenarios, such as those with low resolution and extreme lighting, especially in depth inference for edges and occluded areas, where the integrity of the reconstruction significantly decreases. To address the reconstruction issues in these challenging scenes, we propose a multi-view stereo matching network based on depth edge flow, DEFMVSNet ^[6] .Our network dynamically infers edge coordinates using the reference image as guidance to improve the integrity of the reconstruction, learns features while generating 3D geometric information and seeks semantic features of the reference image that contain low-level details of the depth map.

Existing reconstruction methods mostly rely on convolutional neural networks, which limits the ability of the network to capture the global context of images, resulting in a lack of complete representation of the final depth map. We propose a Global Context Complementary Network (GCCN) ^[7] , which aims to enhance the complete representation of depth maps with a global context complementary learning strategy. For the feature maps, we first exploit the advantages of convolution neural networks and self-attention to extract 2D local features and long-term dependence information, respectively, which achieves maximizing the preservation of the complementary information. To obtain richer 3D depth information, we design a Contextual-feature Complementary Learning Module, which utilizes global feature interaction in the cost volume to achieve complementary learning of cost volumes at different scales.

Self-supervised methods obtain progress on the Multi-View Stereo (MVS). However, existing methods ignore the edge structure information of the reconstructed target, which includes the outer silhouette and the edge information of the internal structure. To address this issue, we propose a self-supervised edge structure learning network (UESM) ^[8] , we propose an extractor for extracting edge structure maps and design an edge structure loss to constrain the network to pay more attention to edge structure features of the reference view. We utilize the idea of constructing cost volume in multi-view stereo and warp the edge structure map of the source view to the reference view to provide reliable self-supervision. We design a masking mechanism combining local and global properties which ensures robustness and adopt an effective parallel acceleration approach to improve the training speed and reconstruction efficiency. Our parallel method improves efficiency while ensuring accuracy.