



3D human pose estimation is a fundamental problem in computer vision and has many potentially useful real-world applications such as human-computer interaction, autonomous driving, virtual reality, sports analysis and healthcare. Great progress has been made in 3D human pose estimation based on deep learning.
Monocular 3D human pose estimation aims to recover the corresponding 3D human pose based on a given monocular image or video, which remains a challenging problem due to the self-occlusion and depth ambiguity of 2D representations such as monocular images or videos. We propose a Mul-Tiple Hybrid Extraction Network (MHENet) [23] , which obtains three different representations of pose hypotheses features by multiple hybrid extractors with different structures, and uses pose interaction and fusion to obtain accurate 3D pose. We design a hybrid extraction module to obtain three hypotheses features:base features correspond to structural information, diverse features correspond to detail information, and condensed features correspond to action information. We also design a hypotheses interaction fusion module to build relationships across hypotheses feature to generate more accurate 3D poses.
Existing methods generally lift 2D poses to 3D space through a single mapping function, in which case some large-pose samples far away from the majority distribution may not be well concerned. Unlike the previous methods processing the all 2D poses equally, we utilize an unsupervised method to separate large pose samples from the normal ones and reason them through a separate branch network.we design a multi-branch network based on the human center of gravity [24] to enhance the robustness of the model to large-pose samples. Noticing the correspondence between the human center of gravity and human pose, by clustering the human center of gravity, we separate the large-pose samples from the normal ones in an unsupervised pattern, and lift them with separate branch network. In addition, we introduce a global loss function to regularize the integrality of 3D joints.
Recovering 3D human pose and shape from videos has always received great attention. This benefits from many potential and useful real-world applications, such as human-computer interaction, virtual reality, and motion analysis. Despite conventional 3D human pose and shape estimation methods have achieved success based on a single image, recovering accurate and smooth 3D human motion from a video is still challenging. Recovering human body meshes from videos differs greatly from recovering them from monocular images, as it requires considering precise parameterized 3D human body meshes to adapt to each frame of the video and ensuring temporal consistency between consecutive poses. Therefore, recovering accurate and smooth 3D human body motion from videos is a challenging problem. To address this issue, we propose a frame-level feature tokenization approach [17] , which defines single-frame inputs of the video sequence as coarse-grained tokens and defines cropped features of the single-frame sequence as fine-grained tokens. It effectively utilizes information from different granularities of tokens to enhance the consistency of temporal information and the accuracy of predictions. Additionally, we consider that the residual connections between static and temporal features can suppress the better learning of features by the temporal encoder. We use spatial attention mechanisms to transform the spatial information in the static feature vectors into another space, preserving key information favorable for temporal features while eliminating strongly dependent information. Although frame-level feature tokenization approaches provide highly temporally consistent 3D motion and more accurate 3D poses per frame, the paradigm of feeding frame-level features extracted by the spatial encoder into the temporal encoder for temporal modeling often focuses on short-range spa-tiotemporal receptive fields and information propagation in videos, which fails to adaptively perceive effective spatiotemporal dependencies at long distances and lacks the ability to perceive local motions of small ranges. To address the problem of long-range modeling, we propose a time-frequency awareness network [18] for human mesh recovery. We employ a time-frequency aware attention module and a time-frequency aware recurrent module in the network to perceive long-term spatiotemporal dependencies and capture temporally accumulated spatiotemporal information. To ensure the network's ability to perceive small-scale human motions, we also introduce a local perception loss that calculates relative offsets between different joints of the body to capture variations in pose changes across different scales, which helps predict more accurate results.
Capturing motion images in unconstrained environments poses another challenging problem in recovering 3D human body meshes from videos. Current methods struggle to estimate model parameters accurately in unconstrained environments (low natural lighting, motion blur), resulting in the inability to reconstruct plausible human bodies. While some methods attempt to improve performance by incorporating external data resources, they fail to fully leverage the latent information in the underlying data. To learn more accurate motion sequences in unconstrained environments and fully exploit the spatial features of existing video data, we propose a spatio-temporal trend reasoning network [19] .We introduce time trend inference modules and spatial trend enhancement modules to effectively infer trends in the temporal sequences and facilitate the propagation of human-relevant information, while learning spatial-temporal domain-sensitive features that stimulate human motion information representation. We also introduce an integration strategy to integrate and refine the spatiotemporal feature representations.
Although the spatio-temporal trend reasoning network can effectively recover human body meshes in unconstrained environments, it overlooks spatial fine-grained features and human motion recognition features, making it difficult to reconstruct plausible human sequences in cluttered and extreme lighting conditions. To address this issue, we propose a two-stage cosegmentation network based on discriminative representation [20] .In the first stage of the network, the video is segmented in the spatial domain to highlight spatial fine-grained information. Then, frame-level discriminative representations are learned and enhanced through a dual-excited mechanism and a frequency domain enhancement module, while suppressing irrelevant information such as the background. In the second stage, the network focuses on the temporal context by segmenting the video in the temporal domain and models interframe discriminative representations through dynamic integration strategies. To generate plausible human discriminative actions effectively, we design an anchored region loss to constrain the variation of human motion regions.