Edgar Simo-Serra・Monocular Single Image 3D Human Pose Estimation

Monocular Single Image 3D Human Pose Estimation

This line of research focuses on the estimation of the 3D pose of humans from single monocular images. This is an extremely difficult problem due to the large number of ambiguities that rise from the projection of 3D objects to the image plane. We consider image evidence derived from the usage of different detectors for the different parts of the body, which results in noisy 2D estimations where the estimation uncertainty must be compensation. In order to deal with these issues, we propose different approaches using discriminative and generative models to enforce learnt anthropomorphism constraints. We show that by exploiting prior knowledge of human kinematics it is possible to overcome these ambiguities and obtain good pose estimation performance.

Overview

Below we present some additional results and insights on the approaches we propose. The additional results correspond to the CVPR 2013 approach in which we present a joint model for both 2D and 3D pose estimation. In all cases we always work with a single monocular image as input.

The CVPR 2013 approach consists of the combination of a strong 3D generative model with a bank of 2D part detectors. This can be derived from the probabilistic formulation used by pictorial structures, in which the following equation can be obtained:

Probabilistic Formulation

where X is the 3D pose, L is the 2D pose, H is a set of latent variables and D is the image evidence. We can see that two clear components can be distinguished, corresponding to the generative model and discriminative detectors respectively. A visual overview of the approach can be seen below.

Overview

Annotation Divergence

We evaluate the differences between the 2D annotations and the 3D annotations. The 2D annotations on the PARSE dataset were manually done by different persons. On the other hand, the HumanEva dataset obtains the 3D annotations by using an automatic system that depends on markers. Thus the placement of the markers becomes crucial and divergences may occur between the two datasets. A simple example is the head, while the 2D annotations are able to mark the center of the head, the 3D annotations use a marker located on the forehead. As explained in the paper, we compensate this kind of divergences by learning the relative weighting of the parts. Below we illustrate these differences in the annotations and show examples from both datasets, which are especially noticeable in lateral views.

PARSE Dataset

HumanEva Dataset

We highlight some of the differences on the left. These differences are primarily caused by the markers used to obtain the 3D ground truth. These markers are external to the human body, while the 2D annotations on the PARSE dataset are done on the actual body parts. The ground truth is marked in green, the few parts with large differences from the ground truth are marked in red and the differences are marked in cyan.

TUD Stadmitte

We have evaluated qualitatively on the TUD Stadmitte sequence as there is no available ground truth. This is a short yet extremely challenging sequence due to various factors. First it is a complex outdoor scene with lots of occlusions of different pedestrians. Additionally the pedestrians generally have uniform jackets and keep their hands in their pockets. This makes it very difficult for the upper body part detectors to provide meaningful responses. Finally, for algorithms that have been trained on the 3D walking sequence of the HumanEva dataset, it provides yet another challenge as real world pedestrians generally have very different walking kinematics from those obtained in laboratory conditions using a small circular track.

As done with the HumanEva dataset, we use the 2D annotations to crop the image to each individual and use that for the input of our algorithm. However, unlike the HumanEva dataset, we have no 3D information at all and must perform the coarse initialization shown in our paper. In addition, the annotations on the sequence do not take into account occlusions, either by borders or by other pedestrians. We perform no additional filtering on these annotations and try to evaluate our method on all the frames.

For a qualitative comparison with a tracking method (remind that we do not consider temporal consistency and estimate 3D and 2D from one single frame), please see this project site. It is interesting to note that many of the issues we face, such as poor results of the upper body detector, are shared with the original results on the sequence. However, as they use tracking, they are able to correct them with additional temporal information.

Below we show 5 selected frames from the sequence with comments on the results.

Frame 7024:The man on the right is too close to the border such that detectors fail to properly estimate half of his body.

Frame 7047:Man in cyan is sharing leg detection with a pedestrian he is occluding. Person in red is still too close to the border. Additionally, the man in purple is partially occluded by a small tree and is improperly located.

Frame 7074:All but the man in green are partially occluded. In the case of the man in green, the upper body detectors fail to give meaningful output and he is erroneously detected as facing forwards. As we can see, due to the overall weakness of the upper body detectors to precisely detect parts when pedestrians have hands in their pockets has a tendency to give us 90 degree errors in viewpoints.

Frame 7111:Only one person is annotated due to the heavy occlusion of two pedestrians, which we improperly detect as the green person.

Frame 7129:Person in green is too close to the border and improperly detected at lower scale (further away). There is an occlusion of the person in light blue on the person in purple.

Publications

2026

ProjFlow: Projection Sampling with Flow Matching for Zero‑Shot Exact Spatial Motion Control

Akihisa Watanabe, Qing Yu, Edgar Simo-Serra, Kent Fujiwara

Conference in Computer Vision and Pattern Recognition (CVPR), 2026

PDF abstract bibtex project page

Generating human motion with precise spatial control is a challenging problem. Existing approaches often require task-specific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce ProjFlow, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for sparse inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.

@InProceedings{WatanabeCVPR2026,
   author    = {Akihisa Watanabe and Qing Yu and Edgar Simo-Serra and Kent Fujiwara},
   title     = {{ProjFlow: Projection Sampling with Flow Matching for Zero‑Shot Exact Spatial Motion Control}}
   booktitle = "Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)",
   year      = 2026,
}

2013

A Joint Model for 2D and 3D Pose Estimation from a Single Image

Edgar Simo-Serra, Ariadna Quattoni, Carme Torras, Francesc Moreno-Noguer

Conference in Computer Vision and Pattern Recognition (CVPR), 2013

PDF abstract bibtex project page poster DOI

We introduce a novel approach to automatically recover 3D human pose from a single image. Most previous work follows a pipelined approach: initially, a set of 2D features such as edges, joints or silhouettes are detected in the image, and then these observations are used to infer the 3D pose. Solving these two problems separately may lead to erroneous 3D poses when the feature detector has performed poorly. In this paper, we address this issue by jointly solving both the 2D detection and the 3D inference problems. For this purpose, we propose a Bayesian framework that integrates a generative model based on latent variables and discriminative 2D part detectors based on HOGs, and perform inference using evolutionary algorithms. Real experimentation demonstrates competitive results, and the ability of our methodology to provide accurate 2D and 3D pose estimations even when the 2D detectors are inaccurate.

@InProceedings{SimoSerraCVPR2013,
   author    = {Edgar Simo-Serra and Ariadna Quattoni and Carme Torras and Francesc Moreno-Noguer},
   title     = {{A Joint Model for 2D and 3D Pose Estimation from a Single Image}},
   booktitle = "Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)",
   year      = 2013,
}

2012

Single Image 3D Human Pose Estimation from Noisy Observations

Edgar Simo-Serra, Arnau Ramisa, Guillem Alenyà, Carme Torras, Francesc Moreno-Noguer

Conference in Computer Vision and Pattern Recognition (CVPR), 2012

PDF abstract bibtex project page poster DOI

Markerless 3D human pose detection from a single image is a severely underconstrained problem in which different 3D poses can have very similar image projections. In order to handle this ambiguity, current approaches rely on prior shape models whose parameters are estimated by minimizing image-based objective functions that require 2D features to be accurately detected in the input images. Unfortunately, although current 2D part detectors algorithms have shown promising results, their accuracy is not yet sufficiently high to subsequently infer the 3D human pose in a robust and unambiguous manner. We introduce a novel approach for estimating 3D human pose even when observations are noisy. We propose a stochastic sampling strategy to propagate the noise from the image plane to the shape space. This provides a set of ambiguous 3D shapes, which are virtually undistinguishable using image-based information alone. Disambiguation is then achieved by imposing kinematic constraints that guarantee the resulting pose resembles a 3D human shape. We validate our approach on a variety of situations in which state-of-the-art 2D detectors yield either inaccurate estimations or partly miss some of the body parts.

@InProceedings{SimoSerraCVPR2012,
   author    = {Edgar Simo-Serra and Arnau Ramisa and Guillem Aleny\`a and Carme Torras and Francesc Moreno-Noguer},
   title     = {{Single Image 3D Human Pose Estimation from Noisy Observations}},
   booktitle = "Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)",
   year      = 2012,
}