Image Processing Research

Image processing consists of extracting useful information from images. The research here focuses on a wide variety of tasks from estimating the 3D pose of humans from a single image to the automatic colorization of black and white images.

  • Globally and Locally Consistent Image Completion

    Globally and Locally Consistent Image Completion

    We present a novel approach for image completion that results in images that are both locally and globally consistent. With a fully-convolutional neural network, we can complete images of arbitrary resolutions by filling-in missing regions of any shape. To train this image completion network to be consistent, we use global and local context discriminators that are trained to distinguish real images from completed ones. The global discriminator looks at the entire image to assess if it is coherent as a whole, while the local discriminator looks only at a small area centered at the completed region to ensure the local consistency of the generated patches. The image completion network is then trained to fool the both context discriminator networks, which requires it to generate images that are indistinguishable from real ones with regard to overall consistency as well as in details. We show that our approach can be used to complete a wide variety of scenes. Furthermore, in contrast with the patch-based approaches such as PatchMatch, our approach can generate fragments that do not appear elsewhere in the image, which allows us to naturally complete the images of objects with familiar and highly specific structures, such as faces.

  • Colorization of Black and White Images

    Colorization of Black and White Images

    We present a novel technique to automatically colorize grayscale images that combines both global priors and local image features. Based on Convolutional Neural Networks, our deep network features a fusion layer that allows us to elegantly merge local information dependent on small image patches with global priors computed using the entire image. The entire framework, including the global and local priors as well as the colorization model, is trained in an end-to-end fashion. Furthermore, our architecture can process images of any resolution, unlike most existing approaches based on CNN. We leverage an existing large-scale scene classification database to train our model, exploiting the class labels of the dataset to more efficiently and discriminatively learn the global priors. We validate our approach with a user study and compare against the state of the art, where we show significant improvements. Furthermore, we demonstrate our method extensively on many different types of images, including black-and-white photography from over a hundred years ago, and show realistic colorizations.

  • Monocular Single Image 3D Human Pose Estimation

    Monocular Single Image 3D Human Pose Estimation

    This line of research focuses on the estimation of the 3D pose of humans from single monocular images. This is an extremely difficult problem due to the large number of ambiguities that rise from the projection of 3D objects to the image plane. We consider image evidence derived from the usage of different detectors for the different parts of the body, which results in noisy 2D estimations where the estimation uncertainty must be compensation. In order to deal with these issues, we propose different approaches using discriminative and generative models to enforce learnt anthropomorphism constraints. We show that by exploiting prior knowledge of human kinematics it is possible to overcome these ambiguities and obtain good pose estimation performance.

Publications

Restoring Degraded Old Films with Recursive Recurrent Transformer Networks
Shan Lin, Edgar Simo-Serra
Winter Conference on Applications of Computer Vision, 2024
There exists a large number of old films that have not only artistic value but also historical significance. However, due to the degradation of analogue medium over time, old films often suffer from various deteriorations that make it difficult to restore them with existing approaches. In this work, we proposed a novel framework called Recursive Recurrent Transformer Network (RRTN) which is specifically designed for restoring degraded old films. Our approach introduces several key advancements, including a more accurate film noise mask estimation method, the utilization of second-order grid propagation and flow-guided deformable alignment, and the incorporation of a recursive structure to further improve the removal of challenging film noise. Through qualitative and quantitative evaluations, our approach demonstrates superior performance compared to existing approaches, effectively improving the restoration for difficult film noises that cannot be perfectly handled by existing approaches.
@InProceedings{LinWACV2024,
   author    = {Shan Lin and Edgar Simo-Serra},
   title     = {{Restoring Degraded Old Films with Recursive Recurrent Transformer Networks}},
   booktitle = "Proceedings of the Winter Conference on Applications of Computer Vision (WACV)",
   year      = 2024,
}
Diffusion-based Holistic Texture Rectification and Synthesis
Diffusion-based Holistic Texture Rectification and Synthesis
Guoqing Hao, Satoshi Iizuka, Kensho Hara, Edgar Simo-Serra, Hirokatsu Kataoka, Kazuhiro Fukui
ACM Transactions on Graphics (SIGGRAPH Asia), 2023
We present a novel framework for rectifying occlusions and distortions in degraded texture samples from natural images. Traditional texture synthesis approaches focus on generating textures from pristine samples, which necessitate meticulous preparation by humans and are often unattainable in most natural images. These challenges stem from the frequent occlusions and distortions of texture samples in natural images due to obstructions and variations in object surface geometry. To address these issues, we propose a framework that synthesizes holistic textures from degraded samples in natural images, extending the applicability of exemplar-based texture synthesis techniques. Our framework utilizes a conditional Latent Diffusion Model (LDM) with a novel occlusion-aware latent transformer. This latent transformer not only effectively encodes texture features from partially-observed samples necessary for the generation process of the LDM, but also explicitly captures long-range dependencies in samples with large occlusions. To train our model, we introduce a method for generating synthetic data by applying geometric transformations and free-form mask generation to clean textures. Experimental results demonstrate that our framework significantly outperforms existing methods both quantitatively and quantitatively. Furthermore, we conduct comprehensive ablation studies to validate the different components of our proposed framework. Results are corroborated by a perceptual user study which highlights the efficiency of our proposed approach.
@Inproceedings{HaoSIGGRAPHASIA2023,
   author    = {Guoqing Hao and Satoshi Iizuka and Kensho Hara and Edgar Simo-Serra and Hirokatsu Kataoka and Kazuhiro Fukui},
   title     = {{Diffusion-based Holistic Texture Rectification and Synthesis}},
   booktitle = "ACM SIGGRAPH Asia 2023 Conference Papers",
   year      = 2023,
}
Diffusart: Enhancing Line Art Colorization with Conditional Diffusion Models
Hernan Carrillo, Michaël Clément, Aurélie Bugeau, Edgar Simo-Serra
Conference in Computer Vision and Pattern Recognition Workshops (CVPRW), 2023
Colorization of line art drawings is an important task in illustration and animation workflows. However, this highly laborious process is mainly done manually, limiting the creative productivity. This paper presents a novel interactive approach for line art colorization using conditional Diffusion Probabilistic Models (DPMs). In our proposed approach, the user provides initial color strokes for colorizing the line art. The strokes are then integrated into the conditional DPM-based colorization process by means of a coupled implicit and explicit conditioning strategy to generates diverse and high-quality colorized images. We evaluate our proposal and show it outperforms existing state-of-the-art approaches using the FID, LPIPS and SSIM metrics.
@InProceedings{CarrilloCVPRW2023,
   author    = {Hernan Carrillo and Micha\"el Cl/'ement and Aur\'elie Bugeau and Edgar Simo-Serra},
   title     = {{Diffusart: Enhancing Line Art Colorization with Conditional Diffusion Models}},
   booktitle = "Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)",
   year      = 2023,
}
Image Synthesis-based Late Stage Cancer Augmentation and Semi-Supervised Segmentation for MRI Rectal Cancer Staging
Saeko Sasuga, Akira Kudo, Yoshiro Kitamura, Satoshi Iizuka, Edgar Simo-Serra, Atsushi Hamabe, Masayuki Ishii, Ichiro Takemasa
International Conference on Medical Image Computing and Computer Assisted Intervention Workshops (MICCAIW), 2022
Rectal cancer is one of the most common diseases and a major cause of mortality. For deciding rectal cancer treatment plans, T- staging is important. However, evaluating the index from preoperative MRI images requires high radiologists’ skill and experience. Therefore, the aim of this study is to segment the mesorectum, rectum, and rectal cancer region so that the system can predict T-stage from segmentation results. Generally, shortage of large and diverse dataset and high quality an- notation are known to be the bottlenecks in computer aided diagnos- tics development. Regarding rectal cancer, advanced cancer images are very rare, and per-pixel annotation requires high radiologists’ skill and time. Therefore, it is not feasible to collect comprehensive disease pat- terns in a training dataset. To tackle this, we propose two kinds of ap- proaches of image synthesis-based late stage cancer augmentation and semi-supervised learning which is designed for T-stage prediction. In the image synthesis data augmentation approach, we generated advanced cancer images from labels. The real cancer labels were deformed to re- semble advanced cancer labels by artificial cancer progress simulation. Next, we introduce a T-staging loss which enables us to train segmen- tation models from per-image T-stage labels. The loss works to keep inclusion/invasion relationships between rectum and cancer region con- sistent to the ground truth T-stage. The verification tests show that the proposed method obtains the best sensitivity (0.76) and specificity (0.80) in distinguishing between over T3 stage and underT2. In the ab- lation studies, our semi-supervised learning approach with the T-staging loss improved specificity by 0.13. Adding the image synthesis-based data augmentation improved the DICE score of invasion cancer area by 0.08 from baseline. We expect that this rectal cancer staging AI can help doctors to diagnose cancer staging accurately.
@InProceedings{SasugaMICCAIW2022,
   author    = {Saeko Sasuga and Akira Kudo and Yoshiro Kitamura and Satoshi Iizuka and Edgar Simo-Serra and Atsushi Hamabe and Masayuki Ishii and Ichiro Takemasa},
   title     = {{Image Synthesis-based Late Stage Cancer Augmentation and Semi-Supervised Segmentation for MRI Rectal Cancer Staging}},
   booktitle = "Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention Workshops (MICCAIW)",
   year      = 2022,
}
Line Art Colorization with Concatenated Spatial Attention
Mingcheng Yuan, Edgar Simo-Serra
Conference in Computer Vision and Pattern Recognition Workshops (CVPRW), 2021
Line art plays a fundamental role in illustration and design, and allows for iteratively polishing designs. However, as they lack color, they can have issues in conveying final designs. In this work, we propose an interactive colorization approach based on a conditional generative adversarial network that takes both the line art and color hints as inputs to produce a high-quality colorized image. Our approach is based on a U-net architecture with a multi-discriminator framework. We propose a Concatenation and Spatial Attention module that is able to generate more consistent and higher quality of line art colorization from user given hints. We evaluate on a large-scale illustration dataset and comparison with existing approaches corroborate the effectiveness of our approach.
@InProceedings{YuanCVPRW2021,
   author    = {Mingcheng Yuan and Edgar Simo-Serra},
   title     = {{Line Art Colorization with Concatenated Spatial Attention}},
   booktitle = "Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)",
   year      = 2021,
}
LoL-V2T: Large-Scale Esports Video Description Dataset
Tsunehiko Tanaka, Edgar Simo-Serra
Conference in Computer Vision and Pattern Recognition Workshops (CVPRW), 2021
Esports is a fastest-growing new field with a largely online-presence, and is creating a demand for automatic domain-specific captioning tools. However, at the current time, there are few approaches that tackle the esports video description problem. In this work, we propose a large-scale dataset for esports video description, focusing on the popular game "League of Legends". The dataset, which we call LoL-V2T, is the largest video description dataset in the vldeo game domain, and includes 9,723 clips with 62,677 captions. This new dataset presents multiple new video captioning challenges such as large amounts of domain-specific vocabulary, subtle motions with large importance, and a temporal gap between most captions and the events that occurred. In order to tackle the issue of vocabulary, we propose a masking the domain-specific words and provide additional annotations for this. In our results, we show that the dataset poses a challenge to existing video captioning approaches, and the masking can significantly improve performance.
@InProceedings{TanakaCVPRW2021,
   author    = {Tsunehiko Tanaka and Edgar Simo-Serra},
   title     = {{LoL-V2T: Large-Scale Esports Video Description Dataset}},
   booktitle = "Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)",
   year      = 2021,
}
Differentiable Rendering-based Pose-Conditioned Human Image Generation
Yusuke Horiuchi, Edgar Simo-Serra, Satoshi Iizuka, Hiroshi Ishikawa
Conference in Computer Vision and Pattern Recognition Workshops (CVPRW), 2021
Conditional human image generation, or generation of human images with specified pose based on one or more reference images, is an inherently ill-defined problem, as there can be multiple plausible appearance for parts that are occluded in the reference. Using multiple images can mitigate this problem while boosting the performance. In this work, we introduce a differentiable vertex and edge renderer for incorporating the pose information to realize human image generation conditioned on multiple reference images. The differentiable renderer has parameters that can be jointly optimized with other parts of the system to obtain better results by learning more meaningful shape representation of human pose. We evaluate our method on the Market-1501 and DeepFashion datasets and comparison with existing approaches validates the effectiveness of our approach.
@InProceedings{HoriuchiCVPRW2021,
   author    = {Yusuke Horiuchi and Edgar Simo-Serra and Satoshi Iizuka and Hiroshi Ishikawa},
   title     = {{Differentiable Rendering-based Pose-Conditioned Human Image Generation}},
   booktitle = "Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)",
   year      = 2021,
}
Automatic Segmentation, Localization and Identification of Vertebrae in 3D CT Images Using Cascaded Convolutional Neural Networks
Automatic Segmentation, Localization and Identification of Vertebrae in 3D CT Images Using Cascaded Convolutional Neural Networks
Naoto Masuzawa, Yoshiro Kitamura, Keigo Nakamura, Satoshi Iizuka, Edgar Simo-Serra
International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2020
This paper presents a method for automatic segmentation, localization, and identification of vertebrae in arbitrary 3D CT images. Many previous works do not perform the three tasks simultaneously even though requiring a priori knowledge of which part of the anatomy is visible in the 3D CT images. Our method tackles all these tasks in a single multi-stage framework without any assumptions. In the first stage, we train a 3D Fully Convolutional Networks to find the bounding box of the cervical, thoracic, and lumbar vertebrae. In the second stage, we train an iterative 3D Fully Convolutional Networks to segment individual vertebrae in the bounding box. The input to the second network has an auxiliary channel in addition to the 3D CT images. Given the segmented vertebrae regions in the auxiliary channel, the network output the next vertebra. The proposed method is evaluated in terms of segmentation, localization and identification accuracy with two public datasets of 15 3D CT images from the MICCAI CSI 2014 workshop challenge and 302 3D CT images with various pathologies. Our method achieved a mean Dice score of 96%, a mean localization error of 8.3 mm, and a mean identification rate of 84%. In summary, our method achieved better performance than all existing works in all the three metrics.
@InProceedings{MasuzawaMICCAI2020,
   author    = {Naoto Masuzawa and Yoshiro Kitamura and Keigo Nakamura and Satoshi Iizuka and Edgar Simo-Serra},
   title     = {{Automatic Segmentation, Localization and Identification of Vertebrae in 3D CT Images Using Cascaded Convolutional Neural Networks}},
   booktitle = "Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI)",
   year      = 2020,
}
TopNet: Topology Preserving Metric Learning for Vessel Tree Reconstruction and Labelling
TopNet: Topology Preserving Metric Learning for Vessel Tree Reconstruction and Labelling
Deepak Keshwani, Yoshiro Kitamura, Satoshi Ihara, Satoshi Iizuka, Edgar Simo-Serra
International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2020
Reconstructing Portal Vein and Hepatic Vein trees from contrast enhanced abdominal CT scans is a prerequisite for preoperative liver surgery simulation. Existing deep learning based methods treat vascular tree reconstruction as a semantic segmentation problem. However, vessels such as hepatic and portal vein look very similar locally and need to be traced to their source for robust label assignment. Therefore, semantic segmentation by looking at local 3D patch results in noisy misclassifications. To tackle this, we propose a novel multi-task deep learning architecture for vessel tree reconstruction. The network architecture simultaneously solves the task of detecting voxels on vascular centerlines (i.e. nodes) and estimates connectivity between center-voxels (edges) in the tree structure to be reconstructed. Further, we propose a novel connectivity metric which considers both inter-class distance and intra-class topological distance between center-voxel pairs. Vascular trees are reconstructed starting from the vessel source using the learned connectivity metric using the shortest path tree algorithm. A thorough evaluation on public IRCAD dataset shows that the proposed method considerably outperforms existing semantic segmentation based methods. To the best of our knowledge, this is the first deep learning based approach which learns multi-label tree structure connectivity from images.
@InProceedings{KeshwaniMICCAI2020,
   author    = {Deepak Keshwani and Yoshiro Kitamura and Satoshi Ihara and Satoshi Iizuka and Edgar Simo-Serra},
   title     = {{TopNet: Topology Preserving Metric Learning for Vessel Tree Reconstruction and Labelling}},
   booktitle = "Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI)",
   year      = 2020,
}
Two-stage Discriminative Re-ranking for Large-scale Landmark Retrieval
Two-stage Discriminative Re-ranking for Large-scale Landmark Retrieval
Shuhei Yokoo, Kohei Ozaki, Edgar Simo-Serra, Satoshi Iizuka
Conference in Computer Vision and Pattern Recognition Workshops (CVPRW), 2020
We propose an efficient pipeline for large-scale landmark image retrieval that addresses the diversity of the dataset through two-stage discriminative re-ranking. Our approach is based on embedding the images in a feature-space using a convolutional neural network trained with a cosine softmax loss. Due to the variance of the images, which include extreme viewpoint changes such as having to retrieve images of the exterior of a landmark from images of the interior, this is very challenging for approaches based exclusively on visual similarity. Our proposed re-ranking approach improves the results in two steps: in the sort-step, k-nearest neighbor search with soft-voting to sort the retrieved results based on their label similarity to the query images, and in the insert-step, we add additional samples from the dataset that were not retrieved by image-similarity. This approach allows overcoming the low visual diversity in retrieved images. In-depth experimental results show that the proposed approach significantly outperforms existing approaches on the challenging Google Landmarks Datasets. Using our methods, we achieved 1st place in the Google Landmark Retrieval 2019 challenge on Kaggle.
@InProceedings{YokooCVPRW2020,
   author    = {Shuhei Yokoo and Kohei Ozaki and Edgar Simo-Serra and Satoshi Iizuka},
   title     = {{Two-stage Discriminative Re-ranking for Large-scale Landmark Retrieval}},
   booktitle = "Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)",
   year      = 2020,
}
DeepRemaster: Temporal Source-Reference Attention Networks for Comprehensive Video Enhancement
DeepRemaster: Temporal Source-Reference Attention Networks for Comprehensive Video Enhancement
Satoshi Iizuka, Edgar Simo-Serra
ACM Transactions on Graphics (SIGGRAPH Asia), 2019
The remastering of vintage film comprises of a diversity of sub-tasks including super-resolution, noise removal, and contrast enhancement which aim to restore the deteriorated film medium to its original state. Additionally, due to the technical limitations of the time, most vintage film is either recorded in black and white, or has low quality colors, for which colorization becomes necessary. In this work, we propose a single framework to tackle the entire remastering task semi-interactively. Our work is based on temporal convolutional neural networks with attention mechanisms trained on videos with data-driven deterioration simulation. Our proposed source-reference attention allows the model to handle an arbitrary number of reference color images to colorize long videos without the need for segmentation while maintaining temporal consistency. Quantitative analysis shows that our framework outperforms existing approaches, and that, in contrast to existing approaches, the performance of our framework increases with longer videos and more reference color images.
@Article{IizukaSIGGRAPHASIA2019,
   author    = {Satoshi Iizuka and Edgar Simo-Serra},
   title     = {{DeepRemaster: Temporal Source-Reference Attention Networks for Comprehensive Video Enhancement}},
   journal   = "ACM Transactions on Graphics (SIGGRAPH Asia)",
   year      = 2019,
   volume    = 38,
   number    = 6,
}
Understanding the Effects of Pre-training for Object Detectors via Eigenspectrum
Understanding the Effects of Pre-training for Object Detectors via Eigenspectrum
Yosuke Shinya, Edgar Simo-Serra, Taiji Suzuki
International Conference on Computer Vision Workshops (ICCVW), 2019
ImageNet pre-training has been regarded as essential for training accurate object detectors for a long time. Recently, it has been shown that object detectors trained from randomly initialized weights can be on par with those fine-tuned from ImageNet pre-trained models. However, effect of pre-training and the differences caused by pre-training are still not fully understood. In this paper, we analyze the eigenspectrum dynamics of the covariance matrix of each feature map in object detectors. Based on our analysis on ResNet-50, Faster R-CNN with FPN, and Mask R-CNN, we show that object detectors trained from ImageNet pre-trained models and those trained from scratch behave differently from each other even if both object detectors have similar accuracy. Furthermore, we propose a method for automatically determining the widths (the numbers of channels) of object detectors based on the eigenspectrum. We train Faster R-CNN with FPN from randomly initialized weights, and show that our method can reduce ~27% of the parameters of ResNet-50 without increasing Multiply-Accumulate operations (MACs) and losing accuracy. Our results indicate that we should develop more appropriate methods for transferring knowledge from image classification to object detection (or other tasks).
@InProceedings{ShinyaICCVW2019,
   author    = {Yosuke Shinya and Edgar Simo-Serra and Taiji Suzuki},
   title     = {{Understanding the Effects of Pre-training for Object Detectors via Eigenspectrum}},
   booktitle = "Proceedings of the International Conference on Computer Vision Workshops (ICCVW)",
   year      = 2019,
}
Virtual Thin Slice: 3D Conditional GAN-based Super-resolution for CT Slice Interval
Virtual Thin Slice: 3D Conditional GAN-based Super-resolution for CT Slice Interval
Akira Kudo, Yoshiro Kitamura, Yuanzhong Li, Satoshi Iizuka, Edgar Simo-Serra
International Conference on Medical Image Computing and Computer Assisted Intervention Workshops (MICCAIW), 2019
Many CT slice images are stored with large slice intervals to reduce storage size in clinical practice. This leads to low resolution perpendicular to the slice images (i.e., z-axis), which is insufficient for 3D visualization or image analysis. In this paper, we present a novel architecture based on conditional Generative Adversarial Networks (cGANs) with the goal of generating high resolution images of main body parts including head, chest, abdomen and legs. However, GANs are known to have a difficulty with generating a diversity of patterns due to a phenomena known as mode collapse. To overcome the lack of generated pattern variety, we propose to condition the discriminator on the different body parts. Furthermore, our generator networks are extended to be three dimensional fully convolutional neural networks, allowing for the generation of high resolution images from arbitrary fields of view. In our verification tests, we show that the proposed method obtains the best scores by PSNR/SSIM metrics and Visual Turing Test, allowing for accurate reproduction of the principle anatomy in high resolution. We expect that the proposed method contribute to effective utilization of the existing vast amounts of thick CT images stored in hospitals.
@InProceedings{KudoMICCAIW2019,
   author    = {Akira Kudo and Yoshiro Kitamura and Yuanzhong Li and Satoshi Iizuka and Edgar Simo-Serra},
   title     = {{Virtual Thin Slice: 3D Conditional GAN-based Super-resolution for CT Slice Interval}},
   booktitle = "Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention Workshops (MICCAIW)",
   year      = 2019,
}
Optimization-Based Data Generation for Photo Enhancement
Optimization-Based Data Generation for Photo Enhancement
Mayu Omiya*, Yusuke Horiuchi*, Edgar Simo-Serra, Satoshi Iizuka, Hiroshi Ishikawa (*equal contribution)
Conference in Computer Vision and Pattern Recognition Workshops (CVPRW), 2019
The preparation of large amounts of high-quality training data has always been the bottleneck for the performance of supervised learning methods. It is especially time-consuming for complicated tasks such as photo enhancement. A recent approach to ease data annotation creates realistic training data automatically with optimization. In this paper, we improve upon this approach by learning image-similarity which, in combination with a Covariance Matrix Adaptation optimization method, allows us to create higher quality training data for enhancing photos. We evaluate our approach on challenging real world photo-enhancement images by conducting a perceptual user study, which shows that its performance compares favorably with existing approaches.
@InProceedings{OmiyaCVPRW2019,
   author    = {Mayu Omiya and Yusuke Horiuchi and Edgar Simo-Serra and Satoshi Iizuka and Hiroshi Ishikawa},
   title     = {{Optimization-Based Data Generation for Photo Enhancement}},
   booktitle = "Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)",
   year      = 2019,
}
Learning Photo Enhancement by Black-Box Model Optimization Data Generation
Learning Photo Enhancement by Black-Box Model Optimization Data Generation
Mayu Omiya, Edgar Simo-Serra, Satoshi Iizuka, Hiroshi Ishikawa
SIGGRAPH Asia Technical Brief, 2018
We address the problem of automatic photo enhancement, in which the challenge is to determine the optimal enhancement for a given photo according to its content. For this purpose, we train a convolutional neural network to predict the best enhancement for given picture. While such machine learning techniques have shown great promise in photo enhancement, there are some limitations. One is the problem of interpretability, i.e., that it is not easy for the user to discern what has been done by a machine. In this work, we leverage existing manual photo enhancement tools as a black-box model, and predict the enhancement parameters of that model. Because the tools are designed for human use, the resulting parameters can be interpreted by their users. Another problem is the difficulty of obtaining training data. We propose generating supervised training data from high-quality professional images by randomly sampling realistic de-enhancement parameters. We show that this approach allows automatic enhancement of photographs without the need for large manually labelled supervised training datasets.
@InProceedings{OmiyaSIGGRAPASIABRIEF2018,
   author    = {Mayu Omiya and Edgar Simo-Serra and Satoshi Iizuka and Hiroshi Ishikawa},
   title     = {{Learning Photo Enhancement by Black-Box Model Optimization Data Generation}},
   booktitle = "SIGGRAPH Asia 2018 Technical Briefs",
   year      = 2018,
}
Adaptive Energy Selection For Content-Aware Image Resizing
Adaptive Energy Selection For Content-Aware Image Resizing
Kazuma Sasaki, Yuya Nagahama, Zheng Ze, Satoshi Iizuka, Edgar Simo-Serra, Yoshihiko Mochizuki, Hiroshi Ishikawa
Asian Conference on Pattern Recognition (ACPR), 2017
Content-aware image resizing aims to reduce the size of an image without touching important objects and regions. In seam carving, this is done by assessing the importance of each pixel by an energy function and repeatedly removing a string of pixels avoiding pixels with high energy. However, there is no single energy function that is best for all images: the optimal energy function is itself a function of the image. In this paper, we present a method for predicting the quality of the results of resizing an image with different energy functions, so as to select the energy best suited for that particular image. We formulate the selection as a classification problem; i.e., we 'classify' the input into the class of images for which one of the energies works best. The standard approach would be to use a CNN for the classification. However, the existence of a fully connected layer forces us to resize the input to a fixed size, which obliterates useful information, especially lower-level features that more closely relate to the energies used for seam carving. Instead, we extract a feature from internal convolutional layers, which results in a fixed-length vector regardless of the input size, making it amenable to classification with a Support Vector Machine. This formulation of the algorithm selection as a classification problem can be used whenever there are multiple approaches for a specific image processing task. We validate our approach with a user study, where our method outperforms recent seam carving approaches.
@InProceedings{SasakiACPR2017,
   author = {Kazuma Sasaki and Yuya Nagahama and Zheng Ze and Satoshi Iizuka and Edgar Simo-Serra and Yoshihiko Mochizuki and Hiroshi Ishikawa},
   title = {{Adaptive Energy Selection For Content-Aware Image Resizing}},
   booktitle = "Proceedings of the Asian Conference on Pattern Recognition (ACPR)",
   year = 2017,
}
Globally and Locally Consistent Image Completion
Globally and Locally Consistent Image Completion
Satoshi Iizuka, Edgar Simo-Serra, Hiroshi Ishikawa
ACM Transactions on Graphics (SIGGRAPH), 2017
We present a novel approach for image completion that results in images that are both locally and globally consistent. With a fully-convolutional neural network, we can complete images of arbitrary resolutions by filling-in missing regions of any shape. To train this image completion network to be consistent, we use global and local context discriminators that are trained to distinguish real images from completed ones. The global discriminator looks at the entire image to assess if it is coherent as a whole, while the local discriminator looks only at a small area centered at the completed region to ensure the local consistency of the generated patches. The image completion network is then trained to fool the both context discriminator networks, which requires it to generate images that are indistinguishable from real ones with regard to overall consistency as well as in details. We show that our approach can be used to complete a wide variety of scenes. Furthermore, in contrast with the patch-based approaches such as PatchMatch, our approach can generate fragments that do not appear elsewhere in the image, which allows us to naturally complete the images of objects with familiar and highly specific structures, such as faces.
@Article{IizukaSIGGRAPH2017,
   author    = {Satoshi Iizuka and Edgar Simo-Serra and Hiroshi Ishikawa},
   title     = {{Globally and Locally Consistent Image Completion}},
   journal   = "ACM Transactions on Graphics (SIGGRAPH)",
   year      = 2017,
   volume    = 36,
   number    = 4,
}
BASS: Boundary-Aware Superpixel Segmentation
BASS: Boundary-Aware Superpixel Segmentation
Antonio Rubio, Longlong Yu, Edgar Simo-Serra, Francesc Moreno-Noguer
International Conference on Pattern Recognition (ICPR), 2016
We propose a new superpixel algorithm based on exploiting the boundary information of an image, as objects in images can generally be described by their boundaries. Our proposed approach initially estimates the boundaries and uses them to place superpixel seeds in the areas in which they are more dense. Afterwards, we minimize an energy function in order to expand the seeds into full superpixels. In addition to standard terms such as color consistency and compactness, we propose using the geodesic distance which concentrates small superpixels in regions of the image with more information, while letting larger superpixels cover more homogeneous regions. By both improving the initialization using the boundaries and coherency of the superpixels with geodesic distances, we are able to maintain the coherency of the image structure with fewer superpixels than other approaches. We show the resulting algorithm to yield smaller Variation of Information metrics in seven different datasets while maintaining Undersegmentation Error values similar to the state-of-the-art methods.
@InProceedings{RubioICPR2016,
   author = {Antonio Rubio and Longlong Yu and Edgar Simo-Serra and Francesc Moreno-Noguer},
   title = {{BASS: Boundary-Aware Superpixel Segmentation}},
   booktitle = "Proceedings of the International Conference on Pattern Recognition (ICPR)",
   year = 2016,
}
Detection by Classification of Buildings in Multispectral Satellite Imagery
Detection by Classification of Buildings in Multispectral Satellite Imagery
Tomohiro Ishii, Edgar Simo-Serra, Satoshi Iizuka, Yoshihiko Mochizuki, Akihiro Sugimoto, Hiroshi Ishikawa, Ryosuke Nakamura
International Conference on Pattern Recognition (ICPR), 2016
We present an approach for the detection of buildings in multispectral satellite images. Unlike 3-channel RGB images, satellite imagery contains additional channels corresponding to different wavelengths. Approaches that do not use all channels are unable to fully exploit these images for optimal performance. Furthermore, care must be taken due to the large bias in classes, e.g., most of the Earth is covered in water and thus it will be dominant in the images. Our approach consists of training a Convolutional Neural Network (CNN) from scratch to classify multispectral image patches taken by satellites as whether or not they belong to a class of buildings. We then adapt the classification network to detection by converting the fully-connected layers of the network to convolutional layers, which allows the network to process images of any resolution. The dataset bias is compensated by subsampling negatives and tuning the detection threshold for optimal performance. We have constructed a new dataset using images from the Landsat 8 satellite for detecting solar power plants and show our approach is able to significantly outperform the state-of-the-art. Furthermore, we provide an in-depth evaluation of the seven different spectral bands provided by the satellite images and show it is critical to combine them to obtain good results.
@InProceedings{IshiiICPR2016,
   author = {Tomohiro Ishii and Edgar Simo-Serra and Satoshi Iizuka and Yoshihiko Mochizuki and Akihiro Sugimoto and Hiroshi Ishikawa and Ryosuke Nakamura},
   title = {{Detection by Classification of Buildings in Multispectral Satellite Imagery}},
   booktitle = "Proceedings of the International Conference on Pattern Recognition (ICPR)",
   year = 2016,
}
Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification
Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification
Satoshi Iizuka*, Edgar Simo-Serra*, Hiroshi Ishikawa (* equal contribution)
ACM Transactions on Graphics (SIGGRAPH), 2016
We present a novel technique to automatically colorize grayscale images that combines both global priors and local image features. Based on Convolutional Neural Networks, our deep network features a fusion layer that allows us to elegantly merge local information dependent on small image patches with global priors computed using the entire image. The entire framework, including the global and local priors as well as the colorization model, is trained in an end-to-end fashion. Furthermore, our architecture can process images of any resolution, unlike most existing approaches based on CNN. We leverage an existing large-scale scene classification database to train our model, exploiting the class labels of the dataset to more efficiently and discriminatively learn the global priors. We validate our approach with a user study and compare against the state of the art, where we show significant improvements. Furthermore, we demonstrate our method extensively on many different types of images, including black-and-white photography from over a hundred years ago, and show realistic colorizations.
@Article{IizukaSIGGRAPH2016,
   author    = {Satoshi Iizuka and Edgar Simo-Serra and Hiroshi Ishikawa},
   title     = {{Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification}},
   journal   = "ACM Transactions on Graphics (SIGGRAPH)",
   year      = 2016,
   volume    = 35,
   number    = 4,
}
A Joint Model for 2D and 3D Pose Estimation from a Single Image
A Joint Model for 2D and 3D Pose Estimation from a Single Image
Edgar Simo-Serra, Ariadna Quattoni, Carme Torras, Francesc Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2013
We introduce a novel approach to automatically recover 3D human pose from a single image. Most previous work follows a pipelined approach: initially, a set of 2D features such as edges, joints or silhouettes are detected in the image, and then these observations are used to infer the 3D pose. Solving these two problems separately may lead to erroneous 3D poses when the feature detector has performed poorly. In this paper, we address this issue by jointly solving both the 2D detection and the 3D inference problems. For this purpose, we propose a Bayesian framework that integrates a generative model based on latent variables and discriminative 2D part detectors based on HOGs, and perform inference using evolutionary algorithms. Real experimentation demonstrates competitive results, and the ability of our methodology to provide accurate 2D and 3D pose estimations even when the 2D detectors are inaccurate.
@InProceedings{SimoSerraCVPR2013,
   author    = {Edgar Simo-Serra and Ariadna Quattoni and Carme Torras and Francesc Moreno-Noguer},
   title     = {{A Joint Model for 2D and 3D Pose Estimation from a Single Image}},
   booktitle = "Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)",
   year      = 2013,
}
Single Image 3D Human Pose Estimation from Noisy Observations
Single Image 3D Human Pose Estimation from Noisy Observations
Edgar Simo-Serra, Arnau Ramisa, Guillem Alenyà, Carme Torras, Francesc Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2012
Markerless 3D human pose detection from a single image is a severely underconstrained problem in which different 3D poses can have very similar image projections. In order to handle this ambiguity, current approaches rely on prior shape models whose parameters are estimated by minimizing image-based objective functions that require 2D features to be accurately detected in the input images. Unfortunately, although current 2D part detectors algorithms have shown promising results, their accuracy is not yet sufficiently high to subsequently infer the 3D human pose in a robust and unambiguous manner. We introduce a novel approach for estimating 3D human pose even when observations are noisy. We propose a stochastic sampling strategy to propagate the noise from the image plane to the shape space. This provides a set of ambiguous 3D shapes, which are virtually undistinguishable using image-based information alone. Disambiguation is then achieved by imposing kinematic constraints that guarantee the resulting pose resembles a 3D human shape. We validate our approach on a variety of situations in which state-of-the-art 2D detectors yield either inaccurate estimations or partly miss some of the body parts.
@InProceedings{SimoSerraCVPR2012,
   author    = {Edgar Simo-Serra and Arnau Ramisa and Guillem Aleny\`a and Carme Torras and Francesc Moreno-Noguer},
   title     = {{Single Image 3D Human Pose Estimation from Noisy Observations}},
   booktitle = "Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)",
   year      = 2012,
}

Source Code

Inpainting Network
Inpainting Network, 1.0 (Feb, 2018)
Globally and locally consistent image completion network
This code is the implementation of the "Globally and Locally Consistent Image Completion" paper. It contains the pre-trained model and example usage code.
Colorization Network
Colorization Network, 1.0 (Apr, 2016)
Let there be Color! Colorization Network
This code is the implementation of the "Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification" paper. It contains the pre-trained model and example usage code.
bttc
bttc, 1.0 (Mar, 2012)
Small library to handle B-Tree Triangular Coding (BTTC)
This library calculates the faces obtained by B-Tree Triangular Coding (BTTC). This is usually for subdividing an image into a triangular mesh. The core library is written in C but an octave/matlab interface is provided. The main focus of this library is simplicity. The code is simple enough to directly integrate it into another program as it is a single C code file with no external dependencies.