Fashion Research

Fashion plays a fundamental role in everyday lives and conveys a large amount of non-verbal information. However, dealing with fashion is a very complicated task due to the large amount of subjectiveness which makes it near-impossible to annotate large sets of data. The research here tackles these issues in order to make sense out of the vast amounts of fashion data on the web.

  • Fashion Style in 128 Floats

    Fashion Style in 128 Floats

    In this work we present an approach for learning features from large amounts of weakly-labelled data. Our approach consists training a convolutional neural network with both a ranking and classification loss jointly. We do this by exploiting user-provided metadata of images on the web. We define a rough concept of similarity between images using this metadata, which allows us to define a ranking loss using this similarity. Combining this ranking loss with a standard classification loss, we are able to learn a compact 128 float representation of fashion style using only noisy user provided tags that outperforms standard features. Furthermore, qualitative analysis shows that our model is able to automatically learn nuances in style.

  • Neuroaesthetics in Fashion

    Neuroaesthetics in Fashion

    Being able to understand and model fashion can have a great impact in everyday life. From choosing your outfit in the morning to picking your best picture for your social network profile, we make fashion decisions on a daily basis that can have impact on our lives. As not everyone has access to a fashion expert to give advice on the current trends and what picture looks best, we have been working on developing systems that are able to automatically learn about fashion and provide useful recommendations to users. In this work we focus on building models that are able to discover and understand fashion. For this purpose we have created the Fashion144k dataset, consisting of 144,169 user posts with images and their associated metadata. We exploit the votes given to each post by different users to obtain measure of fashionability, that is, how fashionable the user and their outfit is in the image. We propose the challenging task of identifying the fashionability of the posts and present an approach that by combining many different sources of information, is not only able to predict fashionability, but it is also able to give fashion advice to the users.

  • Clothes Segmentation

    Clothes Segmentation

    In this research we focus on the semantic segmentation of clothings from still images. This is a very complex task due to the large number of classes where intra-class variability can be larger than inter-class variability. We propose a Conditional Random Field (CRF) model that is able to leverage many different image features to obtain state-of-the-art performance on the challenging Fashionista dataset.

Publications

Regularized Adversarial Training for Single-shot Virtual Try-On
Regularized Adversarial Training for Single-shot Virtual Try-On
Kotaro Kikuchi, Kota Yamaguchi, Edgar Simo-Serra, Tetsunori Kobayashi
International Conference on Computer Vision Workshops (ICCVW), 2019
Spatially placing an object onto a background is an essential operation in graphic design and facilitates many different applications such as virtual try-on. The placing operation is formulated as a geometric inference problem for given foreground and background images, and has been approached by spatial transformer architecture.In this paper, we propose a simple yet effective regularization technique to guide the geometric parameters based on user-defined trust regions. Our approach stabilizes the training process of spatial transformer networks and achieves a high-quality prediction with single-shot inference. Our proposed method is independent of initial parameters, and can easily incorporate various priors to prevent different types of trivial solutions. Empirical evaluation with the Abstract Scenes and CelebA datasets shows that our approach achieves favorable results compared to baselines.
@InProceedings{KikuchiICCVW2019,
   author    = {Kotaro Kikuchi and Kota Yamaguchi and Edgar Simo-Serra and Tetsunori Kobayashi},
   title     = {{Regularized Adversarial Training for Single-shot Virtual Try-On}},
   booktitle = "Proceedings of the International Conference on Computer Vision Workshops (ICCVW)",
   year      = 2019,
}
Multi-Modal Embedding for Main Product Detection in Fashion
Multi-Modal Embedding for Main Product Detection in Fashion
Antonio Rubio, Longlong Yu, Edgar Simo-Serra, Francesc Moreno-Noguer
International Conference on Computer Vision Workshops (ICCVW) [best paper], 2017
We present an approach to detect the main product in fashion images by exploiting the textual metadata associated with each image. Our approach is based on a Convolutional Neural Network and learns a joint embedding of object proposals and textual metadata to predict the main product in the image. We additionally use several complementary classification and overlap losses in order to improve training stability and performance. Our tests on a large-scale dataset taken from eight e-commerce sites show that our approach outperforms strong baselines and is able to accurately detect the main product in a wide diversity of challenging fashion images.
@InProceedings{RubioICCVW2017,
   author    = {Antonio Rubio and Longlong Yu and Edgar Simo-Serra and Francesc Moreno-Noguer},
   title     = {{Multi-Modal Embedding for Main Product Detection in Fashion}},
   booktitle = "Proceedings of the International Conference on Computer Vision Workshops (ICCVW)",
   year      = 2017,
}
What Makes a Style: Experimental Analysis of Fashion Prediction
What Makes a Style: Experimental Analysis of Fashion Prediction
Moeko Takagi, Edgar Simo-Serra, Satoshi Iizuka, Hiroshi Ishikawa
International Conference on Computer Vision Workshops (ICCVW), 2017
In this work, we perform an experimental analysis of the differences of both how humans and machines see and distinguish fashion styles. For this purpose, we propose an expert-curated new dataset for fashion style prediction, which consists of 14 different fashion styles each with roughly 1,000 images of worn outfits. The dataset, with a total of 13,126 images, captures the diversity and complexity of modern fashion styles. We perform an extensive analysis of the dataset by benchmarking a wide variety of modern classification networks, and also perform an in-depth user study with both fashion-savvy and fashion-naive users. Our results indicate that, although classification networks are able to outperform naive users, they are still far from the performance of savvy users, for which it is important to not only consider texture and color, but subtle differences in the combination of garments.
@InProceedings{TakagiICCVW2017,
   author    = {Moeko Takagi and Edgar Simo-Serra and Satoshi Iizuka and Hiroshi Ishikawa},
   title     = {{What Makes a Style: Experimental Analysis of Fashion Prediction}},
   booktitle = "Proceedings of the International Conference on Computer Vision Workshops (ICCVW)",
   year      = 2017,
}
Multi-Label Fashion Image Classification with Minimal Human Supervision
Multi-Label Fashion Image Classification with Minimal Human Supervision
Naoto Inoue, Edgar Simo-Serra, Toshihiko Yamasaki, Hiroshi Ishikawa
International Conference on Computer Vision Workshops (ICCVW), 2017
We tackle the problem of multi-label classification of fashion images from noisy data using minimal human supervision. We present a new dataset of full body poses, each with a set of 66 binary labels corresponding to information about the garments worn in the image and obtained in an automatic manner. As the automatically collected labels contain significant noise, for a small subset of the data, we manually correct the labels, using these correct labels for further training and evaluating the model. We build upon a recent approach that both cleans the noisy labels while learning to classify, and show simple changes that can significantly improve the performance.
@InProceedings{InoueICCVW2017,
   author    = {Naoto Inoue and Edgar Simo-Serra and Toshihiko Yamasaki and Hiroshi Ishikawa},
   title     = {{Multi-Label Fashion Image Classification with Minimal Human Supervision}},
   booktitle = "Proceedings of the International Conference on Computer Vision Workshops (ICCVW)",
   year      = 2017,
}
Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction
Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction
Edgar Simo-Serra and Hiroshi Ishikawa
Conference in Computer Vision and Pattern Recognition (CVPR), 2016
We propose a novel approach for learning features from weakly-supervised data by joint ranking and classification. In order to exploit data with weak labels, we jointly train a feature extraction network with a ranking loss and a classification network with a cross-entropy loss. We obtain high-quality compact discriminative features with few parameters, learned on relatively small datasets without additional annotations. This enables us to tackle tasks with specialized images not very similar to the more generic ones in existing fully-supervised datasets. We show that the resulting features in combination with a linear classifier surpass the state-of-the-art on the Hipster Wars dataset despite using features only 0.3% of the size. Our proposed features significantly outperform those obtained from networks trained on ImageNet, despite being 32 times smaller (128 single-precision floats), trained on noisy and weakly-labeled data, and using only 1.5% of the number of parameters.
@InProceedings{SimoSerraCVPR2016,
   author    = {Edgar Simo-Serra and Hiroshi Ishikawa},
   title     = {{Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction}},
   booktitle = "Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)",
   year      = 2016,
}
Neuroaesthetics in Fashion: Modeling the Perception of Fashionability
Neuroaesthetics in Fashion: Modeling the Perception of Fashionability
Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun
Conference in Computer Vision and Pattern Recognition (CVPR), 2015
In this paper, we analyze the fashion of clothing of a large social website. Our goal is to learn and predict how fashionable a person looks on a photograph and suggest subtle improvements the user could make to improve her/his appeal. We propose a Conditional Random Field model that jointly reasons about several fashionability factors such as the type of outfit and garments the user is wearing, the type of the user, the photograph's setting (e.g., the scenery behind the user), and the fashionability score. Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change in order to improve fashionability. We demonstrate that our joint approach significantly outperforms a variety of intelligent baselines. We additionally collected a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information which can be exploited for our task. We also provide a detailed analysis of the data, showing different outfit trends and fashionability scores across the globe and across a span of 6 years.
@InProceedings{SimoSerraCVPR2015,
   author    = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
   title     = {{Neuroaesthetics in Fashion: Modeling the Perception of Fashionability}},
   booktitle = "Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)",
   year      = 2015,
}
A High Performance CRF Model for Clothes Parsing
A High Performance CRF Model for Clothes Parsing
Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun
Asian Conference on Computer Vision (ACCV), 2014
In this paper we tackle the problem of semantic segmentation of clothing. We frame the problem as the one of inference in a pose-aware Markov random field which exploits appearance, figure/ground segmentation, shape and location priors for each garment as well as similarities between segments and symmetries between different human body parts. We demonstrate the effectiveness of our approach in the fashionista dataset and show that we can obtain a significant improvement over the state-of-the-art.
@InProceedings{SimoSerraACCV2014,
   author    = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
   title     = {{A High Performance CRF Model for Clothes Parsing}},
   booktitle = "Proceedings of the Asian Conference on Computer Vision (ACCV)",
   year      = 2014,
}

Source Code

StyleNet
StyleNet, 1.0 (Jun, 2016)
Fashion style in 128 floats
This code is the implementation of the "Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction". It contains the best performing feature extraction model explained in the paper.
Clothes Parsing
Clothes Parsing, 1.0 (Dec, 2014)
Clothes Parsing
This code is the implementation of the "A High Performance CRF Model for Clothes Parsing" paper. It contains all the code for being able to learn and run inference. The features for the Fashionista dataset must be downloaded separately.

Datasets

Fashion550k
Fashion550k
Large-scale weakly labelled dataset for fashion for evaluating training with noisy labels.
This extends the previous Fashion144k dataset to have a much larger number of images, and uses the automatic curating approach proposed in our StyleNet paper to improve the quality of the images. To evaluate learning with noisy labels, we provide a selected subset of 66 noisy tags for all the images, and additionally provide a subset of manually curated tags for both training and evaluation.
FashionStyle14
FashionStyle14
Expert-curated fashion style prediction datase with a focus on modern Japanese fashion.
We present the FashionStyle14 dataset which focuses on predicting the fashion style of images. The images focus on single individuals with fully visible poses. We provide expert-curated fashion style annotations for a total of 14 unique challenging classes that focus on modern Japanese fashion styles such as Gal, Natural, or Casual.
Fashion144k (Stylenet)
Fashion144k (Stylenet)
Curated version of the large-scale weakly labelled dataset for learning fashion.
We present an automatically curated version of the Fashion144k dataset. In order to improve the quality of images, we annotated a small subset of images in which a single individual is roughly centered in the image as positive images. We then train a convolutional network in order to predict whether an image is positive or not, and use this network to automatically curate the rest of the dataset. Although this reduces the number of available images, the resulting images are of much higher quality and do not include product nor heavily distorted images.
Fashion144k
Fashion144k
Large-scale weakly labelled dataset for predicting fashionability of fashion images.
We present the Fashion144k dataset, consisting of 144,169 user posts with images and their associated metadata, for predicting fashionability, that is, how fashionable the user and their outfit is in an image. We exploit the votes given to each post by different users to obtain measure of fashionability, and provide diverse metadata to perform analysis and predictions.