Fashion Style in 128 Floats

Fashion Style in 128 Floats

In this work we present an approach for learning features from large amounts of weakly-labelled data. Our approach consists training a convolutional neural network with both a ranking and classification loss jointly. We do this by exploiting user-provided metadata of images on the web. We define a rough concept of similarity between images using this metadata, which allows us to define a ranking loss using this similarity. Combining this ranking loss with a standard classification loss, we are able to learn a compact 128 float representation of fashion style using only noisy user provided tags that outperforms standard features. Furthermore, qualitative analysis shows that our model is able to automatically learn nuances in style.

For related work, see our work on parsing clothing in images and predicting fashionability.

Method

Overview of our Approach

We base our approach on combining both a classification network with a feature network that are learnt jointly with a ranking and classification loss. We do this by first defining a similarity metric on the user provided noisy tags. Using this metric we can then roughly determine semantically similar and dissimilar images. Given an anchor or reference image, we then form triplets of images by choosing a very similar and very dissimilar image to the anchor image. This allows us to define a ranking loss in which we encourage the L2 norm of the features from similar images to be small, and the L2 norm of the features from dissimilar images to be large. Although this already gives good performance, by further combining this with a small classification network and a classification loss on the dissimilar image, results can be further improved. In contrast with using features directly from classification networks, our features are optimized as an embedding using L2 norm and thus the Euclidean distance can be used directly to provide t-SNE visualizations and also similarity queries using KD-trees.

Results

t-SNE Visualization

We train our model on the Fashion144k dataset and evaluate our features both qualitatively and quantitatively on the Hipsters Wars dataset1. Above we show a visualization using t-SNE2 on the Pinup class of the Hipster Wars dataset. We can see our approach is able to group different outfits ignoring the background and the wearer. For the full visualization, click here.

For full results and details please consult the full paper.

This research was partially funded by JST CREST.

Publications

2016

Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction
Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction
Edgar Simo-Serra and Hiroshi Ishikawa
Conference in Computer Vision and Pattern Recognition (CVPR), 2016
We propose a novel approach for learning features from weakly-supervised data by joint ranking and classification. In order to exploit data with weak labels, we jointly train a feature extraction network with a ranking loss and a classification network with a cross-entropy loss. We obtain high-quality compact discriminative features with few parameters, learned on relatively small datasets without additional annotations. This enables us to tackle tasks with specialized images not very similar to the more generic ones in existing fully-supervised datasets. We show that the resulting features in combination with a linear classifier surpass the state-of-the-art on the Hipster Wars dataset despite using features only 0.3% of the size. Our proposed features significantly outperform those obtained from networks trained on ImageNet, despite being 32 times smaller (128 single-precision floats), trained on noisy and weakly-labeled data, and using only 1.5% of the number of parameters.
@InProceedings{SimoSerraCVPR2016,
   author    = {Edgar Simo-Serra and Hiroshi Ishikawa},
   title     = {{Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction}},
   booktitle = "Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)",
   year      = 2016,
}

Source Code

StyleNet
StyleNet, 1.0 (Jun, 2016)
Fashion style in 128 floats
This code is the implementation of the "Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction". It contains the best performing feature extraction model explained in the paper.
  1. M. Hadi Kiapour, Kota Yamaguchi, Alexander C. Berg, Tamara L. Berg. Hipster Wars: Discovering Elements of Fashion Styles. In ECCV, 2014. 

  2. L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008.