Loading...
Loading...
Loading...
_Collection of CVPR 2017, including titles, links, authors, abstracts and my own comments. Created by Michael Liang, NUDT. All my work are based on http://www.cvpapers.com/cvpr2017.html
# CVPR 2017 Abstracts Collection
_Collection of CVPR 2017, including titles, links, authors, abstracts and my own comments. Created by Michael Liang, NUDT. All my work are based on http://www.cvpapers.com/cvpr2017.html
It is a convient project for CVPR fast reading. Some information are missing, and I hope we can work together for a better collection._
## Machine Learning 1
Spotlight 1-1A
#### Exclusivity-Consistency Regularized Multi-View Subspace Clustering
Xiaobo Wang, Xiaojie Guo, Zhen Lei, Changqing Zhang, Stan Z. Li
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### *Borrowing Treasures From the Wealthy: Deep Transfer Learning Through Selective Joint Fine-Tuning [PDF](https://arxiv.org/abs/1702.08690)
Weifeng Ge, Yizhou Yu
_**Abstract**_:  _[Deep neural networks require a large amount of labeled training data during supervised learning. However, collecting and labeling so much data might be infeasible in many cases. In this paper, we introduce a source-target selective joint fine-tuning scheme for improving the performance of deep learning tasks with insufficient training data. In this scheme, a target learning task with insufficient training data is carried out simultaneously with another source learning task with abundant training data. However, the source learning task does not use all existing training data. Our core idea is to identify and use a subset of training images from the original source learning task whose low-level characteristics are similar to those from the target learning task, and jointly fine-tune shared convolutional layers for both tasks. Specifically, we compute descriptors from linear or nonlinear filter bank responses on training images from both tasks, and use such descriptors to search for a desired subset of training samples for the source learning task. Experiments demonstrate that our selective joint fine-tuning scheme achieves state-of-the-art performance on multiple visual classification tasks with insufficient training data for deep learning. Such tasks include Caltech 256, MIT Indoor 67, Oxford Flowers 102 and Stanford Dogs 120. In comparison to fine-tuning without a source domain, the proposed method can improve the classification accuracy by 2% - 10% using a single model.]_
_**Comment**_:  _<a source-target selective joint fine-tuning scheme with insufficient training data; insufficient & abundant task with similar low-level feature trained simultaneously; share conv layers; help labelling data,>_
#### **The More You Know: Using Knowledge Graphs for Image Classification [PDF](https://arxiv.org/abs/1612.04844)
Kenneth Marino, Ruslan Salakhutdinov, Abhinav Gupta
_**Abstract**_:  _[One characteristic that sets humans apart from modern learning-based computer vision algorithms is the ability to acquire knowledge about the world and use that knowledge to reason about the visual world. Humans can learn about the characteristics of objects and the relationships that occur between them to learn a large variety of visual concepts, often with few examples. This paper investigates the use of structured prior knowledge in the form of knowledge graphs and shows that using this knowledge improves performance on image classification. We build on recent work on end-to-end learning on graphs, introducing the Graph Search Neural Network as a way of efficiently incorporating large knowledge graphs into a vision classification pipeline. We show in a number of experiments that our method outperforms standard neural network baselines for multi-label classification.]_
_**Comment**_:  _< the real way to AI >_
#### Dynamic Edge-Conditioned Filters in Convolutional Neural Networks on Graphs [PDF](https://arxiv.org/abs/1704.02901)
Martin Simonovsky, Nikos Komodakis
_**Abstract**_:  _[A number of problems can be formulated as prediction on graph-structured data. In this work, we generalize the convolution operator from regular grids to arbitrary graphs while avoiding the spectral domain, which allows us to handle graphs of varying size and connectivity. To move beyond a simple diffusion, filter weights are conditioned on the specific edge labels in the neighborhood of a vertex. Together with the proper choice of graph coarsening, we explore constructing deep neural networks for graph classification. In particular, we demonstrate the generality of our formulation in point cloud classification, where we set the new state of the art, and on a graph classification dataset, where we outperform other deep learning approaches.]_
_**Comment**_:  _< graph-structured data >_
#### *Convolutional Neural Network Architecture for Geometric Matching [PDF](https://arxiv.org/abs/1703.05593)
Ignacio Rocco, Relja Arandjelović, Josef Sivic
_**Abstract**_:  _[We address the problem of determining correspondences between two images in agreement with a geometric model such as an affine or thin-plate spline transformation, and estimating its parameters. The contributions of this work are three-fold. First, we propose a convolutional neural network architecture for geometric matching. The architecture is based on three main components that zmimic the standard steps of feature extraction, matching and simultaneous inlier detection and model parameter estimation, while being trainable end-to-end. Second, we demonstrate that the network parameters can be trained from synthetically generated imagery without the need for manual annotation and that our matching layer significantly increases generalization capabilities to never seen before images. Finally, we show that the same model can perform both instance-level and category-level matching giving state-of-the-art results on the challenging Proposal Flow dataset.]_
_**Comment**_:  _< geomatric matching; synthetically generated image; instance-level& category-level; matching layer,>_
#### Deep Affordance-Grounded Sensorimotor Object Recognition [PDF](https://arxiv.org/abs/1704.02787)
Spyridon Thermos, Georgios Th. Papadopoulos, Petros Daras, Gerasimos Potamianos
_**Abstract**_:  _[It is well-established by cognitive neuroscience that human perception of objects constitutes a complex process, where object appearance information is combined with evidence about the so-called object "affordances", namely the types of actions that humans typically perform when interacting with them. This fact has recently motivated the "sensorimotor" approach to the challenging task of automatic object recognition, where both information sources are fused to improve robustness. In this work, the aforementioned paradigm is adopted, surpassing current limitations of sensorimotor object recognition research. Specifically, the deep learning paradigm is introduced to the problem for the first time, developing a number of novel neuro-biologically and neuro-physiologically inspired architectures that utilize state-of-the-art neural networks for fusing the available information sources in multiple ways. The proposed methods are evaluated using a large RGB-D corpus, which is specifically collected for the task of sensorimotor object recognition and is made publicly available. Experimental results demonstrate the utility of affordance information to object recognition, achieving an up to 29% relative error reduction by its inclusion.]_
_**Comment**_:  _< Sensorimotor Object Recognition >_
#### Discovering Causal Signals in Images [PDF](https://arxiv.org/abs/1605.08179)
David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Schölkopf, Léon Bottou
_**Abstract**_:  _[The purpose of this paper is to point out and assay observable causal signals within collections of static images. We achieve this goal in two steps. First, we take a learning approach to observational causal inference, and build a classifier that achieves state-of-the-art performance on finding the causal direction between pairs of random variables, when given samples from their joint distribution. Second, we use our causal direction finder to effectively distinguish between features of objects and features of their contexts in collections of static images. Our experiments demonstrate the existence of (1) a relation between the direction of causality and the difference between objects and their contexts, and (2) observable causal signals in collections of static images.]_
_**Comment**_:  _< causal singals>_
#### *On Compressing Deep Models by Low Rank and Sparse Decomposition
Xiyu Yu, Tongliang Liu, Xinchao Wang, Dacheng Tao
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
Oral 1-1A
#### PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation [PDF](https://arxiv.org/abs/1612.00593)
Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas
_**Abstract**_:  _[Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds and well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.]_
_**Comment**_:  _< point cloud, 3D>_
#### ***Universal Adversarial Perturbations [PDF](https://arxiv.org/abs/1610.08401)
Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, Pascal Frossard
_**Abstract**_:  _[Given a state-of-the-art deep neural network classifier, we show the existence of a universal (image-agnostic) and very small perturbation vector that causes natural images to be misclassified with high probability. We propose a systematic algorithm for computing universal perturbations, and show that state-of-the-art deep neural networks are highly vulnerable to such perturbations, albeit being quasi-imperceptible to the human eye. We further empirically analyze these universal perturbations and show, in particular, that they generalize very well across neural networks. The surprising existence of universal perturbations reveals important geometric correlations among the high-dimensional decision boundary of classifiers. It further outlines potential security breaches with the existence of single directions in the input space that adversaries can possibly exploit to break a classifier on most natural images.]_
_**Comment**_:  _< using unseen perturbations to break a classifier, destory DL>_
#### *Unsupervised Pixel-Level Domain Adaptation With Generative Adversarial Networks [PDF](https://arxiv.org/abs/1612.05424)
Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, Dilip Krishnan
_**Abstract**_:  _[Collecting well-annotated image datasets to train modern machine learning algorithms is prohibitively expensive for many tasks. One appealing alternative is rendering synthetic data where ground-truth annotations are generated automatically. Unfortunately, models trained purely on rendered images often fail to generalize to real images. To address this shortcoming, prior work introduced unsupervised domain adaptation algorithms that attempt to map representations between the two domains or learn to extract features that are domain-invariant. In this work, we present a new approach that learns, in an unsupervised manner, a transformation in the pixel space from one domain to the other. Our generative adversarial network (GAN)-based method adapts source-domain images to appear as if drawn from the target domain. Our approach not only produces plausible samples, but also outperforms the state-of-the-art on a number of unsupervised domain adaptation scenarios by large margins. Finally, we demonstrate that the adaptation process generalizes to object classes unseen during training.]_
_**Comment**_:  _< generate domain-invariant annotations, adapts source-domain images to appear as if drawn from the target domain ;GAN>_
#### Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network (PDF, code) [PDF](https://arxiv.org/pdf/1609.04802.pdf) [PDF](https://github.com/leehomyc/Photo-Realistic-Super-Resoluton)
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, Wenzhe Shi
_**Abstract**_:  _[Despite the breakthroughs in accuracy and speed of single image super-resolution using faster and deeper convolutional neural networks, one central problem remains largely unsolved: how do we recover the finer texture details when we super-resolve at large upscaling factors? The behavior of optimization-based super-resolution methods is principally driven by the choice of the objective function. Recent work has largely focused on minimizing the mean squared reconstruction error. The resulting estimates have high peak signal-to-noise ratios, but they are often lacking high-frequency details and are perceptually unsatisfying in the sense that they fail to match the fidelity expected at the higher resolution. In this paper, we present SRGAN, a generative adversarial network (GAN) for image super-resolution (SR). To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, we use a content loss motivated by perceptual similarity instead of similarity in pixel space. Our deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score (MOS) test shows hugely significant gains in perceptual quality using SRGAN. The MOS scores obtained with SRGAN are closer to those of the original high-resolution images than to those obtained with any state-of-the-art method.]_
_**Comment**_:  _< Super-Resolution, GAN>_
## 3D Vision 1
Spotlight 1-1B
#### Context-Aware Captions From Context-Agnostic Supervision [PDF](https://arxiv.org/abs/1701.02870)
Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, Gal Chechik
_**Abstract**_:  _[We introduce an inference technique to produce discriminative context-aware image captions (captions that describe differences between images or visual concepts) using only generic context-agnostic training data (captions that describe a concept or an image in isolation). For example, given images and captions of "siamese cat" and "tiger cat", we generate language that describes the "siamese cat" in a way that distinguishes it from "tiger cat". Our key novelty is that we show how to do joint inference over a language model that is context-agnostic and a listener which distinguishes closely-related concepts. We first apply our technique to a justification task, namely to describe why an image contains a particular fine-grained category as opposed to another closely-related category of the CUB-200-2011 dataset. We then study discriminative image captioning to generate language that uniquely refers to one of two semantically-similar images in the COCO dataset. Evaluations with discriminative ground truth for justification and human studies for discriminative image captioning reveal that our approach outperforms baseline generative and speaker-listener approaches for discrimination. ]_
_**Comment**_:  _< distinguished image caption>_
#### Global Hypothesis Generation for 6D Object Pose Estimation [PDF](https://arxiv.org/abs/1612.02287)
Frank Michel, Alexander Kirillov, Eric Brachmann, Alexander Krull, Stefan Gumhold, Bogdan Savchynskyy, Carsten Rother
_**Abstract**_:  _[This paper addresses the task of estimating the 6D pose of a known 3D object from a single RGB-D image. Most modern approaches solve this task in three steps: i) Compute local features; ii) Generate a pool of pose-hypotheses; iii) Select and refine a pose from the pool. This work focuses on the second step. While all existing approaches generate the hypotheses pool via local reasoning, e.g. RANSAC or Hough-voting, we are the first to show that global reasoning is beneficial at this stage. In particular, we formulate a novel fully-connected Conditional Random Field (CRF) that outputs a very small number of pose-hypotheses. Despite the potential functions of the CRF being non-Gaussian, we give a new and efficient two-step optimization procedure, with some guarantees for optimality. We utilize our global hypotheses generation procedure to produce results that exceed state-of-the-art for the challenging "Occluded Object Dataset".]_
_**Comment**_:  _< pose estimation>_
#### A Practical Method for Fully Automatic Intrinsic Camera Calibration Using Directionally Encoded Light
Mahdi Abbaspour Tehrani, Thabo Beeler, Anselm Grundhöfer
_**Abstract**_:  _[]_
_**Comment**_:  _< camera calibration >_
#### CATS: A Color and Thermal Stereo Benchmark
Wayne Treible, Philip Saponaro, Scott Sorensen, Abhishek Kolagunda, Michael O'Neal, Brian Phelan, Kelly Sherbondy, Chandra Kambhamettu
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Elastic Shape-From-Template With Spatially Sparse Deforming Forces
Abed Malti, Cédric Herzet
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Distinguishing the Indistinguishable: Exploring Structural Ambiguities via Geodesic Context
Qingan Yan, Long Yang, Ling Zhang, Chunxia Xiao
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Multi-Scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation [PDF](https://arxiv.org/abs/1704.02157)
Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, Nicu Sebe
_**Abstract**_:  _[This paper addresses the problem of depth estimation from a single still image. Inspired by recent works on multi- scale convolutional neural networks (CNN), we propose a deep model which fuses complementary information derived from multiple CNN side outputs. Different from previous methods, the integration is obtained by means of continuous Conditional Random Fields (CRFs). In particular, we propose two different variations, one based on a cascade of multiple CRFs, the other on a unified graphical model. By designing a novel CNN implementation of mean-field updates for continuous CRFs, we show that both proposed models can be regarded as sequential deep networks and that training can be performed end-to-end. Through extensive experimental evaluation we demonstrate the effective- ness of the proposed approach and establish new state of the art results on publicly available datasets. ]_
_**Comment**_:  _< depth estimation from a single still image; using multi-scale CNN's side output, integrated by continuous Conditional Random Fields (CRFs).>_
#### Dynamic Time-Of-Flight
Michael Schober, Amit Adam, Omer Yair, Shai Mazor, Sebastian Nowozin
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
Oral 1-1B
#### Semantic Scene Completion From a Single Depth Image [PDF](https://arxiv.org/abs/1611.08974)
Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, Thomas Funkhouser
_**Abstract**_:  _[This paper focuses on semantic scene completion, a task for producing a complete 3D voxel representation of volumetric occupancy and semantic labels for a scene from a single-view depth map observation. Previous work has considered scene completion and semantic labeling of depth maps separately. However, we observe that these two problems are tightly intertwined. To leverage the coupled nature of these two tasks, we introduce the semantic scene completion network (SSCNet), an end-to-end 3D convolutional network that takes a single depth image as input and simultaneously outputs occupancy and semantic labels for all voxels in the camera view frustum. Our network uses a dilation-based 3D context module to efficiently expand the receptive field and enable 3D context learning. To train our network, we construct SUNCG - a manually created large-scale dataset of synthetic 3D scenes with dense volumetric annotations. Our experiments demonstrate that the joint model outperforms methods addressing each task in isolation and outperforms alternative approaches on the semantic scene completion task. ]_
_**Comment**_:  _< >_
#### 3DMatch: Learning Local Geometric Descriptors From RGB-D Reconstructions [PDF](https://arxiv.org/abs/1603.08182) [PDF](http://3dmatch.cs.princeton.edu/)
Andy Zeng, Shuran Song, Matthias Nießner, Matthew Fisher, Jianxiong Xiao, Thomas Funkhouser
_**Abstract**_:  _[Matching local geometric features on real-world depth images is a challenging task due to the noisy, low-resolution, and incomplete nature of 3D scan data. These difficulties limit the performance of current state-of-art methods, which are typically based on histograms over geometric properties. In this paper, we present 3DMatch, a data-driven model that learns a local volumetric patch descriptor for establishing correspondences between partial 3D data. To amass training data for our model, we propose a self-supervised feature learning method that leverages the millions of correspondence labels found in existing RGB-D reconstructions. Experiments show that our descriptor is not only able to match local geometry in new scenes for reconstruction, but also generalize to different tasks and spatial scales (e.g. instance-level object model alignment for the Amazon Picking Challenge, and mesh surface correspondence). Results show that 3DMatch consistently outperforms other state-of-the-art approaches by a significant margin. Code, data, benchmarks, and pre-trained models are available online at this http URL]_
_**Comment**_:  _< >_
#### Multi-View Supervision for Single-View Reconstruction via Differentiable Ray Consistency (PDF, project,code) [PDF](https://arxiv.org/pdf/1704.06254.pdf) [PDF](https://shubhtuls.github.io/drc/)
Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, Jitendra Malik
_**Abstract**_:  _[We study the notion of consistency between a 3D shape and a 2D observation and propose a differentiable formulation which allows computing gradients of the 3D shape given an observation from an arbitrary view. We do so by reformulating view consistency using a differentiable ray consistency (DRC) term. We show that this formulation can be incorporated in a learning framework to leverage different types of multi-view observations e.g. foreground masks, depth, color images, semantics etc. as supervision for learning single-view 3D prediction. We present empirical analysis of our technique in a controlled setting. We also show that this approach allows us to improve over existing techniques for single-view reconstruction of objects from the PASCAL VOC dataset. ]_
_**Comment**_:  _< >_
#### On-The-Fly Adaptation of Regression Forests for Online Camera Relocalisation
Tommaso Cavallari, Stuart Golodetz, Nicholas A. Lord, Julien Valentin, Luigi Di Stefano, Philip H. S. Torr
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
## Low- & Mid-Level Vision
Spotlight 1-1C
#### Designing Effective Inter-Pixel Information Flow for Natural Image Matting [PDF](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwjv6dnwjNPUAhVJNbwKHTq5CiEQFggmMAA&url=http%3A%2F%2Fpeople.inf.ethz.ch%2Faksoyy%2Fpapers%2FCVPR17-ifm.pdf&usg=AFQjCNGFpObgwIjYNZiNW0K_Jb6OLtLcuw&cad=rjt)
Yağiz Aksoy, Tunç Ozan Aydin, Marc Pollefeys
_**Abstract**_:  _[]_
_**Comment**_:  _< Natural Image Matting>_
#### Deep Video Deblurring for Hand-Held Cameras [PDF](https://arxiv.org/abs/1611.08387)
Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, Oliver Wang
_**Abstract**_:  _[Motion blur from camera shake is a major problem in videos captured by hand-held devices. Unlike single-image deblurring, video-based approaches can take advantage of the abundant information that exists across neighboring frames. As a result the best performing methods rely on aligning nearby frames. However, aligning images is a computationally expensive and fragile procedure, and methods that aggregate information must therefore be able to identify which regions have been accurately aligned and which have not, a task which requires high level scene understanding. In this work, we introduce a deep learning solution to video deblurring, where a CNN is trained end-to-end to learn how to accumulate information across frames. To train this network, we collected a dataset of real videos recorded with a high framerate camera, which we use to generate synthetic motion blur for supervision. We show that the features learned from this dataset extend to deblurring motion blur that arises due to camera shake in a wide range of videos, and compare the quality of results to a number of other baselines. ]_
_**Comment**_:  _< Motion blur from camera shake, CNN>_
#### Instance-Level Salient Object Segmentation [PDF](https://arxiv.org/abs/1704.03604)
Guanbin Li, Yuan Xie, Liang Lin, Yizhou Yu
_**Abstract**_:  _[Image saliency detection has recently witnessed rapid progress due to deep convolutional neural networks. However, none of the existing methods is able to identify object instances in the detected salient regions. In this paper, we present a salient instance segmentation method that produces a saliency mask with distinct object instance labels for an input image. Our method consists of three steps, estimating saliency map, detecting salient object contours and identifying salient object instances. For the first two steps, we propose a multiscale saliency refinement network, which generates high-quality salient region masks and salient object contours. Once integrated with multiscale combinatorial grouping and a MAP-based subset optimization framework, our method can generate very promising salient object instance segmentation results. To promote further research and evaluation of salient instance segmentation, we also construct a new database of 1000 images and their pixelwise salient instance annotations. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks for salient region detection as well as on our new dataset for salient instance segmentation. ]_
_**Comment**_:  _< estimating saliency map, detecting salient object contours and identifying salient object instances>_
#### Deep Multi-Scale Convolutional Neural Network for Dynamic Scene Deblurring [PDF](https://arxiv.org/abs/1612.02177)
Seungjun Nah, Tae Hyun Kim, Kyoung Mu Lee
_**Abstract**_:  _[Non-uniform blind deblurring for general dynamic scenes is a challenging computer vision problem since blurs are caused by camera shake, scene depth as well as multiple object motions. To remove these complicated motion blurs, conventional energy optimization based methods rely on simple assumptions such that blur kernel is partially uniform or locally linear. Moreover, recent machine learning based methods also depend on synthetic blur datasets generated under these assumptions. This makes conventional deblurring methods fail to remove blurs where blur kernel is difficult to approximate or parameterize (e.g. object motion boundaries). In this work, we propose a multi-scale convolutional neural network that restores blurred images caused by various sources in an end-to-end manner. Furthermore, we present multi-scale loss function that mimics conventional coarse-to-fine approaches. Moreover, we propose a new large scale dataset that provides pairs of realistic blurry image and the corresponding ground truth sharp image that are obtained by a high-speed camera. With the proposed model trained on this dataset, we demonstrate empirically that our method achieves the state-of-the-art performance in dynamic scene deblurring not only qualitatively, but also quantitatively. ]_
_**Comment**_:  _< deblurring>_
#### *Diversified Texture Synthesis With Feed-Forward Networks [PDF](https://arxiv.org/abs/1703.01664)
Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, Ming-Hsuan Yang
_**Abstract**_:  _[Recent progresses on deep discriminative and generative modeling have shown promising results on texture synthesis. However, existing feed-forward based methods trade off generality for efficiency, which suffer from many issues, such as shortage of generality (i.e., build one network per texture), lack of diversity (i.e., always produce visually identical output) and suboptimality (i.e., generate less satisfying visual effects). In this work, we focus on solving these issues for improved texture synthesis. We propose a deep generative feed-forward network which enables efficient synthesis of multiple textures within one single network and meaningful interpolation between them. Meanwhile, a suite of important techniques are introduced to achieve better convergence and diversity. With extensive experiments, we demonstrate the effectiveness of the proposed model and techniques for synthesizing a large number of textures and show its applications with the stylization. ]_
_**Comment**_:  _< Texture Synthesis using one network>_
Radiometric Calibration for Internet Photo Collections (PDF) [PDF](http://alumni.media.mit.edu/~shiboxin/files/Mo_CVPR17.pdf)
Zhipeng Mo, Boxin Shi, Sai-Kit Yeung, Yasuyuki Matsushita
_**Abstract**_:  _[]_
_**Comment**_:  _< radiometric calibaration>_
#### Deeply Aggregated Alternating Minimization for Image Restoration [PDF](https://arxiv.org/abs/1612.06508)
Youngjung Kim, Hyungjoo Jung, Dongbo Min, Kwanghoon Sohn
_**Abstract**_:  _[Regularization-based image restoration has remained an active research topic in computer vision and image processing. It often leverages a guidance signal captured in different fields as an additional cue. In this work, we present a general framework for image restoration, called deeply aggregated alternating minimization (DeepAM). We propose to train deep neural network to advance two of the steps in the conventional AM algorithm: proximal mapping and ?- continuation. Both steps are learned from a large dataset in an end-to-end manner. The proposed framework enables the convolutional neural networks (CNNs) to operate as a prior or regularizer in the AM algorithm. We show that our learned regularizer via deep aggregation outperforms the recent data-driven approaches as well as the nonlocalbased methods. The flexibility and effectiveness of our framework are demonstrated in several image restoration tasks, including single image denoising, RGB-NIR restoration, and depth super-resolution. ]_
_**Comment**_:  _< image restoration using deep learning>_
#### End-To-End Instance Segmentation With Recurrent Attention [PDF](https://arxiv.org/abs/1605.09410)
Mengye Ren, Richard S. Zemel
_**Abstract**_:  _[While convolutional neural networks have gained impressive success recently in solving structured prediction problems such as semantic segmentation, it remains a challenge to differentiate individual object instances in the scene. Instance segmentation is very important in a variety of applications, such as autonomous driving, image captioning, and visual question answering. Techniques that combine large graphical models with low-level vision have been proposed to address this problem; however, we propose an end-to-end recurrent neural network (RNN) architecture with an attention mechanism to model a human-like counting process, and produce detailed instance segmentations. The network is jointly trained to sequentially produce regions of interest as well as a dominant object segmentation within each region. The proposed model achieves competitive results on the CVPPP, KITTI, and Cityscapes datasets. ]_
_**Comment**_:  _< instance segmentation using RNN with attention mechanism>_
Oral 1-1C
#### *SRN: Side-output Residual Network for Object Symmetry Detection in the Wild [PDF](https://arxiv.org/abs/1703.02243)
Wei Ke, Jie Chen, Jianbin Jiao, Guoying Zhao, Qixiang Ye
_**Abstract**_:  _[In this paper, we establish a baseline for object symmetry detection in complex backgrounds by presenting a new benchmark and an end-to-end deep learning approach, opening up a promising direction for symmetry detection in the wild. The new benchmark, named Sym-PASCAL, spans challenges including object diversity, multi-objects, part-invisibility, and various complex backgrounds that are far beyond those in existing datasets. The proposed symmetry detection approach, named Side-output Residual Network (SRN), leverages output Residual Units (RUs) to fit the errors between the object symmetry groundtruth and the outputs of RUs. By stacking RUs in a deep-to-shallow manner, SRN exploits the 'flow' of errors among multiple scales to ease the problems of fitting complex outputs with limited layers, suppressing the complex backgrounds, and effectively matching object symmetry of different scales. Experimental results validate both the benchmark and its challenging aspects related to realworld images, and the state-of-the-art performance of our symmetry detection approach. The benchmark and the code for SRN are publicly available at this https URL]
_**Comment**_:  _< object symmetry detection by side-output residual net, new benchmark>_
Deep Image Matting [PDF](https://arxiv.org/abs/1703.03872)
Ning Xu, Brian Price, Scott Cohen, Thomas Huang
_**Abstract**_:  _[Image matting is a fundamental computer vision problem and has many applications. Previous algorithms have poor performance when an image has similar foreground and background colors or complicated textures. The main reasons are prior methods 1) only use low-level features and 2) lack high-level context. In this paper, we propose a novel deep learning based algorithm that can tackle both these problems. Our deep model has two parts. The first part is a deep convolutional encoder-decoder network that takes an image and the corresponding trimap as inputs and predict the alpha matte of the image. The second part is a small convolutional network that refines the alpha matte predictions of the first network to have more accurate alpha values and sharper edges. In addition, we also create a large-scale image matting dataset including 49300 training images and 1000 testing images. We evaluate our algorithm on the image matting benchmark, our testing set, and a wide variety of real images. Experimental results clearly demonstrate the superiority of our algorithm over previous methods. ]_
_**Comment**_:  _< Image Matting>_
Wetness and Color From a Single Multispectral Image
Mihoko Shimano, Hiroki Okawa, Yuta Asano, Ryoma Bise, Ko Nishino, Imari Sato
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
FC4: Fully Convolutional Color Constancy With Confidence-Weighted Pooling
Yuanming Hu, Baoyuan Wang, Stephen Lin
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
# Poster 1-1
## 3D Computer Vision
Face Normals “In-The-Wild†Using Fully Convolutional Networks
George Trigeorgis, Patrick Snape, Iasonas Kokkinos, Stefanos Zafeiriou
A Non-Convex Variational Approach to Photometric Stereo Under Inaccurate Lighting
Yvain Quéau, Tao Wu, François Lauze, Jean-Denis Durou, Daniel Cremers
A Linear Extrinsic Calibration of Kaleidoscopic Imaging System From Single 3D Point
Kosuke Takahashi, Akihiro Miyata, Shohei Nobuhara, Takashi Matsuyama
Polarimetric Multi-View Stereo
Zhaopeng Cui, Jinwei Gu, Boxin Shi, Ping Tan, Jan Kautz
An Exact Penalty Method for Locally Convergent Maximum Consensus (PDF, code)
Huu Le, Tat-Jun Chin, David Suter
Deep Supervision With Shape Concepts for Occlusion-Aware 3D Object Parsing
Chi Li, M. Zeeshan Zia, Quoc-Huy Tran, Xiang Yu, Gregory D. Hager, Manmohan Chandraker
Amodal Detection of 3D Objects: Inferring 3D Bounding Boxes From 2D Ones in RGB-Depth Images
Zhuo Deng, Longin Jan Latecki
## Analyzing Humans in Images
Transition Forests: Learning Discriminative Temporal Transitions for Action Recognition and Detection
Guillermo Garcia-Hernando, Tae-Kyun Kim
Scene Flow to Action Map: A New Representation for RGB-D Based Action Recognition With Convolutional Neural Networks
Pichao Wang, Wanqing Li, Zhimin Gao, Yuyao Zhang, Chang Tang, Philip Ogunbona
Detecting Masked Faces in the Wild With LLE-CNNs
Shiming Ge, Jia Li, Qiting Ye, Zhao Luo
A Domain Based Approach to Social Relation Recognition
Qianru Sun, Bernt Schiele, Mario Fritz
Spatio-Temporal Naive-Bayes Nearest-Neighbor (ST-NBNN) for Skeleton-Based Action Recognition
Junwu Weng, Chaoqun Weng, Junsong Yuan
Personalizing Gesture Recognition Using Hierarchical Bayesian Neural Networks
Ajjen Joshi, Soumya Ghosh, Margrit Betke, Stan Sclaroff, Hanspeter Pfister
## Applications
Real-Time 3D Model Tracking in Color and Depth on a Single CPU Core
Wadim Kehl, Federico Tombari, Slobodan Ilic, Nassir Navab
Multi-Scale FCN With Cascaded Instance Aware Segmentation for Arbitrary Oriented Word Spotting in the Wild
Dafang He, Xiao Yang, Chen Liang, Zihan Zhou, Alexander G. Ororbi II, Daniel Kifer, C. Lee Giles
Viraliency: Pooling Local Virality
Xavier Alameda-Pineda, Andrea Pilzer, Dan Xu, Nicu Sebe, Elisa Ricci
## Biomedical Image/Video Analysis
A Non-Local Low-Rank Framework for Ultrasound Speckle Reduction
Lei Zhu, Chi-Wing Fu, Michael S. Brown, Pheng-Ann Heng
## Image Motion & Tracking
Video Acceleration Magnification
Yichao Zhang, Silvia L. Pintea, Jan C. van Gemert
Superpixel-Based Tracking-By-Segmentation Using Markov Chains
Donghun Yeo, Jeany Son, Bohyung Han, Joon Hee Han
BranchOut: Regularization for Online Ensemble Tracking With Convolutional Neural Networks
Bohyung Han, Jack Sim, Hartwig Adam
Learning Motion Patterns in Videos
Pavel Tokmakov, Karteek Alahari, Cordelia Schmid
## Low- & Mid-Level Vision
Deep Level Sets for Salient Object Detection
Ping Hu, Bing Shuai, Jun Liu, Gang Wang
Binary Constraint Preserving Graph Matching
Bo Jiang, Jin Tang, Chris Ding, Bin Luo
From Local to Global: Edge Profiles to Camera Motion in Blurred Images
Subeesh Vasu, A. N. Rajagopalan
What Is the Space of Attenuation Coefficients in Underwater Computer Vision?
Derya Akkaynak, Tali Treibitz, Tom Shlesinger, Yossi Loya, Raz Tamir, David Iluz
Robust Energy Minimization for BRDF-Invariant Shape From Light Fields
Zhengqin Li, Zexiang Xu, Ravi Ramamoorthi, Manmohan Chandraker
Boundary-Aware Instance Segmentation
Zeeshan Hayder, Xuming He, Mathieu Salzmann
Spatially-Varying Blur Detection Based on Multiscale Fused and Sorted Transform Coefficients of Gradient Magnitudes
S. Alireza Golestaneh, Lina J. Karam
Model-Based Iterative Restoration for Binary Document Image Compression With Dictionary Learning
Yandong Guo, Cheng Lu, Jan P. Allebach, Charles A. Bouman
FCSS: Fully Convolutional Self-Similarity for Dense Semantic Correspondence
Seungryong Kim, Dongbo Min, Bumsub Ham, Sangryul Jeon, Stephen Lin, Kwanghoon Sohn
## Machine Learning
Learning by Association — A Versatile Semi-Supervised Training Method for Neural Networks
Philip Haeusser, Alexander Mordvintsev, Daniel Cremers
Dilated Residual Networks
Fisher Yu, Vladlen Koltun, Thomas Funkhouser
Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction
Richard Zhang, Phillip Isola, Alexei A. Efros
Nonnegative Matrix Underapproximation for Robust Multiple Model Fitting
Mariano Tepper, Guillermo Sapiro
Truncated Max-Of-Convex Models
Pankaj Pansari, M. Pawan Kumar
Additive Component Analysis
Calvin Murdock, Fernando De la Torre
Subspace Clustering via Variance Regularized Ridge Regression
Chong Peng, Zhao Kang, Qiang Cheng
The Incremental Multiresolution Matrix Factorization Algorithm
Vamsi K. Ithapu, Risi Kondor, Sterling C. Johnson, Vikas Singh
Transformation-Grounded Image Generation Network for Novel 3D View Synthesis
Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, Alexander C. Berg
Learning Dynamic Guidance for Depth Image Enhancement (PDF)
Shuhang Gu, Wangmeng Zuo, Shi Guo, Yunjin Chen, Chongyu Chen, Lei Zhang
A-Lamp: Adaptive Layout-Aware Multi-Patch Deep Convolutional Neural Network for Photo Aesthetic Assessment (PDF)
Shuang Ma, Jing Liu, Chang Wen Chen
Teaching Compositionality to CNNs
Austin Stone, Huayan Wang, Michael Stark, Yi Liu, D. Scott Phoenix, Dileep George
Using Ranking-CNN for Age Estimation
Shixing Chen, Caojin Zhang, Ming Dong, Jialiang Le, Mike Rao
Accurate Single Stage Detector Using Recurrent Rolling Convolution
Jimmy Ren, Xiaohao Chen, Jianbo Liu, Wenxiu Sun, Jiahao Pang, Qiong Yan, Yu-Wing Tai, Li Xu
A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation
Chunpeng Wu, Wei Wen, Tariq Afzal, Yongmei Zhang, Yiran Chen, Hai [PDF](helen) Li
The Impact of Typicality for Informative Representative Selection
Jawadul H. Bappy, Sujoy Paul, Ertem Tuncel, Amit K. Roy-Chowdhury
Infinite Variational Autoencoder for Semi-Supervised Learning
M. Ehsan Abbasnejad, Anthony Dick, Anton van den Hengel
SurfNet: Generating 3D Shape Surfaces Using Deep Residual Networks
Ayan Sinha, Asim Unmesh, Qixing Huang, Karthik Ramani
Intrinsic Grassmann Averages for Online Linear and Robust Subspace Learning
Rudrasis Chakraborty, Søren Hauberg, Baba C. Vemuri
Variational Bayesian Multiple Instance Learning With Gaussian Processes
Manuel Haußmann, Fred A. Hamprecht, Melih Kandemir
Temporal Attention-Gated Model for Robust Sequence Classification
Wenjie Pei, Tadas Baltrušaitis, David M.J. Tax, Louis-Philippe Morency
Non-Uniform Subset Selection for Active Learning in Structured Data
Sujoy Paul, Jawadul H. Bappy, Amit K. Roy-Chowdhury
Colorization as a Proxy Task for Visual Understanding
Gustav Larsson, Michael Maire, Gregory Shakhnarovich
Shading Annotations in the Wild
Balazs Kovacs, Sean Bell, Noah Snavely, Kavita Bala
LCNN: Lookup-Based Convolutional Neural Network
Hessam Bagherinezhad, Mohammad Rastegari, Ali Farhadi
## Object Recognition & Scene Understanding
Physics Inspired Optimization on Semantic Transfer Features: An Alternative Method for Room Layout Estimation
Hao Zhao, Ming Lu, Anbang Yao, Yiwen Guo, Yurong Chen, Li Zhang
Pixelwise Instance Segmentation With a Dynamically Instantiated Network
Anurag Arnab, Philip H. S. Torr
Object Detection in Videos With Tubelet Proposal Networks
Kai Kang, Hongsheng Li, Tong Xiao, Wanli Ouyang, Junjie Yan, Xihui Liu, Xiaogang Wang
AMVH: Asymmetric Multi-Valued Hashing
Cheng Da, Shibiao Xu, Kun Ding, Gaofeng Meng, Shiming Xiang, Chunhong Pan
Spindle Net: Person Re-Identification With Human Body Region Guided Feature Decomposition and Fusion
Haiyu Zhao, Maoqing Tian, Shuyang Sun, Jing Shao, Junjie Yan, Shuai Yi, Xiaogang Wang, Xiaoou Tang
Deep Visual-Semantic Quantization for Efficient Image Retrieval
Yue Cao, Mingsheng Long, Jianmin Wang, Shichen Liu
Efficient Diffusion on Region Manifolds: Recovering Small Objects With Compact CNN Representations
Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Teddy Furon, Ondřej Chum
Feature Pyramid Networks for Object Detection
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie
Mind the Class Weight Bias: Weighted Maximum Mean Discrepancy for Unsupervised Domain Adaptation
Hongliang Yan, Yukang Ding, Peihua Li, Qilong Wang, Yong Xu, Wangmeng Zuo
StyleNet: Generating Attractive Visual Captions With Styles
Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, Li Deng
Fine-Grained Recognition of Thousands of Object Categories With Single-Example Training
Leonid Karlinsky, Joseph Shtok, Yochay Tzur, Asaf Tzadok
Improving Interpretability of Deep Neural Networks With Semantic Information
Yinpeng Dong, Hang Su, Jun Zhu, Bo Zhang
Video Captioning With Transferred Semantic Attributes
Yingwei Pan, Ting Yao, Houqiang Li, Tao Mei
Fast Boosting Based Detection Using Scale Invariant Multimodal Multiresolution Filtered Features
Arthur Daniel Costea, Robert Varga, Sergiu Nedevschi
## Video Analytics
Temporal Convolutional Networks for Action Segmentation and Detection
Colin Lea, Michael D. Flynn, René Vidal, Austin Reiter, Gregory D. Hager
Surveillance Video Parsing With Single Frame Supervision
Si Liu, Changhu Wang, Ruihe Qian, Han Yu, Renda Bao, Yao Sun
Weakly Supervised Actor-Action Segmentation via Robust Multi-Task Ranking
Yan Yan, Chenliang Xu, Dawen Cai, Jason J. Corso
Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos
De-An Huang, Joseph J. Lim, Li Fei-Fei, Juan Carlos Niebles
Zero-Shot Action Recognition With Error-Correcting Output Codes
Jie Qin, Li Liu, Ling Shao, Fumin Shen, Bingbing Ni, Jiaxin Chen, Yunhong Wang
Enhancing Video Summarization via Vision-Language Embedding
Bryan A. Plummer, Matthew Brown, Svetlana Lazebnik
Synthesizing Dynamic Patterns by Spatial-Temporal Generative ConvNet
Jianwen Xie, Song-Chun Zhu, Ying Nian Wu
# Object Recognition & Scene Understanding - Computer Vision & Language
## Spotlight 1-2A
#### Discriminative Bimodal Networks for Visual Localization and Detection With Natural Language Queries [PDF](https://arxiv.org/abs/1704.03944)
Yuting Zhang, Luyao Yuan, Yijie Guo, Zhiyuan He, I-An Huang, Honglak Lee
_**Abstract**_:  _[Associating image regions with text queries has been recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have been proposed based on recurrent neural language models trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we propose a discriminative approach. We formulate a discriminative bimodal neural network (DBNet), which can be trained by a classifier with extensive use of negative samples. Our training objective encourages better localization on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples. Experiments on the Visual Genome dataset demonstrate the proposed DBNet significantly outperforms previous state-of-the-art methods both for localization on single images and for detection on multiple images. We we also establish an evaluation protocol for natural-language visual detection. ]_
_**Comment**_:  _< natural-language-based visual entity localization, bridge visual and linguistic representations>_
#### Automatic Understanding of Image and Video Advertisements [PDF](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwi928mth9TUAhUBe7wKHUPKDLEQFggnMAE&url=http%3A%2F%2Fpeople.cs.pitt.edu%2F~kovashka%2Fhussain_zhang_kovashka_ads_cvpr2017.pdf&usg=AFQjCNGSbgpDjn-hjeOgRkvkY1A2EHowsQ)
Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, Adriana Kovashka
_**Abstract**_:  _[There is more to images than their objective physical content: for example, advertisements are created to persuade a viewer to take a certain action. We propose the novel problem of automatic advertisement understanding. To enable research on this problem, we create two datasets: an image dataset of 64,832 image ads, and a video dataset of 3,477 ads. Our data contains rich annotations encompassing the topic and sentiment of the ads, questions and answers describing what actions the viewer is prompted to take and the reasoning that the ad presents to persuade the viewer ("What should I do according to this ad, and why should I do it?"), and symbolic references ads make (e.g. a dove symbolizes peace). We also analyze the most common persuasive strategies ads use, and the capabilities that computer vision systems should have to understand these strategies. We present baseline classification results for several prediction tasks, including automatically answering questions about the messages of the ads.]_
_**Comment**_:  _< problem of automatic advertisement understanding; dataset>_
#### *Deep Sketch Hashing: Fast Free-Hand Sketch-Based Image Retrieval [PDF](https://arxiv.org/abs/1703.05605)
Li Liu, Fumin Shen, Yuming Shen, Xianglong Liu, Ling Shao
_**Abstract**_:  _[ Free-hand sketch-based image retrieval (SBIR) is a specific cross-view retrieval task, in which queries are abstract and ambiguous sketches while the retrieval database is formed with natural images. Work in this area mainly focuses on extracting representative and shared features for sketches and natural images. However, these can neither cope well with the geometric distortion between sketches and images nor be feasible for large-scale SBIR due to the heavy continuous-valued distance computation. In this paper, we speed up SBIR by introducing a novel binary coding method, named \textbf{Deep Sketch Hashing} (DSH), where a semi-heterogeneous deep architecture is proposed and incorporated into an end-to-end binary coding framework. Specifically, three convolutional neural networks are utilized to encode free-hand sketches, natural images and, especially, the auxiliary sketch-tokens which are adopted as bridges to mitigate the sketch-image geometric distortion. The learned DSH codes can effectively capture the cross-view similarities as well as the intrinsic semantic correlations between different categories. To the best of our knowledge, DSH is the first hashing work specifically designed for category-level SBIR with an end-to-end deep architecture. The proposed DSH is comprehensively evaluated on two large-scale datasets of TU-Berlin Extension and Sketchy, and the experiments consistently show DSH's superior SBIR accuracies over several state-of-the-art methods, while achieving significantly reduced retrieval time and memory footprint. ]_
_**Comment**_:  _< sketch-image,Free-hand sketch-based image retrieval by binary coding using DL>_
#### *Discover and Learn New Objects From Documentaries [PDF](http://personal.ie.cuhk.edu.hk/~ccloy/files/cvpr_2017_discover.pdf)
Kai Chen, Hang Song, Chen Change Loy, Dahua Lin
_**Abstract**_:  _[]_
_**Comment**_:  _< learning object detectors from documentary films in a weakly supervised manner, using a joint probabilistic framework>_
#### Spatial-Semantic Image Search by Visual Feature Synthesis [PDF](http://web.cecs.pdx.edu/~fliu/papers/cvpr2017-search.pdf)
Long Mai, Hailin Jin, Zhe Lin, Chen Fang, Jonathan Brandt, Feng Liu
_**Abstract**_:  _[]_
_**Comment**_:  _< Spatial-Semantic Image Search, search with sematic & spatial info at the same time>_
#### *Fully-Adaptive Feature Sharing in Multi-Task Networks With Applications in Person Attribute Classification [PDF](https://arxiv.org/abs/1611.05377)
Yongxi Lu, Abhishek Kumar, Shuangfei Zhai, Yu Cheng, Tara Javidi, Rogerio Feris
_**Abstract**_:  _[Multi-task learning aims to improve generalization performance of multiple prediction tasks by appropriately sharing relevant information across them. In the context of deep neural networks, this idea is often realized by hand-designed network architectures with layers that are shared across tasks and branches that encode task-specific features. However, the space of possible multi-task deep architectures is combinatorially large and often the final architecture is arrived at by manual exploration of this space subject to designer's bias, which can be both error-prone and tedious. In this work, we propose a principled approach for designing compact multi-task deep learning architectures. Our approach starts with a thin network and dynamically widens it in a greedy manner during training using a novel criterion that promotes grouping of similar tasks together. Our Extensive evaluation on person attributes classification tasks involving facial and clothing attributes suggests that the models produced by the proposed method are fast, compact and can closely match or exceed the state-of-the-art accuracy from strong baselines by much more expensive models. ]_
_**Comment**_:  _< Multi-task learning in DL with layers shared across tasks,adaptive widening of net>_
#### Semantic Compositional Networks for Visual Captioning [PDF](https://arxiv.org/abs/1611.08002)
Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, Li Deng
_**Abstract**_:  _[A Semantic Compositional Network (SCN) is developed for image captioning, in which semantic concepts (i.e., tags) are detected from the image, and the probability of each tag is used to compose the parameters in a long short-term memory (LSTM) network. The SCN extends each weight matrix of the LSTM to an ensemble of tag-dependent weight matrices. The degree to which each member of the ensemble is used to generate an image caption is tied to the image-dependent probability of the corresponding tag. In addition to captioning images, we also extend the SCN to generate captions for video clips. We qualitatively analyze semantic composition in SCNs, and quantitatively evaluate the algorithm on three benchmark datasets: COCO, Flickr30k, and Youtube2Text. Experimental results show that the proposed method significantly outperforms prior state-of-the-art approaches, across multiple evaluation metrics. ]_
_**Comment**_:  _< visual captioning , LSTM, ensemble of tag-dependent weight matrices>_
#### ***Training Object Class Detectors With Click Supervision [PDF](https://arxiv.org/abs/1704.06189)
Dim P. Papadopoulos, Jasper R. R. Uijlings, Frank Keller, Vittorio Ferrari
_**Abstract**_:  _[Training object class detectors typically requires a large set of images with objects annotated by bounding boxes. However, manually drawing bounding boxes is very time consuming. In this paper we greatly reduce annotation time by proposing center-click annotations: we ask annotators to click on the center of an imaginary bounding box which tightly encloses the object instance. We then incorporate these clicks into existing Multiple Instance Learning techniques for weakly supervised object localization, to jointly localize object bounding boxes over all training images. Extensive experiments on PASCAL VOC 2007 and MS COCO show that: (1) our scheme delivers high-quality detectors, performing substantially better than those produced by weakly supervised techniques, with a modest extra annotation effort; (2) these detectors in fact perform in a range close to those trained from manually drawn bounding boxes; (3) as the center-click task is very fast, our scheme reduces total annotation time by 9x to 18x. ]_
_**Comment**_:  _< not bounding boxes but center clicks, then incorporate these clicks into existing Multiple Instance Learning techniques >_
## Oral 1-2A
#### *Deep Reinforcement Learning-Based Image Captioning With Embedding Reward [PDF](https://arxiv.org/abs/1704.03899)
Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, Li-Jia Li
_**Abstract**_:  _[Image captioning is a challenging problem owing to the complexity in understanding the image content and diverse ways of describing it in natural language. Recent advances in deep neural networks have substantially improved the performance of this task. Most state-of-the-art approaches follow an encoder-decoder framework, which generates captions using a sequential recurrent prediction model. However, in this paper, we introduce a novel decision-making framework for image captioning. We utilize a "policy network" and a "value network" to collaboratively generate captions. The policy network serves as a local guidance by providing the confidence of predicting the next word according to the current state. Additionally, the value network serves as a global and lookahead guidance by evaluating all possible extensions of the current state. In essence, it adjusts the goal of predicting the correct words towards the goal of generating captions similar to the ground truth captions. We train both networks using an actor-critic reinforcement learning model, with a novel reward defined by visual-semantic embedding. Extensive experiments and analyses on the Microsoft COCO dataset show that the proposed framework outperforms state-of-the-art approaches across different evaluation metrics. ]_
_**Comment**_:  _< image caption, utilize a "policy network"(local guidance) and a "value network"(global guidance) to collaboratively generate captions.>_
#### From Red Wine to Red Tomato: Composition With Context [PDF](http://www.cs.cmu.edu/~imisra/data/composing_cvpr17.pdf)
Ishan Misra, Abhinav Gupta, Martial Hebert
_**Abstract**_:  _[]_
_**Comment**_:  _< context ;present a simple method that respects contextuality in order to compose classifiers of known visual concepts; it composes while respecting context >_
#### *Captioning Images With Diverse Objects [PDF](https://arxiv.org/abs/1606.07770)
Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Darrell, Kate Saenko
_**Abstract**_:  _[Recent captioning models are limited in their ability to scale and describe concepts unseen in paired image-text corpora. We propose the Novel Object Captioner (NOC), a deep visual semantic captioning model that can describe a large number of object categories not present in existing image-caption datasets. Our model takes advantage of external sources -- labeled images from object recognition datasets, and semantic knowledge extracted from unannotated text. We propose minimizing a joint objective which can learn from these diverse data sources and leverage distributional semantic embeddings, enabling the model to generalize and describe novel objects outside of image-caption datasets. We demonstrate that our model exploits semantic information to generate captions for hundreds of object categories in the ImageNet object recognition dataset that are not observed in MSCOCO image-caption training data, as well as many categories that are observed very rarely. Both automatic evaluations and human judgements show that our model considerably outperforms prior work in being able to describe many more categories of objects. ]_
_**Comment**_:  _< can describe a large number of object categories not existing, such as in ImageNet; takes advantage of labeled images from object recognition datasets, and semantic knowledge extracted from unannotated text; minimizing a joint objective>_
#### Self-Critical Sequence Training for Image Captioning [PDF](https://arxiv.org/abs/1612.00563)
Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, Vaibhava Goel
_**Abstract**_:  _[Recently it has been shown that policy-gradient methods for reinforcement learning can be utilized to train deep end-to-end systems directly on non-differentiable metrics for the task at hand. In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized. Our systems are built using a new optimization approach that we call self-critical sequence training (SCST). SCST is a form of the popular REINFORCE algorithm that, rather than estimating a "baseline" to normalize the rewards and reduce variance, utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences. Using this approach, estimating the reward signal (as actor-critic methods must do) and estimating normalization (as REINFORCE algorithms typically do) is avoided, while at the same time harmonizing the model with respect to its test-time inference procedure. Empirically we find that directly optimizing the CIDEr metric with SCST and greedy decoding at test-time is highly effective. Our results on the MSCOCO evaluation sever establish a new state-of-the-art on the task, improving the best result in terms of CIDEr from 104.9 to 112.3. ]_
_**Comment**_:  _< image captioning, using reinforcement learning, a new optimization approach that we call self-critical sequence training: utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences>_
# Analyzing Humans 1
## Spotlight 1-2B
#### Crossing Nets: Combining GANs and VAEs With a Shared Latent Space for Hand Pose Estimation
Chengde Wan, Thomas Probst, Luc Van Gool, Angela Yao
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Predicting Behaviors of Basketball Players From First Person Videos
Shan Su, Jung Pyo Hong, Jianbo Shi, Hyun Soo Park
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### LCR-Net: Localization-Classification-Regression for Human Pose
Grégory Rogez, Philippe Weinzaepfel, Cordelia Schmid
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### *Learning Residual Images for Face Attribute Manipulation [PDF](https://arxiv.org/abs/1612.05363)
Wei Shen, Rujie Liu
_**Abstract**_:  _[Face attributes are interesting due to their detailed description of human faces. Unlike prior researches working on attribute prediction, we address an inverse and more challenging problem called face attribute manipulation which aims at modifying a face image according to a given attribute value. Instead of manipulating the whole image, we propose to learn the corresponding residual image defined as the difference between images before and after the manipulation. In this way, the manipulation can be operated efficiently with modest pixel modification. The framework of our approach is based on the Generative Adversarial Network. It consists of two image transformation networks and a discriminative network. The transformation networks are responsible for the attribute manipulation and its dual operation and the discriminative network is used to distinguish the generated images from real images. We also apply dual learning to allow transformation networks to learn from each other. Experiments show that residual images can be effectively learned and used for attribute manipulations. The generated images remain most of the details in attribute-irrelevant areas. ]_
_**Comment**_:  _< modifying a face image according to a given attribute value, GAN>_
#### *Seeing What Is Not There: Learning Context to Determine Where Objects Are Missing [PDF](https://arxiv.org/abs/1702.07971)
Jin Sun, David W. Jacobs
_**Abstract**_:  _[Most of computer vision focuses on what is in an image. We propose to train a standalone object-centric context representation to perform the opposite task: seeing what is not there. Given an image, our context model can predict where objects should exist, even when no object instances are present. Combined with object detection results, we can perform a novel vision task: finding where objects are missing in an image. Our model is based on a convolutional neural network structure. With a specially designed training strategy, the model learns to ignore objects and focus on context only. It is fully convolutional thus highly efficient. Experiments show the effectiveness of the proposed approach in one important accessibility task: finding city street regions where curb ramps are missing, which could help millions of people with mobility disabilities. ]_
_**Comment**_:  _< special traning strategy , ignore objects and focus on context only>_
#### Deep Learning on Lie Groups for Skeleton-Based Action Recognition
Zhiwu Huang, Chengde Wan, Thomas Probst, Luc Van Gool
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Harvesting Multiple Views for Marker-Less 3D Human Pose Annotations
Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, Kostas Daniilidis
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Coarse-To-Fine Volumetric Prediction for Single-Image 3D Human Pose
Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, Kostas Daniilidis
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
## Oral 1-2B
#### Weakly Supervised Action Learning With RNN Based Fine-To-Coarse Modeling [PDF](https://arxiv.org/abs/1703.08132)
Alexander Richard, Hilde Kuehne, Juergen Gall
_**Abstract**_:  _[We present an approach for weakly supervised learning of human actions. Given a set of videos and an ordered list of the occurring actions, the goal is to infer start and end frames of the related action classes within the video and to train the respective action classifiers without any need for hand labeled frame boundaries. To address this task, we propose a combination of a discriminative representation of subactions, modeled by a recurrent neural network, and a coarse probabilistic model to allow for a temporal alignment and inference over long sequences. While this system alone already generates good results, we show that the performance can be further improved by approximating the number of subactions to the characteristics of the different action classes. To this end, we adapt the number of subaction classes by iterating realignment and reestimation during training. The proposed system is evaluated on two benchmark datasets, the Breakfast and the Hollywood extended dataset, showing a competitive performance on various weak learning tasks such as temporal action segmentation and action alignment. ]_
_**Comment**_:  _< >_
#### *Disentangled Representation Learning GAN for Pose-Invariant Face Recognition [PDF](http://cvlab.cse.msu.edu/pdfs/Tran_Yin_Liu_CVPR2017.pdf)
Luan Tran, Xi Yin, Xiaoming Liu
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### ArtTrack: Articulated Multi-Person Tracking in the Wild [PDF](https://arxiv.org/abs/1612.01465)
Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, Siyu Tang, Evgeny Levinkov, Bjoern Andres, Bernt Schiele
_**Abstract**_:  _[In this paper we propose an approach for articulated tracking of multiple people in unconstrained videos. Our starting point is a model that resembles existing architectures for single-frame pose estimation but is substantially faster. We achieve this in two ways: (1) by simplifying and sparsifying the body-part relationship graph and leveraging recent methods for faster inference, and (2) by offloading a substantial share of computation onto a feed-forward convolutional architecture that is able to detect and associate body joints of the same person even in clutter. We use this model to generate proposals for body joint locations and formulate articulated tracking as spatio-temporal grouping of such proposals. This allows to jointly solve the association problem for all people in the scene by propagating evidence from strong detections through time and enforcing constraints that each proposal can be assigned to one person only. We report results on a public MPII Human Pose benchmark and on a new MPII Video Pose dataset of image sequences with multiple people. We demonstrate that our model achieves state-of-the-art results while using only a fraction of time and is able to leverage temporal information to improve state-of-the-art for crowded scenes. ]_
_**Comment**_:  _< >_
#### Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields (PDF, code) [PDF](https://arxiv.org/pdf/1611.08050.pdf) [PDF](https://github.com/ZheC/Realtime_Multi-Person_Pose_Estimation)
Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
# Image Motion & Tracking; Video Analysis
## Spotlight 1-2C
#### *Template Matching With Deformable Diversity Similarity(https://arxiv.org/abs/1612.02190)
Itamar Talmi, Roey Mechrez, Lihi Zelnik-Manor
_**Abstract**_:  _[We propose a novel measure for template matching named Deformable Diversity Similarity -- based on the diversity of feature matches between a target image window and the template. We rely on both local appearance and geometric information that jointly lead to a powerful approach for matching. Our key contribution is a similarity measure, that is robust to complex deformations, significant background clutter, and occlusions. Empirical evaluation on the most up-to-date benchmark shows that our method outperforms the current state-of-the-art in its detection accuracy while improving computational complexity. ]_
_**Comment**_:  _< >_
#### **Beyond Triplet Loss: A Deep Quadruplet Network for Person Re-Identification [PDF](https://arxiv.org/abs/1704.01719)
Weihua Chen, Xiaotang Chen, Jianguo Zhang, Kaiqi Huang
_**Abstract**_:  _[Person re-identification (ReID) is an important task in wide area video surveillance which focuses on identifying people across different cameras. Recently, deep learning networks with a triplet loss become a common framework for person ReID. However, the triplet loss pays main attentions on obtaining correct orders on the training set. It still suffers from a weaker generalization capability from the training set to the testing set, thus resulting in inferior performance. In this paper, we design a quadruplet loss, which can lead to the model output with a larger inter-class variation and a smaller intra-class variation compared to the triplet loss. As a result, our model has a better generalization ability and can achieve a higher performance on the testing set. In particular, a quadruplet deep network using a margin-based online hard negative mining is proposed based on the quadruplet loss for the person ReID. In extensive experiments, the proposed network outperforms most of the state-of-the-art algorithms on representative datasets which clearly demonstrates the effectiveness of our proposed method. ]_
_**Comment**_:  _< a triplet loss to a quadruplet loss>_
#### Agent-Centric Risk Assessment: Accident Anticipation and Risky Region Localization [PDF](https://arxiv.org/abs/1705.06560)
Kuo-Hao Zeng, Shih-Han Chou, Fu-Hsiang Chan, Juan Carlos Niebles, Min Sun
_**Abstract**_:  _[For survival, a living agent must have the ability to assess risk (1) by temporally anticipating accidents before they occur, and (2) by spatially localizing risky regions in the environment to move away from threats. In this paper, we take an agent-centric approach to study the accident anticipation and risky region localization tasks. We propose a novel soft-attention Recurrent Neural Network (RNN) which explicitly models both spatial and appearance-wise non-linear interaction between the agent triggering the event and another agent or static-region involved. In order to test our proposed method, we introduce the Epic Fail (EF) dataset consisting of 3000 viral videos capturing various accidents. In the experiments, we evaluate the risk assessment accuracy both in the temporal domain (accident anticipation) and spatial domain (risky region localization) on our EF dataset and the Street Accident (SA) dataset. Our method consistently outperforms other baselines on both datasets. ]_
_**Comment**_:  _< >_
#### Bidirectional Multirate Reconstruction for Temporal Modeling in Videos
Linchao Zhu, Zhongwen Xu, Yi Yang
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Action-Decision Networks for Visual Tracking With Deep Reinforcement Learning [PDF](https://sites.google.com/view/cvpr2017-adnet)
Sangdoo Yun, Jongwon Choi, Youngjoon Yoo, Kimin Yun, Jin Young Choi
_**Abstract**_:  _[This paper proposes a novel tracker which is controlled by sequentially pursuing actions learned by deep reinforcement learning. In contrast to the existing trackers using deep networks, the proposed tracker is designed to achieve a light computation as well as satisfactory tracking accuracy in both location and scale. The deep network to control actions is pre-trained using various training sequences and fine-tuned during tracking for online adaptation to target and background changes. The pre-training is done by utilizing deep reinforcement learning as well as supervised learning. The use of reinforcement learning enables even partially labeled data to be successfully utilized for semi-supervised learning. Through evaluation of the OTB dataset, the proposed tracker is validated to achieve a competitive performance that is three times faster than state-of-the-art, deep network–based trackers. The fast version of the proposed method, which operates in real-time on GPU, outperforms the state-of-the-art real-time trackers.]_
_**Comment**_:  _< >_
#### TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering [PDF](https://arxiv.org/abs/1704.04497)
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, Gunhee Kim
_**Abstract**_:  _[Vision and language understanding has emerged as a subject undergoing intense study in Artificial Intelligence. Among many tasks in this line of research, visual question answering (VQA) has been one of the most successful ones, where the goal is to learn a model that understands visual content at region-level details and finds their associations with pairs of questions and answers in the natural language form. Despite the rapid progress in the past few years, most existing work in VQA have focused primarily on images. In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways. First, we propose three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly. Next, we introduce a new large-scale dataset for video VQA named TGIF-QA that extends existing VQA work with our new tasks. Finally, we propose a dual-LSTM based approach with both spatial and temporal attention, and show its effectiveness over conventional VQA techniques through empirical evaluations. ]_
_**Comment**_:  _< >_
#### Making 360° Video Watchable in 2D: Learning Videography for Click Free Viewing [PDF](https://arxiv.org/abs/1703.00495)
Yu-Chuan Su, Kristen Grauman
_**Abstract**_:  _[360 video requires human viewers to actively control "where" to look while watching the video. Although it provides a more immersive experience of the visual content, it also introduces additional burden for viewers; awkward interfaces to navigate the video lead to suboptimal viewing experiences. Virtual cinematography is an appealing direction to remedy these problems, but conventional methods are limited to virtual environments or rely on hand-crafted heuristics. We propose a new algorithm for virtual cinematography that automatically controls a virtual camera within a 360 video. Compared to the state of the art, our algorithm allows more general camera control, avoids redundant outputs, and extracts its output videos substantially more efficiently. Experimental results on over 7 hours of real "in the wild" video show that our generalized camera control is crucial for viewing 360 video, while the proposed efficient algorithm is essential for making the generalized control computationally tractable. ]_
_**Comment**_:  _< >_
#### Unsupervised Adaptive Re-Identification in Open World Dynamic Camera Networks [PDF](https://arxiv.org/abs/1706.03112)
Rameswar Panda, Amran Bhuiyan, Vittorio Murino, Amit K. Roy-Chowdhury
_**Abstract**_:  _[Person re-identification is an open and challenging problem in computer vision. Existing approaches have concentrated on either designing the best feature representation or learning optimal matching metrics in a static setting where the number of cameras are fixed in a network. Most approaches have neglected the dynamic and open world nature of the re-identification problem, where a new camera may be temporarily inserted into an existing system to get additional information. To address such a novel and very practical problem, we propose an unsupervised adaptation scheme for re-identification models in a dynamic camera network. First, we formulate a domain perceptive re-identification method based on geodesic flow kernel that can effectively find the best source camera (already installed) to adapt with a newly introduced target camera, without requiring a very expensive training phase. Second, we introduce a transitive inference algorithm for re-identification that can exploit the information from best source camera to improve the accuracy across other camera pairs in a network of multiple cameras. Extensive experiments on four benchmark datasets demonstrate that the proposed approach significantly outperforms the state-of-the-art unsupervised learning based alternatives whilst being extremely efficient to compute.]_
_**Comment**_:  _< based on geodesic flow kernel; cameras are not fixed>_
## Oral 1-2C
#### Context-Aware Correlation Filter Tracking [PDF](https://goo.gl/5rDpff)
Matthias Mueller, Neil Smith, Bernard Ghanem
_**Abstract**_:  _[Correlation filter (CF) based trackers have recently gained a lot of popularity due to their impressive performance on benchmark datasets, while maintaining high frame rates. A significant amount of recent research focuses on the incorporation of stronger features for a richer representation of the tracking target. However, this only helps to discriminate the target from background within a small neighborhood. In this paper, we present a framework that allows the explicit incorporation of global context within CF trackers. We reformulate the original optimization problem and provide a closed form solution for single and multi-dimensional features in the primal and dual domain. Extensive experiments demonstrate that this framework significantly improves the performance of many CF trackers with only a modest impact on frame rate.]_
_**Comment**_:  _< tracking>_
#### Deep 360 Pilot: Learning a Deep Agent for Piloting Through 360° Sports Videos
Hou-Ning Hu, Yen-Chen Lin, Ming-Yu Liu, Hsien-Tzu Cheng, Yung-Ju Chang, Min Sun
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Slow Flow: Exploiting High-Speed Cameras for Accurate and Diverse Optical Flow Reference Data
Joel Janai, Fatma Güney, Jonas Wulff, Michael J. Black, Andreas Geiger
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### *CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos [PDF](https://arxiv.org/abs/1703.01515)
Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, Shih-Fu Chang
_**Abstract**_:  _[Temporal action localization is an important yet challenging problem. Given a long, untrimmed video consisting of multiple action instances and complex background contents, we need not only to recognize their action categories, but also to localize the start time and end time of each instance. Many state-of-the-art systems use segment-level classifiers to select and rank proposal segments of pre-determined boundaries. However, a desirable model should move beyond segment-level and make dense predictions at a fine granularity in time to determine precise temporal boundaries. To this end, we design a novel Convolutional-De-Convolutional (CDC) network that places CDC filters on top of 3D ConvNets, which have been shown to be effective for abstracting action semantics but reduce the temporal length of the input data. The proposed CDC filter performs the required temporal upsampling and spatial downsampling operations simultaneously to predict actions at the frame-level granularity. It is unique in jointly modeling action semantics in space-time and fine-grained temporal dynamics. We train the CDC network in an end-to-end manner efficiently. Our model not only achieves superior performance in detecting actions in every frame, but also significantly boosts the precision of localizing temporal boundaries. Finally, the CDC network demonstrates a very high efficiency with the ability to process 500 frames per second on a single GPU server. We will update the camera-ready version and publish the source codes online soon. ]_
_**Comment**_:  _< Convolutional-De-Convolutional (CDC) filters on top of 3D ConvNets,>_
# Poster 1-2
## 3D Computer Vision
Exploiting 2D Floorplan for Building-Scale Panorama RGBD Alignment
Erik Wijmans, Yasutaka Furukawa
A Combinatorial Solution to Non-Rigid 3D Shape-To-Image Matching
Florian Bernard, Frank R. Schmidt, Johan Thunberg, Daniel Cremers
NID-SLAM: Robust Monocular SLAM Using Normalised Information Distance
Geoffrey Pascoe, Will Maddern, Michael Tanner, Pedro Piniés, Paul Newman
End-To-End Training of Hybrid CNN-CRF Models for Stereo
Patrick Knöbelreiter, Christian Reinbacher, Alexander Shekhovtsov, Thomas Pock
Learning Shape Abstractions by Assembling Volumetric Primitives
Shubham Tulsiani, Hao Su, Leonidas J. Guibas, Alexei A. Efros, Jitendra Malik
Locality-Sensitive Deconvolution Networks With Gated Fusion for RGB-D Indoor Semantic Segmentation
Yanhua Cheng, Rui Cai, Zhiwei Li, Xin Zhao, Kaiqi Huang
Acquiring Axially-Symmetric Transparent Objects Using Single-View Transmission Imaging
Jaewon Kim, Ilya Reshetouski, Abhijeet Ghosh
Regressing Robust and Discriminative 3D Morphable Models With a Very Deep Neural Network
Anh Tuấn Trần, Tal Hassner, Iacopo Masi, Gérard Medioni
End-To-End 3D Face Reconstruction With Deep Neural Networks
Pengfei Dou, Shishir K. Shah, Ioannis A. Kakadiaris
DUST: Dual Union of Spatio-Temporal Subspaces for Monocular Multiple Object 3D Reconstruction
Antonio Agudo, Francesc Moreno-Noguer
## Analyzing Humans in Images
#### Finding Tiny Faces
Peiyun Hu, Deva Ramanan
_**Abstract**_:Though tremendous strides have been made in object recognition, one of the remaining open challenges is detecting small objects. We explore three aspects of the problem in the context of finding small faces: the role of scale invariance, image resolution, and contextual reasoning. While most recognition approaches aim to be scale-invariant, the cues for recognizing a 3px tall face are fundamentally different than those for recognizing a 300px tall face. We take a different approach and train separate detectors for different scales. To maintain efficiency, detectors are trained in a multi-task fashion: they make use of features extracted from multiple layers of single (deep) feature hierarchy. While training detectors for large objects is straightforward, the crucial challenge remains training detectors for small objects. We show that context is crucial, and define templates that make use of massively-large receptive fields (where 99% of the template extends beyond the object of interest). Finally, we explore the role of scale in pre-trained deep networks, providing ways to extrapolate networks tuned for limited scales to rather extreme ranges. We demonstrate state-of-the-art results on massively-benchmarked face datasets (FDDB and WIDER FACE). In particular, when compared to prior art on WIDER FACE, our results reduce error by a factor of 2 (our models produce an AP of 82% while prior art ranges from 29-64%).
_**Comment**_:  _< face detection under crowd scenes,image pyramid,feature fusion>_
#### Dynamic Facial Analysis: From Bayesian Filtering to Recurrent Neural Network
Jinwei Gu, Xiaodong Yang, Shalini De Mello, Jan Kautz
#### Deep Temporal Linear Encoding Networks
Ali Diba, Vivek Sharma, Luc Van Gool
#### Joint Registration and Representation Learning for Unconstrained Face Identification
Munawar Hayat, Salman H. Khan, Naoufel Werghi, Roland Goecke
#### 3D Human Pose Estimation From a Single Image via Distance Matrix Regression
Francesc Moreno-Noguer
#### One-Shot Metric Learning for Person Re-Identification
Slawomir BÄ…k, Peter Carr
#### Generalized Rank Pooling for Activity Recognition
Anoop Cherian, Basura Fernando, Mehrtash Harandi, Stephen Gould
#### Deep Representation Learning for Human Motion Prediction and Classification
Judith Bütepage, Michael J. Black, Danica Kragic, Hedvig Kjellström
#### Interspecies Knowledge Transfer for Facial Keypoint Detection
Maheen Rashid, Xiuye Gu, Yong Jae Lee
#### Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization
Runpeng Cui, Hu Liu, Changshui Zhang
## Applications
Modeling Sub-Event Dynamics in First-Person Action Recognition
Hasan F. M. Zaki, Faisal Shafait, Ajmal Mian
## Computational Photography
Turning an Urban Scene Video Into a Cinemagraph
Hang Yan, Yebin Liu, Yasutaka Furukawa
Light Field Reconstruction Using Deep Convolutional Network on EPI
Gaochang Wu, Mandan Zhao, Liangyong Wang, Qionghai Dai, Tianyou Chai, Yebin Liu
## Image Motion & Tracking
FlowNet 2.0: Evolution of Optical Flow Estimation With Deep Networks
Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, Thomas Brox
## Low- & Mid-Level Vision
Attention-Aware Face Hallucination via Deep Reinforcement Learning
Qingxing Cao, Liang Lin, Yukai Shi, Xiaodan Liang, Guanbin Li
Simple Does It: Weakly Supervised Instance and Semantic Segmentation
Anna Khoreva, Rodrigo Benenson, Jan Hosang, Matthias Hein, Bernt Schiele
Anti-Glare: Tightly Constrained Optimization for Eyeglass Reflection Removal
Tushar Sandhan, Jin Young Choi
Deep Joint Rain Detection and Removal From a Single Image
Wenhan Yang, Robby T. Tan, Jiashi Feng, Jiaying Liu, Zongming Guo, Shuicheng Yan
Radiometric Calibration From Faces in Images
Chen Li, Stephen Lin, Kun Zhou, Katsushi Ikeuchi
Webly Supervised Semantic Segmentation
Bin Jin, Maria V. Ortiz Segovia, Sabine Süsstrunk
Removing Rain From Single Images via a Deep Detail Network
Xueyang Fu, Jiabin Huang, Delu Zeng, Yue Huang, Xinghao Ding, John Paisley
Deep Crisp Boundaries
Yupei Wang, Xin Zhao, Kaiqi Huang
Coarse-To-Fine Segmentation With Shape-Tailored Continuum Scale Spaces
Naeemullah Khan, Byung-Woo Hong, Anthony Yezzi, Ganesh Sundaramoorthi
Large Kernel Matters — Improve Semantic Segmentation by Global Convolutional Network
Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, Jian Sun
Single Image Reflection Suppression
Nikolaos Arvanitopoulos, Radhakrishna Achanta, Sabine Süsstrunk
CASENet: Deep Category-Aware Semantic Edge Detection
Zhiding Yu, Chen Feng, Ming-Yu Liu, Srikumar Ramalingam
Reflectance Adaptive Filtering Improves Intrinsic Image Estimation
Thomas Nestmeyer, Peter V. Gehler
## Machine Learning
Conditional Similarity Networks
Andreas Veit, Serge Belongie, Theofanis Karaletsos
Spatially Adaptive Computation Time for Residual Networks
Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, Ruslan Salakhutdinov
Xception: Deep Learning With Depthwise Separable Convolutions
François Chollet
Feedback Networks
Amir R. Zamir, Te-Lin Wu, Lin Sun, William B. Shen, Bertram E. Shi, Jitendra Malik, Silvio Savarese
Online Summarization via Submodular and Convex Optimization
Ehsan Elhamifar, M. Clara De Paolis Kaluza
Deep MANTA: A Coarse-To-Fine Many-Task Network for Joint 2D and 3D Vehicle Analysis From Monocular Image
Florian Chabot, Mohamed Chaouch, Jaonary Rabarisoa, Céline Teulière, Thierry Chateau
Improving Pairwise Ranking for Multi-Label Image Classification
Yuncheng Li, Yale Song, Jiebo Luo
Active Convolution: Learning the Shape of Convolution for Image Classification
Yunho Jeon, Junmo Kim
Linking Image and Text With 2-Way Nets
Aviv Eisenschtat, Lior Wolf
Stacked Generative Adversarial Networks
Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, Serge Belongie
Image Splicing Detection via Camera Response Function Analysis
Can Chen, Scott McCloskey, Jingyi Yu
Building a Regular Decision Boundary With Deep Networks
Edouard Oyallon
More Is Less: A More Complicated Network With Less Inference Complexity
Xuanyi Dong, Junshi Huang, Yi Yang, Shuicheng Yan
Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications
Evgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Omran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother, Thomas Brox, Bernt Schiele, Bjoern Andres
Scale-Aware Face Detection
Zekun Hao, Yu Liu, Hongwei Qin, Junjie Yan, Xiu Li, Xiaolin Hu
Deep Unsupervised Similarity Learning Using Partially Ordered Sets
Miguel A. Bautista, Artsiom Sanakoyeu, Björn Ommer
Generative Hierarchical Learning of Sparse FRAME Models
Jianwen Xie, Yifei Xu, Erik Nijkamp, Ying Nian Wu, Song-Chun Zhu
## Object Recognition & Scene Understanding
Generating Holistic 3D Scene Abstractions for Text-Based Image Retrieval
Ang Li, Jin Sun, Joe Yue-Hei Ng, Ruichi Yu, Vlad I. Morariu, Larry S. Davis
Perceptual Generative Adversarial Networks for Small Object Detection
Jianan Li, Xiaodan Liang, Yunchao Wei, Tingfa Xu, Jiashi Feng, Shuicheng Yan
Emotion Recognition in Context
Ronak Kosti, Jose M. Alvarez, Adria Recasens, Agata Lapedriza
Deep Learning of Human Visual Sensitivity in Image Quality Assessment Framework
Jongyoo Kim, Sanghoon Lee
Dense Captioning With Joint Inference and Visual Context
Linjie Yang, Kevin Tang, Jianchao Yang, Li-Jia Li
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick
Cross-View Image Matching for Geo-Localization in Urban Environments
Yicong Tian, Chen Chen, Mubarak Shah
Matrix Tri-Factorization With Manifold Regularizations for Zero-Shot Learning
Xing Xu, Fumin Shen, Yang Yang, Dongxiang Zhang, Heng Tao Shen, Jingkuan Song
Self-Supervised Learning of Visual Features Through Embedding Images Into Text Topic Spaces
Lluis Gomez, Yash Patel, Marçal Rusiñol, Dimosthenis Karatzas, C. V. Jawahar
Learning Spatial Regularization With Image-Level Supervisions for Multi-Label Image Classification
Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, Xiaogang Wang
Semantically Consistent Regularization for Zero-Shot Recognition
Pedro Morgado, Nuno Vasconcelos
Can Walking and Measuring Along Chord Bunches Better Describe Leaf Shapes?
Bin Wang, Yongsheng Gao, Changming Sun, Michael Blumenstein, John La Salle
## Video Analytics
Self-Learning Scene-Specific Pedestrian Detectors Using a Progressive Latent Model
Qixiang Ye, Tianliang Zhang, Wei Ke, Qiang Qiu, Jie Chen, Guillermo Sapiro, Baochang Zhang
Predictive-Corrective Networks for Action Detection
Achal Dave, Olga Russakovsky, Deva Ramanan
Budget-Aware Deep Semantic Video Segmentation
Behrooz Mahasseni, Sinisa Todorovic, Alan Fern
Unified Embedding and Metric Learning for Zero-Exemplar Event Detection
Noureldien Hussein, Efstratios Gavves, Arnold W.M. Smeulders
Spatiotemporal Pyramid Network for Video Action Recognition
Yunbo Wang, Mingsheng Long, Jianmin Wang, Philip S. Yu
ER3: A Unified Framework for Event Retrieval, Recognition and Recounting
Zhanning Gao, Gang Hua, Dongqing Zhang, Nebojsa Jojic, Le Wang, Jianru Xue, Nanning Zheng
FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos
Suyog Dutt Jain, Bo Xiong, Kristen Grauman
Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach
Aidean Sharghi, Jacob S. Laurel, Boqing Gong
Flexible Spatio-Temporal Networks for Video Prediction
Chaochao Lu, Michael Hirsch, Bernhard Schölkopf
Temporal Action Co-Segmentation in 3D Motion Capture Data and Videos
Konstantinos Papoutsakis, Costas Panagiotakis, Antonis A. Argyros
# Machine Learning 2
## Spotlight 2-1A
#### Dual Attention Networks for Multimodal Reasoning and Matching( [PDF](https://arxiv.org/abs/1611.00471))
Hyeonseob Nam, Jung-Woo Ha, Jeonghee Kim
_**Abstract**_:  _[We propose Dual Attention Networks (DANs) which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language. DANs attend to specific regions in images and words in text through multiple steps and gather essential information from both modalities. Based on this framework, we introduce two types of DANs for multimodal reasoning and matching, respectively. The reasoning model allows visual and textual attentions to steer each other during collaborative inference, which is useful for tasks such as Visual Question Answering (VQA). In addition, the matching model exploits the two attention mechanisms to estimate the similarity between images and sentences by focusing on their shared semantics. Our extensive experiments validate the effectiveness of DANs in combining vision and language, achieving the state-of-the-art performance on public benchmarks for VQA and image-text matching. ]_
_**Comment**_:  _< visual and textual attention ,>_
#### DESIRE: Distant Future Prediction in Dynamic Scenes With Interacting Agents [PDF](https://arxiv.org/abs/1704.04394)
Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B. Choy, Philip H. S. Torr, Manmohan Chandraker
_**Abstract**_:  _[We introduce a Deep Stochastic IOC RNN Encoderdecoder framework, DESIRE, for the task of future predictions of multiple interacting agents in dynamic scenes. DESIRE effectively predicts future locations of objects in multiple scenes by 1) accounting for the multi-modal nature of the future prediction (i.e., given the same context, future may vary), 2) foreseeing the potential future outcomes and make a strategic prediction based on that, and 3) reasoning not only from the past motion history, but also from the scene context as well as the interactions among the agents. DESIRE achieves these in a single end-to-end trainable neural network model, while being computationally efficient. The model first obtains a diverse set of hypothetical future prediction samples employing a conditional variational autoencoder, which are ranked and refined by the following RNN scoring-regression module. Samples are scored by accounting for accumulated future rewards, which enables better long-term strategic decisions similar to IOC frameworks. An RNN scene context fusion module jointly captures past motion histories, the semantic scene context and interactions among multiple agents. A feedback mechanism iterates over the ranking and refinement to further boost the prediction accuracy. We evaluate our model on two publicly available datasets: KITTI and Stanford Drone Dataset. Our experiments show that the proposed model significantly improves the prediction accuracy compared to other baseline methods. ]_
_**Comment**_:  _< >_
#### *Interpretable Structure-Evolving LSTM [PDF](https://arxiv.org/abs/1703.03055)
Xiaodan Liang, Liang Lin, Xiaohui Shen, Jiashi Feng, Shuicheng Yan, Eric P. Xing
_**Abstract**_:  _[This paper develops a general framework for learning interpretable data representation via Long Short-Term Memory (LSTM) recurrent neural networks over hierarchal graph structures. Instead of learning LSTM models over the pre-fixed structures, we propose to further learn the intermediate interpretable multi-level graph structures in a progressive and stochastic way from data during the LSTM network optimization. We thus call this model the structure-evolving LSTM. In particular, starting with an initial element-level graph representation where each node is a small data element, the structure-evolving LSTM gradually evolves the multi-level graph representations by stochastically merging the graph nodes with high compatibilities along the stacked LSTM layers. In each LSTM layer, we estimate the compatibility of two connected nodes from their corresponding LSTM gate outputs, which is used to generate a merging probability. The candidate graph structures are accordingly generated where the nodes are grouped into cliques with their merging probabilities. We then produce the new graph structure with a Metropolis-Hasting algorithm, which alleviates the risk of getting stuck in local optimums by stochastic sampling with an acceptance probability. Once a graph structure is accepted, a higher-level graph is then constructed by taking the partitioned cliques as its nodes. During the evolving process, representation becomes more abstracted in higher-levels where redundant information is filtered out, allowing more efficient propagation of long-range data dependencies. We evaluate the effectiveness of structure-evolving LSTM in the application of semantic object parsing and demonstrate its advantage over state-of-the-art LSTM models on standard benchmarks. ]_
_**Comment**_:  _< structure-evolving LSTM, learning interpretable data representation via LSTM recurrent neural networks over hierarchal graph structures>_
#### ShapeOdds: Variational Bayesian Learning of Generative Shape Models [PDF](https://www.researchgate.net/profile/Shireen_Elhabian/publication/314237204_ShapeOdds_Variational_Bayesian_Learning_of_Generative_Shape_Models/links/58beee88a6fdccff7b1f97c6/ShapeOdds-Variational-Bayesian-Learning-of-Generative-Shape-Models.pdf)
Shireen Elhabian, Ross Whitaker
_**Abstract**_:  _[Shape models provide a compact parameterization of a class of shapes, and have been shown to be important to a variety of vision problems, including object detection, tracking, and image segmentation. Learning generative shape models from grid-structured representations, aka silhouettes , is usually hindered by (1) data likelihoods with intractable marginals and posteriors, (2) high-dimensional shape spaces with limited training samples (and the associated risk of overfitting), and (3) estimation of hyperparam-eters relating to model complexity that often entails compu-tationally expensive grid searches. In this paper, we propose a Bayesian treatment that relies on direct probabilis-tic formulation for learning generative shape models in the silhouettes space. We propose a variational approach for learning a latent variable model in which we make use of, and extend, recent works on variational bounds of logistic-Gaussian integrals to circumvent intractable marginals and posteriors. Spatial coherency and sparsity priors are also incorporated to lend stability to the optimization problem by regularizing the solution space while avoiding overfitting in this high-dimensional, low-sample-size scenario. We deploy a type-II maximum likelihood estimate of the model hy-perparameters to avoid grid searches. We demonstrate that the proposed model generates realistic samples, generalizes to unseen examples, and is able to handle missing regions and/or background clutter, while comparing favorably with recent, neural-network-based approaches.]_
_**Comment**_:  _< generative shape models in the silhouettes space>_
#### **Fast Video Classification via Adaptive Cascading of Deep Models [PDF](https://arxiv.org/abs/1611.06453)
Haichen Shen, Seungyeop Han, Matthai Philipose, Arvind Krishnamurthy
_**Abstract**_:  _[Recent advances have enabled "oracle" classifiers that can classify across many classes and input distributions with high accuracy without retraining. However, these classifiers are relatively heavyweight, so that applying them to classify video is costly. We show that day-to-day video exhibits highly skewed class distributions over the short term, and that these distributions can be classified by much simpler models. We formulate the problem of detecting the short-term skews online and exploiting models based on it as a new sequential decision making problem dubbed the Online Bandit Problem, and present a new algorithm to solve it. When applied to recognizing faces in TV shows and movies, we realize end-to-end classification speedups of 2.5-8.5x/2.8-12.7x (on GPU/CPU) relative to a state-of-the-art convolutional neural network, at competitive accuracy. ]_
_**Comment**_:  _< skewed class distributions, as a new seqential disicison making, lightweight>_
#### *Deep Metric Learning via Facility Location [PDF](https://arxiv.org/abs/1612.012130
Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, Kevin Murphy
_**Abstract**_:  _[Learning the representation and the similarity metric in an end-to-end fashion with deep networks have demonstrated outstanding results for clustering and retrieval. However, these recent approaches still suffer from the performance degradation stemming from the local metric training procedure which is unaware of the global structure of the embedding space. We propose a global metric learning scheme for optimizing the deep metric embedding with the learnable clustering function and the clustering metric (NMI) in a novel structured prediction framework. Our experiments on CUB200-2011, Cars196, and Stanford online products datasets show state of the art performance both on the clustering and retrieval tasks measured in the NMI and Recall@K evaluation metrics. ]_
_**Comment**_:  _< global metric learning using DL>_
#### Semi-Supervised Deep Learning for Monocular Depth Map Prediction [PDF](https://arxiv.org/abs/1702.02706)
Yevhen Kuznietsov, Jörg Stückler, Bastian Leibe
_**Abstract**_:  _[Supervised deep learning often suffers from the lack of sufficient training data. Specifically in the context of monocular depth map prediction, it is barely possible to determine dense ground truth depth images in realistic dynamic outdoor environments. When using LiDAR sensors, for instance, noise is present in the distance measurements, the calibration between sensors cannot be perfect, and the measurements are typically much sparser than the camera images. In this paper, we propose a novel approach to depth map prediction from monocular images that learns in a semi-supervised way. While we use sparse ground-truth depth for supervised learning, we also enforce our deep network to produce photoconsistent dense depth maps in a stereo setup using a direct image alignment loss. In experiments we demonstrate superior performance in depth map prediction from single images compared to the state-of-the-art methods. ]_
_**Comment**_:  _< monocular depth map prediction, supervised + semi-supervised>_
#### *Weakly Supervised Semantic Segmentation Using Web-Crawled Videos [PDF](https://arxiv.org/abs/1701.00352)
Seunghoon Hong, Donghun Yeo, Suha Kwak, Honglak Lee, Bohyung Han
_**Abstract**_:  _[We propose a novel algorithm for weakly supervised semantic segmentation based on image-level class labels only. In weakly supervised setting, it is commonly observed that trained model overly focuses on discriminative parts rather than the entire object area. Our goal is to overcome this limitation with no additional human intervention by retrieving videos relevant to target class labels from web repository, and generating segmentation labels from the retrieved videos to simulate strong supervision for semantic segmentation. During this process, we take advantage of image classification with discriminative localization technique to reject false alarms in retrieved videos and identify relevant spatio-temporal volumes within retrieved videos. Although the entire procedure does not require any additional supervision, the segmentation annotations obtained from videos are sufficiently strong to learn a model for semantic segmentation. The proposed algorithm substantially outperforms existing methods based on the same level of supervision and is even as competitive as the approaches relying on extra annotations. ]_
_**Comment**_:  _< based on image-level class labels only, Guo's project>_
## Oral 2-1A
#### *Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach [PDF](https://arxiv.org/abs/1609.03683)
Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, Lizhen Qu
_**Abstract**_:  _[We present a theoretically grounded approach to train deep neural networks, including recurrent networks, subject to class-dependent label noise. We propose two procedures for loss correction that are agnostic to both application domain and network architecture. They simply amount to at most a matrix inversion and multiplication, provided that we know the probability of each class being corrupted into another. We further show how one can estimate these probabilities, adapting a recent technique for noise estimation to the multi-class setting, and thus providing an end-to-end framework. Extensive experiments on MNIST, IMDB, CIFAR-10, CIFAR-100 and a large scale dataset of clothing images employing a diversity of architectures --- stacking dense, convolutional, pooling, dropout, batch normalization, word embedding, LSTM and residual layers --- demonstrate the noise robustness of our proposals. Incidentally, we also prove that, when ReLU is the only non-linearity, the loss curvature is immune to class-dependent label noise. ]_
_**Comment**_:  _< robust to label noise, >_
#### *Learning From Simulated and Unsupervised Images Through Adversarial Training [PDF](https://arxiv.org/abs/1612.07828)
Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, Russell Webb
_**Abstract**_:  _[With recent progress in graphics, it has become more tractable to train models on synthetic images, potentially avoiding the need for expensive annotations. However, learning from synthetic images may not achieve the desired performance due to a gap between synthetic and real image distributions. To reduce this gap, we propose Simulated+Unsupervised (S+U) learning, where the task is to learn a model to improve the realism of a simulator's output using unlabeled real data, while preserving the annotation information from the simulator. We develop a method for S+U learning that uses an adversarial network similar to Generative Adversarial Networks (GANs), but with synthetic images as inputs instead of random vectors. We make several key modifications to the standard GAN algorithm to preserve annotations, avoid artifacts and stabilize training: (i) a 'self-regularization' term, (ii) a local adversarial loss, and (iii) updating the discriminator using a history of refined images. We show that this enables generation of highly realistic images, which we demonstrate both qualitatively and with a user study. We quantitatively evaluate the generated images by training models for gaze estimation and hand pose estimation. We show a significant improvement over using synthetic images, and achieve state-of-the-art results on the MPIIGaze dataset without any labeled real data. ]_
_**Comment**_:  _< train on synthetic images( exits a gap), Simulated+Unsupervised (S+U) learning; how about physical model's image>_
#### *Inverse Compositional Spatial Transformer Networks [PDF](https://arxiv.org/abs/1612.03897)
Chen-Hsuan Lin, Simon Lucey
_**Abstract**_:  _[In this paper, we establish a theoretical connection between the classical Lucas & Kanade (LK) algorithm and the emerging topic of Spatial Transformer Networks (STNs). STNs are of interest to the vision and learning communities due to their natural ability to combine alignment and classification within the same theoretical framework. Inspired by the Inverse Compositional (IC) variant of the LK algorithm, we present Inverse Compositional Spatial Transformer Networks (IC-STNs). We demonstrate that IC-STNs can achieve better performance than conventional STNs with less model capacity; in particular, we show superior performance in pure image alignment tasks as well as joint alignment/classification problems on real-world problems. ]_
_**Comment**_:  _< LK algorithm & STN, image alignment tasks>_
#### ***Densely Connected Convolutional Networks [PDF](https://arxiv.org/abs/1608.06993) [PDF](https://github.com/liuzhuang13/DenseNet)
Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger
_**Abstract**_:  _[Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and models are available at this https URL . ]_
_**Comment**_:  _< shorter connections between 'input' &'output' layers, connects each layer to every other layer>_
# Computational Photography
## Spotlight 2-1B
Visual Dialog
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, Dhruv Batra
Video Frame Interpolation via Adaptive Convolution
Simon Niklaus, Long Mai, Feng Liu
FastMask: Segment Multi-Scale Object Candidates in One Shot
Hexiang Hu, Shiyi Lan, Yuning Jiang, Zhimin Cao, Fei Sha
Reconstructing Transient Images From Single-Photon Sensors
Matthew O'Toole, Felix Heide, David B. Lindell, Kai Zang, Steven Diamond, Gordon Wetzstein
DeshadowNet: A Multi-Context Embedding Deep Network for Shadow Removal
Liangqiong Qu, Jiandong Tian, Shengfeng He, Yandong Tang, Rynson W. H. Lau
Illuminant-Camera Communication to Observe Moving Objects Under Strong External Light by Spread Spectrum Modulation
Ryusuke Sagawa, Yutaka Satoh
Photorealistic Facial Texture Inference Using Deep Neural Networks
Shunsuke Saito, Lingyu Wei, Liwen Hu, Koki Nagano, Hao Li
The Geometry of First-Returning Photons for Non-Line-Of-Sight Imaging
Chia-Yin Tsai, Kiriakos N. Kutulakos, Srinivasa G. Narasimhan, Aswin C. Sankaranarayanan
## Oral 2-1B
Unrolling the Shutter: CNN to Correct Motion Distortions
Vijay Rengarajan, Yogesh Balaji, A. N. Rajagopalan
Light Field Blind Motion Deblurring
Pratul P. Srinivasan, Ren Ng, Ravi Ramamoorthi
Computational Imaging on the Electric Grid
Mark Sheinin, Yoav Y. Schechner, Kiriakos N. Kutulakos
Deep Outdoor Illumination Estimation
Yannick Hold-Geoffroy, Kalyan Sunkavalli, Sunil Hadap, Emiliano Gambaretto, Jean-François Lalonde
# 3D Vision 2
## Spotlight 2-1C
Efficient Solvers for Minimal Problems by Syzygy-Based Reduction
Viktor Larsson, Kalle Åström, Magnus Oskarsson
HSfM: Hybrid Structure-from-Motion
Hainan Cui, Xiang Gao, Shuhan Shen, Zhanyi Hu
Efficient Global Point Cloud Alignment Using Bayesian Nonparametric Mixtures
Julian Straub, Trevor Campbell, Jonathan P. How, John W. Fisher III
A New Rank Constraint on Multi-View Fundamental Matrices, and Its Application to Camera Location Recovery
Soumyadip Sengupta, Tal Amir, Meirav Galun, Tom Goldstein, David W. Jacobs, Amit Singer, Ronen Basri
IM2CAD
Hamid Izadinia, Qi Shan, Steven M. Seitz
ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, Matthias Nießner
Noise Robust Depth From Focus Using a Ring Difference Filter
Jaeheung Surh, Hae-Gon Jeon, Yunwon Park, Sunghoon Im, Hyowon Ha, In So Kweon
Group-Wise Point-Set Registration Based on Rényi's Second Order Entropy
Luis G. Sanchez Giraldo, Erion Hasanbelliu, Murali Rao, Jose C. Principe
## Oral 2-1C
A Point Set Generation Network for 3D Object Reconstruction From a Single Image
Haoqiang Fan, Hao Su, Leonidas J. Guibas
3D Point Cloud Registration for Localization Using a Deep Neural Network Auto-Encoder
Gil Elbaz, Tamar Avraham, Anath Fischer
Flight Dynamics-Based Recovery of a UAV Trajectory Using Ground Cameras
Artem Rozantsev, Sudipta N. Sinha, Debadeepta Dey, Pascal Fua
DSAC - Differentiable RANSAC for Camera Localization
Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, Carsten Rother
# Poster 2-1
3D Computer Vision
Scalable Surface Reconstruction From Point Clouds With Extreme Scale and Density Diversity
Christian Mostegel, Rudolf Prettenthaler, Friedrich Fraundorfer, Horst Bischof
Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes With Deep Generative Networks
Amir Arsalan Soltani, Haibin Huang, Jiajun Wu, Tejas D. Kulkarni, Joshua B. Tenenbaum
General Models for Rational Cameras and the Case of Two-Slit Projections
Matthew Trager, Bernd Sturmfels, John Canny, Martial Hebert, Jean Ponce
Accurate Depth and Normal Maps From Occlusion-Aware Focal Stack Symmetry
Michael Strecke, Anna Alperovich, Bastian Goldluecke
A Multi-View Stereo Benchmark With High-Resolution Images and Multi-Camera Videos
Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, Andreas Geiger
Non-Contact Full Field Vibration Measurement Based on Phase-Shifting
Hiroyuki Kayaba, Yuji Kokumai
A Minimal Solution for Two-View Focal-Length Estimation Using Two Affine Correspondences
Daniel Barath, Tekla Toth, Levente Hajder
PoseAgent: Budget-Constrained 6D Object Pose Estimation via Reinforcement Learning
Alexander Krull, Eric Brachmann, Sebastian Nowozin, Frank Michel, Jamie Shotton, Carsten Rother
An Efficient Background Term for 3D Reconstruction and Tracking With Smooth Surface Models
Mariano Jaimez, Thomas J. Cashman, Andrew Fitzgibbon, Javier Gonzalez-Jimenez, Daniel Cremers
Analyzing Humans in Images
Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild
Shan Li, Weihong Deng, JunPing Du
Procedural Generation of Videos to Train Deep Action Recognition Networks
César Roberto de Souza, Adrien Gaidon, Yohann Cabon, Antonio Manuel López
BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis
Shanxin Yuan, Qi Ye, Björn Stenger, Siddhant Jain, Tae-Kyun Kim
DenseReg: Fully Convolutional Dense Shape Regression In-The-Wild
Rıza Alp Güler, George Trigeorgis, Epameinondas Antonakos, Patrick Snape, Stefanos Zafeiriou, Iasonas Kokkinos
Adaptive Class Preserving Representation for Image Classification
Jian-Xun Mi, Qiankun Fu, Weisheng Li
Applications
Generalized Semantic Preserving Hashing for N-Label Cross-Modal Retrieval
Devraj Mandal, Kunal N. Chaudhury, Soma Biswas
EAST: An Efficient and Accurate Scene Text Detector
Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, Jiajun Liang
VidLoc: A Deep Spatio-Temporal Model for 6-DoF Video-Clip Relocalization
Ronald Clark, Sen Wang, Andrew Markham, Niki Trigoni, Hongkai Wen
Biomedical Image/Video Analysis
Improving RANSAC-Based Segmentation Through CNN Encapsulation
Dustin Morley, Hassan Foroosh
Computational Photography
Position Tracking for Virtual Reality Using Commodity WiFi
Manikanta Kotaru, Sachin Katti
Designing Illuminant Spectral Power Distributions for Surface Classification
Henryk Blasinski, Joyce Farrell, Brian Wandell
One-Shot Hyperspectral Imaging Using Faced Reflectors
Tsuyoshi Takatani, Takahito Aoto, Yasuhiro Mukaigawa
Image Motion & Tracking
Direct Photometric Alignment by Mesh Deformation
Kaimo Lin, Nianjuan Jiang, Shuaicheng Liu, Loong-Fah Cheong, Minh Do, Jiangbo Lu
CNN-Based Patch Matching for Optical Flow With Thresholded Hinge Embedding Loss
Christian Bailer, Kiran Varanasi, Didier Stricker
Optical Flow Estimation Using a Spatial Pyramid Network
Anurag Ranjan, Michael J. Black
Deep Network Flow for Multi-Object Tracking
Manmohan Chandraker, Paul Vernaza, Wongun Choi, Samuel Schulter
Low- & Mid-Level Vision
Material Classification Using Frequency- and Depth-Dependent Time-Of-Flight Distortion
Kenichiro Tanaka, Yasuhiro Mukaigawa, Takuya Funatomi, Hiroyuki Kubo, Yasuyuki Matsushita, Yasushi Yagi
Benchmarking Denoising Algorithms With Real Photographs
Tobias Plötz, Stefan Roth
A Unified Approach of Multi-Scale Deep and Hand-Crafted Features for Defocus Estimation (PDF, project)
Jinsun Park, Yu-Wing Tai, Donghyeon Cho, In So Kweon
StyleBank: An Explicit Representation for Neural Image Style Transfer
Dongdong Chen, Lu Yuan, Jing Liao, Nenghai Yu, Gang Hua
Specular Highlight Removal in Facial Images
Chen Li, Stephen Lin, Kun Zhou, Katsushi Ikeuchi
Image Super-Resolution via Deep Recursive Residual Network
Ying Tai, Jian Yang, Xiaoming Liu
Deep Image Harmonization
Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, Ming-Hsuan Yang
Learning Deep CNN Denoiser Prior for Image Restoration (PDF, code)
Kai Zhang, Wangmeng Zuo, Shuhang Gu, Lei Zhang
A Novel Tensor-Based Video Rain Streaks Removal Approach via Utilizing Discriminatively Intrinsic Priors
Tai-Xiang Jiang, Ting-Zhu Huang, Xi-Le Zhao, Liang-Jian Deng, Yao Wang
GMS: Grid-based Motion Statistics for Fast, Ultra-Robust Feature Correspondence
JiaWang Bian, Wen-Yan Lin, Yasuyuki Matsushita, Sai-Kit Yeung, Tan-Dat Nguyen, Ming-Ming Cheng
Video Desnowing and Deraining Based on Matrix Decomposition
Weihong Ren, Jiandong Tian, Zhi Han, Antoni Chan, Yandong Tang
Real-Time Video Super-Resolution With Spatio-Temporal Networks and Motion Compensation
Jose Caballero, Christian Ledig, Andrew Aitken, Alejandro Acosta, Johannes Totz, Zehan Wang, Wenzhe Shi
Deep Watershed Transform for Instance Segmentation
Min Bai, Raquel Urtasun
AnchorNet: A Weakly Supervised Network to Learn Geometry-Sensitive Features for Semantic Matching
David Novotny, Diane Larlus, Andrea Vedaldi
Learning Diverse Image Colorization
Aditya Deshpande, Jiajun Lu, Mao-Chuang Yeh, Min Jin Chong, David Forsyth
Awesome Typography: Statistics-Based Text Effects Transfer
Shuai Yang, Jiaying Liu, Zhouhui Lian, Zongming Guo
Machine Learning
Unsupervised Video Summarization With Adversarial LSTM Networks
Behrooz Mahasseni, Michael Lam, Sinisa Todorovic
Deep TEN: Texture Encoding Network
Hang Zhang, Jia Xue, Kristin Dana
Order-Preserving Wasserstein Distance for Sequence Matching
Bing Su, Gang Hua
A Dual Ascent Framework for Lagrangean Decomposition of Combinatorial Problems
Paul Swoboda, Jan Kuske, Bogdan Savchynskyy
Attend in Groups: A Weakly-Supervised Deep Learning Framework for Learning From Web Data
Bohan Zhuang, Lingqiao Liu, Yao Li, Chunhua Shen, Ian Reid
Hierarchical Multimodal Metric Learning for Multimodal Classification
Heng Zhang, Vishal M. Patel, Rama Chellappa
Efficient Linear Programming for Dense CRFs
Thalaiyasingam Ajanthan, Alban Desmaison, Rudy Bunel, Mathieu Salzmann, Philip H. S. Torr, M. Pawan Kumar
Variational Autoencoded Regression: High Dimensional Regression of Visual Data on Complex Manifold
YoungJoon Yoo, Sangdoo Yun, Hyung Jin Chang, Yiannis Demiris, Jin Young Choi
Learning Random-Walk Label Propagation for Weakly-Supervised Semantic Segmentation
Paul Vernaza, Manmohan Chandraker
Low-Rank-Sparse Subspace Representation for Robust Regression
Yongqiang Zhang, Daming Shi, Junbin Gao, Dansong Cheng
Object Recognition & Scene Understanding
Generating the Future With Adversarial Transformers
Carl Vondrick, Antonio Torralba
Semantic Amodal Segmentation
Yan Zhu, Yuandong Tian, Dimitris Metaxas, Piotr Dollár
Learning a Deep Embedding Model for Zero-Shot Learning
Li Zhang, Tao Xiang, Shaogang Gong
BIND: Binary Integrated Net Descriptors for Texture-Less Object Recognition
Jacob Chan, Jimmy Addison Lee, Qian Kemao
Growing a Brain: Fine-Tuning by Increasing Model Capacity
Yu-Xiong Wang, Deva Ramanan, Martial Hebert
A-Fast-RCNN: Hard Positive Generation via Adversary for Object Detection
Xiaolong Wang, Abhinav Shrivastava, Abhinav Gupta
Multiple Instance Detection Network With Online Instance Classifier Refinement
Peng Tang, Xinggang Wang, Xiang Bai, Wenyu Liu
Kernel Pooling for Convolutional Neural Networks
Yin Cui, Feng Zhou, Jiang Wang, Xiao Liu, Yuanqing Lin, Serge Belongie
Learning Cross-Modal Embeddings for Cooking Recipes and Food Images
Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, Antonio Torralba
Zero-Shot Learning - the Good, the Bad and the Ugly
Yongqin Xian, Bernt Schiele, Zeynep Akata
DeepNav: Learning to Navigate Large Cities
Samarth Brahmbhatt, James Hays
Scene Graph Generation by Iterative Message Passing
Danfei Xu, Yuke Zhu, Christopher B. Choy, Li Fei-Fei
Visual Translation Embedding Network for Visual Relation Detection
Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, Tat-Seng Chua
Unsupervised Part Learning for Visual Recognition
Ronan Sicre, Yannis Avrithis, Ewa Kijak, Frédéric Jurie
Comprehension-Guided Referring Expressions
Ruotian Luo, Gregory Shakhnarovich
Top-Down Visual Saliency Guided by Captions
Vasili Ramanishka, Abir Das, Jianming Zhang, Kate Saenko
Theory
Grassmannian Manifold Optimization Assisted Sparse Spectral Clustering
Qiong Wang, Junbin Gao, Hong Li
Video Analytics
Video Propagation Networks
Varun Jampani, Raghudeep Gadde, Peter V. Gehler
ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification
Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell
SCC: Semantic Context Cascade for Efficient Action Detection
Fabian Caba Heilbron, Wayner Barrios, Victor Escorcia, Bernard Ghanem
Hierarchical Boundary-Aware Neural Encoder for Video Captioning
Lorenzo Baraldi, Costantino Grana, Rita Cucchiara
HOPE: Hierarchical Object Prototype Encoding for Efficient Object Instance Search in Videos
Tan Yu, Yuwei Wu, Junsong Yuan
Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos
Ionut Cosmin Duta, Bogdan Ionescu, Kiyoharu Aizawa, Nicu Sebe
Temporal Action Localization by Structured Maximal Sums
Zehuan Yuan, Jonathan C. Stroud, Tong Lu, Jia Deng
Predicting Salient Face in Multiple-Face Videos
Yufan Liu, Songyang Zhang, Mai Xu, Xuming He
# Object Recognition & Scene Understanding 1
## Spotlight 2-2A
#### Graph-Structured Representations for Visual Question Answering [PDF](https://arxiv.org/abs/1609.05600)
Damien Teney, Lingqiao Liu, Anton van den Hengel
_**Abstract**_:  _[This paper proposes to improve visual question answering (VQA) with structured representations of both scene contents and questions. A key challenge in VQA is to require joint reasoning over the visual and text domains. The predominant CNN/LSTM-based approach to VQA is limited by monolithic vector representations that largely ignore structure in the scene and in the form of the question. CNN feature vectors cannot effectively capture situations as simple as multiple object instances, and LSTMs process questions as series of words, which does not reflect the true complexity of language structure. We instead propose to build graphs over the scene objects and over the question words, and we describe a deep neural network that exploits the structure in these representations. This shows significant benefit over the sequential processing of LSTMs. The overall efficacy of our approach is demonstrated by significant improvements over the state-of-the-art, from 71.2% to 74.4% in accuracy on the "abstract scenes" multiple-choice benchmark, and from 34.7% to 39.1% in accuracy over pairs of "balanced" scenes, i.e. images with fine-grained differences and opposite yes/no answers to a same question. ]_
_**Comment**_:  _< structured representations of both scene contents and questions>_
#### Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning [PDF](https://arxiv.org/abs/1612.01887)
Jiasen Lu, Caiming Xiong, Devi Parikh, Richard Socher
_**Abstract**_:  _[Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as "the" and "of". Other words that may seem visual can often be predicted reliably just from the language model e.g., "sign" after "behind a red stop" or "phone" following "talking on a cell". In this paper, we propose a novel adaptive attention model with a visual sentinel. At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel. The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation. We test our method on the COCO image captioning 2015 challenge dataset and Flickr30K. Our approach sets the new state-of-the-art by a significant margin. ]_
_**Comment**_:  _< >_
#### *Learned Contextual Feature Reweighting for Image Geo-Localization [PDF](https://hyojin.web.unc.edu/files/2017/06/CVPR2017_0780.pdf)
Hyo Jin Kim, Enrique Dunn, Jan-Michael Frahm
_**Abstract**_:  _[We address the problem of large scale image geolocalization where the location of an image is estimated by identifying geo-tagged reference images depicting the same place. We propose a novel model for learning image representations that integrates context-aware feature reweighting in order to effectively focus on regions that positively contribute to geo-localization. In particular, we introduce a Contextual Reweighting Network (CRN) that predicts the importance of each region in the feature map based on the image context. Our model is learned end-to-end for the image geo-localization task, and requires no annotation other than image geo-tags for training. In experimental results, the proposed approach significantly outperforms the previous state-of-the-art on the standard geo-localization benchmark datasets.We also demonstrate that our CRN discovers task-relevant contexts without any additional supervision.]_
_**Comment**_:  _< geolocalization by geo-tagged reference images, predicts the importance of each region in the feature map based on the image context>_
#### End-To-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering [PDF](https://128.84.21.199/abs/1610.02947v2)
Youngjae Yu, Hyungjin Ko, Jongwook Choi, Gunhee Kim
_**Abstract**_:  _[We propose a high-level concept word detector that can be integrated with any video-to-language models. It takes a video as input and generates a list of concept words as useful semantic priors for language generation models. The proposed word detector has two important properties. First, it does not require any external knowledge sources for training. Second, the proposed word detector is trainable in an end-to-end manner jointly with any video-to-language models. To maximize the values of detected words, we also develop a semantic attention mechanism that selectively focuses on the detected concept words and fuse them with the word encoding and decoding in the language model. In order to demonstrate that the proposed approach indeed improves the performance of multiple video-to-language tasks, we participate in four tasks of LSMDC 2016. Our approach achieves the best accuracies in three of them, including fill-in-the-blank, multiple-choice test, and movie retrieval. We also attain comparable performance for the other task, movie description.]_
_**Comment**_:  _< concept word detector with any video-to-language tasks>_
#### Deep Cross-Modal Hashing [PDF](https://arxiv.org/abs/1602.02255)
Qing-Yuan Jiang, Wu-Jun Li
_**Abstract**_:  _[Due to its low storage cost and fast query speed, cross-modal hashing (CMH) has been widely used for similarity search in multimedia retrieval applications. However, almost all existing CMH methods are based on hand-crafted features which might not be optimally compatible with the hash-code learning procedure. As a result, existing CMH methods with handcrafted features may not achieve satisfactory performance. In this paper, we propose a novel cross-modal hashing method, called deep crossmodal hashing (DCMH), by integrating feature learning and hash-code learning into the same framework. DCMH is an end-to-end learning framework with deep neural networks, one for each modality, to perform feature learning from scratch. Experiments on two real datasets with text-image modalities show that DCMH can outperform other baselines to achieve the state-of-the-art performance in cross-modal retrieval applications. ]_
_**Comment**_:  _< cross-modal hashing for similarity search, integrating feature learning and hash-code learning into the same framework>_
#### Unambiguous Text Localization and Retrieval for Cluttered Scenes [PDF](http://www-ee.ccny.cuny.edu/wwwn/yltian/Publications/CVPR17-2326.pdf)
Xuejian Rong, Chucai Yi, Yingli Tian
_**Abstract**_:  _[]_
_**Comment**_:  _< Text instance localization>_
#### Bayesian Supervised Hashing
Zihao Hu, Junxuan Chen, Hongtao Lu, Tongzhen Zhang
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### ***Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors [PDF](https://arxiv.org/abs/1611.10012)
Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, Kevin Murphy
_**Abstract**_:  _[The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end, we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN [Ren et al., 2015], R-FCN [Dai et al., 2016] and SSD [Liu et al., 2015] systems, which we view as "meta-architectures" and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that achieves real time speeds and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.]_
_**Comment**_:  _< a gudide for speed/memory/accuracy balance, comparsion>_
## Oral 2-2A
#### **Detecting Visual Relationships With Deep Relational Networks [PDF](https://arxiv.org/abs/1704.03114)
Bo Dai, Yuqi Zhang, Dahua Lin
_**Abstract**_:  _[Relationships among objects play a crucial role in image understanding. Despite the great success of deep learning techniques in recognizing individual objects, reasoning about the relationships among objects remains a challenging task. Previous methods often treat this as a classification problem, considering each type of relationship (e.g. "ride") or each distinct visual phrase (e.g. "person-ride-horse") as a category. Such approaches are faced with significant difficulties caused by the high diversity of visual appearance for each kind of relationships or the large number of distinct visual phrases. We propose an integrated framework to tackle this problem. At the heart of this framework is the Deep Relational Network, a novel formulation designed specifically for exploiting the statistical dependencies between objects and their relationships. On two large datasets, the proposed method achieves substantial improvement over state-of-the-art.]_
_**Comment**_:  _< Relationships among objects,Deep Relational Network>_
#### *Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes [PDF](https://arxiv.org/abs/1611.08323)
Tobias Pohlen, Alexander Hermans, Markus Mathias, Bastian Leibe
_**Abstract**_:  _[Semantic image segmentation is an essential component of modern autonomous driving systems, as an accurate understanding of the surrounding scene is crucial to navigation and action planning. Current state-of-the-art approaches in semantic image segmentation rely on pre-trained networks that were initially developed for classifying images as a whole. While these networks exhibit outstanding recognition performance (i.e., what is visible?), they lack localization accuracy (i.e., where precisely is something located?). Therefore, additional processing steps have to be performed in order to obtain pixel-accurate segmentation masks at the full image resolution. To alleviate this problem we propose a novel ResNet-like architecture that exhibits strong localization and recognition performance. We combine multi-scale context with pixel-level accuracy by using two processing streams within our network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition. The two streams are coupled at the full image resolution using residuals. Without additional processing steps and without pre-training, our approach achieves an intersection-over-union score of 71.8% on the Cityscapes dataset. ]_
_**Comment**_:  _< semantic image segmentation,using two processing stream(one for full image, one for robust features>_
#### ***Network Dissection: Quantifying Interpretability of Deep Visual Representations [PDF](https://arxiv.org/abs/1704.05796)
David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, Antonio Torralba
_**Abstract**_:  _[We propose a general framework called Network Dissection for quantifying the interpretability of latent representations of CNNs by evaluating the alignment between individual hidden units and a set of semantic concepts. Given any CNN model, the proposed method draws on a broad data set of visual concepts to score the semantics of hidden units at each intermediate convolutional layer. The units with semantics are given labels across a range of objects, parts, scenes, textures, materials, and colors. We use the proposed method to test the hypothesis that interpretability of units is equivalent to random linear combinations of units, then we apply our method to compare the latent representations of various networks when trained to solve different supervised and self-supervised training tasks. We further analyze the effect of training iterations, compare networks trained with different initializations, examine the impact of network depth and width, and measure the effect of dropout and batch normalization on the interpretability of deep visual representations. We demonstrate that the proposed method can shed light on characteristics of CNN models and training methods that go beyond measurements of their discriminative power. ]_
_**Comment**_:  _< score the semantics of hidden units for different objects using different CNN, The units with semantics are given labels, interpretability of units is equivalent to random linear combinations of units, >_
#### **AGA: Attribute-Guided Augmentation [PDF](https://arxiv.org/abs/1612.02559)
Mandar Dixit, Roland Kwitt, Marc Niethammer, Nuno Vasconcelos
_**Abstract**_:  _[We consider the problem of data augmentation, i.e., generating artificial samples to extend a given corpus of training data. Specifically, we propose attributed-guided augmentation (AGA) which learns a mapping that allows to synthesize data such that an attribute of a synthesized sample is at a desired value or strength. This is particularly interesting in situations where little data with no attribute annotation is available for learning, but we have access to a large external corpus of heavily annotated samples. While prior works primarily augment in the space of images, we propose to perform augmentation in feature space instead. We implement our approach as a deep encoder-decoder architecture that learns the synthesis function in an end-to-end manner. We demonstrate the utility of our approach on the problems of (1) one-shot object recognition in a transfer-learning setting where we have no prior knowledge of the new classes, as well as (2) object-based one-shot scene recognition. As external data, we leverage 3D depth and pose information from the SUN RGB-D dataset. Our experiments show that attribute-guided augmentation of high-level CNN features considerably improves one-shot recognition performance on both problems. ]_
_**Comment**_:  _< attributed-guided data augmentation in feature space, a deep encoder-decoder architecture; transfer learning, one-shot recognition >_
# Analyzing Humans 2
## Spotlight 2-2B
#### A Hierarchical Approach for Generating Descriptive Image Paragraphs [PDF](https://arxiv.org/abs/1611.06607)
Jonathan Krause, Justin Johnson, Ranjay Krishna, Li Fei-Fei
_**Abstract**_:  _[Recent progress on image captioning has made it possible to generate novel sentences describing images in natural language, but compressing an image into a single sentence can describe visual content in only coarse detail. While one new captioning approach, dense captioning, can potentially describe images in finer levels of detail by captioning many regions within an image, it in turn is unable to produce a coherent story for an image. In this paper we overcome these limitations by generating entire paragraphs for describing images, which can tell detailed, unified stories. We develop a model that decomposes both images and paragraphs into their constituent parts, detecting semantic regions in images and using a hierarchical recurrent neural network to reason about language. Linguistic analysis confirms the complexity of the paragraph generation task, and thorough experiments on a new dataset of image and paragraph pairs demonstrate the effectiveness of our approach. ]_
_**Comment**_:  _< generating entire paragraphs for describing images>_
#### Person Re-Identification in the Wild [PDF](https://arxiv.org/abs/1604.02531)
Liang Zheng, Hengheng Zhang, Shaoyan Sun, Manmohan Chandraker, Yi Yang, Qi Tian
_**Abstract**_:  _[We present a novel large-scale dataset and comprehensive baselines for end-to-end pedestrian detection and person recognition in raw video frames. Our baselines address three issues: the performance of various combinations of detectors and recognizers, mechanisms for pedestrian detection to help improve overall re-identification accuracy and assessing the effectiveness of different detectors for re-identification. We make three distinct contributions. First, a new dataset, PRW, is introduced to evaluate Person Re-identification in the Wild, using videos acquired through six synchronized cameras. It contains 932 identities and 11,816 frames in which pedestrians are annotated with their bounding box positions and identities. Extensive benchmarking results are presented on this dataset. Second, we show that pedestrian detection aids re-identification through two simple yet effective improvements: a discriminatively trained ID-discriminative Embedding (IDE) in the person subspace using convolutional neural network (CNN) features and a Confidence Weighted Similarity (CWS) metric that incorporates detection scores into similarity measurement. Third, we derive insights in evaluating detector performance for the particular scenario of accurate person re-identification. ]_
_**Comment**_:  _< end-to-end pedestrian detection and person recognition in raw video frames, >_
#### Scalable Person Re-Identification on Supervised Smoothed Manifold [PDF](https://arxiv.org/abs/1703.08359)
Song Bai, Xiang Bai, Qi Tian
_**Abstract**_:  _[Most existing person re-identification algorithms either extract robust visual features or learn discriminative metrics for person images. However, the underlying manifold which those images reside on is rarely investigated. That raises a problem that the learned metric is not smooth with respect to the local geometry structure of the data manifold. In this paper, we study person re-identification with manifold-based affinity learning, which did not receive enough attention from this area. An unconventional manifold-preserving algorithm is proposed, which can 1) make the best use of supervision from training data, whose label information is given as pairwise constraints; 2) scale up to large repositories with low on-line time complexity; and 3) be plunged into most existing algorithms, serving as a generic postprocessing procedure to further boost the identification accuracies. Extensive experimental results on five popular person re-identification benchmarks consistently demonstrate the effectiveness of our method. Especially, on the largest CUHK03 and Market-1501, our method outperforms the state-of-the-art alternatives by a large margin with high efficiency, which is more appropriate for practical applications. ]_
_**Comment**_:  _< maniford>_
#### Binge Watching: Scaling Affordance Learning From Sitcoms
Xiaolong Wang, Rohit Girdhar, Abhinav Gupta
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Joint Detection and Identification Feature Learning for Person Search [PDF](https://pdfs.semanticscholar.org/cec9/17ae255439e17b9a345556f1e091b6b9485a.pdf?_ga=2.153350832.985571883.1498350464-2104590067.1498350464)
Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, Xiaogang Wang
_**Abstract**_:  _[Existing person re-identification benchmarks and methods mainly focus on matching cropped pedestrian images between queries and candidates. However, it is different from real-world scenarios where the annotations of pedestrian bounding boxes are unavailable and the target person needs to be searched from a gallery of whole scene images. To close the gap, we propose a new deep learning framework for person search. Instead of breaking it down into two separate tasks—pedestrian detection and person re-identification, we jointly handle both aspects in a single convolutional neural network. An Online Instance Matching (OIM) loss function is proposed to train the network effectively , which is scalable to datasets with numerous identities. To validate our approach, we collect and annotate a large-scale benchmark dataset for person search. It contains 18, 184 images, 8, 432 identities, and 96, 143 pedestrian bounding boxes. Experiments show that our framework outperforms other separate approaches, and the proposed OIM loss function converges much faster and better than the conventional Softmax loss.]_
_**Comment**_:  _< jointly handle pedestrian detection and reidentification in a single CNN, OIM losss function>_
#### Synthesizing Normalized Faces From Facial Identity Features [PDF](https://arxiv.org/abs/1701.04851)
Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, William T. Freeman
_**Abstract**_:  _[We present a method for synthesizing a frontal, neutral-expression image of a person's face given an input face photograph. This is achieved by learning to generate facial landmarks and textures from features extracted from a facial-recognition network. Unlike previous approaches, our encoding feature vector is largely invariant to lighting, pose, and facial expression. Exploiting this invariance, we train our decoder network using only frontal, neutral-expression photographs. Since these photographs are well aligned, we can decompose them into a sparse set of landmark points and aligned texture maps. The decoder then predicts landmarks and textures independently and combines them using a differentiable image warping operation. The resulting images can be used for a number of applications, such as analyzing facial attributes, exposure and white balance adjustment, or creating a 3-D avatar. ]_
_**Comment**_:  _< generate a frontal, neutral-expression image, predicts landmarks and textures independently>_
#### Consistent-Aware Deep Learning for Person Re-Identification in a Camera Network
Ji Lin, Liangliang Ren, Jiwen Lu, Jianjiang Feng, Jie Zhou
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Level Playing Field for Million Scale Face Recognition [PDF](https://arxiv.org/abs/1705.00393)
Aaron Nech, Ira Kemelmacher-Shlizerman
_**Abstract**_:  _[Face recognition has the perception of a solved problem, however when tested at the million-scale exhibits dramatic variation in accuracies across the different algorithms. Are the algorithms very different? Is access to good/big training data their secret weapon? Where should face recognition improve? To address those questions, we created a benchmark, MF2, that requires all algorithms to be trained on same data, and tested at the million scale. MF2 is a public large-scale set with 672K identities and 4.7M photos created with the goal to level playing field for large scale face recognition. We contrast our results with findings from the other two large-scale benchmarks MegaFace Challenge and MS-Celebs-1M where groups were allowed to train on any private/public/big/small set. Some key discoveries: 1) algorithms, trained on MF2, were able to achieve state of the art and comparable results to algorithms trained on massive private sets, 2) some outperformed themselves once trained on MF2, 3) invariance to aging suffers from low accuracies as in MegaFace, identifying the need for larger age variations possibly within identities or adjustment of algorithms in future testings. ]_
_**Comment**_:  _< a new face dataset>_
## Oral 2-2B
#### Re-Sign: Re-Aligned End-To-End Sequence Modelling With Deep Recurrent CNN-HMMs [PDF](https://www.researchgate.net/profile/Oscar_Koller/publication/315892054_Re-Sign_Re-Aligned_End-to-End_Sequence_Modelling_with_Deep_Recurrent_CNN-HMMs/links/58ec87800f7e9b6b274bb137/Re-Sign-Re-Aligned-End-to-End-Sequence-Modelling-with-Deep-Recurrent-CNN-HMMs.pdf)
Oscar Koller, Sepehr Zargaran, Hermann Ney
_**Abstract**_:  _[This work presents an iterative realignment approach applicable to visual sequence labelling tasks such as gesture recognition, activity recognition and continuous sign language recognition. Previous methods dealing with video data usually rely on given frame labels to train their clas-sifiers. Looking at recent data sets, these labels often tend to be noisy which is commonly overseen. We propose an algorithm that treats the provided training labels as weak labels and refines the label-to-image alignment on-the-fly in a weakly supervised fashion. Given a series of frames and sequence-level labels, a deep recurrent CNN-BLSTM network is trained end-to-end. Embedded into an HMM, the resulting deep model corrects the frame labels and continuously improves its performance in several realignments. We evaluate on two challenging publicly available sign recognition benchmark data sets featuring over 1000 classes. We outperform the state-of-the-art by up to 10% absolute and 30% relative. ]_
_**Comment**_:  _< visual sequence labelling tasks such as gesture recognition, activity recognition and continuous sign language recognition; using weak labels>_
#### Social Scene Understanding: End-To-End Multi-Person Action Localization and Collective Activity Recognition [PDF](https://arxiv.org/abs/1611.09078)
Timur Bagautdinov, Alexandre Alahi, François Fleuret, Pascal Fua, Silvio Savarese
_**Abstract**_:  _[We present a unified framework for understanding human social behaviors in raw image sequences. Our model jointly detects multiple individuals, infers their social actions, and estimates the collective actions with a single feed-forward pass through a neural network. We propose a single architecture that does not rely on external detection algorithms but rather is trained end-to-end to generate dense proposal maps that are refined via a novel inference scheme. The temporal consistency is handled via a person-level matching Recurrent Neural Network. The complete model takes as input a sequence of frames and outputs detections along with the estimates of individual actions and collective activities. We demonstrate state-of-the-art performance of our algorithm on multiple publicly available benchmarks. ]_
_**Comment**_:  _< understanding human social behaviors in raw image sequences>_
#### Detangling People: Individuating Multiple Close People and Their Body Parts via Region Assembly [PDF](https://arxiv.org/abs/1604.03880)
Hao Jiang, Kristen Grauman
_**Abstract**_:  _[Today's person detection methods work best when people are in common upright poses and appear reasonably well spaced out in the image. However, in many real images, that's not what people do. People often appear quite close to each other, e.g., with limbs linked or heads touching, and their poses are often not pedestrian-like. We propose an approach to detangle people in multi-person images. We formulate the task as a region assembly problem. Starting from a large set of overlapping regions from body part semantic segmentation and generic object proposals, our optimization approach reassembles those pieces together into multiple person instances. It enforces that the composed body part regions of each person instance obey constraints on relative sizes, mutual spatial relationships, foreground coverage, and exclusive label assignments when overlapping. Since optimal region assembly is a challenging combinatorial problem, we present a Lagrangian relaxation method to accelerate the lower bound estimation, thereby enabling a fast branch and bound solution for the global optimum. As output, our method produces a pixel-level map indicating both 1) the body part labels (arm, leg, torso, and head), and 2) which parts belong to which individual person. Our results on three challenging datasets show our method is robust to clutter, occlusion, and complex poses. It outperforms a variety of competing methods, including existing detector CRF methods and region CNN approaches. In addition, we demonstrate its impact on a proxemics recognition task, which demands a precise representation of "whose body part is where" in crowded images. ]_
_**Comment**_:  _< detangle people in multi-person images,>_
#### Lip Reading Sentences in the Wild [PDF](https://arxiv.org/abs/1611.05358)
Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman
_**Abstract**_:  _[The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos.
Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS) network that learns to transcribe videos of mouth motion to characters; (2) a curriculum learning strategy to accelerate training and to reduce overfitting; (3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition, consisting of over 100,000 natural sentences from British television.
The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin. This lip reading performance beats a professional lip reader on videos from BBC television, and we also demonstrate that visual information helps to improve speech recognition performance even when the audio is available. ]_
_**Comment**_:  _< lip reading>_
# Applications
## Spotlight 2-2C
#### *Deep Matching Prior Network: Toward Tighter Multi-Oriented Text Detection [PDF](https://arxiv.org/abs/1703.014250)
Yuliang Liu, Lianwen Jin
_**Abstract**_:  _[Detecting incidental scene text is a challenging task because of multi-orientation, perspective distortion, and variation of text size, color and scale. Retrospective research has only focused on using rectangular bounding box or horizontal sliding window to localize text, which may result in redundant background noise, unnecessary overlap or even information loss. To address these issues, we propose a new Convolutional Neural Networks (CNNs) based method, named Deep Matching Prior Network (DMPNet), to detect text with tighter quadrangle. First, we use quadrilateral sliding windows in several specific intermediate convolutional layers to roughly recall the text with higher overlapping area and then a shared Monte-Carlo method is proposed for fast and accurate computing of the polygonal areas. After that, we designed a sequential protocol for relative regression which can exactly predict text with compact quadrangle. Moreover, a auxiliary smooth Ln loss is also proposed for further regressing the position of text, which has better overall performance than L2 loss and smooth L1 loss in terms of robustness and stability. The effectiveness of our approach is evaluated on a public word-level, multi-oriented scene text database, ICDAR 2015 Robust Reading Competition Challenge 4 "Incidental scene text localization". The performance of our method is evaluated by using F-measure and found to be 70.64%, outperforming the existing state-of-the-art method with F-measure 63.76%. ]_
_**Comment**_:  _< text detection, CNN based Deep Matching Prior Net, with quadrangle slding windows>_
#### ChestX-ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases [PDF](https://arxiv.org/abs/1705.02315)
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, Ronald M. Summers
_**Abstract**_:  _[The chest X-ray is one of the most commonly accessible radiological examinations for screening and diagnosis of many lung diseases. A tremendous number of X-ray imaging studies accompanied by radiological reports are accumulated and stored in many modern hospitals' Picture Archiving and Communication Systems (PACS). On the other side, it is still an open question how this type of hospital-size knowledge database containing invaluable imaging informatics (i.e., loosely labeled) can be used to facilitate the data-hungry deep learning paradigms in building truly large-scale high precision computer-aided diagnosis (CAD) systems. In this paper, we present a new chest X-ray database, namely "ChestX-ray8", which comprises 108,948 frontal-view X-ray images of 32,717 unique patients with the text-mined eight disease image labels (where each image can have multi-labels), from the associated radiological reports using natural language processing. Importantly, we demonstrate that these commonly occurring thoracic diseases can be detected and even spatially-located via a unified weakly-supervised multi-label image classification and disease localization framework, which is validated using our proposed dataset. Although the initial quantitative results are promising as reported, deep convolutional neural network based "reading chest X-rays" (i.e., recognizing and locating the common disease patterns trained with only image-level labels) remains a strenuous task for fully-automated high precision CAD systems. ]_
_**Comment**_:  _< new X-ray dataset>_
#### *Attentional Push: A Deep Convolutional Network for Augmenting Image Salience With Shared Attention Modeling in Social Scenes
Siavash Gorji, James J. Clark
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### *Detecting Oriented Text in Natural Images by Linking Segments [PDF](https://arxiv.org/abs/1703.06520)
Baoguang Shi, Xiang Bai, Serge Belongie
_**Abstract**_:  _[Most state-of-the-art text detection methods are specific to horizontal Latin text and are not fast enough for real-time applications. We introduce Segment Linking (SegLink), an oriented text detection method. The main idea is to decompose text into two locally detectable elements, namely segments and links. A segment is an oriented box covering a part of a word or text line; A link connects two adjacent segments, indicating that they belong to the same word or text line. Both elements are detected densely at multiple scales by an end-to-end trained, fully-convolutional neural network. Final detections are produced by combining segments connected by links. Compared with previous methods, SegLink improves along the dimensions of accuracy, speed, and ease of training. It achieves an f-measure of 75.0% on the standard ICDAR 2015 Incidental (Challenge 4) benchmark, outperforming the previous best by a large margin. It runs at over 20 FPS on 512x512 images. Moreover, without modification, SegLink is able to detect long lines of non-Latin text, such as Chinese. ]_
_**Comment**_:  _< oriented text; segment(oriented box coving words & linking(connects two adjacent segment>_
#### Learning Video Object Segmentation From Static Images [PDF](https://arxiv.org/abs/1612.02646)
Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, Alexander Sorkine-Hornung
_**Abstract**_:  _[Inspired by recent advances of deep learning in instance segmentation and object tracking, we introduce video object segmentation problem as a concept of guided instance segmentation. Our model proceeds on a per-frame basis, guided by the output of the previous frame towards the object of interest in the next frame. We demonstrate that highly accurate object segmentation in videos can be enabled by using a convnet trained with static images only. The key ingredient of our approach is a combination of offline and online learning strategies, where the former serves to produce a refined mask from the previous frame estimate and the latter allows to capture the appearance of the specific object instance. Our method can handle different types of input annotations: bounding boxes and segments, as well as incorporate multiple annotated frames, making the system suitable for diverse applications. We obtain competitive results on three different datasets, independently from the type of input annotation. ]_
_**Comment**_:  _< >_
#### Seeing Invisible Poses: Estimating 3D Body Pose From Egocentric Video [PDF](https://arxiv.org/abs/1603.07763)
Hao Jiang, Kristen Grauman
_**Abstract**_:  _[Understanding the camera wearer's activity is central to egocentric vision, yet one key facet of that activity is inherently invisible to the camera--the wearer's body pose. Prior work focuses on estimating the pose of hands and arms when they come into view, but this 1) gives an incomplete view of the full body posture, and 2) prevents any pose estimate at all in many frames, since the hands are only visible in a fraction of daily life activities. We propose to infer the "invisible pose" of a person behind the egocentric camera. Given a single video, our efficient learning-based approach returns the full body 3D joint positions for each frame. Our method exploits cues from the dynamic motion signatures of the surrounding scene--which changes predictably as a function of body pose--as well as static scene structures that reveal the viewpoint (e.g., sitting vs. standing). We further introduce a novel energy minimization scheme to infer the pose sequence. It uses soft predictions of the poses per time instant together with a non-parametric model of human pose dynamics over longer windows. Our method outperforms an array of possible alternatives, including deep learning approaches for direct pose regression from images. ]_
_**Comment**_:  _< estimate camera wearer's body pose>_
#### Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space [PDF](https://arxiv.org/abs/1612.00005)
Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, Jason Yosinski
_**Abstract**_:  _[Generating high-resolution, photo-realistic images has been a long-standing goal in machine learning. Recently, Nguyen et al. (2016) showed one interesting way to synthesize novel images by performing gradient ascent in the latent space of a generator network to maximize the activations of one or multiple neurons in a separate classifier network. In this paper we extend this method by introducing an additional prior on the latent code, improving both sample quality and sample diversity, leading to a state-of-the-art generative model that produces high quality images at higher resolutions (227x227) than previous generative models, and does so for all 1000 ImageNet categories. In addition, we provide a unified probabilistic interpretation of related activation maximization methods and call the general class of models "Plug and Play Generative Networks". PPGNs are composed of 1) a generator network G that is capable of drawing a wide range of image types and 2) a replaceable "condition" network C that tells the generator what to draw. We demonstrate the generation of images conditioned on a class (when C is an ImageNet or MIT Places classification network) and also conditioned on a caption (when C is an image captioning network). Our method also improves the state of the art of Multifaceted Feature Visualization, which generates the set of synthetic inputs that activate a neuron in order to better understand how deep neural networks operate. Finally, we show that our model performs reasonably well at the task of image inpainting. While image models are used in this paper, the approach is modality-agnostic and can be applied to many types of data. ]_
_**Comment**_:  _< >_
#### *A Joint Speaker-Listener-Reinforcer Model for Referring Expressions [PDF](https://arxiv.org/abs/1612.09542)
Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg
_**Abstract**_:  _[Referring expressions are natural language constructions used to identify particular objects within a scene. In this paper, we propose a unified framework for the tasks of referring expression comprehension and generation. Our model is composed of three modules: speaker, listener, and reinforcer. The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions. The listener-speaker modules are trained jointly in an end-to-end learning framework, allowing the modules to be aware of one another during learning while also benefiting from the discriminative reinforcer's feedback. We demonstrate that this unified framework and training achieves state-of-the-art results for both comprehension and generation on three referring expression datasets. Project and demo page: this https URL]_
_**Comment**_:  _< referring expression, Speaker-Listener-Reinforcer network>_
## Oral 2-2C
#### End-To-End Learning of Driving Models From Large-Scale Video Datasets [PDF](https://arxiv.org/abs/1612.010790
Huazhe Xu, Yang Gao, Fisher Yu, Trevor Darrell
_**Abstract**_:  _[Robust perception-action models should be learned from training data with diverse visual appearances and realistic behaviors, yet current approaches to deep visuomotor policy learning have been generally limited to in-situ models learned from a single vehicle or a simulation environment. We advocate learning a generic vehicle motion model from large scale crowd-sourced video data, and develop an end-to-end trainable architecture for learning to predict a distribution over future vehicle egomotion from instantaneous monocular camera observations and previous vehicle state. Our model incorporates a novel FCN-LSTM architecture, which can be learned from large-scale crowd-sourced vehicle action data, and leverages available scene segmentation side tasks to improve performance under a privileged learning paradigm. ]_
_**Comment**_:  _< driving models>_
#### Deep Future Gaze: Gaze Anticipation on Egocentric Videos Using Adversarial Networks
Mengmi Zhang, Keng Teck Ma, Joo Hwee Lim, Qi Zhao, Jiashi Feng
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### *MDNet: A Semantically and Visually Interpretable Medical Image Diagnosis Network
Zizhao Zhang, Yuanpu Xie, Fuyong Xing, Mason McGough, Lin Yang
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
# Poster 2-2
3D Computer Vision
Surface Motion Capture Transfer With Gaussian Process Regression
Adnane Boukhayma, Jean-Sébastien Franco, Edmond Boyer
Visual-Inertial-Semantic Scene Representation for 3D Object Detection
Jingming Dong, Xiaohan Fei, Stefano Soatto
Template-Based Monocular 3D Recovery of Elastic Shapes Using Lagrangian Multipliers
Nazim Haouchine, Stephane Cotin
Learning Category-Specific 3D Shape Models From Weakly Labeled 2D Images
Dingwen Zhang, Junwei Han, Yang Yang, Dong Huang
Simultaneous Geometric and Radiometric Calibration of a Projector-Camera Pair
Marjan Shahpaski, Luis Ricardo Sapaico, Gaspard Chevassus, Sabine Süsstrunk
Learning Barycentric Representations of 3D Shapes for Sketch-Based 3D Shape Retrieval
Jin Xie, Guoxian Dai, Fan Zhu, Yi Fang
Geodesic Distance Descriptors
Gil Shamai, Ron Kimmel
Analyzing Humans in Images
Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks
Hongsong Wang, Liang Wang
Forecasting Human Dynamics From Static Images
Yu-Wei Chao, Jimei Yang, Brian Price, Scott Cohen, Jia Deng
Re-Ranking Person Re-Identification With k-Reciprocal Encoding
Zhun Zhong, Liang Zheng, Donglin Cao, Shaozi Li
Deep Sequential Context Networks for Action Prediction
Yu Kong, Zhiqiang Tao, Yun Fu
Global Context-Aware Attention LSTM Networks for 3D Action Recognition
Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, Alex C. Kot
Dynamic Attention-Controlled Cascaded Shape Regression Exploiting Training Data Augmentation and Fuzzy-Set Sample Weighting
Zhen-Hua Feng, Josef Kittler, William Christmas, Patrik Huber, Xiao-Jun Wu
A Deep Regression Architecture With Two-Stage Re-Initialization for High Performance Facial Landmark Detection
Jiangjing Lv, Xiaohu Shao, Junliang Xing, Cheng Cheng, Xi Zhou
Multiple People Tracking by Lifted Multicut and Person Re-Identification
Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, Bernt Schiele
Towards Accurate Multi-Person Pose Estimation in the Wild
George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, Kevin Murphy
Applications
Towards a Quality Metric for Dense Light Fields
Vamsi Kiran Adhikarla, Marek Vinkler, Denis Sumin, Rafał K. Mantiuk, Karol Myszkowski, Hans-Peter Seidel, Piotr Didyk
Controlling Perceptual Factors in Neural Style Transfer
Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, Aaron Hertzmann, Eli Shechtman
Biomedical Image/Video Analysis
Joint Sequence Learning and Cross-Modality Convolution for 3D Biomedical Segmentation
Kuan-Lun Tseng, Yen-Liang Lin, Winston Hsu, Chung-Yang Huang
LSTM Self-Supervision for Detailed Behavior Analysis
Biagio Brattoli, Uta Büchler, Anna-Sophia Wahl, Martin E. Schwab, Björn Ommer
Computational Photography
A Wide-Field-Of-View Monocentric Light Field Camera
Donald G. Dansereau, Glenn Schuster, Joseph Ford, Gordon Wetzstein
Image Motion & Tracking
S2F: Slow-To-Fast Interpolator Flow
Yanchao Yang, Stefano Soatto
CLKN: Cascaded Lucas-Kanade Networks for Image Alignment
Che-Han Chang, Chun-Nan Chou, Edward Y. Chang
Multi-Object Tracking With Quadruplet Convolutional Neural Networks
Mooyeol Baek, Jeany Son, Minsu Cho, Bohyung Han
Low- & Mid-Level Vision
Learning to Detect Salient Objects With Image-Level Supervision
Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai Yin, Xiang Ruan
From Motion Blur to Motion Flow: A Deep Learning Solution for Removing Heterogeneous Motion Blur
Dong Gong, Jie Yang, Lingqiao Liu, Yanning Zhang, Ian Reid, Chunhua Shen, Anton van den Hengel, Qinfeng Shi
Co-Occurrence Filter
Roy J. Jevnisek, Shai Avidan
Fractal Dimension Invariant Filtering and Its CNN-Based Implementation
Hongteng Xu, Junchi Yan, Nils Persson, Weiyao Lin, Hongyuan Zha
Noise-Blind Image Deblurring
Meiguang Jin, Stefan Roth, Paolo Favaro
Simultaneous Visual Data Completion and Denoising Based on Tensor Rank and Total Variation Minimization and Its Primal-Dual Splitting Algorithm
Tatsuya Yokota, Hidekata Hontani
HPatches: A Benchmark and Evaluation of Handcrafted and Learned Local Descriptors
Vassileios Balntas, Karel Lenc, Andrea Vedaldi, Krystian Mikolajczyk
Hyperspectral Image Super-Resolution via Non-Local Sparse Tensor Factorization
Renwei Dian, Leyuan Fang, Shutao Li
Reflection Removal Using Low-Rank Matrix Completion
Byeong-Ju Han, Jae-Young Sim
Object Co-Skeletonization With Co-Segmentation
Koteswar Rao Jerripothula, Jianfei Cai, Jiangbo Lu, Junsong Yuan
Machine Learning
Mining Object Parts From CNNs via Active Question-Answering
Quanshi Zhang, Ruiming Cao, Ying Nian Wu, Song-Chun Zhu
PolyNet: A Pursuit of Structural Diversity in Very Deep Networks
Xingcheng Zhang, Zhizhong Li, Chen Change Loy, Dahua Lin
The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions
Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel
Joint Discriminative Bayesian Dictionary and Classifier Learning
Naveed Akhtar, Ajmal Mian, Fatih Porikli
A Study of Lagrangean Decompositions and Dual Ascent Solvers for Graph Matching
Paul Swoboda, Carsten Rother, Hassan Abu Alhaija, Dagmar Kainmüller, Bogdan Savchynskyy
Quad-Networks: Unsupervised Learning to Rank for Interest Point Detection
Nikolay Savinov, Akihito Seki, Ľubor Ladický, Torsten Sattler, Marc Pollefeys
Outlier-Robust Tensor PCA
Pan Zhou, Jiashi Feng
Learning Adaptive Receptive Fields for Deep Image Parsing Network
Zhen Wei, Yao Sun, Jinqiao Wang, Hanjiang Lai, Si Liu
Learning an Invariant Hilbert Space for Domain Adaptation
Samitha Herath, Mehrtash Harandi, Fatih Porikli
Fixed-Point Factorized Networks
Peisong Wang, Jian Cheng
Discriminative Optimization: Theory and Applications to Point Cloud Registration
Jayakorn Vongkulbhisal, Fernando De la Torre, João P. Costeira
Online Asymmetric Similarity Learning for Cross-Modal Retrieval
Yiling Wu, Shuhui Wang, Qingming Huang
Improving Training of Deep Neural Networks via Singular Value Bounding
Kui Jia, Dacheng Tao, Shenghua Gao, Xiangmin Xu
S3Pool: Pooling With Stochastic Spatial Sampling
Shuangfei Zhai, Hui Wu, Abhishek Kumar, Yu Cheng, Yongxi Lu, Zhongfei Zhang, Rogerio Feris
Sports Field Localization via Deep Structured Models
Namdar Homayounfar, Sanja Fidler, Raquel Urtasun
Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing the Early Softmax Saturation
Binghui Chen, Weihong Deng, Junping Du
Switching Convolutional Neural Network for Crowd Counting
Deepak Babu Sam, Shiv Surya, R. Venkatesh Babu
Network Sketching: Exploiting Binary Structure in Deep CNNs
Yiwen Guo, Anbang Yao, Hao Zhao, Yurong Chen
Multi-Task Clustering of Human Actions by Sharing Information
Xiaoqiang Yan, Shizhe Hu, Yangdong Ye
Soft-Margin Mixture of Regressions
Dong Huang, Longfei Han, Fernando De la Torre
Multigrid Neural Architectures
Tsung-Wei Ke, Michael Maire, Stella X. Yu
High-Resolution Image Inpainting Using Multi-Scale Neural Patch Synthesis
Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, Hao Li
Deep Quantization: Encoding Convolutional Activations With Deep Generative Model
Zhaofan Qiu, Ting Yao, Tao Mei
DOPE: Distributed Optimization for Pairwise Energies
Jose Dolz, Ismail Ben Ayed, Christian Desrosiers
Improved Texture Networks: Maximizing Quality and Diversity in Feed-Forward Stylization and Texture Synthesis
Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky
Object Recognition & Scene Understanding
Polyhedral Conic Classifiers for Visual Object Detection and Classification
Hakan Cevikalp, Bill Triggs
Incremental Kernel Null Space Discriminant Analysis for Novelty Detection
Juncheng Liu, Zhouhui Lian, Yi Wang, Jianguo Xiao
Predicting Ground-Level Scene Layout From Aerial Imagery
Menghua Zhai, Zachary Bessinger, Scott Workman, Nathan Jacobs
Deep Feature Flow for Video Recognition
Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, Yichen Wei
Object-Aware Dense Semantic Correspondence
Fan Yang, Xin Li, Hong Cheng, Jianping Li, Leiting Chen
Semantic Regularisation for Recurrent Image Annotation
Feng Liu, Tao Xiang, Timothy M. Hospedales, Wankou Yang, Changyin Sun
Video2Shop: Exact Matching Clothes in Videos to Online Shopping Images
Zhi-Qi Cheng, Xiao Wu, Yang Liu, Xian-Sheng Hua
Fast-At: Fast Automatic Thumbnail Generation Using Deep Neural Networks
Seyed A. Esmaeili, Bharat Singh, Larry S. Davis
Multi-Level Attention Networks for Visual Question Answering
Dongfei Yu, Jianlong Fu, Tao Mei, Yong Rui
Generating Descriptions With Grounded and Co-Referenced People
Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, Bernt Schiele
Straight to Shapes: Real-Time Detection of Encoded Shapes
Saumya Jetley, Michael Sapienza, Stuart Golodetz, Philip H. S. Torr
Simultaneous Feature Aggregating and Hashing for Large-Scale Image Search
Thanh-Toan Do, Dang-Khoa Le Tan, Trung T. Pham, Ngai-Man Cheung
Improving Facial Attribute Prediction Using Semantic Segmentation
Mahdi M. Kalayeh, Boqing Gong, Mubarak Shah
Video Analytics
Learning Cross-Modal Deep Representations for Robust Pedestrian Detection
Dan Xu, Wanli Ouyang, Elisa Ricci, Xiaogang Wang, Nicu Sebe
Spatio-Temporal Self-Organizing Map Deep Network for Dynamic Object Detection From Videos
Yang Du, Chunfeng Yuan, Bing Li, Weiming Hu, Stephen Maybank
CERN: Confidence-Energy Recurrent Network for Group Activity Recognition
Tianmin Shu, Sinisa Todorovic, Song-Chun Zhu
Understanding Traffic Density From Large-Scale Web Camera Data
Shanghang Zhang, Guanhang Wu, João P. Costeira, José M. F. Moura
Collaborative Summarization of Topic-Related Videos
Rameswar Panda, Amit K. Roy-Chowdhury
# Machine Learning 3
## Spotlight 3-1A
#### ***Local Binary Convolutional Neural Networks [PDF](https://arxiv.org/abs/1608.06049)
Felix Juefei-Xu, Vishnu Naresh Boddeti, Marios Savvides
_**Abstract**_:  _[We propose local binary convolution (LBC), an efficient alternative to convolutional layers in standard convolutional neural networks (CNN). The design principles of LBC are motivated by local binary patterns (LBP). The LBC layer comprises of a set of fixed sparse pre-defined binary convolutional filters that are not updated during the training process, a non-linear activation function and a set of learnable linear weights. The linear weights combine the activated filter responses to approximate the corresponding activated filter responses of a standard convolutional layer. The LBC layer affords significant parameter savings, 9x to 169x in the number of learnable parameters compared to a standard convolutional layer. Furthermore, due to lower model complexity and sparse and binary nature of the weights also results in up to 9x to 169x savings in model size compared to a standard convolutional layer. We demonstrate both theoretically and experimentally that our local binary convolution layer is a good approximation of a standard convolutional layer. Empirically, CNNs with LBC layers, called local binary convolutional neural networks (LBCNN), reach state-of-the-art performance on a range of visual datasets (MNIST, SVHN, CIFAR-10, and a subset of ImageNet) while enjoying significant computational savings. ]_
_**Comment**_:  _< LBCNN, approximation to standard CNN>_
#### *Deep Self-Taught Learning for Weakly Supervised Object Localization [PDF](https://arxiv.org/abs/1704.05188)
Zequn Jie, Yunchao Wei, Xiaojie Jin, Jiashi Feng, Wei Liu
_**Abstract**_:  _[Most existing weakly supervised localization (WSL) approaches learn detectors by finding positive bounding boxes based on features learned with image-level supervision. However, those features do not contain spatial location related information and usually provide poor-quality positive samples for training a detector. To overcome this issue, we propose a deep self-taught learning approach, which makes the detector learn the object-level features reliable for acquiring tight positive samples and afterwards re-train itself based on them. Consequently, the detector progressively improves its detection ability and localizes more informative positive samples. To implement such self-taught learning, we propose a seed sample acquisition method via image-to-object transferring and dense subgraph discovery to find reliable positive samples for initializing the detector. An online supportive sample harvesting scheme is further proposed to dynamically select the most confident tight positive samples and train the detector in a mutual boosting way. To prevent the detector from being trapped in poor optima due to overfitting, we propose a new relative improvement of predicted CNN scores for guiding the self-taught learning process. Extensive experiments on PASCAL 2007 and 2012 show that our approach outperforms the state-of-the-arts, strongly validating its effectiveness. ]_
_**Comment**_:  _< selft-taught learning, imporve progressively>_
#### Multi-Modal Mean-Fields via Cardinality-Based Clamping [PDF](https://arxiv.org/abs/1611.07941)
Pierre Baqué, François Fleuret, Pascal Fua
_**Abstract**_:  _[Mean Field inference is central to statistical physics. It has attracted much interest in the Computer Vision community to efficiently solve problems expressible in terms of large Conditional Random Fields. However, since it models the posterior probability distribution as a product of marginal probabilities, it may fail to properly account for important dependencies between variables. We therefore replace the fully factorized distribution of Mean Field by a weighted mixture of such distributions, that similarly minimizes the KL-Divergence to the true posterior. By introducing two new ideas, namely, conditioning on groups of variables instead of single ones and using a parameter of the conditional random field potentials, that we identify to the temperature in the sense of statistical physics to select such groups, we can perform this minimization efficiently. Our extension of the clamping method proposed in previous works allows us to both produce a more descriptive approximation of the true posterior and, inspired by the diverse MAP paradigms, fit a mixture of Mean Field approximations. We demonstrate that this positively impacts real-world algorithms that initially relied on mean fields. ]_
_**Comment**_:  _< >_
#### Probabilistic Temporal Subspace Clustering
Behnam Gholami, Vladimir Pavlovic
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### *Provable Self-Representation Based Outlier Detection in a Union of Subspaces [PDF](https://arxiv.org/abs/1704.03925)
Chong You, Daniel P. Robinson, René Vidal
_**Abstract**_:  _[Many computer vision tasks involve processing large amounts of data contaminated by outliers, which need to be detected and rejected. While outlier detection methods based on robust statistics have existed for decades, only recently have methods based on sparse and low-rank representation been developed along with guarantees of correct outlier detection when the inliers lie in one or more low-dimensional subspaces. This paper proposes a new outlier detection method that combines tools from sparse representation with random walks on a graph. By exploiting the property that data points can be expressed as sparse linear combinations of each other, we obtain an asymmetric affinity matrix among data points, which we use to construct a weighted directed graph. By defining a suitable Markov Chain from this graph, we establish a connection between inliers/outliers and essential/inessential states of the Markov chain, which allows us to detect outliers by using random walks. We provide a theoretical analysis that justifies the correctness of our method under geometric and connectivity assumptions. Experimental results on image databases demonstrate its superiority with respect to state-of-the-art sparse and low-rank outlier detection methods. ]_
_**Comment**_:  _< data contaminated by outliers, outlier detection>_
#### Latent Multi-View Subspace Clustering
Changqing Zhang, Qinghua Hu, Huazhu Fu, Pengfei Zhu, Xiaochun Cao
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### *Learning to Extract Semantic Structure From Documents Using Multimodal Fully Convolutional Neural Networks [PDF](https://arxiv.org/abs/1706.02337)
Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, C. Lee Giles
_**Abstract**_:  _[We present an end-to-end, multimodal, fully convolutional network for extracting semantic structures from document images. We consider document semantic structure extraction as a pixel-wise segmentation task, and propose a unified model that classifies pixels based not only on their visual appearance, as in the traditional page segmentation task, but also on the content of underlying text. Moreover, we propose an efficient synthetic document generation process that we use to generate pretraining data for our network. Once the network is trained on a large set of synthetic documents, we fine-tune the network on unlabeled real documents using a semi-supervised approach. We systematically study the optimum network architecture and show that both our multimodal approach and the synthetic data pretraining significantly boost the performance. ]_
_**Comment**_:  _< extracting semantic structures from document images; synthetic document generation for pertraining data, fine-tune using semi-supervised,>_
#### *Age Progression/Regression by Conditional Adversarial Autoencoder [PDF](https://arxiv.org/abs/1702.08423)
Zhifei Zhang, Yang Song, Hairong Qi
_**Abstract**_:  _["If I provide you a face image of mine (without telling you the actual age when I took the picture) and a large amount of face images that I crawled (containing labeled faces of different ages but not necessarily paired), can you show me what I would look like when I am 80 or what I was like when I was 5?" The answer is probably a "No." Most existing face aging works attempt to learn the transformation between age groups and thus would require the paired samples as well as the labeled query image. In this paper, we look at the problem from a generative modeling perspective such that no paired samples is required. In addition, given an unlabeled image, the generative model can directly produce the image with desired age attribute. We propose a conditional adversarial autoencoder (CAAE) that learns a face manifold, traversing on which smooth age progression and regression can be realized simultaneously. In CAAE, the face is first mapped to a latent vector through a convolutional encoder, and then the vector is projected to the face manifold conditional on age through a deconvolutional generator. The latent vector preserves personalized face features (i.e., personality) and the age condition controls progression vs. regression. Two adversarial networks are imposed on the encoder and generator, respectively, forcing to generate more photo-realistic faces. Experimental results demonstrate the appealing performance and flexibility of the proposed framework by comparing with the state-of-the-art and ground truth. ]_
_**Comment**_:  _< generate face images of different age >_
## Oral 3-1A
#### Compact Matrix Factorization With Dependent Subspaces
Viktor Larsson, Carl Olsson
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### FFTLasso: Large-Scale LASSO in the Fourier Domain [PDF](https://www.researchgate.net/profile/Adel_Bibi/publication/315765699_FFTLasso_Large-Scale_LASSO_in_the_Fourier_Domain/links/58e325cbaca2722505d16b85/FFTLasso-Large-Scale-LASSO-in-the-Fourier-Domain.pdf)
Adel Bibi, Hani Itani, Bernard Ghanem
_**Abstract**_:  _[In this paper, we revisit the LASSO sparse representation problem, which has been studied and used in a variety of different areas, ranging from signal processing and information theory to computer vision and machine learning. In the vision community, it found its way into many important applications, including face recognition, tracking, super resolution, image denoising, to name a few. Despite advances in efficient sparse algorithms, solving large-scale LASSO problems remains a challenge. To circumvent this difficulty, people tend to downsample and subsample the problem (e.g. via dimensionality reduction) to maintain a manageable sized LASSO, which usually comes at the cost of losing solution accuracy. This paper proposes a novel circulant reformulation of the LASSO that lifts the problem to a higher dimension, where ADMM can be efficiently applied to its dual form. Because of this lifting, all optimization variables are updated using only basic element-wise operations, the most computationally expensive of which is a 1D FFT. In this way, there is no need for a linear system solver nor matrix-vector multiplication. Since all operations in our FFTLasso method are element-wise, the sub-problems are completely independent and can be trivially parallelized (e.g. on a GPU). The attractive computational properties of FFTLasso are verified by extensive experiments on synthetic and real data and on the face recognition task. They demonstrate that FFTLasso scales much more effectively than a state-of-the-art solver.]_
_**Comment**_:  _< large-scale LASSO problems, lifted to a higer dim and use ADMM ,only basic element-wise operations, >_
#### On the Global Geometry of Sphere-Constrained Sparse Blind Deconvolution
Yuqian Zhang, Yenson Lau, Han-wen Kuo, Sky Cheung, Abhay Pasupathy, John Wright
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### *Global Optimality in Neural Network Training
Benjamin D. Haeffele, René Vidal
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
# Object Recognition & Scene Understanding 2
## Spotlight 3-1B
#### What Is and What Is Not a Salient Object? Learning Salient Object Detector by Ensembling Linear Exemplar Regressors
Changqun Xia, Jia Li, Xiaowu Chen, Anlin Zheng, Yu Zhang
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection [PDF](https://arxiv.org/abs/1703.03054)
Xiaodan Liang, Lisa Lee, Eric P. Xing
_**Abstract**_:  _[Despite progress in visual perception tasks such as image classification and detection, computers still struggle to understand the interdependency of objects in the scene as a whole, e.g., relations between objects or their attributes. Existing methods often ignore global context cues capturing the interactions among different object instances, and can only recognize a handful of types by exhaustively training individual detectors for all possible relationships. To capture such global interdependency, we propose a deep Variation-structured Reinforcement Learning (VRL) framework to sequentially discover object relationships and attributes in the whole image. First, a directed semantic action graph is built using language priors to provide a rich and compact representation of semantic correlations between object categories, predicates, and attributes. Next, we use a variation-structured traversal over the action graph to construct a small, adaptive action set for each step based on the current state and historical actions. In particular, an ambiguity-aware object mining scheme is used to resolve semantic ambiguity among object categories that the object detector fails to distinguish. We then make sequential predictions using a deep RL framework, incorporating global context cues and semantic embeddings of previously extracted phrases in the state vector. Our experiments on the Visual Relationship Detection (VRD) dataset and the large-scale Visual Genome dataset validate the superiority of VRL, which can achieve significantly better detection results on datasets involving thousands of relationship and attribute types. We also demonstrate that VRL is able to predict unseen types embedded in our action graph by learning correlations on shared graph nodes. ]_
_**Comment**_:  _< global interdependency( object relationships and attributes) >_
#### *Modeling Relationships in Referential Expressions With Compositional Modular Networks [PDF](https://arxiv.org/abs/1611.09978)
Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, Kate Saenko
_**Abstract**_:  _[People often refer to entities in an image in terms of their relationships with other entities. For example, "the black cat sitting under the table" refers to both a "black cat" entity and its relationship with another "table" entity. Understanding these relationships is essential for interpreting and grounding such natural language expressions. Most prior work focuses on either grounding entire referential expressions holistically to one region, or localizing relationships based on a fixed set of categories. In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene. We call this approach Compositional Modular Networks (CMNs): a novel architecture that learns linguistic analysis and visual inference end-to-end. Our approach is built around two types of neural modules that inspect local regions and pairwise interactions between regions. We evaluate CMNs on multiple referential expression datasets, outperforming state-of-the-art approaches on all tasks. ]_
_**Comment**_:  _< learns linguistic analysis and visual inference end-to-end( a modular network),referential expressions, e.g.the black cat sitting under the table>_
#### Counting Everyday Objects in Everyday Scenes [PDF](https://arxiv.org/abs/1604.03505)
Prithvijit Chattopadhyay, Ramakrishna Vedantam, Ramprasaath R. Selvaraju, Dhruv Batra, Devi Parikh
_**Abstract**_:  _[We are interested in counting the number of instances of object classes in natural, everyday images. Previous counting approaches tackle the problem in restricted domains such as counting pedestrians in surveillance videos. Counts can also be estimated from outputs of other vision tasks like object detection. In this work, we build dedicated models for counting designed to tackle the large variance in counts, appearances, and scales of objects found in natural scenes. Our approach is inspired by the phenomenon of subitizing - the ability of humans to make quick assessments of counts given a perceptual signal, for small count values. Given a natural scene, we employ a divide and conquer strategy while incorporating context across the scene to adapt the subitizing idea to counting. Our approach offers consistent improvements over numerous baseline approaches for counting on the PASCAL VOC 2007 and COCO datasets. Subsequently, we study how counting can be used to improve object detection. We then show a proof of concept application of our counting methods to the task of Visual Question Answering, by studying the `how many?' questions in the VQA and COCO-QA datasets. ]_
_**Comment**_:  _< dedicated models for counting>_
#### *Fully Convolutional Instance-Aware Semantic Segmentation [PDF](https://arxiv.org/abs/1611.07709)
Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, Yichen Wei
_**Abstract**_:  _[We present the first fully convolutional end-to-end solution for instance-aware semantic segmentation task. It inherits all the merits of FCNs for semantic segmentation and instance mask proposal. It performs instance mask prediction and classification jointly. The underlying convolutional representation is fully shared between the two sub-tasks, as well as between all regions of interest. The proposed network is highly integrated and achieves state-of-the-art performance in both accuracy and efficiency. It wins the COCO 2016 segmentation competition by a large margin. Code would be released at ]_
_**Comment**_:  _< instance-aware semantic segmentation task>_
#### ****Semantic Autoencoder for Zero-Shot Learning [PDF](https://arxiv.org/abs/1704.08345)
Elyor Kodirov, Tao Xiang, Shaogang Gong
_**Abstract**_:  _[Existing zero-shot learning (ZSL) models typically learn a projection function from a feature space to a semantic embedding space (e.g.~attribute space). However, such a projection function is only concerned with predicting the training seen class semantic representation (e.g.~attribute prediction) or classification. When applied to test data, which in the context of ZSL contains different (unseen) classes without training data, a ZSL model typically suffers from the project domain shift problem. In this work, we present a novel solution to ZSL based on learning a Semantic AutoEncoder (SAE). Taking the encoder-decoder paradigm, an encoder aims to project a visual feature vector into the semantic space as in the existing ZSL models. However, the decoder exerts an additional constraint, that is, the projection/code must be able to reconstruct the original visual feature. We show that with this additional reconstruction constraint, the learned projection function from the seen classes is able to generalise better to the new unseen classes. Importantly, the encoder and decoder are linear and symmetric which enable us to develop an extremely efficient learning algorithm. Extensive experiments on six benchmark datasets demonstrate that the proposed SAE outperforms significantly the existing ZSL models with the additional benefit of lower computational cost. Furthermore, when the SAE is applied to supervised clustering problem, it also beats the state-of-the-art. ]_
_**Comment**_:  _< zero-shot learning; exiting ZSL learn a projection, causing project domain shift problem; with decoder's constraint that projection must be able to reconstruct the original visual feature, thus generalise better to new unseen classes; classification with rejection>_
#### CityPersons: A Diverse Dataset for Pedestrian Detection [PDF](https://arxiv.org/abs/1702.05693)
Shanshan Zhang, Rodrigo Benenson, Bernt Schiele
_**Abstract**_:  _[Convnets have enabled significant progress in pedestrian detection recently, but there are still open questions regarding suitable architectures and training data. We revisit CNN design and point out key adaptations, enabling plain FasterRCNN to obtain state-of-the-art results on the Caltech dataset. To achieve further improvement from more and better data, we introduce CityPersons, a new set of person annotations on top of the Cityscapes dataset. The diversity of CityPersons allows us for the first time to train one single CNN model that generalizes well over multiple benchmarks. Moreover, with additional training with CityPersons, we obtain top results using FasterRCNN on Caltech, improving especially for more difficult cases [PDF](heavy occlusion and small scale) and providing higher localization quality. ]_
_**Comment**_:  _< pedestrain detection>_
#### GuessWhat?! Visual Object Discovery Through Multi-Modal Dialogue [PDF](https://arxiv.org/abs/1611.08481)
Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, Aaron Courville
_**Abstract**_:  _[We introduce GuessWhat?!, a two-player guessing game as a testbed for research on the interplay of computer vision and dialogue systems. The goal of the game is to locate an unknown object in a rich image scene by asking a sequence of questions. Higher-level image understanding, like spatial reasoning and language grounding, is required to solve the proposed task. Our key contribution is the collection of a large-scale dataset consisting of 150K human-played games with a total of 800K visual question-answer pairs on 66K images. We explain our design decisions in collecting the dataset and introduce the oracle and questioner tasks that are associated with the two players of the game. We prototyped deep learning models to establish initial baselines of the introduced tasks. ]_
_**Comment**_:  _< two-player guessing game >_
## Oral 3-1B
#### *Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition
Jianlong Fu, Heliang Zheng, Tao Mei
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Annotating Object Instances With a Polygon-RNN [PDF](https://arxiv.org/abs/1704.05548)
LluÃs Castrejón, Kaustav Kundu, Raquel Urtasun, Sanja Fidler
_**Abstract**_:  _[We propose an approach for semi-automatic annotation of object instances. While most current methods treat object segmentation as a pixel-labeling problem, we here cast it as a polygon prediction task, mimicking how most current datasets have been annotated. In particular, our approach takes as input an image crop and sequentially produces vertices of the polygon outlining the object. This allows a human annotator to interfere at any time and correct a vertex if needed, producing as accurate segmentation as desired by the annotator. We show that our approach speeds up the annotation process by a factor of 4.7 across all classes in Cityscapes, while achieving 78.4% agreement in IoU with original ground-truth, matching the typical agreement between human annotators. For cars, our speed-up factor is 7.3 for an agreement of 82.2%. We further show generalization capabilities of our approach to unseen datasets. ]_
_**Comment**_:  _< semi-automatic annotation of object instances, treat object seg as a polygon prediction task instead of pixel-labeling prob.>_
#### *Connecting Look and Feel: Associating the Visual and Tactile Properties of Physical Materials [PDF](https://arxiv.org/abs/1704.03822)
Wenzhen Yuan, Shaoxiong Wang, Siyuan Dong, Edward Adelson
_**Abstract**_:  _[For machines to interact with the physical world, they must understand the physical properties of objects and materials they encounter. We use fabrics as an example of a deformable material with a rich set of mechanical properties. A thin flexible fabric, when draped, tends to look different from a heavy stiff fabric. It also feels different when touched. Using a collection of 118 fabric sample, we captured color and depth images of draped fabrics along with tactile data from a high resolution touch sensor. We then sought to associate the information from vision and touch by jointly training CNNs across the three modalities. Through the CNN, each input, regardless of the modality, generates an embedding vector that records the fabric's physical property. By comparing the embeddings, our system is able to look at a fabric image and predict how it will feel, and vice versa. We also show that a system jointly trained on vision and touch data can outperform a similar system trained only on visual data when tested purely with visual inputs. ]_
_**Comment**_:  _< look and feel, predict mutually, jointly training CNNs(color, depth, tactile data)>_
#### *Deep Learning Human Mind for Automated Visual Classification [PDF](https://arxiv.org/abs/1609.00344)
Concetto Spampinato, Simone Palazzo, Isaak Kavasidis, Daniela Giordano, Nasim Souly, Mubarak Shah
_**Abstract**_:  _[What if we could effectively read the mind and transfer human visual capabilities to computer vision methods? In this paper, we aim at addressing this question by developing the first visual object classifier driven by human brain signals. In particular, we employ EEG data evoked by visual object stimuli combined with Recurrent Neural Networks (RNN) to learn a discriminative brain activity manifold of visual categories. Afterwards, we train a Convolutional Neural Network (CNN)-based regressor to project images onto the learned manifold, thus effectively allowing machines to employ human brain-based features for automated visual classification. We use a 32-channel EEG to record brain activity of seven subjects while looking at images of 40 ImageNet object classes. The proposed RNN based approach for discriminating object classes using brain signals reaches an average accuracy of about 40%, which outperforms existing methods attempting to learn EEG visual object representations. As for automated object categorization, our human brain-driven approach obtains competitive performance, comparable to those achieved by powerful CNN models, both on ImageNet and CalTech 101, thus demonstrating its classification and generalization capabilities. This gives us a real hope that, indeed, human mind can be read and transferred to machines. ]_
_**Comment**_:  _< visual object classifier driven by human brain signals, human brain-based features>_
# Poster 3-1
3D Computer Vision
Self-Calibration-Based Approach to Critical Motion Sequences of Rolling-Shutter Structure From Motion
Eisuke Ito, Takayuki Okatani
Semi-Calibrated Near Field Photometric Stereo
Fotios Logothetis, Roberto Mecca, Roberto Cipolla
Semantic Multi-View Stereo: Jointly Estimating Objects and Voxels
Ali Osman Ulusoy, Michael J. Black, Andreas Geiger
Learning to Predict Stereo Reliability Enforcing Local Consistency of Confidence Maps
Matteo Poggi, Stefano Mattoccia
The Misty Three Point Algorithm for Relative Pose
Tobias Palmér, Kalle Ã…ström, Jan-Michael Frahm
The Surfacing of Multiview 3D Drawings via Lofting and Occlusion Reasoning
Anil Usumezbas, Ricardo Fabbri, Benjamin B. Kimia
A New Representation of Skeleton Sequences for 3D Action Recognition
Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, Farid Boussaid
A General Framework for Curve and Surface Comparison and Registration With Oriented Varifolds
Irène Kaltenmark, Benjamin Charlier, Nicolas Charon
Learning to Align Semantic Segmentation and 2.5D Maps for Geolocalization
Anil Armagan, Martin Hirzer, Peter M. Roth, Vincent Lepetit
A Generative Model for Depth-Based Robust 3D Facial Pose Tracking
Lu Sheng, Jianfei Cai, Tat-Jen Cham, Vladimir Pavlovic, King Ngi Ngan
Fast 3D Reconstruction of Faces With Glasses
Fabio Maninchedda, Martin R. Oswald, Marc Pollefeys
An Efficient Algebraic Solution to the Perspective-Three-Point Problem
Tong Ke, Stergios I. Roumeliotis
Analyzing Humans in Images
Learning From Synthetic Humans
Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, Cordelia Schmid
Forecasting Interactive Dynamics of Pedestrians With Fictitious Play
Wei-Chiu Ma, De-An Huang, Namhoon Lee, Kris M. Kitani
Hand Keypoint Detection in Single Images Using Multiview Bootstrapping
Tomas Simon, Hanbyul Joo, Iain Matthews, Yaser Sheikh
PoseTrack: Joint Multi-Person Pose Estimation and Tracking
Umar Iqbal, Anton Milan, Juergen Gall
Expecting the Unexpected: Training Detectors for Unusual Pedestrians With Adversarial Imposters
Shiyu Huang, Deva Ramanan
On Human Motion Prediction Using Recurrent Neural Networks
Julieta Martinez, Michael J. Black, Javier Romero
Learning and Refining of Privileged Information-Based RNNs for Action Recognition From Depth Sequences
Zhiyuan Shi, Tae-Kyun Kim
Quality Aware Network for Set to Set Recognition
Yu Liu, Junjie Yan, Wanli Ouyang
Unite the People: Closing the Loop Between 3D and 2D Human Representations
Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J. Black, Peter V. Gehler
Deep Multitask Architecture for Integrated 2D and 3D Human Sensing
Alin-Ionut Popa, Mihai Zanfir, Cristian Sminchisescu
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
João Carreira, Andrew Zisserman
Applications
Identifying First-Person Camera Wearers in Third-Person Videos
Chenyou Fan, Jangwon Lee, Mingze Xu, Krishna Kumar Singh, Yong Jae Lee, David J. Crandall, Michael S. Ryoo
Biomedical Image/Video Analysis
Parsing Images of Overlapping Organisms With Deep Singling-Out Networks
Victor Yurchenko, Victor Lempitsky
Fine-Tuning Convolutional Neural Networks for Biomedical Image Analysis: Actively and Incrementally
Zongwei Zhou, Jae Shin, Lei Zhang, Suryakanth Gurudu, Michael Gotway, Jianming Liang
Computational Photography
Depth From Defocus in the Wild
Huixuan Tang, Scott Cohen, Brian Price, Stephen Schiller, Kiriakos N. Kutulakos
Matting and Depth Recovery of Thin Structures Using a Focal Stack
Chao Liu, Srinivasa G. Narasimhan, Artur W. Dubrawski
Image Motion & Tracking
Robust Interpolation of Correspondences for Large Displacement Optical Flow
Yinlin Hu, Yunsong Li, Rui Song
Large Margin Object Tracking With Circulant Feature Maps
Mengmeng Wang, Yong Liu, Zeyi Huang
Minimum Delay Moving Object Detection
Dong Lao, Ganesh Sundaramoorthi
Multi-Task Correlation Particle Filter for Robust Object Tracking
Tianzhu Zhang, Changsheng Xu, Ming-Hsuan Yang
Attentional Correlation Filter Network for Adaptive Visual Tracking
Jongwon Choi, Hyung Jin Chang, Sangdoo Yun, Tobias Fischer, Yiannis Demiris, Jin Young Choi
The World of Fast Moving Objects
Denys Rozumnyi, Jan Kotera, Filip Šroubek, Lukáš Novotný, Jiřà Matas
Discriminative Correlation Filter With Channel and Spatial Reliability
Alan LukežiÄ, Tomáš VojÃÅ™, Luka ÄŒehovin Zajc, Jiřà Matas, Matej Kristan
Low- & Mid-Level Vision
Learning Deep Binary Descriptor With Multi-Quantization
Yueqi Duan, Jiwen Lu, Ziwei Wang, Jianjiang Feng, Jie Zhou
One-To-Many Network for Visually Pleasing Compression Artifacts Reduction
Jun Guo, Hongyang Chao
Gated Feedback Refinement Network for Dense Image Labeling
Md Amirul Islam, Mrigank Rochan, Neil D. B. Bruce, Yang Wang
BRISKS: Binary Features for Spherical Images on a Geodesic Grid
Hao Guan, William A. P. Smith
Superpixels and Polygons Using Simple Non-Iterative Clustering
Radhakrishna Achanta, Sabine Süsstrunk
Hardware-Efficient Guided Image Filtering for Multi-Label Problem
Longquan Dai, Mengke Yuan, Zechao Li, Xiaopeng Zhang, Jinhui Tang
Alternating Direction Graph Matching
D. Khuê Lê-Huu, Nikos Paragios
Learning Discriminative and Transformation Covariant Local Feature Detectors
Xu Zhang, Felix X. Yu, Svebor Karaman, Shih-Fu Chang
Machine Learning
Correlational Gaussian Processes for Cross-Domain Visual Recognition
Chengjiang Long, Gang Hua
DeLiGAN : Generative Adversarial Networks for Diverse and Limited Data
Swaminathan Gurumurthy, Ravi Kiran Sarvadevabhatla, R. Venkatesh Babu
Oriented Response Networks
Yanzhao Zhou, Qixiang Ye, Qiang Qiu, Jianbin Jiao
Missing Modalities Imputation via Cascaded Residual Autoencoder
Luan Tran, Xiaoming Liu, Jiayu Zhou, Rong Jin
Efficient Optimization for Hierarchically-structured Interacting Segments [PDF](hINTS)
Hossam Isack, Olga Veksler, Ipek Oguz, Milan Sonka, Yuri Boykov
A Message Passing Algorithm for the Minimum Cost Multicut Problem
Paul Swoboda, Bjoern Andres
End-To-End Representation Learning for Correlation Filter Based Tracking
Jack Valmadre, Luca Bertinetto, João Henriques, Andrea Vedaldi, Philip H. S. Torr
Filter Flow Made Practical: Massively Parallel and Lock-Free
Sathya N. Ravi, Yunyang Xiong, Lopamudra Mukherjee, Vikas Singh
Online Graph Completion: Multivariate Signal Recovery in Computer Vision
Won Hwa Kim, Mona Jalal, Seongjae Hwang, Sterling C. Johnson, Vikas Singh
Point to Set Similarity Based Deep Feature Learning for Person Re-Identification
Sanping Zhou, Jinjun Wang, Jiayun Wang, Yihong Gong, Nanning Zheng
Exploiting Saliency for Object Segmentation From Image Level Labels
Seong Joon Oh, Rodrigo Benenson, Anna Khoreva, Zeynep Akata, Mario Fritz, Bernt Schiele
Consensus Maximization With Linear Matrix Inequality Constraints
Pablo Speciale, Danda Pani Paudel, Martin R. Oswald, Till Kroeger, Luc Van Gool, Marc Pollefeys
Physically-Based Rendering for Indoor Scene Understanding Using Convolutional Neural Networks
Yinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva, Joon-Young Lee, Hailin Jin, Thomas Funkhouser
Deep Multimodal Representation Learning From Temporal Data
Xitong Yang, Palghat Ramesh, Radha Chitta, Sriganesh Madhvanath, Edgar A. Bernal, Jiebo Luo
All You Need Is Beyond a Good Init: Exploring Better Solution for Training Extremely Deep Convolutional Neural Networks With Orthonormality and Modulation
Di Xie, Jiang Xiong, Shiliang Pu
Hard Mixtures of Experts for Large Scale Weakly Supervised Vision
Sam Gross, Marc'Aurelio Ranzato, Arthur Szlam
A Reinforcement Learning Approach to the View Planning Problem
Mustafa Devrim Kaba, Mustafa Gokhan Uzunbas, Ser Nam Lim
Zero-Shot Classification With Discriminative Semantic Representation Learning
Meng Ye, Yuhong Guo
Adversarial Discriminative Domain Adaptation
Eric Tzeng, Judy Hoffman, Kate Saenko, Trevor Darrell
None of the above
Learning to Rank Retargeted Images
Yang Chen, Yong-Jin Liu, Yu-Kun Lai
Object Recognition & Scene Understanding
Automatic Discovery, Association Estimation and Learning of Semantic Attributes for a Thousand Categories
Ziad Al-Halah, Rainer Stiefelhagen
Scene Parsing Through ADE20K Dataset
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, Antonio Torralba
Weakly Supervised Cascaded Convolutional Networks
Ali Diba, Vivek Sharma, Ali Pazandeh, Hamed Pirsiavash, Luc Van Gool
Discretely Coding Semantic Rank Orders for Supervised Image Hashing
Li Liu, Ling Shao, Fumin Shen, Mengyang Yu
Joint Geometrical and Statistical Alignment for Visual Domain Adaptation
Jing Zhang, Wanqing Li, Philip Ogunbona
Weakly Supervised Dense Video Captioning
Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, Xiangyang Xue
RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation
Guosheng Lin, Anton Milan, Chunhua Shen, Ian Reid
Semantic Segmentation via Structured Patch Prediction, Context CRF and Guidance CRF
Falong Shen, Rui Gan, Shuicheng Yan, Gang Zeng
Person Search With Natural Language Description
Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, Xiaogang Wang
Weakly Supervised Affordance Detection
Johann Sawatzky, Abhilash Srikantha, Juergen Gall
Zero-Shot Recognition Using Dual Visual-Semantic Mapping Paths
Yanan Li, Donghui Wang, Huanhang Hu, Yuetan Lin, Yueting Zhuang
Neural Aggregation Network for Video Face Recognition
Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, Fang Wen, Hongdong Li, Gang Hua
Relationship Proposal Networks
Ji Zhang, Mohamed Elhoseiny, Scott Cohen, Walter Chang, Ahmed Elgammal
Learning Object Interactions and Descriptions for Semantic Image Segmentation
Guangrun Wang, Ping Luo, Liang Lin, Xiaogang Wang
RON: Reverse Connection With Objectness Prior Networks for Object Detection
Tao Kong, Fuchun Sun, Anbang Yao, Huaping Liu, Ming Lu, Yurong Chen
Weakly-Supervised Visual Grounding of Phrases With Linguistic Structures
Fanyi Xiao, Leonid Sigal, Yong Jae Lee
Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects
Ting Yao, Yingwei Pan, Yehao Li, Tao Mei
Beyond Instance-Level Image Retrieval: Leveraging Captions to Learn a Global Visual Representation for Semantic Retrieval
Albert Gordo, Diane Larlus
MuCaLe-Net: Multi Categorical-Level Networks to Generate More Discriminating Features
Youssef Tamaazousti, Hervé Le Borgne, Céline Hudelot
Zero Shot Learning via Multi-Scale Manifold Regularization
Shay Deutsch, Soheil Kolouri, Kyungnam Kim, Yuri Owechko, Stefano Soatto
Theory
Deeply Supervised Salient Object Detection With Short Connections
Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, Philip H. S. Torr
A Matrix Splitting Method for Composite Function Minimization
Ganzhao Yuan, Wei-Shi Zheng, Bernard Ghanem
Video Analytics
One-Shot Video Object Segmentation (PDF, project, code, code)
Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, Luc Van Gool
Fast Person Re-Identification via Cross-Camera Semantic Binary Transformation
Jiaxin Chen, Yunhong Wang, Jie Qin, Li Liu, Ling Shao
SPFTN: A Self-Paced Fine-Tuning Network for Segmenting Objects in Weakly Labelled Videos
Dingwen Zhang, Le Yang, Deyu Meng, Dong Xu, Junwei Han
# Machine Learning 4
## Spotlight 4-1A
#### *Hidden Layers in Perceptual Learning
Gad Cohen, Daphna Weinshall
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### *Few-Shot Object Recognition From Machine-Labeled Web Images [PDF](https://arxiv.org/abs/1612.06152)
Zhongwen Xu, Linchao Zhu, Yi Yang
_**Abstract**_:  _[With the tremendous advances of Convolutional Neural Networks (ConvNets) on object recognition, we can now obtain reliable enough machine-labeled annotations easily by predictions from off-the-shelf ConvNets. In this work, we present an abstraction memory based framework for few-shot learning, building upon machine-labeled image annotations. Our method takes some large-scale machine-annotated datasets (e.g., OpenImages) as an external memory bank. In the external memory bank, the information is stored in the memory slots with the form of key-value, where image feature is regarded as key and label embedding serves as value. When queried by the few-shot examples, our model selects visually similar data from the external memory bank, and writes the useful information obtained from related external data into another memory bank, i.e., abstraction memory. Long Short-Term Memory (LSTM) controllers and attention mechanisms are utilized to guarantee the data written to the abstraction memory is correlated to the query example. The abstraction memory concentrates information from the external memory bank, so that it makes the few-shot recognition effective. In the experiments, we firstly confirm that our model can learn to conduct few-shot object recognition on clean human-labeled data from ImageNet dataset. Then, we demonstrate that with our model, machine-labeled image annotations are very effective and abundant resources to perform object recognition on novel categories. Experimental results show that our proposed model with machine-labeled annotations achieves great performance, only with a gap of 1% between of the one with human-labeled annotations. ]_
_**Comment**_:  _< few-shot learning, machine-labeled>_
#### Hallucinating Very Low-Resolution Unaligned and Noisy Face Images by Transformative Discriminative Autoencoders
Xin Yu, Fatih Porikli
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension
Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, Hannaneh Hajishirzi
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### *Deep Hashing Network for Unsupervised Domain Adaptation [PDF](https://arxiv.org/abs/1706.07522)
Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, Sethuraman Panchanathan
_**Abstract**_:  _[In recent years, deep neural networks have emerged as a dominant machine learning tool for a wide variety of application domains. However, training a deep neural network requires a large amount of labeled data, which is an expensive process in terms of time, labor and human expertise. Domain adaptation or transfer learning algorithms address this challenge by leveraging labeled data in a different, but related source domain, to develop a model for the target domain. Further, the explosive growth of digital data has posed a fundamental challenge concerning its storage and retrieval. Due to its storage and retrieval efficiency, recent years have witnessed a wide application of hashing in a variety of computer vision applications. In this paper, we first introduce a new dataset, Office-Home, to evaluate domain adaptation algorithms. The dataset contains images of a variety of everyday objects from multiple domains. We then propose a novel deep learning framework that can exploit labeled source data and unlabeled target data to learn informative hash codes, to accurately classify unseen target data. To the best of our knowledge, this is the first research effort to exploit the feature learning capabilities of deep neural networks to learn representative hash codes to address the domain adaptation problem. Our extensive empirical studies on multiple transfer tasks corroborate the usefulness of the framework in learning efficient hash codes which outperform existing competitive baselines for unsupervised domain adaptation. ]_
_**Comment**_:  _< Domain adaptation or transfer learning algorithms that leveraging labeled data in a different but related source domain. office-home: domain adaptation dataset. learn representative hash codes >_
#### **Generalized Deep Image to Image Regression [PDF](https://arxiv.org/abs/1612.03268)
Venkataraman Santhanam, Vlad I. Morariu, Larry S. Davis
_**Abstract**_:  _[We present a Deep Convolutional Neural Network architecture which serves as a generic image-to-image regressor that can be trained end-to-end without any further machinery. Our proposed architecture: the Recursively Branched Deconvolutional Network (RBDN) develops a cheap multi-context image representation very early on using an efficient recursive branching scheme with extensive parameter sharing and learnable upsampling. This multi-context representation is subjected to a highly non-linear locality preserving transformation by the remainder of our network comprising of a series of convolutions/deconvolutions without any spatial downsampling. The RBDN architecture is fully convolutional and can handle variable sized images during inference. We provide qualitative/quantitative results on 3 diverse tasks: relighting, denoising and colorization and show that our proposed RBDN architecture obtains comparable results to the state-of-the-art on each of these tasks when used off-the-shelf without any post processing or task-specific architectural modifications. ]_
_**Comment**_:  _< CNN as generic im-to-im regressor, multi-context representation, recursive branching scheme, no spatial downsampling>_
#### ***Deep Learning With Low Precision by Half-Wave Gaussian Quantization [PDF](https://arxiv.org/abs/1702.00953)
Zhaowei Cai, Xiaodong He, Jian Sun, Nuno Vasconcelos
_**Abstract**_:  _[The problem of quantizing the activations of a deep neural network is considered. An examination of the popular binary quantization approach shows that this consists of approximating a classical non-linearity, the hyperbolic tangent, by two functions: a piecewise constant sign function, which is used in feedforward network computations, and a piecewise linear hard tanh function, used in the backpropagation step during network learning. The problem of approximating the ReLU non-linearity, widely used in the recent deep learning literature, is then considered. An half-wave Gaussian quantizer [PDF](hWGQ) is proposed for forward approximation and shown to have efficient implementation, by exploiting the statistics of of network activations and batch normalization operations commonly used in the literature. To overcome the problem of gradient mismatch, due to the use of different forward and backward approximations, several piece-wise backward approximators are then investigated. The implementation of the resulting quantized network, denoted as HWGQ-Net, is shown to achieve much closer performance to full precision networks, such as AlexNet, ResNet, GoogLeNet and VGG-Net, than previously available low-precision networks, with 1-bit binary weights and 2-bit quantized activations. ]_
_**Comment**_:  _< quantized network, quantizing the activations of a deep neural network, binary quantization approach(tanh), HWGQ(Relu),>_
#### Creativity: Generating Diverse Questions Using Variational Autoencoders [PDF](https://arxiv.org/abs/1704.03493)
Unnat Jain, Ziyu Zhang, Alexander G. Schwing
_**Abstract**_:  _[Generating diverse questions for given images is an important task for computational education, entertainment and AI assistants. Different from many conventional prediction techniques is the need for algorithms to generate a diverse set of plausible questions, which we refer to as "creativity". In this paper we propose a creative algorithm for visual question generation which combines the advantages of variational autoencoders with long short-term memory networks. We demonstrate that our framework is able to generate a large set of varying questions given a single input image. ]_
_**Comment**_:  _< Generating diverse questions for given images, combine variational AE with LSTM>_
## Oral 4-1A
#### *Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs [PDF](https://arxiv.org/abs/1611.08402)
Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodolà , Jan Svoboda, Michael M. Bronstein
_**Abstract**_:  _[Deep learning has achieved a remarkable performance breakthrough in several fields, most notably in speech recognition, natural language processing, and computer vision. In particular, convolutional neural network (CNN) architectures currently produce state-of-the-art performance on a variety of image analysis tasks such as object detection and recognition. Most of deep learning research has so far focused on dealing with 1D, 2D, or 3D Euclidean-structured data such as acoustic signals, images, or videos. Recently, there has been an increasing interest in geometric deep learning, attempting to generalize deep learning methods to non-Euclidean structured data such as graphs and manifolds, with a variety of applications from the domains of network analysis, computational social science, or computer graphics. In this paper, we propose a unified framework allowing to generalize CNN architectures to non-Euclidean domains (graphs and manifolds) and learn local, stationary, and compositional task-specific features. We show that various non-Euclidean CNN methods previously proposed in the literature can be considered as particular instances of our framework. We test the proposed method on standard tasks from the realms of image-, graph- and 3D shape analysis and show that it consistently outperforms previous approaches. ]_
_**Comment**_:  _< using CNN to deal with non-Euclidean structured data such as graphs and manifolds>_
#### Full Resolution Image Compression With Recurrent Neural Networks [PDF](https://arxiv.org/abs/1608.05148)
George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, Michele Covell
_**Abstract**_:  _[This paper presents a set of full-resolution lossy image compression methods based on neural networks. Each of the architectures we describe can provide variable compression rates during deployment without requiring retraining of the network: each network need only be trained once. All of our architectures consist of a recurrent neural network (RNN)-based encoder and decoder, a binarizer, and a neural network for entropy coding. We compare RNN types (LSTM, associative LSTM) and introduce a new hybrid of GRU and ResNet. We also study "one-shot" versus additive reconstruction architectures and introduce a new scaled-additive framework. We compare to previous work, showing improvements of 4.3%-8.8% AUC (area under the rate-distortion curve), depending on the perceptual metric used. As far as we know, this is the first neural network architecture that is able to outperform JPEG at image compression across most bitrates on the rate-distortion curve on the Kodak dataset images, with and without the aid of entropy coding. ]_
_**Comment**_:  _< im compression outperform jpeg using RNN>_
#### **Neural Face Editing With Intrinsic Image Disentangling [PDF](https://arxiv.org/abs/1704.04131)
Zhixin Shu, Ersin Yumer, Sunil Hadap, Kalyan Sunkavalli, Eli Shechtman, Dimitris Samaras
_**Abstract**_:  _[Traditional face editing methods often require a number of sophisticated and task specific algorithms to be applied one after the other --- a process that is tedious, fragile, and computationally intensive. In this paper, we propose an end-to-end generative adversarial network that infers a face-specific disentangled representation of intrinsic face properties, including shape (i.e. normals), albedo, and lighting, and an alpha matte. We show that this network can be trained on "in-the-wild" images by incorporating an in-network physically-based image formation module and appropriate loss functions. Our disentangling latent representation allows for semantically relevant edits, where one aspect of facial appearance can be manipulated while keeping orthogonal properties fixed, and we demonstrate its use for a number of facial editing applications. ]_
_**Comment**_:  _< GAN, disentangling latent representation allows for semantically relevant edits, divide different properties>_
#### **Ubernet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory [PDF](https://arxiv.org/abs/1609.02132)
Iasonas Kokkinos
_**Abstract**_:  _[In this work we introduce a convolutional neural network (CNN) that jointly handles low-, mid-, and high-level vision tasks in a unified architecture that is trained end-to-end. Such a universal network can act like a `swiss knife' for vision tasks; we call this architecture an UberNet to indicate its overarching nature. We address two main technical challenges that emerge when broadening up the range of tasks handled by a single CNN: (i) training a deep architecture while relying on diverse training sets and (ii) training many (potentially unlimited) tasks with a limited memory budget. Properly addressing these two problems allows us to train accurate predictors for a host of tasks, without compromising accuracy. Through these advances we train in an end-to-end manner a CNN that simultaneously addresses (a) boundary detection (b) normal estimation (c) saliency estimation (d) semantic segmentation (e) human part segmentation (f) semantic boundary detection, (g) region proposal generation and object detection. We obtain competitive performance while jointly addressing all of these tasks in 0.7 seconds per frame on a single GPU. A demonstration of this system can be found at this http URL]_
_**Comment**_:  _< >_
# Analyzing Humans with 3D Vision
## Spotlight 4-1B
#### 3D Face Morphable Models In-The-Wild
James Booth, Epameinondas Antonakos, Stylianos Ploumpis, George Trigeorgis, Yannis Panagakis, Stefanos Zafeiriou
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### KillingFusion: Non-Rigid 3D Reconstruction Without Correspondences
Miroslava Slavcheva, Maximilian Baust, Daniel Cremers, Slobodan Ilic
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Detailed, Accurate, Human Shape Estimation From Clothed 3D Scan Sequences
Chao Zhang, Sergi Pujades, Michael J. Black, Gerard Pons-Moll
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### POSEidon: Face-From-Depth for Driver Pose Estimation
Guido Borghi, Marco Venturelli, Roberto Vezzani, Rita Cucchiara
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Human Shape From Silhouettes Using Generative HKS Descriptors and Cross-Modal Neural Networks
Endri Dibra, Himanshu Jain, Cengiz Öztireli, Remo Ziegler, Markus Gross
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Parametric T-Spline Face Morphable Model for Detailed Fitting in Shape Subspace
Weilong Peng, Zhiyong Feng, Chao Xu, Yong Su
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### 3D Menagerie: Modeling the 3D Shape and Pose of Animals
Silvia Zuffi, Angjoo Kanazawa, David W. Jacobs, Michael J. Black
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### ***iCaRL: Incremental Classifier and Representation Learning [PDF](https://arxiv.org/abs/1611.07725)
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, Christoph H. Lampert
_**Abstract**_:  _[A major open problem on the road to artificial intelligence is the development of incrementally learning systems that learn about more and more concepts over time from a stream of data. In this work, we introduce a new training strategy, iCaRL, that allows learning in such a class-incremental way: only the training data for a small number of classes has to be present at the same time and new classes can be added progressively. iCaRL learns strong classifiers and a data representation simultaneously. This distinguishes it from earlier works that were fundamentally limited to fixed data representations and therefore incompatible with deep learning architectures. We show by experiments on CIFAR-100 and ImageNet ILSVRC 2012 data that iCaRL can learn many classes incrementally over a long period of time where other strategies quickly fail. ]_
_**Comment**_:  _< a class-incremental way,new classes can be added progressively.>_
# Oral 4-1B
#### Recurrent 3D Pose Sequence Machines
Mude Lin, Liang Lin, Xiaodan Liang, Keze Wang, Hui Cheng
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### **Learning Detailed Face Reconstruction From a Single Image [PDF](https://arxiv.org/abs/1611.05053)
Elad Richardson, Matan Sela, Roy Or-El, Ron Kimmel
_**Abstract**_:  _[Reconstructing the detailed geometric structure of a face from a given image is a key to many computer vision and graphics applications, such as motion capture and reenactment. The reconstruction task is challenging as human faces vary extensively when considering expressions, poses, textures, and intrinsic geometries. While many approaches tackle this complexity by using additional data to reconstruct the face of a single subject, extracting facial surface from a single image remains a difficult problem. As a result, single-image based methods can usually provide only a rough estimate of the facial geometry. In contrast, we propose to leverage the power of convolutional neural networks to produce a highly detailed face reconstruction from a single image. For this purpose, we introduce an end-to-end CNN framework which derives the shape in a coarse-to-fine fashion. The proposed architecture is composed of two main blocks, a network that recovers the coarse facial geometry (CoarseNet), followed by a CNN that refines the facial features of that geometry (FineNet). The proposed networks are connected by a novel layer which renders a depth image given a mesh in 3D. Unlike object recognition and detection problems, there are no suitable datasets for training CNNs to perform face geometry reconstruction. Therefore, our training regime begins with a supervised phase, based on synthetic images, followed by an unsupervised phase that uses only unconstrained facial images. The accuracy and robustness of the proposed model is demonstrated by both qualitative and quantitative evaluation tests. ]_
_**Comment**_:  _< an end-to-end CNN framework which derives the shape in a coarse-to-fine fashion.based on synthetic images>_
#### Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos
Jie Song, Limin Wang, Luc Van Gool, Otmar Hilliges
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Dynamic FAUST: Registering Human Bodies in Motion
Federica Bogo, Javier Romero, Gerard Pons-Moll, Michael J. Black
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
# Poster 4-1
3D Computer Vision
Semantically Coherent Co-Segmentation and Reconstruction of Dynamic Scenes
Armin Mustafa, Adrian Hilton
On the Two-View Geometry of Unsynchronized Cameras
Cenek Albl, Zuzana Kukelova, Andrew Fitzgibbon, Jan Heller, Matej Smid, Tomas Pajdla
Using Locally Corresponding CAD Models for Dense 3D Reconstructions From a Single Image
Chen Kong, Chen-Hsuan Lin, Simon Lucey
A Clever Elimination Strategy for Efficient Minimal Solvers
Zuzana Kukelova, Joe Kileel, Bernd Sturmfels, Tomas Pajdla
Convex Global 3D Registration With Lagrangian Duality
Jesus Briales, Javier Gonzalez-Jimenez
DeMoN: Depth and Motion Network for Learning Monocular Stereo
Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, Thomas Brox
3D Bounding Box Estimation Using Deep Learning and Geometry
Arsalan Mousavian, Dragomir Anguelov, John Flynn, Jana Košecká
A Dataset for Benchmarking Image-Based Localization
Xun Sun, Yuanfan Xie, Pei Luo, Liang Wang
Analyzing Humans in Images
Asynchronous Temporal Fields for Action Recognition
Gunnar A. Sigurdsson, Santosh Divvala, Ali Farhadi, Abhinav Gupta
Sequential Person Recognition in Photo Albums With a Recurrent Network
Yao Li, Guosheng Lin, Bohan Zhuang, Lingqiao Liu, Chunhua Shen, Anton van den Hengel
Multi-Context Attention for Human Pose Estimation
Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. Yuille, Xiaogang Wang
3D Convolutional Neural Networks for Efficient and Robust Hand Pose Estimation From Single Depth Images
Liuhao Ge, Hui Liang, Junsong Yuan, Daniel Thalmann
Lifting From the Deep: Convolutional 3D Pose Estimation From a Single Image
Denis Tome, Chris Russell, Lourdes Agapito
AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos
Amlan Kar, Nishant Rai, Karan Sikka, Gaurav Sharma
Deep Structured Learning for Facial Action Unit Intensity Estimation
Robert Walecki, Ognjen (Oggi) Rudovic, Vladimir Pavlovic, Bjöern Schuller, Maja Pantic
Simultaneous Facial Landmark Detection, Pose and Deformation Estimation Under Facial Occlusion
Yue Wu, Chao Gou, Qiang Ji
Self-Supervised Video Representation Learning With Odd-One-Out Networks
Basura Fernando, Hakan Bilen, Efstratios Gavves, Stephen Gould
Robust Joint and Individual Variance Explained
Christos Sagonas, Yannis Panagakis, Alina Leidinger, Stefanos Zafeiriou
Discriminative Covariance Oriented Representation Learning for Face Recognition With Image Sets
Wen Wang, Ruiping Wang, Shiguang Shan, Xilin Chen
3D Human Pose Estimation = 2D Pose Estimation + Matching
Ching-Hang Chen, Deva Ramanan
Applications
Joint Gap Detection and Inpainting of Line Drawings
Kazuma Sasaki, Satoshi Iizuka, Edgar Simo-Serra, Hiroshi Ishikawa
Biomedical Image/Video Analysis
Riemannian Nonlinear Mixed Effects Models: Analyzing Longitudinal Deformations in Neuroimaging
Hyunwoo J. Kim, Nagesh Adluru, Heemanshu Suri, Baba C. Vemuri, Sterling C. Johnson, Vikas Singh
Simultaneous Super-Resolution and Cross-Modality Synthesis of 3D Medical Images Using Weakly-Supervised Joint Convolutional Sparse Coding
Yawen Huang, Ling Shao, Alejandro F. Frangi
Computational Photography
Multiple-Scattering Microphysics Tomography
Aviad Levis, Yoav Y. Schechner, Anthony B. Davis
Image Motion & Tracking
Accurate Optical Flow via Direct Cost Volume Processing
Jia Xu, René Ranftl, Vladlen Koltun
Event-Based Visual Inertial Odometry
Alex Zihao Zhu, Nikolay Atanasov, Kostas Daniilidis
Robust Visual Tracking Using Oblique Random Forests
Le Zhang, Jagannadan Varadarajan, Ponnuthurai Nagaratnam Suganthan, Narendra Ahuja, Pierre Moulin
Low- & Mid-Level Vision
Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution
Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, Ming-Hsuan Yang
Learning Non-Lambertian Object Intrinsics Across ShapeNet Categories
Jian Shi, Yue Dong, Hao Su, Stella X. Yu
MCMLSD: A Dynamic Programming Approach to Line Segment Detection
Emilio J. Almazà n, Ron Tal, Yiming Qian, James H. Elder
Contour-Constrained Superpixels for Image and Video Processing
Se-Ho Lee, Won-Dong Jang, Chang-Su Kim
Richer Convolutional Features for Edge Detection
Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, Xiang Bai
Non-Local Color Image Denoising With Convolutional Neural Networks
Stamatios Lefkimmiatis
Generative Face Completion
Yijun Li, Sifei Liu, Jimei Yang, Ming-Hsuan Yang
Hyper-Laplacian Regularized Unidirectional Low-Rank Tensor Recovery for Multispectral Image Denoising
Yi Chang, Luxin Yan, Sheng Zhong
Unsupervised Semantic Scene Labeling for Streaming Data
Maggie Wigness, John G. Rogers III
Why You Should Forget Luminance Conversion and Do Something Better
Rang M. H. Nguyen, Michael S. Brown
Deep Semantic Feature Matching
Nikolai Ufer, Björn Ommer
Machine Learning
Revisiting the Variable Projection Method for Separable Nonlinear Least Squares Problems
Je Hyeong Hong, Christopher Zach, Andrew Fitzgibbon
Efficient Multiple Instance Metric Learning Using Weakly Supervised Data
Marc T. Law, Yaoliang Yu, Raquel Urtasun, Richard S. Zemel, Eric P. Xing
WILDCAT: Weakly Supervised Learning of Deep ConvNets for Image Classification, Pointwise Localization and Segmentation
Thibaut Durand, Taylor Mordan, Nicolas Thome, Matthieu Cord
Image-To-Image Translation With Conditional Adversarial Networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros
Deep Roots: Improving CNN Efficiency With Hierarchical Filter Groups
Yani Ioannou, Duncan Robertson, Roberto Cipolla, Antonio Criminisi
Aggregated Residual Transformations for Deep Neural Networks
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He
MIML-FCN+: Multi-Instance Multi-Label Learning via Fully Convolutional Networks With Privileged Information
Hao Yang, Joey Tianyi Zhou, Jianfei Cai, Yew Soon Ong
Low-Rank Embedded Ensemble Semantic Dictionary for Zero-Shot Learning
Zhengming Ding, Ming Shao, Yun Fu
Factorized Variational Autoencoders for Modeling Audience Reactions to Movies
Zhiwei Deng, Rajitha Navarathna, Peter Carr, Stephan Mandt, Yisong Yue, Iain Matthews, Greg Mori
Learning Features by Watching Objects Move
Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, Bharath Hariharan
What Can Help Pedestrian Detection?
Jiayuan Mao, Tete Xiao, Yuning Jiang, Zhimin Cao
DeepPermNet: Visual Permutation Learning
Rodrigo Santa Cruz, Basura Fernando, Anoop Cherian, Stephen Gould
Learning the Multilinear Structure of Visual Data
Mengjiao Wang, Yannis Panagakis, Patrick Snape, Stefanos Zafeiriou
Adaptive and Move Making Auxiliary Cuts for Binary Pairwise Energies
Lena Gorelick, Yuri Boykov, Olga Veksler
Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning
Tien-Ju Yang, Yu-Hsin Chen, Vivienne Sze
Joint Multi-Person Pose Estimation and Semantic Part Segmentation
Fangting Xia, Peng Wang, Xianjie Chen, Alan L. Yuille
Deep Feature Interpolation for Image Content Changes
Paul Upchurch, Jacob Gardner, Geoff Pleiss, Robert Pless, Noah Snavely, Kavita Bala, Kilian Weinberger
FASON: First and Second Order Information Fusion Network for Texture Recognition
Xiyang Dai, Joe Yue-Hei Ng, Larry S. Davis
Lean Crowdsourcing: Combining Humans and Machines in an Online System
Steve Branson, Grant Van Horn, Pietro Perona
Object Recognition & Scene Understanding
Supervising Neural Attention Models for Video Captioning by Human Gaze Data
Youngjae Yu, Jongwook Choi, Yeonhwa Kim, Kyung Yoo, Sang-Hun Lee, Gunhee Kim
L2-Net: Deep Learning of Discriminative Patch Descriptor in Euclidean Space
Yurun Tian, Bin Fan, Fuchao Wu
Convolutional Random Walk Networks for Semantic Image Segmentation
Gedas Bertasius, Lorenzo Torresani, Stella X. Yu, Jianbo Shi
Knowledge Acquisition for Visual Question Answering via Iterative Querying
Yuke Zhu, Joseph J. Lim, Li Fei-Fei
Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search
Bo Zhao, Jiashi Feng, Xiao Wu, Shuicheng Yan
From Zero-Shot Learning to Conventional Supervised Classification: Unseen Visual Data Synthesis
Yang Long, Li Liu, Ling Shao, Fumin Shen, Guiguang Ding, Jungong Han
Are Large-Scale 3D Models Really Necessary for Accurate Visual Localization?
Torsten Sattler, Akihiko Torii, Josef Sivic, Marc Pollefeys, Hajime Taira, Masatoshi Okutomi, Tomas Pajdla
Asymmetric Feature Maps With Application to Sketch Based Retrieval
Giorgos Tolias, Ondřej Chum
Diverse Image Annotation
Baoyuan Wu, Fan Jia, Wei Liu, Bernard Ghanem
AMC: Attention guided Multi-modal Correlation Learning for Image Search
Kan Chen, Trung Bui, Chen Fang, Zhaowen Wang, Ram Nevatia
Multi-Attention Network for One Shot Learning
Peng Wang, Lingqiao Liu, Chunhua Shen, Zi Huang, Anton van den Hengel, Heng Tao Shen
Fried Binary Embedding for High-Dimensional Visual Features
Weixiang Hong, Junsong Yuan, Sreyasee Das Bhattacharjee
Pyramid Scene Parsing Network
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya Jia
Learning Deep Match Kernels for Image-Set Classification
Haoliang Sun, Xiantong Zhen, Yuanjie Zheng, Gongping Yang, Yilong Yin, Shuo Li
Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description
Xishan Zhang, Ke Gao, Yongdong Zhang, Dongming Zhang, Jintao Li, Qi Tian
Learning Multifunctional Binary Codes for Both Category and Attribute Oriented Retrieval Tasks
Haomiao Liu, Ruiping Wang, Shiguang Shan, Xilin Chen
Indoor Scene Parsing With Instance Segmentation, Semantic Labeling and Support Relationship Inference
Wei Zhuo, Mathieu Salzmann, Xuming He, Miaomiao Liu
Episodic CAMN: Contextual Attention-Based Memory Networks With Iterative Feedback for Scene Labeling
Abrar H. Abdulnabi, Bing Shuai, Stefan Winkler, Gang Wang
Link the Head to the “Beakâ€: Zero Shot Learning From Noisy Text Description at Part Precision
Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, Ahmed Elgammal
SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, Tat-Seng Chua
Deep Pyramidal Residual Networks
Dongyoon Han, Jiwhan Kim, Junmo Kim
Product Split Trees
Artem Babenko, Victor Lempitsky
Making the v in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, Devi Parikh
Commonly Uncommon: Semantic Sparsity in Situation Recognition
Mark Yatskar, Vicente Ordonez, Luke Zettlemoyer, Ali Farhadi
Cross-Modality Binary Code Learning via Fusion Similarity Hashing
Hong Liu, Rongrong Ji, Yongjian Wu, Feiyue Huang, Baochang Zhang
Theory
Saliency Revisited: Analysis of Mouse Movements Versus Fixations
Hamed R. Tavakoli, Fawad Ahmed, Ali Borji, Jorma Laaksonen
InterpoNet, a Brain Inspired Neural Network for Optical Flow Dense Interpolation
Shay Zweig, Lior Wolf
Video Analytics
SST: Single-Stream Temporal Action Proposals
Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, Juan Carlos Niebles
Video Segmentation via Multiple Granularity Analysis
Rui Yang, Bingbing Ni, Chao Ma, Yi Xu, Xiaokang Yang
Spatio-Temporal Alignment of Non-Overlapping Sequences From Independently Panning Cameras
Seyed Morteza Safdarnejad, Xiaoming Liu
UntrimmedNets for Weakly Supervised Action Recognition and Detection
Limin Wang, Yuanjun Xiong, Dahua Lin, Luc Van Gool
# Object Recognition & Scene Understanding 3
## Spotlight 4-2A
#### *Gaze Embeddings for Zero-Shot Image Classification [PDF](https://arxiv.org/abs/1611.09309)
Nour Karessli, Zeynep Akata, Bernt Schiele, Andreas Bulling
_**Abstract**_:  _[Zero-shot image classification using auxiliary information, such as attributes describing discriminative object properties, requires time-consuming annotation by domain experts. We instead propose a method that relies on human gaze as auxiliary information, exploiting that even non-expert users have a natural ability to judge class membership. We present a data collection paradigm that involves a discrimination task to increase the information content obtained from gaze data. Our method extracts discriminative descriptors from the data and learns a compatibility function between image and gaze using three novel gaze embeddings: Gaze Histograms (GH), Gaze Features with Grid (GFG) and Gaze Features with Sequence (GFS). We introduce two new gaze-annotated datasets for fine-grained image classification and show that human gaze data is indeed class discriminative, provides a competitive alternative to expert-annotated attributes, and outperforms other baselines for zero-shot image classification. ]_
_**Comment**_:  _< human gaze as auxiliary information, how it collected?>_
#### What's in a Question: Using Visual Questions as a Form of Supervision [PDF](https://arxiv.org/abs/1704.03895)
Siddha Ganju, Olga Russakovsky, Abhinav Gupta
_**Abstract**_:  _[Collecting fully annotated image datasets is challenging and expensive. Many types of weak supervision have been explored: weak manual annotations, web search results, temporal continuity, ambient sound and others. We focus on one particular unexplored mode: visual questions that are asked about images. The key observation that inspires our work is that the question itself provides useful information about the image (even without the answer being available). For instance, the question "what is the breed of the dog?" informs the AI that the animal in the scene is a dog and that there is only one dog present. We make three contributions: (1) providing an extensive qualitative and quantitative analysis of the information contained in human visual questions, (2) proposing two simple but surprisingly effective modifications to the standard visual question answering models that allow them to make use of weak supervision in the form of unanswered questions associated with images and (3) demonstrating that a simple data augmentation strategy inspired by our insights results in a 7.1% improvement on the standard VQA benchmark. ]_
_**Comment**_:  _< weak manual annotations: unanswered visual questins>_
#### *Attend to You: Personalized Image Captioning With Context Sequence Memory Networks [PDF](https://arxiv.org/abs/1704.06485)
Cesc Chunseong Park, Byeongchang Kim, Gunhee Kim
_**Abstract**_:  _[We address personalization issues of image captioning, which have not been discussed yet in previous research. For a query image, we aim to generate a descriptive sentence, accounting for prior knowledge such as the user's active vocabularies in previous documents. As applications of personalized image captioning, we tackle two post automation tasks: hashtag prediction and post generation, on our newly collected Instagram dataset, consisting of 1.1M posts from 6.3K users. We propose a novel captioning model named Context Sequence Memory Network (CSMN). Its unique updates over previous memory network models include (i) exploiting memory as a repository for multiple types of context information, (ii) appending previously generated words into memory to capture long-term information without suffering from the vanishing gradient problem, and (iii) adopting CNN memory structure to jointly represent nearby ordered memory slots for better context understanding. With quantitative evaluation and user studies via Amazon Mechanical Turk, we show the effectiveness of the three novel features of CSMN and its performance enhancement for personalized image captioning over state-of-the-art captioning models. ]_
_**Comment**_:  _< Personalized Image Captioning, adopting CNN memory structure?>_
#### Adversarially Tuned Scene Generation [PDF](https://arxiv.org/abs/1701.00405)
VSR Veeravasarapu, Constantin Rothkopf, Ramesh Visvanathan
_**Abstract**_:  _[Generalization performance of trained computer vision systems that use computer graphics (CG) generated data is not yet effective due to the concept of 'domain-shift' between virtual and real data. Although simulated data augmented with a few real world samples has been shown to mitigate domain shift and improve transferability of trained models, guiding or bootstrapping the virtual data generation with the distributions learnt from target real world domain is desired, especially in the fields where annotating even few real images is laborious (such as semantic labeling, and intrinsic images etc.). In order to address this problem in an unsupervised manner, our work combines recent advances in CG (which aims to generate stochastic scene layouts coupled with large collections of 3D object models) and generative adversarial training (which aims train generative models by measuring discrepancy between generated and real data in terms of their separability in the space of a deep discriminatively-trained classifier). Our method uses iterative estimation of the posterior density of prior distributions for a generative graphical model. This is done within a rejection sampling framework. Initially, we assume uniform distributions as priors on the parameters of a scene described by a generative graphical model. As iterations proceed the prior distributions get updated to distributions that are closer to the (unknown) distributions of target data. We demonstrate the utility of adversarially tuned scene generation on two real-world benchmark datasets (CityScapes and CamVid) for traffic scene semantic labeling with a deep convolutional net (DeepLab). We realized performance improvements by 2.28 and 3.14 points (using the IoU metric) between the DeepLab models trained on simulated sets prepared from the scene generation models before and after tuning to CityScapes and CamVid respectively. ]_
_**Comment**_:  _< domain-shift' between virtual and real data, >_
#### ***Residual Attention Network for Image Classification [PDF](https://arxiv.org/abs/1704.06904)
Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, Xiaoou Tang
_**Abstract**_:  _[In this work, we propose "Residual Attention Network", a convolutional neural network using attention mechanism which can incorporate with state-of-art feed forward network architecture in an end-to-end training fashion. Our Residual Attention Network is built by stacking Attention Modules which generate attention-aware features. The attention-aware features from different modules change adaptively as layers going deeper. Inside each Attention Module, bottom-up top-down feedforward structure is used to unfold the feedforward and feedback attention process into a single feedforward process. Importantly, we propose attention residual learning to train very deep Residual Attention Networks which can be easily scaled up to hundreds of layers. Extensive analyses are conducted on CIFAR-10 and CIFAR-100 datasets to verify the effectiveness of every module mentioned above. Our Residual Attention Network achieves state-of-the-art object recognition performance on three benchmark datasets including CIFAR-10 (3.90% error), CIFAR-100 (20.45% error) and ImageNet (4.8% single model and single crop, top-5 error). Note that, our method achieves 0.6% top-1 accuracy improvement with 46% trunk depth and 69% forward FLOPs comparing to ResNet-200. The experiment also demonstrates that our network is robust against noisy labels. ]_
_**Comment**_:  _< based on CNN, by stacking Attention Modules which generate attention-aware features>_
#### **Not All Pixels Are Equal: Difficulty-Aware Semantic Segmentation via Deep Layer Cascade [PDF](https://arxiv.org/abs/1704.01344)
Xiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, Xiaoou Tang
_**Abstract**_:  _[We propose a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. Unlike the conventional model cascade (MC) that is composed of multiple independent models, LC treats a single deep model as a cascade of several sub-models. Earlier sub-models are trained to handle easy and confident regions, and they progressively feed-forward harder regions to the next sub-model for processing. Convolutions are only calculated on these regions to reduce computations. The proposed method possesses several advantages. First, LC classifies most of the easy regions in the shallow stage and makes deeper stage focuses on a few hard regions. Such an adaptive and 'difficulty-aware' learning improves segmentation performance. Second, LC accelerates both training and testing of deep network thanks to early decisions in the shallow stage. Third, in comparison to MC, LC is an end-to-end trainable framework, allowing joint learning of all sub-models. We evaluate our method on PASCAL VOC and Cityscapes datasets, achieving state-of-the-art performance and fast speed. ]_
_**Comment**_:  _< reats a single deep model as a cascade of several sub-models, LC classifies most of the easy regions in the shallow stage>_
#### Learning Non-Maximum Suppression [PDF](https://arxiv.org/abs/1705.02950)
Jan Hosang, Rodrigo Benenson, Bernt Schiele
_**Abstract**_:  _[Object detectors have hugely profited from moving towards an end-to-end learning paradigm: proposals, features, and the classifier becoming one neural network improved results two-fold on general object detection. One indispensable component is non-maximum suppression (NMS), a post-processing algorithm responsible for merging all detections that belong to the same object. The de facto standard NMS algorithm is still fully hand-crafted, suspiciously simple, and -- being based on greedy clustering with a fixed distance threshold -- forces a trade-off between recall and precision. We propose a new network architecture designed to perform NMS, using only boxes and their score. We report experiments for person detection on PETS and for general object categories on the COCO dataset. Our approach shows promise providing improved localization and occlusion handling. ]_
_**Comment**_:  _< object detection>_
#### The Amazing Mysteries of the Gutter: Drawing Inferences Between Panels in Comic Book Narratives [PDF](https://arxiv.org/abs/1611.05118)
Mohit Iyyer, Varun Manjunatha, Anupam Guha, Yogarshi Vyas, Jordan Boyd-Graber, Hal Daumé III, Larry S. Davis
_**Abstract**_:  _[Visual narrative is often a combination of explicit information and judicious omissions, relying on the viewer to supply missing details. In comics, most movements in time and space are hidden in the "gutters" between panels. To follow the story, readers logically connect panels together by inferring unseen actions through a process called "closure". While computers can now describe what is explicitly depicted in natural images, in this paper we examine whether they can understand the closure-driven narratives conveyed by stylized artwork and dialogue in comic book panels. We construct a dataset, COMICS, that consists of over 1.2 million panels (120 GB) paired with automatic textbox transcriptions. An in-depth analysis of COMICS demonstrates that neither text nor image alone can tell a comic book story, so a computer must understand both modalities to keep up with the plot. We introduce three cloze-style tasks that ask models to predict narrative and character-centric aspects of a panel given n preceding panels as context. Various deep neural architectures underperform human baselines on these tasks, suggesting that COMICS contains fundamental challenges for both vision and language.]_
_**Comment**_:  _< comic book panels, a new dataset COMICS with text and image>_
## Oral 4-2A
#### *Object Region Mining With Adversarial Erasing: A Simple Classification to Semantic Segmentation Approach [PDF](https://arxiv.org/abs/1703.08448)
Yunchao Wei, Jiashi Feng, Xiaodan Liang, Ming-Ming Cheng, Yao Zhao, Shuicheng Yan
_**Abstract**_:  _[ We investigate a principle way to progressively mine discriminative object regions using classification networks to address the weakly-supervised semantic segmentation problems. Classification networks are only responsive to small and sparse discriminative regions from the object of interest, which deviates from the requirement of the segmentation task that needs to localize dense, interior and integral regions for pixel-wise inference. To mitigate this gap, we propose a new adversarial erasing approach for localizing and expanding object regions progressively. Starting with a single small object region, our proposed approach drives the classification network to sequentially discover new and complement object regions by erasing the current mined regions in an adversarial manner. These localized regions eventually constitute a dense and complete object region for learning semantic segmentation. To further enhance the quality of the discovered regions by adversarial erasing, an online prohibitive segmentation learning approach is developed to collaborate with adversarial erasing by providing auxiliary segmentation supervision modulated by the more reliable classification scores. Despite its apparent simplicity, the proposed approach achieves 55.0% and 55.7% mean Intersection-over-Union (mIoU) scores on PASCAL VOC 2012 val and test sets, which are the new state-of-the-arts. ]_
_**Comment**_:  _< propose a new adversarial erasing approach using classification network, Starting with a single small object region, our proposed approach drives the classification network to sequentially discover new and complement object regions by erasing the current mined regions in an adversarial manner>_
#### *Fine-Grained Recognition as HSnet Search for Informative Image Parts [PDF](http://web.engr.oregonstate.edu/~sinisa/research/publications/cvpr17_lstmsearch.pdf)
Michael Lam, Behrooz Mahasseni, Sinisa Todorovic
_**Abstract**_:  _[]_
_**Comment**_:  _< fine-grained im classfication, huristic func& successor func unified via a LSTM. How CNN is used ?>_
#### *G2DeNet: Global Gaussian Distribution Embedding Network and Its Application to Visual Recognition
Qilong Wang, Peihua Li, Lei Zhang
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### *YOLO9000: Better, Faster, Stronger [PDF](https://arxiv.org/abs/1612.08242)
Joseph Redmon, Ali Farhadi
_**Abstract**_:  _[We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that don't have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes; it predicts detections for more than 9000 different object categories. And it still runs in real-time. ]_
_**Comment**_:  _< object detection system that can detect over 9000 object categories, a method to jointly train on object detection and classification, how is it jointly trained?>_
# Machine Learning for 3D Vision
## Spotlight 4-2B
#### Multi-View 3D Object Detection Network for Autonomous Driving
Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, Tian Xia
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### UltraStereo: Efficient Learning-Based Matching for Active Stereo Systems
Sean Ryan Fanello, Julien Valentin, Christoph Rhemann, Adarsh Kowdle, Vladimir Tankovich, Philip Davidson, Shahram Izadi
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Shape Completion Using 3D-Encoder-Predictor CNNs and Shape Synthesis [PDF](https://arxiv.org/abs/1612.00101)
Angela Dai, Charles Ruizhongtai Qi, Matthias Nießner
_**Abstract**_:  _[We introduce a data-driven approach to complete partial 3D shapes through a combination of volumetric deep neural networks and 3D shape synthesis. From a partially-scanned input shape, our method first infers a low-resolution -- but complete -- output. To this end, we introduce a 3D-Encoder-Predictor Network (3D-EPN) which is composed of 3D convolutional layers. The network is trained to predict and fill in missing data, and operates on an implicit surface representation that encodes both known and unknown space. This allows us to predict global structure in unknown areas at high accuracy. We then correlate these intermediary results with 3D geometry from a shape database at test time. In a final pass, we propose a patch-based 3D shape synthesis method that imposes the 3D geometry from these retrieved shapes as constraints on the coarsely-completed mesh. This synthesis process enables us to reconstruct fine-scale detail and generate high-resolution output while respecting the global mesh structure obtained by the 3D-EPN. Although our 3D-EPN outperforms state-of-the-art completion method, the main contribution in our work lies in the combination of a data-driven shape predictor and analytic 3D shape synthesis. In our results, we show extensive evaluations on a newly-introduced shape completion benchmark for both real-world and synthetic data. ]_
_**Comment**_:  _< >_
#### Geometric Loss Functions for Camera Pose Regression With Deep Learning
Alex Kendall, Roberto Cipolla
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### CNN-SLAM: Real-Time Dense Monocular SLAM With Learned Depth Prediction
Keisuke Tateno, Federico Tombari, Iro Laina, Nassir Navab
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### *Learning From Noisy Large-Scale Datasets With Minimal Supervision [PDF](https://arxiv.org/abs/1701.01619)
Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, Serge Belongie
_**Abstract**_:  _[We present an approach to effectively use millions of images with noisy annotations in conjunction with a small subset of cleanly-annotated images to learn powerful image representations. One common approach to combine clean and noisy data is to first pre-train a network using the large noisy dataset and then fine-tune with the clean dataset. We show this approach does not fully leverage the information contained in the clean set. Thus, we demonstrate how to use the clean annotations to reduce the noise in the large dataset before fine-tuning the network using both the clean set and the full set with reduced noise. The approach comprises a multi-task network that jointly learns to clean noisy annotations and to accurately classify images. We evaluate our approach on the recently released Open Images dataset, containing ~9 million images, multiple annotations per image and over 6000 unique classes. For the small clean set of annotations we use a quarter of the validation set with ~40k images. Our results demonstrate that the proposed approach clearly outperforms direct fine-tuning across all major categories of classes in the Open Image dataset. Further, our approach is particularly effective for a large number of classes with wide range of noise in annotations (20-80% false positive annotations). ]_
_**Comment**_:  _< in conjunction with a small subset of cleanly-annotated images, >_
#### *SyncSpecCNN: Synchronized Spectral CNN for 3D Shape Segmentation [PDF](https://arxiv.org/abs/1612.00606)
Li Yi, Hao Su, Xingwen Guo, Leonidas J. Guibas
_**Abstract**_:  _[In this paper, we study the problem of semantic annotation on 3D models that are represented as shape graphs. A functional view is taken to represent localized information on graphs, so that annotations such as part segment or keypoint are nothing but 0-1 indicator vertex functions. Compared with images that are 2D grids, shape graphs are irregular and non-isomorphic data structures. To enable the prediction of vertex functions on them by convolutional neural networks, we resort to spectral CNN method that enables weight sharing by parameterizing kernels in the spectral domain spanned by graph laplacian eigenbases. Under this setting, our network, named SyncSpecCNN, strive to overcome two key challenges: how to share coefficients and conduct multi-scale analysis in different parts of the graph for a single shape, and how to share information across related but different shapes that may be represented by very different graphs. Towards these goals, we introduce a spectral parameterization of dilated convolutional kernels and a spectral transformer network. Experimentally we tested our SyncSpecCNN on various tasks, including 3D shape part segmentation and 3D keypoint prediction. State-of-the-art performance has been achieved on all benchmark datasets. ]_
_**Comment**_:  _< use CNN to deal with other data>_
#### *Non-Local Deep Features for Salient Object Detection
Zhiming Luo, Akshaya Mishra, Andrew Achkar, Justin Eichel, Shaozi Li, Pierre-Marc Jodoin
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
## Oral 4-2B
#### Unsupervised Monocular Depth Estimation With Left-Right Consistency
Clément Godard, Oisin Mac Aodha, Gabriel J. Brostow
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### Unsupervised Learning of Depth and Ego-Motion From Video
Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe
_**Abstract**_:  _[]_
_**Comment**_:  _< >_
#### OctNet: Learning Deep 3D Representations at High Resolutions [PDF](https://arxiv.org/abs/1611.05009)
Gernot Riegler, Ali Osman Ulusoy, Andreas Geiger
_**Abstract**_:  _[We present OctNet, a representation for deep learning with sparse 3D data. In contrast to existing models, our representation enables 3D convolutional networks which are both deep and high resolution. Towards this goal, we exploit the sparsity in the input data to hierarchically partition the space using a set of unbalanced octrees where each leaf node stores a pooled feature representation. This allows to focus memory allocation and computation to the relevant dense regions and enables deeper networks without compromising resolution. We demonstrate the utility of our OctNet representation by analyzing the impact of resolution on several 3D tasks including 3D object classification, orientation estimation and point cloud labeling. ]_
_**Comment**_:  _< what is 3D CNN?>_
#### 3D Shape Segmentation With Projective Convolutional Networks [PDF](https://arxiv.org/abs/1612.02808)
Evangelos Kalogerakis, Melinos Averkiou, Subhransu Maji, Siddhartha Chaudhuri
_**Abstract**_:  _[This paper introduces a deep architecture for segmenting 3D objects into their labeled semantic parts. Our architecture combines image-based Fully Convolutional Networks (FCNs) and surface-based Conditional Random Fields (CRFs) to yield coherent segmentations of 3D shapes. The image-based FCNs are used for efficient view-based reasoning about 3D object parts. Through a special projection layer, FCN outputs are effectively aggregated across multiple views and scales, then are projected onto the 3D object surfaces. Finally, a surface-based CRF combines the projected outputs with geometric consistency cues to yield coherent segmentations. The whole architecture (multi-view FCNs and CRF) is trained end-to-end. Our approach significantly outperforms the existing state-of-the-art methods in the currently largest segmentation benchmark (ShapeNet). Finally, we demonstrate promising segmentation results on noisy 3D shapes acquired from consumer-grade depth cameras. ]_
_**Comment**_:  _< >_
# Poster 4-2
## 3D Computer Vision
#### SGM-Nets: Semi-Global Matching With Neural Networks
Akihito Seki, Marc Pollefeys
#### Stereo-Based 3D Reconstruction of Dynamic Fluid Surfaces by Global Optimization
Yiming Qian, Minglun Gong, Yee-Hong Yang
#### Fine-To-Coarse Global Registration of RGB-D Scans
Maciej Halber, Thomas Funkhouser
#### Analyzing Computer Vision Data - The Good, the Bad and the Ugly
Oliver Zendel, Katrin Honauer, Markus Murschitz, Martin Humenberger, Gustavo Fernández DomÃnguez
#### Product Manifold Filter: Non-Rigid Shape Correspondence via Kernel Density Estimation in the Product Space
Matthias Vestner, Roee Litman, Emanuele Rodolà , Alex Bronstein, Daniel Cremers
#### Unsupervised Vanishing Point Detection and Camera Calibration From a Single Manhattan Image With Radial Distortion
Michel Antunes, João P. Barreto, Djamila Aouada, Björn Ottersten
#### Toroidal Constraints for Two-Point Localization Under High Outlier Ratios
Federico Camposeco, Torsten Sattler, Andrea Cohen, Andreas Geiger, Marc Pollefeys
#### 4D Light Field Superpixel and Segmentation
Hao Zhu, Qi Zhang, Qing Wang
#### Exploiting Symmetry and/or Manhattan Properties for 3D Object Structure Estimation From Single and Multiple Images
Yuan Gao, Alan L. Yuille
## Analyzing Humans in Images
#### Binary Coding for Partial Action Analysis With Limited Observation Ratios
Jie Qin, Li Liu, Ling Shao, Bingbing Ni, Chen Chen, Fumin Shen, Yunhong Wang
#### SphereFace: Deep Hypersphere Embedding for Face Recognition
Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, Le Song
#### IRINA: Iris Recognition (Even) in Inaccurately Segmented Data
Hugo Proença, João C. Neves
#### Look Into Person: Self-Supervised Structure-Sensitive Learning and a New Benchmark for Human Parsing
Ke Gong, Xiaodan Liang, Dongyu Zhang, Xiaohui Shen, Liang Lin
#### Action Unit Detection With Region Adaptation, Multi-Labeling Learning and Optimal Temporal Fusing
Wei Li, Farnaz Abtahi, Zhigang Zhu
#### See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-Based Person Re-Identification
Zhen Zhou, Yan Huang, Wei Wang, Liang Wang, Tieniu Tan
#### Joint Intensity and Spatial Metric Learning for Robust Gait Recognition
Yasushi Makihara, Atsuyuki Suzuki, Daigo Muramatsu, Xiang Li, Yasushi Yagi
#### Pose-Aware Person Recognition
Vijay Kumar, Anoop Namboodiri, Manohar Paluri, C. V. Jawahar
#### Not Afraid of the Dark: NIR-VIS Face Recognition via Cross-Spectral Hallucination and Low-Rank Embedding
José Lezama, Qiang Qiu, Guillermo Sapiro
## Applications
#### Jointly Learning Energy Expenditures and Activities Using Egocentric Multimodal Signals
Katsuyuki Nakamura, Serena Yeung, Alexandre Alahi, Li Fei-Fei
#### Binarized Mode Seeking for Scalable Visual Pattern Discovery
Wei Zhang, Xiaochun Cao, Rui Wang, Yuanfang Guo, Zhineng Chen
#### Scribbler: Controlling Deep Image Synthesis With Sketch and Color
Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, James Hays
## Biomedical Image/Video Analysis
#### Multi-Way Multi-Level Kernel Modeling for Neuroimaging Classification
Lifang He, Chun-Ta Lu, Hao Ding, Shen Wang, Linlin Shen, Philip S. Yu, Ann B. Ragin
#### WSISA: Making Survival Prediction From Whole Slide Histopathological Images
Xinliang Zhu, Jiawen Yao, Feiyun Zhu, Junzhou Huang
## Computational Photography
On the Effectiveness of Visible Watermarks
Tali Dekel, Michael Rubinstein, Ce Liu, William T. Freeman
Snapshot Hyperspectral Light Field Imaging
Zhiwei Xiong, Lizhi Wang, Huiqun Li, Dong Liu, Feng Wu
Semantic Image Inpainting With Deep Generative Models
Raymond A. Yeh, Chen Chen, Teck Yian Lim, Alexander G. Schwing, Mark Hasegawa-Johnson, Minh N. Do
Image Motion & Tracking
Fast Multi-Frame Stereo Scene Flow With Motion Segmentation
Tatsunori Taniai, Sudipta N. Sinha, Yoichi Sato
Improved Stereo Matching With Constant Highway Networks and Reflective Confidence Learning
Amit Shaked, Lior Wolf
Optical Flow in Mostly Rigid Scenes
Jonas Wulff, Laura Sevilla-Lara, Michael J. Black
Optical Flow Requires Multiple Strategies (but Only One Network)
Tal Schuster, Lior Wolf, David Gadot
ECO: Efficient Convolution Operators for Tracking
Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, Michael Felsberg
Low- & Mid-Level Vision
Differential Angular Imaging for Material Recognition
Jia Xue, Hang Zhang, Kristin Dana, Ko Nishino
Fast Fourier Color Constancy
Jonathan T. Barron, Yun-Ta Tsai
Comparative Evaluation of Hand-Crafted and Learned Local Features
Johannes L. Schönberger, Hans Hardmeier, Torsten Sattler, Marc Pollefeys
Learning Fully Convolutional Networks for Iterative Non-Blind Deconvolution
Jiawei Zhang, Jinshan Pan, Wei-Sheng Lai, Rynson W. H. Lau, Ming-Hsuan Yang
Image Deblurring via Extreme Channels Prior
Yanyang Yan, Wenqi Ren, Yuanfang Guo, Rui Wang, Xiaochun Cao
Simultaneous Stereo Video Deblurring and Scene Flow Estimation
Liyuan Pan, Yuchao Dai, Miaomiao Liu, Fatih Porikli
Deep Photo Style Transfer
Fujun Luan, Sylvain Paris, Eli Shechtman, Kavita Bala
Generative Attribute Controller With Conditional Filtered Generative Adversarial Networks
Takuhiro Kaneko, Kaoru Hiramatsu, Kunio Kashino
Fast Haze Removal for Nighttime Image Using Maximum Reflectance Prior
Jing Zhang, Yang Cao, Shuai Fang, Yu Kang, Chang Wen Chen
Machine Learning
Low-Rank Bilinear Pooling for Fine-Grained Classification
Shu Kong, Charless Fowlkes
Neural Scene De-Rendering
Jiajun Wu, Joshua B. Tenenbaum, Pushmeet Kohli
Real-Time Neural Style Transfer for Videos
Haozhi Huang, Hao Wang, Wenhan Luo, Lin Ma, Wenhao Jiang, Xiaolong Zhu, Zhifeng Li, Wei Liu
A Graph Regularized Deep Neural Network for Unsupervised Image Representation Learning
Shijie Yang, Liang Li, Shuhui Wang, Weigang Zhang, Qingming Huang
Collaborative Deep Reinforcement Learning for Joint Object Search
Xiangyu Kong, Bo Xin, Yizhou Wang, Gang Hua
Loss Max-Pooling for Semantic Image Segmentation
Samuel Rota Bulò, Gerhard Neuhold, Peter Kontschieder
Deep View Morphing
Dinghuang Ji, Junghyun Kwon, Max McFarland, Silvio Savarese
Unsupervised Learning of Long-Term Motion Dynamics for Videos
Zelun Luo, Boya Peng, De-An Huang, Alexandre Alahi, Li Fei-Fei
Revisiting Metric Learning for SPD Matrix Based Visual Representation
Luping Zhou, Lei Wang, Jianjia Zhang, Yinghuan Shi, Yang Gao
Expert Gate: Lifelong Learning With a Network of Experts
Rahaf Aljundi, Punarjay Chakravarty, Tinne Tuytelaars
A Gift From Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning
Junho Yim, Donggyu Joo, Jihoon Bae, Junmo Kim
Domain Adaptation by Mixture of Alignments of Second- or Higher-Order Scatter Tensors
Piotr Koniusz, Yusuf Tas, Fatih Porikli
Deep Mixture of Linear Inverse Regressions Applied to Head-Pose Estimation
Stéphane Lathuilière, Rémi Juge, Pablo Mesejo, Rafael Muñoz-Salinas, Radu Horaud
STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling
Yang He, Wei-Chen Chiu, Margret Keuper, Mario Fritz
Harmonic Networks: Deep Translation and Rotation Equivariance
Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, Gabriel J. Brostow
Multimodal Transfer: A Hierarchical Deep Convolutional Neural Network for Fast Artistic Style Transfer
Xin Wang, Geoffrey Oxholm, Da Zhang, Yuan-Fang Wang
Detect, Replace, Refine: Deep Structured Prediction for Pixel Wise Labeling
Spyros Gidaris, Nikos Komodakis
Weighted-Entropy-Based Quantization for Deep Neural Networks
Eunhyeok Park, Junwhan Ahn, Sungjoo Yoo
Residual Expansion Algorithm: Fast and Effective Optimization for Nonconvex Least Squares Problems
Daiki Ikami, Toshihiko Yamasaki, Kiyoharu Aizawa
Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-In-The-Blank Image Captioning
Qing Sun, Stefan Lee, Dhruv Batra
Newton-Type Methods for Inference in Higher-Order Markov Random Fields
Hariprasad Kannan, Nikos Komodakis, Nikos Paragios
Adaptive Relaxed ADMM: Convergence Theory and Practical Implementation
Zheng Xu, Mário A. T. Figueiredo, Xiaoming Yuan, Christoph Studer, Tom Goldstein
Object Recognition & Scene Understanding
ViP-CNN: Visual Phrase Guided Convolutional Neural Network
Yikang Li, Wanli Ouyang, Xiaogang Wang, Xiao'ou Tang
Instance-Aware Image and Sentence Matching With Selective Multimodal LSTM
Yan Huang, Wei Wang, Liang Wang
Kernel Square-Loss Exemplar Machines for Image Retrieval
Rafael S. Rezende, Joaquin Zepeda, Jean Ponce, Francis Bach, Patrick Pérez
Cognitive Mapping and Planning for Visual Navigation
Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, Jitendra Malik
Combining Bottom-Up, Top-Down, and Smoothness Cues for Weakly Supervised Image Segmentation
Anirban Roy, Sinisa Todorovic
Seeing Into Darkness: Scotopic Visual Recognition
Bo Chen, Pietro Perona
Deep Co-Occurrence Feature Learning for Visual Object Recognition
Ya-Fang Shih, Yang-Ming Yeh, Yen-Yu Lin, Ming-Fang Weng, Yi-Chang Lu, Yung-Yu Chuang
An Empirical Evaluation of Visual Question Answering for Novel Objects
Santhosh K. Ramakrishnan, Ambar Pal, Gaurav Sharma, Anurag Mittal
InstanceCut: From Edges to Instances With MultiCut
Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, Carsten Rother
Fine-Grained Image Classification via Combining Vision and Language
Xiangteng He, Yuxin Peng
Mimicking Very Efficient Network for Object Detection
Quanquan Li, Shengying Jin, Junjie Yan
Tracking by Natural Language Specification
Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G. M. Snoek, Arnold W.M. Smeulders
A Dataset and Exploration of Models for Understanding Video Data Through Fill-In-The-Blank Question-Answering
Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron Courville, Christopher Pal
Learning Detection With Diverse Proposals
Samaneh Azadi, Jiashi Feng, Trevor Darrell
Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition
Yufei Wang, Zhe Lin, Xiaohui Shen, Scott Cohen, Garrison W. Cottrell
Theory
A Low Power, Fully Event-Based Gesture Recognition System
Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, Jeff Kusnitz, Michael Debole, Steve Esser, Tobi Delbruck, Myron Flickner, Dharmendra Modha
Video Analytics
Learning Deep Context-Aware Features Over Body and Latent Parts for Person Re-Identification
Dangwei Li, Xiaotang Chen, Zhang Zhang, Kaiqi Huang
Recurrent Modeling of Interaction Context for Collective Activity Recognition
Minsi Wang, Bingbing Ni, Xiaokang Yang
Primary Object Segmentation in Videos Based on Region Augmentation and Reduction
Yeong Jun Koh, Chang-Su Kim
ROAM: A Rich Object Appearance Model With Application to Rotoscoping
Ondrej Miksik, Juan-Manuel Pérez-Rúa, Philip H. S. Torr, Patrick Pérez
Temporal Residual Networks for Dynamic Scene Recognition
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
Spatiotemporal Multiplier Networks for Video Action Recognition
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
Learning to Learn From Noisy Web Videos
Serena Yeung, Vignesh Ramanathan, Olga Russakovsky, Liyue Shen, Greg Mori, Li Fei-Fei
YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video
Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, Vincent Vanhoucke
Online Video Object Segmentation via Convolutional Trident Network
Won-Dong Jang, Chang-Su Kim
Notes about some of the important parts of the system.
**Authors**: Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive
title: Loss Functions