Proposta de tesi Investigadors/es Grup de recerca

Object Recognition

Recognition of objects in images is still one of the most important research topics in computer vision. Given an image or a video, the goal of object recognition is to recognize and localize all the objects.

Over the last few years, performance in this field has been significantly improved with the use of Deep Neural Networks [1] and big datasets such as ImageNet [2]. Despite the research efforts, however, object recognition remains an unsolved problem. For real-time methods (such as Deformable Part Models [3]), the detection accuracy is low, while the methods that offer higher performance cannot run in real time. Currently, even the best object recognition algorithms are still a long way from matching human performance. In this line of research we focus on improving current systems, both in terms of accuracy and speed.

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.

[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, 2009.

[3] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan. Object Detection with Discriminatively Trained Part Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No. 9, Sep. 2010.

Dra. Àgata Lapedriza

Dr. David Masip

SUNAI Research group

Scene Recognition and Understanding

Understanding complex visual scenes is one of the hallmark tasks of computer vision. Given a picture or a video, the goal of scene understanding is to build a representation of the content of a picture (ie what are the objects inside the picture; how are they related; if there are people in the picture, what actions are they performing; what is the place depicted in the picture; etc.).

With the appearance of large scale databases like ImageNet [1] and Places [2], and the recent success of machine learning techniques such as Deep Neural Networks [3], scene understanding has experienced a great deal of progress. This progress has made it possible to build vision systems capable of addressing some of the above-mentioned tasks [4].

This line of research is being undertaken in collaboration with the computer vision group at the Massachusetts Institute of Technology. Our goal is to improve existing algorithms for scene understanding and to define new problems made attainable by recent advances in neural networks and machine learning.

[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, 2009.

[2] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. "Learning Deep Features for Scene Recognition using Places Database." Advances in Neural Information Processing Systems 27 (NIPS), 2014.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. "Imagenet classification with deep convolutional neural networks." In Advances in Neural Information Processing Systems, 2012.

[4] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. "Learning Deep Features for Discriminative Localization". Computer Vision Pattern Recognition (CVPR), 2016.

Dra. Àgata Lapedriza SUNAI Research group

Recognition of facial expressions

Facial expressions are a very important source of information for the development of new technologies. As humans we use our faces to communicate our emotions, and psychologists have studied emotions in faces since the publication of Charles Darwin’s early works [1]. One of the most successful emotion models is the Facial Action Coding System (FACS) [2], where a particular set of action units (facial muscle movements) act as the building blocks of six basic emotions (happiness, surprise, fear, anger, disgust, sadness).

The automatic understanding of this universal language (very similar in almost all cultures) is one of the most important research areas in computer vision. It has applications in many fields, such as design of intelligent user interfaces, human-computer interaction, diagnosis of disorders and even in the field of reactive publicity. In this line of research we propose to design and apply state-of-the-art supervised algorithms to detect and classify emotions and action units.

Nevertheless, there is a far greater range of emotions than just this basic set. We can predict with better than chance accuracy: the results of a negotiation, the preferences of the users in binary decisions [3], the deception perception, etc. In this line of research we collaborate with the Social Perception Lab at Princeton University (http://tlab.princeton.edu/) to apply automated algorithms to real data from psychology labs.

[1] Darwin, Charles (1872), The expression of the emotions in man and animals, London: John Murray.

[2] P. Ekman and W. Friesen. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, 1978.

[3] Masip D, North MS, Todorov A, Osherson DN (2014) Automated Prediction of Preferences Using Facial Expressions. PLoS ONE 9(2): e87434.doi:10.1371/journal.pone.0087434

Dr. David Masip SUNAI Research group

Deep-learning algorithms

In recent years, end-to-end learning algorithms have revolutionized many areas of research, such as computer vision [1], natural language processing [2], gaming [3], robotics, etc. Deep-learning techniques have achieved the highest levels of success in many of these tasks, given their astonishing capability to model both the features/filters and the classification rule.

The algorithms developed in this line of research will focus on enhancing deep-learning architectures and improving their learning capabilities, in terms of invariant (rotation, translation, warping, scaling) feature extraction [4], computational efficiency and parallelization [5], speeding up the network learning times [6, 7] and connecting images to sequences.

These algorithms will be applied to real computer vision problems in the field of Neuroscience, in collaboration with the Princeton Neuroscience Institute. These range from detection and tracking of rodents in low resolution videos, image segmentation and limb detection, motion estimation of whiskers using high-speed cameras and in vivo calcium image segmentation of neural network activity in rodents [8].

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.

[2] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).

[3] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489.

[4] Jaderberg, M., Simonyan, K., & Zisserman, A. (2015). Spatial transformer networks. In Advances in Neural Information Processing Systems (pp. 2017-2025).

[5] Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., & Kavukcuoglu, K. (2016). Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343.

[6] Ha, D., Dai, A., & Le, Q. V. (2016). HyperNetworks. arXiv preprint arXiv:1609.09106.

[7] Bakhtiary, A. H., Lapedriza, A., & Masip, D. (2015). Speeding Up Neural Networks for Large Scale Classification using WTA Hashing. arXiv preprint arXiv:1504.07488.

[8] Grewe, B. F., Langer, D., Kasper, H., Kampa, B. M., & Helmchen, F. (2010). High-speed in vivo calcium imaging reveals neuronal network activity with near-millisecond precision. Nature methods, 7(5), 399-405. 

Dr. David Masip

SUNAI Research group

Human pose recovery and behavior analysis

Human action/gesture recognition is a challenging area of research. It deals with the problem of recognizing people in images, detecting and describing body parts, inferring their spatial configuration, and performing action/gesture recognition from still images or image sequences also including multi-modal data. Because of the large pose parameter space inherent in human configurations, body pose recovery is a difficult problem that involves dealing with several distortions: illumination changes, partial occlusions, changes in the point of view, rigid and elastic deformations, and high inter- and intra-class variability, to mention just a few. In spite of the difficulty the problem presents, modern computer vision techniques and new trends merit further attention, and promising results are expected in the next few years.

Moreover, several subareas have recently been defined, such as affective computing, social signal processing, human behaviour analysis, and social robotics. The effort involved in this area of research will be compensated by its potential applications: TV production, home entertainment (multimedia content analysis), educational purposes, sociology research, surveillance and security, improved quality of life through monitoring and automatic artificial assistance, among others.

Dr. Xavier Baró SUNAI Research group

Computer vision and emotional AI

In recent years we have observed an increasing interest, both in academia and in the computer vision industry, in systems for understanding how people feel [1, 2] and how visual information affects our mood and emotions [3]. The line of research of Computer vision and emotional AI is focused on creating systems for understanding image that include aspects of emotional intelligence in the process of interpreting the visual information. These systems have many applications. For example, they can be applied to the care and assistance of people, online education, and human-computer interaction.

In this line of research we work with advanced deep learning techniques. The line of research combines several computer vision topics, such as face analysis, pose and gesture analysis, action recognition, scene recognition, object detection, and object/scene attribute recognition, to extract high-level information from images and videos.

[1] https://www.microsoft.com/cognitive-services/en-us/emotion-api

[2] http://www.affectiva.com/

[3] Kuan-Chuan Peng, Tsuhan Chen, Amir Sadovnik, Andrew Gallagher. “A Mixed Bag of Emotions: Model, Predict and Transfer Emotion Distributions”. International Conference on Computer Vision and Pattern Recognition (CVPR). 2015.

Dra. Àgata Lapedriza
 
SUNAI Research group

Computer vision and cognition

We have observed huge progress in computer vision over the last four years, mainly because of the appearance of big datasets of labelled images, such as ImageNet [1] and Places [2], and the success of deep learning algorithms when they are trained with this large amount of data [2, 3]. Since this turning point, performance has increased in many computer vision applications, such as scene recognition, object detection and recognition, image captioning, etc.

However, despite this amazing progress, there are still some tasks that are very hard for a machine to solve, such as image question-answering, or describing, in detail, the content of an image. The point is that we can perform these tasks easily not just because of our capacity for detecting and recognizing objects and places, but because of our ability to reason about what we see. To be capable of reasoning about something, one needs cognition. Nowadays computers cannot reason about visual information because computer vision systems do not include artificial cognition. One of the main obstacles to developing cognitive systems for computer vision was the lack of data to train. However, the recent work Visual Genome [4] presents the first dataset that enables the modelling of such systems and opens the door to new research goals.

This line of research aims to explore how to add cognition in vision systems, to create algorithms that can reason about the visual information.

[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. “Imagenet: A large-scale hierarchical image database”. In Proc. CVPR, 2009.

[2] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. "Learning Deep Features for Scene Recognition using Places Database”. Advances in Neural Information Processing Systems 27 (NIPS), 2014.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “Imagenet classification with deep convolutional neural networks”. In Advances in Neural Information Processing Systems, 2012.

[4] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li Jia-Li, David Ayman Shamma, Michael Bernstein, Li Fei-Fei. ”Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. 2016. https://visualgenome.org/

Dra. Àgata Lapedriza

Dr. Carles Ventura

SUNAI Research group