Computer Vision, Machine Learning and Pattern Recognition

Propuesta de tesis Investigadores/as Grupo de investigación

Explainable computer vision

Over the last few years, the performance in many areas of computer vision has been significantly improved with the use of Deep Neural Networks [1] and big datasets such as ImageNet [2] or Places [3]. Deep Learning (DL) models are able to accurately recognize objects, scenes, actions or human emotions. Nevertheless, many applications require not only classification or regression outputs from DL models, but also additional information explaining why the model made a certain decision.

Explainability in Computer Vision is the area of research that studies how to make Computer Vision Deep Learning (DL) models that allow humans to understand why a model made a decision. An epitomic example could be a system for assisting cardio-vascular diagnosis using medical imaging. When the result concludes a high risk of cardiac incident and the necessity of urgent operation, rather than providing a black box it would be very relevant to obtain an explanation about this prediction, that could consist for example in the location and rate of coronary stenosis or in the detection of similar clinical cases. With this extra information, in addition to the classification, an expert can more easily verify whether the decision made by the model is trustable or not. 

More generally, explainable DL is an area of research that has recently emerged to understand better how DL models learn and perform, and what type of representations these models learn. The works [4,5] offer nice overviews on the recent progress of explainable DL. More particularly, [6,7,8, 9] are examples of recent explainable models for computer vision. An overview of Explainable AI, as well as a discussion on how to adapt explanations to the user, can be found in this paper [10]

The goal of the Explainable Computer Vision research line is to explore explainability in DL, with applications to different problems in computer vision, such as emotion perception and scene understanding. 

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.

[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, 2009.

[3] Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems (pp. 487-495).

[4] Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael Specter and Lalana Kagal, “Explaining Explanations: An Approach to Evaluating Interpretability of Machine Learning”,

[5] Q. Zhang, Z. Song-Chun, “Visual interpretability for deep learning: a survey”, Frontiers of Information Technology & Electronic Engineering, 2018.

[6] B. Zhou, A.Khosla, A.Lapedriza, A.Oliva, A.Torralba. Learning Deep Features for Discriminative Localization,CVPR, 2016.

[7] J.Lu, J. Yang, D. Batra, D. Parikh, “Hierarchical question-image co-attention for visual question answering,” in NIPS, 2016 

[8] B. Zhou, Y. Sun, D. Bau, A. Torralba, “Interpretable Basis Decomposition for visual explanations”. European Conference on Computer Vision (ECCV) 2018.

[9] D. Bau, J-Y Zhu, H. Strobelt, A. Lapedriza, B. Zhou, A. Torralba, "Understanding the role of individual units in a deep neural network", Proceedings of the National Academy of Sciences (PNAS), 2020.

[10] Ribera, Mireia, and Agata Lapedriza. "Can we do better explanations? A proposal of user-centered explainable AI." IUI Workshops. 2019.

Dr Àgata Lapedriza

Dr David Masip


Scene Understanding

Understanding complex visual scenes is one of the hallmark tasks of computer vision. Given a picture or a video, the goal of scene understanding is to build a representation of the content of a picture (e.g., what are the objects inside the picture, how are they related, if there are people in the picture what actions are they performing, what is the place depicted in the picture, etc). With the appearance of large scale databases like ImageNet [1] and Places [2], and the recent success of machine learning techniques such as Deep Neural Networks [3], scene understanding has experienced a large amount of progress, making possible to build vision systems capable of addressing some of the mentioned tasks [4].

In this research line, in collaboration with the computer vision group at the Massachusetts Institute of Technology (MIT), our goal is to improve existing algorithms for scene understanding and to define new problems that become reachable now, thanks to the recent advances in deep neural networks and machine learning. You can try our demo online of our state-of-the-art system for scene category recognition and scene attribute recognition:

The goals of this research line are to understand general scenes [5] and also to understand human-centric situations [6]. A paper related to a recent project in this area can be found in [12].

Furthermore, related with scene understanding, one of the research lines we are working on is instance segmentation. Instance segmentation techniques on images and videos consist on assigning each pixel a label with a semantic category that identifies the class object, e.g. car, person, etc. and a label with a identifier to differentiate objects belonging to the same category. Although there have been large improvements on image segmentation for the last years, with some techniques becoming very popular as MaskRCNN [7], there have been few algorithms that exploit the temporal domain for video object segmentation.

Most video object segmentation techniques are frame-based, which means that an image segmentation technique is applied at every frame independently, and postprocessing techniques are used to connect the objects segmented at each frame along the video. There are a few video object segmentation techniques that leverage additional cues from videos, such as motion, using optical flow. However, these architectures are not end-to-end trainable.

This research line would focus on the study of end-to-end trainable architectures that exploits the use of spatio-temporal features. Recently, some video object segmentation benchmarks have been released, e.g. YouTube-VOS [8] and YouTube-VIS [9], with much more objects annotated that in the previous video object segmentation benchmark (DAVIS [10]). There are two different challenges defined: (i) the one called semi-supervised video object segmentation (also referred to as one-shot), where the masks of the objects to be segmented is given at the first frame, and (ii) the one called unsupervised video object segmentation (also referred to as zero-shot), where no masks are given. The proposed RVOS model [11] could be a possible starting point for exploration.


[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. “Imagenet: A large-scale hierar- chical image database”. In Proc. CVPR, 2009.

[2] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. “Learning Deep Features for Scene Recognition using Places Database”. Advances in Neural Information Processing Systems 27 (NIPS), 2014.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “Imagenet classification with deep convolutional neural networks”. In In Advances in Neural Information Processing Systems, 2012

[4] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. “Learning Deep Features for Discriminative Localization”. Computer Vision and Pattern Recognition (CVPR), 2016

[5] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh, "Graph R-CNN for Scene Graph Generation". European Conference on Computer Vision (ECCV), 2018.

[6] M. Yatskar, L. Zettlemoyer, A. Farhadi, "Situation recognition: Visual semantic role labeling for image understanding", Computer Vision and Pattern Recognition (CVPR), 2016.

[7] K. He, G. Gkioxari, P. Dollár & R. Girshick. Mask R-CNN. International Conference on Computer Vision (ICCV), 2017.

[8] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang & T. Huang. Youtube-vos: A large-scale video object segmentation benchmark, 2018, arXiv preprint arXiv:1809.03327.

[9] L. Yang, Y. Fan, N. Xu. Video Instance Segmentation. International Conference on Computer Vision (ICCV), 2019.

[10] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung & L. Van Gool. The 2017 davis challenge on video object segmentation, 2017 arXiv preprint arXiv:1704.00675.

[11] C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques & X. Giro-i-Nieto. RVOS: End-to-end recurrent network for video object segmentation. Computer Vision and Pattern Recognition (CVPR), 2019.

[12] E. Weber, N. Marzo, D. P. Papadopoulos, A. Biswas, A. Lapedriza, F. Ofli, M. Imran, A. Torralba, "Detecting natural disasters, damage, and incidents in the wild", European Conferece in Computer Vision (ECCV), 2020.

Dr Àgata Lapedriza SUNAI

Recognition of facial expressions

Facial expressions are a very important source of information for the development of new technologies. As humans, we use our faces to communicate our emotions, and psychologists have studied emotions in faces since the publication of Charles Darwin’s early works [1]. One of the most successful emotion models is the Facial Action Coding System (FACS) [2], where a particular set of action units (facial muscle movements) act as the building blocks of six basic emotions (happiness, surprise, fear, anger, disgust, sadness). The automatic understanding of this universal language (very similar in almost all cultures) is one of the most important research areas in computer vision. It has applications in many fields, such as design of intelligent user interfaces, uman-computer interaction, diagnosis of disorders and even in the field of reactive publicity.

Nevertheless, there exists a far greater range of emotions than just this basic set. We can predict with better than chance accuracy: the results of a negotiation, the preferences of the users in binary decisions [3], the deception perception, etc. In this line of research we collaborate with the Social Perception Lab at Princeton University ( to apply automated algorithms to real data from psychology labs.

In this line of research we propose to apply Deep Learning methods to allow computers to perceive human emotions from facial images/videos [4]. We propose to use self-supervised [5] and semi- supervised methods to take benefit from the large amount of public unlabeled data available on Internet. We conjecture that large scale data acquired in-the-wild will pave the way for improved emotion classifiers in real applications.

[1] Darwin, Charles (1872), The expression of the emotions in man and animals, London: John Murray.

[2] P. Ekman and W. Friesen. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, 1978.

[3] Masip D, North MS, Todorov A, Osherson DN (2014) Automated Prediction of Preferences Using Facial Expressions. PLoS ONE 9(2): e87434.doi:10.1371/journal.pone.0087434

[4] Pons, G., & Masip, D. (2018). Supervised Committee of Convolutional Neural networks in automated facial expression analysis. IEEE Transactions on Affective Computing, 9(3), 343-350.

[5] Wiles, O., Koepke, A., & Zisserman, A. (2018). Self-supervised learning of a facial attribute embedding from video. arXiv preprint arXiv:1808.06882.

Dr David Masip SUNAI

Deep-learning algorithms

In recent years, end-to-end learning algorithms have revolutionized many areas of research, such as computer vision [1], natural language processing [2], gaming [3], robotics, etc. Deep-learning techniques have achieved the highest levels of success in many of these tasks, given their astonishing capability to model both the features/filters and the classification rule.

The output of a Neural Network is usually a score with the probability or regression of a user defined label.  Real life applications will require also explicit values regarding the uncertainty of this prediction (e.g. if the system diagnoses a cancer, we need both the type, probability and certainty of the prediction). We propose to develop novel Deep Learning methods that model the uncertainty on their predictions [4], and also exploit this uncertainty to perform active and semi-supervised learning on large scale unlabeled data sets.

These algorithms will be applied to real computer vision problems in the field of Neuroscience, in collaboration with the Princeton Neuroscience Institute. These range from detection and tracking of rodents in low resolution videos, image segmentation and limb detection, motion estimation of whiskers using high-speed cameras and in vivo calcium image segmentation of neuronal activity in rodents [5].

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.

[2] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).

[3] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489.

[4] Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In: ICML. pp. 1050–1059 (2016)

[5] Grewe, B. F., Langer, D., Kasper, H., Kampa, B. M., & Helmchen, F. (2010). High-speed in vivo calcium imaging reveals neuronal network activity with near-millisecond precision. Nature methods, 7(5), 399-405.

Dr David Masip


Human pose recovery and behavior analysis

Human action/gesture recognition is a challenging area of research. It deals with the problem of recognizing people in images, detecting and describing body parts, inferring their spatial configuration, and performing action/gesture recognition from still images or image sequences also including multi-modal data. Because of the large pose parameter space inherent in human configurations, body pose recovery is a difficult problem that involves dealing with several distortions: illumination changes, partial occlusions, changes in the point of view, rigid and elastic deformations, and high inter- and intra-class variability, to mention just a few. In spite of the difficulty the problem presents, modern computer vision techniques and new trends merit further attention, and promising results are expected in the next few years.

Moreover, several subareas have recently been defined, such as affective computing, social signal processing, human behaviour analysis, and social robotics. The effort involved in this area of research will be compensated by its potential applications: TV production, home entertainment (multimedia content analysis), educational purposes, sociology research, surveillance and security, improved quality of life through monitoring and automatic artificial assistance, among others.

Dr Xavier Baró SUNAI

Computer vision and cognition

We observed a huge progress in computer vision in the last 8 years, mainly because of the appearance of big datasets of labeled images, such as ImageNet [1] and Places [2], and the success of Deep Learning algorithms when they are trained with this big amount of data [2,3]. After this turning point, performance has increased in a lot of computer vision applications, such as scene recognition, object detection and recognition, image captioning, or visual-question answering, among others.

However, despite of this amazing progress, there are still some tasks that are very hard to solve for a machine, such as image question-answering or describing, in detail, the content of an image. The human capacity for performing these high-level tasks is associated to our visual recognition capacity and also to our capability of reasoning about the things that we see. This line of research aims at developing new systems for visual reasoning. To be able of reasoning about something one needs cognition, which is a very difficult skill for a machine. One of the main drawbacks for developing cognitive systems for computer vision was the lack of training data. However, we can find recent works [4,5,6] that present datasets and benchmarks that enable the modelling of such systems and open the door to new research goals.

The research line of Computer Vision and Cognition includes several sub-areas of research. Some examples are:

  • Recognition and segmentation of objects in a scene taking into account the presence/absence of other objects, which can be considered as context as semantic graphs [6, 7, 8].
  • Segmentation of an object in an image or video given a referring expression. A referring expression is a natural language expression that refers to an object of the scene without any ambiguity. Whereas some systems have been proposed for images [9], there is few research done in video domain [10, 11]. Furthermore, for the image domain, the main papers about generation and comprehension of referring expressions has been done with bounding boxes [12] instead of using regions of arbitrary shape.

More generally, this research line aims to explore how to add cognition in vision systems, to create algorithms that can reason about the visual information.


[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. “Imagenet: A large-scale hierarchical image database”. In Proc. CVPR, 2009.

[2] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. "Learning Deep Features for Scene Recognition using Places Database”. Advances in Neural Information Processing Systems 27 (NIPS), 2014.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “Imagenet classification with deep convolutional neural networks”. In In Advances in Neural Information Processing Systems, 2012.

[4] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li Jia-Li, David Ayman Shamma, Michael Bernstein, Li Fei-Fei. ”Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. 2016.

[5] Kenneth Marino, Ruslan Salakhutdinov, Abhinav Gupta," The More You Know: Using Knowledge Graphs for Image Classification", Computer Vision and Pattern Recognition (CVPR), 2017.

[6] Chung-Wei Lee, Wei Fang, Chih-Kuan Yeh2, Yu-Chiang Frank Wang, “Multi-Label Zero-Shot Learning with Structured Knowledge Graphs”, Computer Vision and Pattern Recognition (CVPR), 2018.

[7] Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, Yanwen Guo, “Multi-Label Image Recognition with Graph Convolutional Networks”, Computer Vision and Pattern Recognition (CVPR), 2019.

[8] Peixi Xiong, Huayi Zhan, Xin Wang, Baivab Sinha, Ying Wu, “Relation-Aware Graph Attention Network for Visual Question Answering”, International Conference on Computer Vision (ICCV), 2019.

[9] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, Tamara L. Berg, “MAttNet: Modular Attention Network for Referring Expression Comprehension”, Computer Vision and Pattern Recognition (CVPR), 2018

[10] Anna Khoreva, Anna Rohrbach, Bernt Schiele, “Video Object Segmentation with Language Referring Expressions”, Asian Conference on Computer Vision (ACCV), 2018

[11] Alba Maria Hererra-Palacio, Carles Ventura, Carina Silberer, Ionut-Teodor Sorodoc, Gemma Boleda, Xavier Giro-i-Nieto, “Recurrent Instance Segmentation using Sequences of Referring Expressions” NIPS Workshops, 2019

[12] Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L., & Murphy, K. Generation and comprehension of unambiguous object descriptions. Computer Vision and Pattern Recognition (CVPR), 2016


Dr Àgata Lapedriza

Dr Carles Ventura


Emotional Intelligence for Social Robots

The World Health Organization predicts that by the year 2030, depression, along with other mood disorders, will be the global #1 disease burden in terms of disability and lives lost. Today the diagnosis and tracking of mood disorders still rely on clinical assessments of self-reported depressive symptoms. Unfortunately these methods are often inaccurate and usually do not allow early detection of these type of mood disorders.

In this context, social robots are a very interesting technology for the early detection and tracking of mood disorders. In the future, we will interact every day with social robots that will have at home, and will have with them long-term relationships [1]. If these robots are provided with emotional intelligence, they would have a great potential to help us on our emotional wellbeing, by tracking our mood and helping us to regulate our emotions. Prior works [2,3] support the idea that machine learning applied data captured by wearables, smartphones and self report can predict people’s mood the next day in a personalized way. The idea of this project is to do the same using as data the interactions with the robot. These technologies would be particularly useful for those people that live alone and have a high risk of isolation.

This research line is focused in the development of the perception for emotional intelligence of social robots and machines, in general. It includes areas of research such as emotion recognition and personality trait recognition. In particular, one of the goals of this research research line is to create new multimodal Deep Learning Models that process video, audio, and speech text transcript data acquired during conversations with a robot, to recognize and understand the emotional state of the person. More generally, these area of research includes other subareas related with the development of emotional intelligence in machines, like emotion recognition in the wild [4], visual sentiment analysis [5], or even the integration of commercial software for emotion recognition from facial expression like [6]. In terms of applications we focus in good purposes like emotional wellbeing and education.

This area of research is done in collaboration with the Affective Computing group and the Personal Robots group at the Massachusetts Institute of Technology (MIT) Medialab.

[1] C. Kidd and C. Breazeal (2008). “Robots at Home: Understanding Long-Term Human-Robot Interaction”. Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2008). Nice, France.

[2] N Jaques, S Taylor, E Nosakhare, A Sano, R Picard. “Multi-task Learning for Predicting Health, Stress, and Happiness.” NIPS Workshop on Machine Learning for Health, Barcelona, Spain, December 2016.

[3] A. Ghandeharioun, S. Fedor, L. Sangermano, D. Ionescu, J. Alpert, C. Dale, D. Sontag, R. Picard. “Objective assessment of depressive symptoms with machine learning and wearable sensors data”,Affective Computing and Intelligent Interaction 2017, San Antonio, TX, Oct 2017.

[4] Kuan-Chuan Peng, Tsuhan Chen, Amir Sadovnik, Andrew Gallagher. “A Mixed Bag of Emotions: Model, Predict and Transfer Emotion Distributions”. International Conference on Computer Vision and Pattern Recognition (CVPR). 2015.

[5] R. Kosti, J.M Alvarez, A. Recasens, A.Lapedriza. "Emotion Recognition in Context". Computer Vision and Pattern Recognition (CVPR), 2017.


Dr Àgata Lapedriza SUNAI

Retina characterization through advanced image analysis techniques

Retinal imaging is a very important way to diagnose many diseases that can affect vision. But it can also be a way to assess the health of our brain, since the retina is the only visible part of our nervous system. The health of the cells in our retina opens a window to our brain and it is becoming a key tool to detect early stages of neurological disorders.

In this sense, retinal in-vivo screening using non-invasive optical imaging techniques are on its way to become a cost effective alternative to magnetic resonance imaging, and probably safer than computerized tomography (CT). In particular, adaptive optics assisted retinal imaging techniques can generate retinal images in a way that is safe for the patient, and individual cells can be observed at different depths in the tissue. However, it has proven challenging to extract information from them using a robust method that can provide biomarkers in a consistent way in the ophthalmology clinic.

In this project we aim to develop automatic image analysis techniques based on artificial intelligence to allow registration between different modalities of retinal images of the same subject, build mosaics of the acquired images, and interpret them by detecting individual cells/cones in an effort to propose new biomarkers of neurodegeneration which can be applied not only in ophthalmology, but also in neurology.


Dr David Merino

Dr Ferran Prados




Medical image processing

Medical image processing is a key step in the diagnostic of a large number of diseases. Nowadays, we can acquire images of inside and outside of our bodies using a large variety of devices (ultrasound, magnetic resonance, optic tomography, computed tomography, ...). Afterwards, the acquired images usually need to be denoised, corrected for inhomogeneities, segmented, registered,... in order to be able to get relevant information to aid the clinical decision using image-based biomarkers. On this research line, we would like to explore the latest image processing challenges and develop new image-based biomarkers that aid clinicians in their daily work. This work will be done in collaboration with world-wide recognised clinical institutions in Barcelona.

Dr Ferran Prados


Dr Jordi Casas