Network and Information Technologies

Computer Vision, Machine Learning and Pattern Recognition
Research proposal Researchers Research Group

Explainable computer vision

Over the last few years, the performance in many areas of computer vision has been significantly improved with the use of Deep Neural Networks [1] and big datasets such as ImageNet [2] or Places [3]. Deep Learning (DL) models are able to accurately recognize objects, scenes, actions or human emotions. Nevertheless, many applications require not only classification or regression outputs from DL models, but also additional information explaining why the model made a certain decision.

Explainability in Computer Vision is the area of research that studies how to make Computer Vision Deep Learning (DL) models that allow humans to understand why a model made a decision. An epitomic example could be a system for assisting cardio-vascular diagnosis using medical imaging. When the result concludes a high risk of cardiac incident and the necessity of urgent operation, rather than providing a black box it would be very relevant to obtain an explanation about this prediction, that could consist for example in the location and rate of coronary stenosis or in the detection of similar clinical cases. With this extra information, in addition to the classification, an expert can more easily verify whether the decision made by the model is trustable or not. 

More generally, explainable DL is an area of research that has recently emerged to understand better how DL models learn and perform, and what type of representations these models learn. The works [4,5] offer nice overviews on the recent progress of explainable DL. More particularly, [6,7,8, 9] are examples of recent explainable models for computer vision. An overview of Explainable AI, as well as a discussion on how to adapt explanations to the user, can be found in this paper [10]

The goal of the Explainable Computer Vision research line is to explore explainability in DL, with applications to different problems in computer vision, such as emotion perception and scene understanding. 

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.

[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, 2009.

[3] Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems (pp. 487-495).

[4] Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael Specter and Lalana Kagal, “Explaining Explanations: An Approach to Evaluating Interpretability of Machine Learning”,

[5] Q. Zhang, Z. Song-Chun, “Visual interpretability for deep learning: a survey”, Frontiers of Information Technology & Electronic Engineering, 2018.

[6] B. Zhou, A.Khosla, A.Lapedriza, A.Oliva, A.Torralba. Learning Deep Features for Discriminative Localization,CVPR, 2016.

[7] J.Lu, J. Yang, D. Batra, D. Parikh, “Hierarchical question-image co-attention for visual question answering,” in NIPS, 2016 

[8] B. Zhou, Y. Sun, D. Bau, A. Torralba, “Interpretable Basis Decomposition for visual explanations”. European Conference on Computer Vision (ECCV) 2018.

[9] D. Bau, J-Y Zhu, H. Strobelt, A. Lapedriza, B. Zhou, A. Torralba, "Understanding the role of individual units in a deep neural network", Proceedings of the National Academy of Sciences (PNAS), 2020.

[10] Ribera, Mireia, and Agata Lapedriza. "Can we do better explanations? A proposal of user-centered explainable AI." IUI Workshops. 2019.

Dr Àgata Lapedriza


Emotion perception from facial expressions
Facial expressions are a very important source of information for the development of new technologies. As humans, we use our faces to communicate our emotions, and psychologists have studied emotions in faces since the publication of Charles Darwin’s early works [1]. One of the most successful emotion models is the Facial Action Coding System (FACS) [2], where a particular set of action units (facial muscle movements) act as the building blocks of six basic emotions (happiness, surprise, fear, anger, disgust, sadness). The automatic understanding of this universal language (very similar in almost all cultures) is one of the most important research areas in computer vision. It has applications in many fields, such as design of intelligent user interfaces, human-computer interaction, diagnosis of disorders and even in the field of reactive publicity.
Nevertheless, there exists a far greater range of emotions than just this basic set. We can predict with better than chance accuracy: the results of a negotiation, the preferences of the users in binary decisions [3], the deception perception, etc. In this line of research we collaborate with the Social Perception Lab at Princeton University ( ) to apply automated algorithms to real data from psychology labs.
In this line of research we propose to apply Deep Learning methods to allow computers to perceive human emotions from facial images/videos [4]. We propose to use self-supervised [5] and semi- supervised methods to take benefit from the large amount of public unlabeled data available on Internet. We conjecture that large scale data acquired in-the-wild will pave the way for improved emotion classifiers in real applications.
We will put special emphasis on the applications of emotion perception in e-health, where multimodal data extracted from images and text from social media [6] can provide relevant cues for diagnosing mental health disorders.
Perceiving emotions in children from facial expressions
We are interested in exploring innovative applications of advanced technology in mixed research designs with children in natural contexts (e.g. school). Our interest is focused on the automatic recognition of facial emotions. On the one hand, we are working to find new solutions to collect, code, process and visualize audio-visual data in an efficient, ethical and non-intrusive way. On the other hand, we work at the post-processing level of the data generated by automatic recognition in machine learning by applying statistical models and techniques for dynamic data analysis and spatio-temporal pattern analysis, including coding and data compression methods. 
The PhD thesis will be carried out in collaboration with the Child Tech Lab Research Group and the AI for Human Well-being Lab Research Group at the Universitat Oberta de Catalunya.
[1] Darwin, Charles (1872), The expression of the emotions in man and animals, London: John Murray.
[2] P. Ekman and W. Friesen. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, 1978.
[3] Masip D, North MS, Todorov A, Osherson DN (2014) Automated Prediction of Preferences Using Facial Expressions. PLoS ONE 9(2): e87434.doi:10.1371/journal.pone.0087434
[4] Pons, G., and; Masip, D. (2018). Supervised Committee of Convolutional Neural networks in automated facial expression analysis. IEEE Transactions on Affective Computing, 9(3), 343-350.


Dr Lucrezia Crescenzi


Child Tech Lab
(this research line temporarily does not accept new PhD candidates)
Deep-learning algorithms
In recent years, end-to-end learning algorithms have revolutionized many areas of research, such as computer vision [1], natural language processing [2], gaming [3], robotics, etc. Deep-learning techniques have achieved the highest levels of success in many of these tasks, given their astonishing capability to model both the features/filters and the classification rule.
The output of a Neural Network is usually a score with the probability or regression of a user defined label.  Real life applications will require also explicit values regarding the uncertainty of this prediction (e.g. if the system diagnoses a cancer, we need both the type, probability and certainty of the prediction). We propose to develop novel Deep Learning methods that model the uncertainty on their predictions [4], and also exploit this uncertainty to perform active and semi-supervised learning on large scale unlabeled data sets.
These algorithms will be applied to real computer vision problems in the field of Neuroscience, in collaboration with the Princeton Neuroscience Institute. These range from detection and tracking of rodents in low resolution videos, image segmentation and limb detection, motion estimation of whiskers using high-speed cameras and in vivo calcium image segmentation of neuronal activity in rodents [5].
We also develop curriculum learning strategies using uncertainty, that can be applied to animal modelling. Recent collaborations include the CSIC-CEAB (Advanced Studies Center and the mosquito alert citizen science project ( [6]. 
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
[2] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).
[3] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489.
[4] Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In: ICML. pp. 1050–1059 (2016).
[5] Grewe, B. F., Langer, D., Kasper, H., Kampa, B. M., & Helmchen, F. (2010). High-speed in vivo calcium imaging reveals neuronal network activity with near-millisecond precision. Nature methods, 7(5), 399-405.
[6] Adhane, G., Dehshibi, M. M., and Masip, D. (2021). A Deep Convolutional Neural Network for Classification of Aedes Albopictus Mosquitoes. IEEE Access, 9, 72681-72690.



Human pose recovery and behavior analysis

Human action/gesture recognition is a challenging area of research. It deals with the problem of recognizing people in images, detecting and describing body parts, inferring their spatial configuration, and performing action/gesture recognition from still images or image sequences also including multi-modal data. Because of the large pose parameter space inherent in human configurations, body pose recovery is a difficult problem that involves dealing with several distortions: illumination changes, partial occlusions, changes in the point of view, rigid and elastic deformations, and high inter- and intra-class variability, to mention just a few. In spite of the difficulty the problem presents, modern computer vision techniques and new trends merit further attention, and promising results are expected in the next few years.

Moreover, several subareas have recently been defined, such as affective computing, social signal processing, human behaviour analysis, and social robotics. The effort involved in this area of research will be compensated by its potential applications: TV production, home entertainment (multimedia content analysis), educational purposes, sociology research, surveillance and security, improved quality of life through monitoring and automatic artificial assistance, among others.

Dr Xavier Baró


Emotional Intelligence for Social Robots
In the future, we will interact every day with social robots that we will have at home, and will have with them long-term relationships [1]. Actually, we are already seeing research on social robots that help in several areas of healthcare[2],  robots that will be able to assist elderly people [3], or  robots that help with teaching [4]. An important aspect for these social robots to communicate fluently with people is the capacity of being empathic, and perceiving expressions of emotions, preferences, needs, or intents, and reacting to those expressions in a socially and emotionally intelligent manner. This research line focuses on the design and implementation of technologies that allow social robots to have these types of emotional intelligence skills, which are essential abilities to maintain social interactions with humans. In particular, one of the goals of this research line is to create new multimodal Deep Learning Models that process video, audio, and speech text transcript data acquired during conversations with a robot, to recognize and understand the emotional state of the person and/or to recognize personality traits.
More generally, this area of research includes other subareas related with the development of emotional intelligence in machines, like emotion recognition in the wild [5], automatic understanding dyadic interactions [6], personality trait recognition [6], text sentiment analysis [7], visual sentiment analysis [8], or the integration of commercial software for expression recognition [9] to the decision-making process of the robot. In terms of applications we focus on human assistance, human companionship, entertainment, and emotional wellbeing.
[1] C. Kidd and C. Breazeal (2008). “Robots at Home: Understanding Long-Term Human-Robot Interaction”. Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2008). Nice, France.
[2]  Breazeal, C. (2011, August). Social robots for health applications. In 2011 Annual international conference of the IEEE engineering in medicine and biology society (pp. 5368-5371). IEEE.
[3] Broekens, J., Heerink, M., & Rosendal, H. (2009). Assistive social robots in elderly care: a review. Gerontechnology, 8(2), 94-103.
[4] Belpaeme, T., Kennedy, J., Ramachandran, A., Scassellati, B., & Tanaka, F. (2018). Social robots for education: A review. Science robotics, 3(21).
[5] R. Kosti, J.M. Álvarez, A. Recasens and A. Lapedriza, "Context based emotion recognition using emotic dataset", IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2019.
[6] Palmero, C., Selva, J., Smeureanu, S., Junior, J. C. J., Clapés, A., Moseguí, A., ... & Escalera, S. (2021, January). Context-Aware Personality Inference in Dyadic Scenarios: Introducing the UDIVA Dataset. In WACV (Workshops) (pp. 1-12).
[7] Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., & Lehmann, S. (2017). Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. arXiv preprint arXiv:1708.00524.
[8] Ortis, A., Farinella, G. M., & Battiato, S. (2020). Survey on visual sentiment analysis. IET Image Processing, 14(8), 1440-1456.

Dr Àgata Lapedriza


Dr Carles Ventura

Dr Lucrezia Crescenzi


Medical diagnosis using retinal imaging
In recent years, the diagnosis of several medical conditions using retinal imaging has gained traction. Some examples are the cardiometabolic risk [1,2], anemia [3], kidney disease [4], dementia [5] or uveitis. This research project will focus on analyzing retina imaging (fundus, OCT or retinal angiography) and developing Deep Learning models that are able to early diagnose for preventive treatment. 
Advanced retinal imaging techniques can be of great impact in this field of research. The research group has access to images acquired using high resolution retinal imaging techniques, such as AOSLO, and also to spectral information obtained by means of Raman spectroscopy.
Two are the main difficulties when dealing with these applications:
  • Small sample size problems, usually the N is reduced, and the generalization capabilities of the network are affected. We will explore transfer learning methods for this purpose. 
  • The resulting models should be explainable and easy to interpret. We will provide both a classification score and an explanation of this score, to make the early diagnosis more reliable and trustable. 
The resulting methods will be transferred to hospitals from the Barcelona Metropolitan area, and the research efforts will result in a strong social return.
[1] Gerrits, N., Elen, B., Van Craenendonck, T., Triantafyllidou, D., Petropoulos, I. N., Malik, R. A., and De Boever, P. (2020). Age and sex affect deep learning prediction of cardiometabolic risk factors from retinal images. Scientific reports, 10(1), 1-9.
[2] Barriada, R. G., Simó-Servat, O., Planas, A., Hernández, C., Simó, R., & Masip, D. (2022). Deep Learning of Retinal Imaging: A Useful Tool for Coronary Artery Calcium Score Prediction in Diabetic Patients. Applied Sciences, 12(3), 1401. 
[3] Tham, Y. C., Cheng, C. Y., and Wong, T. Y. (2020). Detection of anaemia from retinal images. Nature biomedical engineering, 4(1), 2-3.
[4] Sabanayagam, C., Xu, D., Ting, D. S., Nusinovici, S., Banu, R., Hamzah, H., ... & Wong, T. Y. (2020). A deep learning algorithm to detect chronic kidney disease from retinal photographs in community-based populations. The Lancet Digital Health, 2(6), e295-e302.
[5] McGrory, S., Cameron, J. R., Pellegrini, E., Warren, C., Doubal, F. N., Deary, I. J., ... and MacGillivray, T. J. (2017). The application of retinal fundus camera imaging in dementia: a systematic review. Alzheimer's & Dementia: Diagnosis, Assessment & Disease Monitoring, 6, 91-1

Dr David Merino

Computer Vision and Language 
We observed a huge progress in computer vision in the last 8 years, mainly because of the appearance of big datasets of labeled images, such as ImageNet [1] and Places [2], and the success of Deep Learning algorithms when they are trained with this big amount of data [2,3]. After this turning point, performance has increased in a lot of computer vision applications, such as scene recognition, object detection, or action recognition, among others.
However, despite this amazing progress, there are still some high-level tasks that are very hard to solve for a machine, such as image question-answering or describing, in detail, the content of an image. The human capacity for performing these high-level tasks is associated to our visual recognition capacity and also to our capability of reasoning about the things that we see. This line of research aims at developing new systems for visual reasoning based on language.
The research line of Computer Vision and Language includes several sub-areas of research. Some examples are:
Scene graph prediction, which aims at recognizing the objects present in a scene as well as the relationships among these objects [4, 6, 5, 7, 8].
Visual-Question Answering (VQA): given an image or a video and a question about the video or the image, the goal is to retrieve or generate the answer of the question [13].
Image/Video Captioning: given an image or a video, the goal of captioning is to generate or retrieve a short description of the image or the video content [14, 15].
Referring Image/Video Segmentation: segmentation of an object in an image or video given a referring expression. A referring expression is a natural language expression that refers to an object of the scene without any ambiguity. Whereas some systems have been proposed for images [9], there is few research done in video domain [10, 11]. Furthermore, for the image domain, the main papers about generation and comprehension of referring expressions has been done with bounding boxes [12] instead of using regions of arbitrary shape.
More generally, this research line aims to explore how to add cognition in vision systems, to create algorithms that can reason about the visual information.
[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. “Imagenet: A large-scale hierarchical image database”. In Proc. CVPR, 2009.
[2] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. "Learning Deep Features for Scene Recognition using Places Database”. Advances in Neural Information Processing Systems 27 (NIPS), 2014.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “Imagenet classification with deep convolutional neural networks”. In In Advances in Neural Information Processing Systems, 2012.
[4] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li Jia-Li, David Ayman Shamma, Michael Bernstein, Li Fei-Fei. ”Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. 2016.
[5] Kenneth Marino, Ruslan Salakhutdinov, Abhinav Gupta," The More You Know: Using Knowledge Graphs for Image Classification", Computer Vision and Pattern Recognition (CVPR), 2017.
[6] Chung-Wei Lee, Wei Fang, Chih-Kuan Yeh2, Yu-Chiang Frank Wang, “Multi-Label Zero-Shot Learning with Structured Knowledge Graphs”, Computer Vision and Pattern Recognition (CVPR), 2018.
[7] Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, Yanwen Guo, “Multi-Label Image Recognition with Graph Convolutional Networks”, Computer Vision and Pattern Recognition (CVPR), 2019.
[8] Peixi Xiong, Huayi Zhan, Xin Wang, Baivab Sinha, Ying Wu, “Relation-Aware Graph Attention Network for Visual Question Answering”, International Conference on Computer Vision (ICCV), 2019.
[9] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, Tamara L. Berg, “MAttNet: Modular Attention Network for Referring Expression Comprehension”, Computer Vision and Pattern Recognition (CVPR), 2018
[10] Anna Khoreva, Anna Rohrbach, Bernt Schiele, “Video Object Segmentation with Language Referring Expressions”, Asian Conference on Computer Vision (ACCV), 2018
[11] Alba Maria Hererra-Palacio, Carles Ventura, Carina Silberer, Ionut-Teodor Sorodoc, Gemma Boleda, Xavier Giro-i-Nieto, “Recurrent Instance Segmentation using Sequences of Referring Expressions” NIPS Workshops, 2019
[12] Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L., & Murphy, K. Generation and comprehension of unambiguous object descriptions. Computer Vision and Pattern Recognition (CVPR), 2016
[13] Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6904-6913).
[14] Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2016). Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence, 39(4), 652-663.
[15] Li, S., Tao, Z., Li, K., & Fu, Y. (2019). Visual to text: Survey of image and video captioning. IEEE Transactions on Emerging Topics in Computational Intelligence, 3(4), 297-312.
Scene-Centric Visual Understanding
Understanding complex visual scenes is one of the hallmark tasks of computer vision. Given a picture or a video, the goal of scene understanding is to build a representation of the content of a picture (e.g., what are the objects inside the picture, how are they related, if there are people in the picture what actions are they performing, what is the place depicted in the picture, etc). With the appearance of large scale databases like ImageNet [1] and Places [2], and the recent success of machine learning techniques such as Deep Neural Networks [3], scene understanding has experienced a large amount of progress, making possible to build vision systems capable of addressing some of the mentioned tasks [4].
In this research line, our goal is to improve existing algorithms for understanding scene-centric images or videos, and to define new problems that become reachable now, thanks to the recent advances in deep neural networks and machine learning. Some examples of these potential new problems are understanding general scenes [5] or understanding human-centric situations [6]. A paper related to one of our recent projects in this area can be found in [12].
An interesting topic, in the context of understanding scene-centric images and videos,  is instance segmentation. Instance segmentation consist on assigning each pixel a label with a semantic category that identifies the class object (e.g. car, person, etc.) and the specific instance of that object (i.e. if there are multiple cars in an image instance segmentation means to segment each car separately). Although there have been large improvements on image segmentation for the last years, with some techniques becoming very popular as MaskRCNN [7], there have been few algorithms that exploit the temporal domain for video object segmentation.
Most video object segmentation techniques are frame-based, which means that an image segmentation technique is applied at every frame independently, and postprocessing techniques are used to connect the objects segmented at each frame along the video. There are a few video object segmentation techniques that leverage additional cues from videos, such as motion, using optical flow. However, these architectures are not end-to-end trainable.
This research line includes the study of end-to-end trainable architectures that exploits the use of spatio-temporal features. Recently, some video object segmentation benchmarks have been released, e.g. YouTube-VOS [8] and YouTube-VIS [9], with much more objects annotated that in the previous video object segmentation benchmark (DAVIS [10]). There are two different challenges defined: (i) the one called semi-supervised video object segmentation (also referred to as one-shot), where the masks of the objects to be segmented is given at the first frame, and (ii) the one called unsupervised video object segmentation (also referred to as zero-shot), where no masks are given. The proposed RVOS model [11] could be a possible starting point for exploration.
[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. “Imagenet: A large-scale hierar- chical image database”. In Proc. CVPR, 2009.
[2] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. “Learning Deep Features for Scene Recognition using Places Database”. Advances in Neural Information Processing Systems 27 (NIPS), 2014.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “Imagenet classification with deep convolutional neural networks”. In In Advances in Neural Information Processing Systems, 2012
[4] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. “Learning Deep Features for Discriminative Localization”. Computer Vision and Pattern Recognition (CVPR), 2016
[5] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh, "Graph R-CNN for Scene Graph Generation". European Conference on Computer Vision (ECCV), 2018.
[6] M. Yatskar, L. Zettlemoyer, A. Farhadi, "Situation recognition: Visual semantic role labeling for image understanding", Computer Vision and Pattern Recognition (CVPR), 2016.
[7] K. He, G. Gkioxari, P. Dollár & R. Girshick. Mask R-CNN. International Conference on Computer Vision (ICCV), 2017.
[8] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang & T. Huang. Youtube-vos: A large-scale video object segmentation benchmark, 2018, arXiv preprint arXiv:1809.03327.
[9] L. Yang, Y. Fan, N. Xu. Video Instance Segmentation. International Conference on Computer Vision (ICCV), 2019.
[10] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung & L. Van Gool. The 2017 davis challenge on video object segmentation, 2017 arXiv preprint arXiv:1704.00675.
[11] C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques & X. Giro-i-Nieto. RVOS: End-to-end recurrent network for video object segmentation. Computer Vision and Pattern Recognition (CVPR), 2019.
[12] E. Weber, N. Marzo, D. P. Papadopoulos, A. Biswas, A. Lapedriza, F. Ofli, M. Imran, A. Torralba, "Detecting natural disasters, damage, and incidents in the wild", European Conferece in Computer Vision (ECCV), 2020.