Computer Vision, Machine Learning and Pattern Recognition

Propuesta de tesi Investigadores/as Grupo de invesetigación

Explainable computer vision

Over the last few years, the performance in many areas of computer vision has been significantly improved with the use of Deep Neural Networks [1] and big datasets such as ImageNet [2] or Places [3]. Deep Learning (DL) models are able to accurately recognize objects, scenes, actions or human emotions. Nevertheless, many applications require not only classification or regression outputs from DL models, but also additional information explaining why the model made a certain decision.

Explainability in computer is the area of research that studies how to make DL models for that allow humans to understand why a model made a decision. An epitomic example could be a system for assisting cardio-vascular diagnosis using medical imaging. When the result concludes a high risk of cardiac incident and the necessity of urgent operation, rather than providing a black box it would be very relevant to obtain an explanation about this prediction, that could consist for example in the location and rate of coronary stenosis or in the detection of similar clinical cases. With this extra information, in addition to the classification, an expert can more easily verify whether the decision made by the model is trustable or not.

More generally, explainable DL is an area of research that has recently emerged to understand better how DL models learn and perform, and what type of representations these models learn. The works [4,5] offer nice overviews on the recent progress of explainable DL. More particularly, [6,7,8] are examples of recent explainable models for computer vision.

The goal of the Explainable Computer Vision research line is to explore explainability in DL, with applications to different problems in computer vision, such as emotion perception and scene understanding.

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.

[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, 2009.
[3] Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems (pp. 487-495).
[4] Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael Specter and Lalana Kagal, “Explaining Explanations: An Approach to Evaluating Interpretability of Machine Learning”, https://arxiv.org/abs/1806.00069v2.
[5] Q. Zhang, Z. Song-Chun, “Visual interpretability for deep learning: a survey”, Frontiers of Information Technology & Electronic Engineering, 2018.
[6] B. Zhou, A.Khosla, A.Lapedriza, A.Oliva, A.Torralba. Learning Deep Features for Discriminative Localization,CVPR, 2016.
[7] J.Lu, J. Yang, D. Batra, D. Parikh, “Hierarchical question-image co-attention for visual question answering,” in NIPS, 2016
[8] B. Zhou, Y. Sun, D. Bau, A. Torralba, “Interpretable Basis Decomposition for visual explanations”. European Conference on Computer Vision (ECCV) 2018.

 

Dr Àgata Lapedriza

Dr David Masip

SUNAI

Scene Understanding

Understanding complex visual scenes is one of the hallmark tasks of computer vision. Given a picture or a video, the goal of scene understanding is to build a representation of the content of a picture (e.g., what are the objects inside the picture, how are they related, if there are people in the picture what actions are they performing, what is the place depicted in the picture, etc). With the appearance of large scale databases like ImageNet [1] and Places [2], and the recent success of machine learning techniques such as Deep Neural Networks [3], scene understanding has experienced a large amount of progress, making possible to build vision systems capable of addressing some of the mentioned tasks [4].

In this research line, in collaboration with the computer vision group at the Massachusetts Institute of Technology (MIT), our goal is to improve existing algorithms for scene understanding and to define new problems that become reachable now, thanks to the recent advances in deep neural networks and machine learning. You can try our demo online of our state-of-the-art system for scene category recognition and scene attribute recognition:

http://places.csail.mit.edu/demo.html

The goals of this research line are to understand general scenes [5] and also to understand human-centric situations [6].

References:

[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. “Imagenet: A large-scale hierar- chical image database”. In Proc. CVPR, 2009.

[2] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. “Learning Deep Features for Scene Recognition using Places Database”. Advances in Neural Information Processing Systems 27 (NIPS), 2014.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “Imagenet classification with deep convolutional neural networks”. In In Advances in Neural Information Processing Systems, 2012

[4] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. “Learning Deep Features for Discriminative Localization”. Computer Vision and Pattern Recognition (CVPR), 2016

[5] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh, "Graph R-CNN for Scene Graph Generation". European Conference on Computer Vision (ECCV), 2018.

[6] M. Yatskar, L. Zettlemoyer, A. Farhadi, "Situation recognition: Visual semantic role labeling for image understanding", Computer Vision and Pattern Recognition (CVPR), 2016.

Dr Àgata Lapedriza SUNAI

Recognition of facial expressions

Facial expressions are a very important source of information for the development of new technologies. As humans, we use our faces to communicate our emotions, and psychologists have studied emotions in faces since the publication of Charles Darwin’s early works [1]. One of the most successful emotion models is the Facial Action Coding System (FACS) [2], where a particular set of action units (facial muscle movements) act as the building blocks of six basic emotions (happiness, surprise, fear, anger, disgust, sadness). The automatic understanding of this universal language (very similar in almost all cultures) is one of the most important research areas in computer vision. It has applications in many fields, such as design of intelligent user interfaces, uman-computer interaction, diagnosis of disorders and even in the field of reactive publicity.

Nevertheless, there exists a far greater range of emotions than just this basic set. We can predict with better than chance accuracy: the results of a negotiation, the preferences of the users in binary decisions [3], the deception perception, etc. In this line of research we collaborate with the Social Perception Lab at Princeton University (http://tlab.princeton.edu/) to apply automated algorithms to real data from psychology labs.

In this line of research we propose to apply Deep Learning methods to allow computers to perceive human emotions from facial images/videos [4]. We propose to use self-supervised [5] and semi- supervised methods to take benefit from the large amount of public unlabeled data available on Internet. We conjecture that large scale data acquired in-the-wild will pave the way for improved emotion classifiers in real applications.

[1] Darwin, Charles (1872), The expression of the emotions in man and animals, London: John Murray.

[2] P. Ekman and W. Friesen. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, 1978.

[3] Masip D, North MS, Todorov A, Osherson DN (2014) Automated Prediction of Preferences Using Facial Expressions. PLoS ONE 9(2): e87434.doi:10.1371/journal.pone.0087434

[4] Pons, G., & Masip, D. (2018). Supervised Committee of Convolutional Neural networks in automated facial expression analysis. IEEE Transactions on Affective Computing, 9(3), 343-350.

[5] Wiles, O., Koepke, A., & Zisserman, A. (2018). Self-supervised learning of a facial attribute embedding from video. arXiv preprint arXiv:1808.06882.

 
Dr David Masip SUNAI

Deep-learning algorithms

In recent years, end-to-end learning algorithms have revolutionized many areas of research, such as computer vision [1], natural language processing [2], gaming [3], robotics, etc. Deep-learning techniques have achieved the highest levels of success in many of these tasks, given their astonishing capability to model both the features/filters and the classification rule.

The output of a Neural Network is usually a score with the probability or regression of a user defined label.  Real life applications will require also explicit values regarding the uncertainty of this prediction (e.g. if the system diagnoses a cancer, we need both the type, probability and certainty of the prediction). We propose to develop novel Deep Learning methods that model the uncertainty on their predictions [4], and also exploit this uncertainty to perform active and semi-supervised learning on large scale unlabeled data sets.

These algorithms will be applied to real computer vision problems in the field of Neuroscience, in collaboration with the Princeton Neuroscience Institute. These range from detection and tracking of rodents in low resolution videos, image segmentation and limb detection, motion estimation of whiskers using high-speed cameras and in vivo calcium image segmentation of neuronal activity in rodents [5].

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.

[2] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).

[3] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489.

[4] Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In: ICML. pp. 1050–1059 (2016)

[5] Grewe, B. F., Langer, D., Kasper, H., Kampa, B. M., & Helmchen, F. (2010). High-speed in vivo calcium imaging reveals neuronal network activity with near-millisecond precision. Nature methods, 7(5), 399-405.

 
Dr David Masip

SUNAI 

Human pose recovery and behavior analysis

Human action/gesture recognition is a challenging area of research. It deals with the problem of recognizing people in images, detecting and describing body parts, inferring their spatial configuration, and performing action/gesture recognition from still images or image sequences also including multi-modal data. Because of the large pose parameter space inherent in human configurations, body pose recovery is a difficult problem that involves dealing with several distortions: illumination changes, partial occlusions, changes in the point of view, rigid and elastic deformations, and high inter- and intra-class variability, to mention just a few. In spite of the difficulty the problem presents, modern computer vision techniques and new trends merit further attention, and promising results are expected in the next few years.

Moreover, several subareas have recently been defined, such as affective computing, social signal processing, human behaviour analysis, and social robotics. The effort involved in this area of research will be compensated by its potential applications: TV production, home entertainment (multimedia content analysis), educational purposes, sociology research, surveillance and security, improved quality of life through monitoring and automatic artificial assistance, among others.

Dr Xavier Baró SUNAI

Computer vision and cognition

We observed a huge progress in computer vision in the last 6 years, mainly because of the appearance of big datasets of labeled images, such as ImageNet [1] and Places [2], and the success of deep learning algorithms when they are trained with this big amount of data [2,3]. After this turning point, performance has increased in a lot of computer vision applications, such as scene recognition, object detection and recognition, image captioning, or visual-question answering, among others.

However, despite of this amazing progress, there are still some tasks that are very hard to solve for a machine, such as image question-answering or describing, in detail, the content of an image. The point is that we can perform these tasks easily not just because of our capacity of detecting and recognizing objects and places, but because of our ability of reasoning about what we see. To be able of reasoning about something one needs cognition. Nowadays computers can not reason about visual information because computer vision systems do not include artificial cognition. One of the main drawbacks for developing cognitive systems for computer vision was the lack of data to train. However, we can find recent works [4,5,6] that present datasets and benchmarks that enable the modelling of such systems and open the door to new research goals.

This research line aims to explore how to add cognition in vision systems, to create algorithms that can reason about the visual information.

References:

[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. “Imagenet: A large-scale hierarchical image database”. In Proc. CVPR, 2009.

[2] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. "Learning Deep Features for Scene Recognition using Places Database”. Advances in Neural Information Processing Systems 27 (NIPS), 2014.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “Imagenet classification with deep convolutional neural networks”. In In Advances in Neural Information Processing Systems, 2012.

[4] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li Jia-Li, David Ayman Shamma, Michael Bernstein, Li Fei-Fei. ”Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. 2016. https://visualgenome.org/

[5] Kenneth Marino, Ruslan Salakhutdinov, Abhinav Gupta," The More You Know: Using Knowledge Graphs for Image Classification", Computer Vision and Pattern Recognition (CVPR), 2017.

[6] Chung-Wei Lee, Wei Fang, Chih-Kuan Yeh2, Yu-Chiang Frank Wang, “Multi-Label Zero-Shot Learning with Structured Knowledge Graphs”, Computer Vision and Pattern Recognition (CVPR), 2018.

 

Dr Àgata Lapedriza

Dr Carles Ventura

SUNAI

Emotional Intelligence for Social Robots

The World Health Organization predicts that by the year 2030, depression, along with other mood disorders, will be the global #1 disease burden in terms of disability and lives lost. Today the diagnosis and tracking of mood disorders still rely on clinical assessments of self-reported depressive symptoms. Unfortunately these methods are often inaccurate and usually do not allow early detection of these type of mood disorders.

In this context, social robots are a very interesting technology for the early detection and tracking of mood disorders. In the future, we will interact every day with social robots that will have at home, and will have with them long-term relationships [1]. If these robots are provided with emotional intelligence, they would have a great potential to help us on our emotional wellbeing, by tracking our mood and helping us to regulate our emotions. Prior works [2,3] support the idea that machine learning applied data captured by wearables, smartphones and self report can predict people’s mood the next day in a personalized way. The idea of this project is to do the same using as data the interactions with the robot. These technologies would be particularly useful for those people that live alone and have a high risk of isolation.

This research line is focused in the development of the perception for emotional intelligence of social robots and machines, in general. It includes areas of research such as emotion recognition and personality trait recognition. In particular, one of the goals of this research research line is to create new multimodal Deep Learning Models that process video, audio, and speech text transcript data acquired during conversations with a robot, to recognize and understand the emotional state of the person. More generally, these area of research includes other subareas related with the development of emotional intelligence in machines, like emotion recognition in the wild [4], visual sentiment analysis [5], or even the integration of commercial software for emotion recognition from facial expression like [6]. In terms of applications we focus in good purposes like emotional wellbeing and education.

This area of research is done in collaboration with the Affective Computing group and the Personal Robots group at the Massachusetts Institute of Technology (MIT) Medialab.

[1] C. Kidd and C. Breazeal (2008). “Robots at Home: Understanding Long-Term Human-Robot Interaction”. Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2008). Nice, France.

[2] N Jaques, S Taylor, E Nosakhare, A Sano, R Picard. “Multi-task Learning for Predicting Health, Stress, and Happiness.” NIPS Workshop on Machine Learning for Health, Barcelona, Spain, December 2016.

[3] A. Ghandeharioun, S. Fedor, L. Sangermano, D. Ionescu, J. Alpert, C. Dale, D. Sontag, R. Picard. “Objective assessment of depressive symptoms with machine learning and wearable sensors data”,Affective Computing and Intelligent Interaction 2017, San Antonio, TX, Oct 2017.

[4] Kuan-Chuan Peng, Tsuhan Chen, Amir Sadovnik, Andrew Gallagher. “A Mixed Bag of Emotions: Model, Predict and Transfer Emotion Distributions”. International Conference on Computer Vision and Pattern Recognition (CVPR). 2015.

[5] R. Kosti, J.M Alvarez, A. Recasens, A.Lapedriza. "Emotion Recognition in Context". Computer Vision and Pattern Recognition (CVPR), 2017.

[6] http://www.affectiva.com/

 
Dra Àgata Lapedriza SUNAI

Medical image processing - Lesion age

Multiple Sclerosis lesions are one of the most clear evidence of the presence of this neurological disorder. Thanks to several histopathological and MRI studies, it is well known that despite a subject can have several lesions along the central nervous system (brain and spinal cord) not all the lesions are in the same stage and have the same morphological and physiological characteristics. Nowadays, multiple sclerosis image processing pipelines have two clear preprocessing steps: lesion segmentation and then inpaint them in order to minimise their effect over most of the post processing steps. In clinical trials, lesion load is a common first outcome thanks to their strong correlation with clinical disability. Moreover, lesion load and number are often reported in the papers to characterise the population. However, little effort has been done to individually study and classify lesions depending on their development stage. On this medical image research line, we would like to take advantage of different image biomarkers (i.e.: gadolinium+, diffusion image, T1, T2, iron deposition and/or demyelination) to individually characterize and classify each multiple sclerosis lesion for improving the diagnostic and prognosis process. In order to achieve this goal, we will use machine learning techniques to extract these biomarkers and statistical methods to infer the normative trajectories of the lesion stages.

 

Dr Ferran Prados


Dr Jordi Casas

 
 

Retina characterization through advanced image analysis techniques

Retinal imaging is a very important way to diagnose many diseases that can affect vision. But it can also be a way to assess the health of our brain, since the retina is the only visible part of our nervous system. The health of the cells in our retina opens a window to our brain and it is becoming a key tool to detect early stages of neurological disorders.

In this sense, retinal in-vivo screening using non-invasive optical imaging techniques are on its way to become a cost effective alternative to magnetic resonance imaging, and probably safer than computerized tomography (CT). In particular, adaptive optics assisted retinal imaging techniques can generate retinal images in a way that is safe for the patient, and individual cells can be observed at different depths in the tissue. However, it has proven challenging to extract information from them using a robust method that can provide biomarkers in a consistent way in the ophthalmology clinic.

In this project we aim to develop automatic image analysis techniques based on artificial intelligence to allow registration between different modalities of retinal images of the same subject, build mosaics of the acquired images, and interpret them by detecting individual cells/cones in an effort to propose new biomarkers of neurodegeneration which can be applied not only in ophthalmology, but also in neurology.

 

Dr David Merino


Dr Ferran Prados