Ciencia de Datos en Bioinformática

Propuestas de tesis Investigadores/as

Application of High Performance Computing in Bioinformatics

This research line focuses in the use of HPC techniques for optimizing and developing new bioinformatics tools & algorithms taking advantage of advanced computer architectures. Trying to use effectively, in Bioinformatics, environments like Supercomputing, HPC clusters, Grids, and Cloud Computing. And also exploring GPUs and other computing accelerators to enhance the performance of bioinformatic tools & algorithms.

Dr Josep Jorba Esteve
Application of Metaheuristics & Simulation in Bioinformatics
Metaheuristic algorithms are being applied to a large variety of bioinformatic problems, such as gene sequence analysis, molecular 3D structure prediction, microarray analysis, multiple sequence alignment, etc. Similarly, modeling and simulation methods are also employed in biosciences and bioinformatics, including: biological systems, healthcare facilities, epidemics spreading, etc. This research line aims at studying some of the potential applications of metaheuristic algorithms and simulation methods in the area of bioinformatics.
Dr Angel A. Juan Perez
Application of  Deep Learning to Bioinformatics

Currently there exist a large amount of biomedical data available. In the age of Big Data, the need for new pattern discovery methods urges computer scientists to effectively collaborate with biologists and computational biologists.  In this partnership, one of the most interesting and explored fields is machine learning (ML). Nevertheless, during the last years Deep Learning methods have dominated most of the ML applications (Natural Language Processing, Computer Vision, etc.).

This research proposal pretends to cover the applications of Deep Learning to all kind of biological data.  Particularly we will develop: novel CNN (convolutional Neural Networks) architectures to model Gene Expression regulation, image segmentation, or protein structures among others; Recurrent neural networks applied to sequence understanding (RNN, LSTM and GRUs), and other novel schemes such as Deep Reinforcement Learning or Generative Adversarial Networks, which can alleviate the need for large scale labelled data.

Alipanahi, Babak, et al. "Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning." Nature biotechnology 33.8 (2015): 831-838.
Min, Seonwoo, Byunghan Lee, and Sungroh Yoon. "Deep learning in bioinformatics." Briefings in Bioinformatics (2016): bbw068.
Xiong, Hui Y., et al. "The human splicing code reveals new insights into the genetic determinants of disease." Science 347.6218 (2015): 1254806.

Dr David Masip Rodó

MR abnormality detection

Medical image screening is a tedious and time consuming work. Clinicians can spend hours in front of a magnetic resonance (MR) images looking for abnormalities. Moreover, with the latest advances on MR images, we are able to obtain more image modalities and with better quality, but this impacts in more time for screening them due to the increase of information. Hence multimodality screening is more advantageous for detecting abnormalities however it is difficult and needs medical training and specialization. The expertise differences between raters can lead in different diagnostic criterion that can have an important impact in our healthcare system. This project aims to deploy a tool that based on the latest advances in deep learning techniques will be able to decide whether a multimodal MRI scan set is susceptible of having abnormalities or not. Moreover, in order to assist specialist’s assessment it will output a color map suggesting where the abnormal areas are. This tool will help clinicians to reduce the screening time per subject and will assist to take more robust intra-observer decisions.

Dr Jordi Casas Roma


Dr Ferran Prados Carrasco


Large population studies - (clinical and non-clinical data)

The aim of this project is to apply big data analysis techniques to automatically process large databases of clinical data (including genetics, demographics and medical imaging) in order to extract disease progression models. Disease progression models, such as discrete event-based (Fonteijn et al. Neuroimage 2012) or continuous trajectory models (Lorenzi et al. Neuroimage 2017), have been designed to construct a long-term picture of disease progression in neurodegenerative conditions such as Alzheimer’s, using short-term longitudinal or even fully cross-sectional data. The event-based model estimates disease progression as sequence of “events” in which biophysical meaningful features (BMFs) become abnormal; continuous trajectory models offer richer information providing trajectories of BMFs over time, but require longitudinal information. These models offer powerful tools for integrating diverse data sources in order to assess future interventions in any disease. Furthermore, we have seen that is possible combine disease progression models with unsupervised machine learning algorithms to interrogate a database for the presence of patterns of MRI features that can accurately predict chronological age in healthy people (Cole et al. Neuroimage 2017) and become a signature of brain tissue changes associated to specific quality of life deviations. Such features of brain tissue ageing could in future drive efforts to support the increasing longevity of the human race and be used to test the real “brain age” of subjects (Cole et al. Neuroimage 2017) associated with their lifestyle or undergoing specific training schedules.

Dr Ferran Prados Carrasco

Dr Jordi Casas Roma

Synthetic medical data generation for automatically train neuronal networks

Deep learning refers to neural networks with many layers that extract a hierarchy of features from raw data. Nowadays, deep learning models achieve impressive results and generalizability by training on large amount of data. Thanks to these big datasets, we are able to train deep learning algorithms, or in general machine learning algorithms, with enormous amount of instances that provide robustness to variations and better generalization properties. However, large datasets could not be available in several domains. Actually, it is a relevant problem in several medical areas, where training datasets are relatively small compared to large-scale image datasets (e.g., ImageNet) to achieve generalization across datasets. Moreover, current deep learning architectures are based on supervised learning and require generation of manual ground truth labels, which is tedious work on a large-scale data (Akkus et al. J Digit Imaging 2017). In this project we aim to design and develop methods to generate synthetic data from real MRI data. The main objective is to expand or create new data that realistically mimic variations in MRI data could alleviate the need of large amount of data. For instance, autoencoders could be used to generate synthetic data (Bengio et al. NIPS 2013), but it is necessary to consider the type of data and how to modify the data in order to produce variations that are as realistic as possible. Furthermore, methods to assess the data utility are critical and they need to be developed to ensure that synthetic data is realistic enough to train machine learning models.

Dr Ferran Prados Carrasco

Dr Jordi Casas Roma