Ciencia de Datos en Bioinformática

Propuestas de tesis Investigadores/as

Application of High Performance Computing in Bioinformatics

This research line focuses in the use of HPC techniques for optimizing and developing new bioinformatics tools & algorithms taking advantage of advanced computer architectures. Trying to use effectively, in Bioinformatics, environments like Supercomputing, HPC clusters, Grids, and Cloud Computing. And also exploring GPUs and other computing accelerators to enhance the performance of bioinformatic tools & algorithms.

Dr Josep Jorba Esteve
Application of Metaheuristics & Simulation in Bioinformatics
Metaheuristic algorithms are being applied to a large variety of bioinformatic problems, such as gene sequence analysis, molecular 3D structure prediction, microarray analysis, multiple sequence alignment, etc. Similarly, modeling and simulation methods are also employed in biosciences and bioinformatics, including: biological systems, healthcare facilities, epidemics spreading, etc. This research line aims at studying some of the potential applications of metaheuristic algorithms and simulation methods in the area of bioinformatics.
Dr Angel A. Juan Perez
Application of  Deep Learning to Bioinformatics

Currently there exist a large amount of biomedical data available. In the age of Big Data, the need for new pattern discovery methods urges computer scientists to effectively collaborate with biologists and computational biologists.  In this partnership, one of the most interesting and explored fields is machine learning (ML). Nevertheless, during the last years Deep Learning methods have dominated most of the ML applications (Natural Language Processing, Computer Vision, etc.).

This research proposal pretends to cover the applications of Deep Learning to all kind of biological data.  Particularly we will develop: novel CNN (convolutional Neural Networks) architectures to model Gene Expression regulation, image segmentation, or protein structures among others; Recurrent neural networks applied to sequence understanding (RNN, LSTM and GRUs), and other novel schemes such as Deep Reinforcement Learning or Generative Adversarial Networks, which can alleviate the need for large scale labelled data.

Alipanahi, Babak, et al. "Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning." Nature biotechnology 33.8 (2015): 831-838.
Min, Seonwoo, Byunghan Lee, and Sungroh Yoon. "Deep learning in bioinformatics." Briefings in Bioinformatics (2016): bbw068.
Xiong, Hui Y., et al. "The human splicing code reveals new insights into the genetic determinants of disease." Science 347.6218 (2015): 1254806.

Dr David Masip Rodó

Medical image abnormality detection

Medical image screening is a tedious and time-consuming work. Clinicians can spend hours in front of magnetic resonance (MR), ultrasound or CT images looking for abnormalities. Moreover, we are able to obtain a large variety of image modalities and with better quality, but this impacts in more time for screening them due to the increase of information. Hence multimodality screening is more advantageous for detecting abnormalities however it is difficult and needs medical training and specialization. The expertise differences between raters can lead to different diagnostic criterion that can have an important impact on our healthcare system. This project aims to deploy a tool, that based on the latest advances in deep learning techniques will be able to decide whether a multimodal scan set is susceptible of having abnormalities or not. Moreover, in order to assist specialist’s assessment, it will output a colour map suggesting where the abnormal areas are. This tool will help clinicians to reduce the screening time per subject and will assist to take more robust intra-observer decisions. This work will be done in close collaboration with the Multiple Sclerosis group lead by Dr. Sara Llufriu at the IDIBAPS-Hospital Clinic, a world-wide recognised clinical institution.

Dr Jordi Casas Roma


Dr Ferran Prados Carrasco


Medical image processing - Multiple Sclerosis lesion age

Multiple Sclerosis lesions are one of the clearest evidence of the presence of this neurological disorder. Thanks to several histopathological and MRI studies, it is well known that despite a subject can have several lesions along the central nervous system (brain and spinal cord) not all the lesions are in the same stage and have the same morphological and physiological characteristics. Nowadays, multiple sclerosis image processing pipelines have two clear preprocessing steps: lesion segmentation and then inpaint them in order to minimise their effect over most of the post-processing steps. In clinical trials, lesion load is a common first outcome thanks to their strong correlation with clinical disability. Moreover, lesion load and number are often reported in the papers to characterise the population. However, little effort has been done to individually study and classify lesions depending on their development stage. On this medical image research line, we would like to take advantage of different image biomarkers (i.e.: gadolinium+, diffusion image, T1, T2, iron deposition and/or demyelination) to individually characterise and classify each multiple sclerosis lesion for improving the diagnostic and prognosis process. In order to achieve this goal, we will use machine learning techniques to extract these biomarkers and statistical methods to infer the normative trajectories of the lesion stages. This work will be done in close collaboration with the Multiple Sclerosis group lead by Dr. Sara Llufriu at the IDIBAPS-Hospital Clinic, a world-wide recognised clinical institution.

Dr Ferran Prados Carrasco

Dr Jordi Casas Roma

Synthetic medical data generation for automatically train neuronal networks

Deep learning refers to neural networks with many layers that extract a hierarchy of features from raw data. Nowadays, deep learning models achieve impressive results and generalisability by training on a large amount of data. Thanks to these big datasets, we are able to train deep learning algorithms, or in general machine learning algorithms, with an enormous amount of instances that provide robustness to variations and better generalisation properties.
However, large datasets could not be available in several domains. Actually, it is a relevant problem in several medical areas, where training datasets are relatively small compared to large-scale image datasets (e.g., ImageNet) to achieve generalisation across datasets. Moreover, current deep learning architectures are based on supervised learning and require the generation of manual ground truth labels, which is tedious work on a large-scale data (Akkus et al. J Digit Imaging 2017).
In this project, we aim to design and develop methods to generate synthetic data from real MRI data. The main objective is to expand or create new data that realistically mimic variations in MRI data could alleviate the need for a large amount of data. For instance, autoencoders could be used to generate synthetic data (Bengio et al. NIPS 2013), but it is necessary to consider the type of data and how to modify the data in order to produce variations that are as realistic as possible. Furthermore, methods to assess the data utility are critical and they need to be developed to ensure that synthetic data is realistic enough to train machine learning models.  This work will be done in close collaboration with the Multiple Sclerosis group lead by Dr. Sara Llufriu at the IDIBAPS-Hospital Clinic, a world-wide recognised clinical institution.

Dr Ferran Prados Carrasco

Dr Jordi Casas Roma