Data Science

Proposta de tesi Investigadors/es Grup de recerca

Privacy-preserving in Data Mining

In recent years, an explosive increase of data has been made publicly available. Embedded within this data there is private information about users who appear in it. Therefore, data owners must respect the privacy of users before releasing datasets to third parties. In this scenario, anonymization processes become an important concern.

There exist several privacy breaches, each one related to one or more data types. For instance, medical datasets are published as database tables, so linking this information with publicly available datasets may disclose the identity of some individuals; social network data is usually published as graphs and there exist adversaries that can infer the identity of the users by solving a set of restricted graph isomorphism problems; location privacy concerns data from phone call networks or applications like “Foursquare”; and so on.

Simple technique of anonymizing networks by removing identifiers before publishing the actual data does not guarantee privacy [2,3]. Therefore, some approaches and methods have been developed to deal with each data type and each privacy disclosure [1]. The aim of this research is to develop privacy-preserving methods and algorithms that guarantee the users' privacy while keeping data utility as close as possible to the original data. These methods have to achieve a trade-off between data privacy and data utility. Consequently, several data mining tasks must be considered in order to quantify the information loss produced on anonymous data.

Due to the its nature, privacy-preserving in data mining involves some hot and interesting topics, such as security and privacy issues to ensure anonymity, data mining and machine learning to evaluate data utility and information loss, and also aspects related to big data.

[1] Casas-Roma, J., Herrera-Joancomartí, J. & Torra, V. Artif Intell Rev (2017) 47: 341.

[2] Torra V. (2009) Privacy in Data Mining. In: Maimon O., Rokach L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA

[3] Torra, V. and Navarro-Arribas, G. (2014), Data privacy. WIREs Data Mining Knowl Discov, 4: 269–280. doi:10.1002/widm.1129

Dr Jordi Casas-Roma


Data Mining and Community Detection in Graphs (Graph Mining)

In many applications, it is natural to represent data with graphs. Usually, data is represented in one large, connected network. Examples of such networks include the Internet, social networks, citation networks, concept networks, computer networks, chemical interaction networks, regulatory networks, socio-economic networks and encyclopedias. Sample datasets are publicly available at amongst others

Graph mining is the study of how to perform data mining and machine learning on data represented with graphs. It includes several types of analysis, from pattern recognition [2] to community detection [1] and information flow. Algorithms from structured data mining do not work properly, since structural and topological information is crucial for graph analysis. Thus, new extensions or algorithms should be developed to deal with graph-formatted data.

For instance, uncovering the community structure exhibited by real networks is a crucial step towards an understanding of complex systems that goes beyond the local organization of their constituents. Many algorithms have been proposed so far [3], but the problem is still open and new methods and algorithms appear. Additionally, the recently tremendous increment of graph-formatted data, specially in the context of social networks and IoT, needs new methods to deal with very large graphs, with thousands or millions of vertices and edges. Therefore, parallelism and other techniques imported from Big Data can be applied in order to overcome the complexity when dealing with such data.

The challenges in this area are still many and of great complexity, therefore the research is guaranteed for the years to come.

[1] Ferrara, E. (2012). Community structure discovery in Facebook. International Journal of Social Network Mining, 1(1), 67–90.

[2] Gibson, D., Kumar, R., & Tomkins, A. (2005). Discovering large dense subgraphs in massive graphs. In International Conference on Very Large Data Bases (VLDB), pp. 721–732.

[3] Lancichinetti, A., & Fortunato, S. (2009). Community detection algorithms: a comparative analysis. Physical Review E, 80(5), 56117.


Dr Jordi Casas-Roma

Dr Jordi Conesa Caralt


eHealth Center



Medical image processing

The aim of this project is to apply big data analysis techniques to automatically process large databases of clinical data (including genetics, demographics and medical imaging) in order to extract disease progression models. Disease progression models, such as discrete event-based (Fonteijn et al. Neuroimage 2012) or continuous trajectory models (Lorenzi et al. Neuroimage 2017), have been designed to construct a long-term picture of disease progression in neurodegenerative conditions such as Alzheimer’s, using short-term longitudinal or even fully cross-sectional data. The event-based model estimates disease progression as sequence of “events” in which biophysical meaningful features (BMFs) become abnormal; continuous trajectory models offer richer information providing trajectories of BMFs over time, but require longitudinal information. These models offer powerful tools for integrating diverse data sources in order to assess future interventions in any disease. Furthermore, we have seen that is possible combine disease progression models with unsupervised machine learning algorithms to interrogate a database for the presence of patterns of MRI features that can accurately predict chronological age in healthy people (Cole et al. Neuroimage 2017) and become a signature of brain tissue changes associated to specific quality of life deviations. Such features of brain tissue ageing could in future drive efforts to support the increasing longevity of the human race and be used to test the real “brain age” of subjects (Cole et al. Neuroimage 2017) associated with their lifestyle or undergoing specific training schedules.  This work will be done in collaboration with world-wide recognised clinical institutions in Barcelona


Dr Ferran Prados



Dr Jordi Casas




Urban Resilience Reinforcement to Natural Disasters through Data-Based Instruments

Natural disasters have a significant and increasing impact all over the world. There is a growing concern about them, so Disaster Risk Reduction (DRR) is increasingly in international agenda, with special focus on cities because growing concentration of people and assets in urban zones. This thesis proposal sets up the scientific and technical basis for a significantly improved resilience to natural hazards (such as climate related hazards, earthquakes, etc.) and their human and socioeconomic impacts in urban zones.

The proposal is based on three principles, inspired by UN Sendai Framework and related to UN 2030 Agenda for Sustainable Development: 1) Focus on prevention and resilience building oriented. 2) Inclusive “whole-of-society” approach, to involve non-

traditional stakeholders not usually involved in DRR planning and decision making (such as households, SMEs, NGOs, etc.). 3) Data-driven approach, to integrate in DRR planning and decision-making diverse types of data (including small data, thick data, and big data) from a wide range of sources, and including reuse of data.

This thesis proposal will conduct research about data-based instruments for DRR planning and decision making (such as indexes, models and scorecards) applied to urban environments.

Dr Josep Cobarsí




Dr Laura Calvet