Data Science

Propuesta de tesis Investigadores/as Grupo de investigación

Privacy-preserving in Data Mining

In recent years, an explosive increase of data has been made publicly available. Embedded within this data there is private information about users who appear in it. Therefore, data owners must respect the privacy of users before releasing datasets to third parties. In this scenario, anonymization processes become an important concern.

There exist several privacy breaches, each one related to one or more data types. For instance, medical datasets are published as database tables, so linking this information with publicly available datasets may disclose the identity of some individuals; social network data is usually published as graphs and there exist adversaries that can infer the identity of the users by solving a set of restricted graph isomorphism problems; location privacy concerns data from phone call networks or applications like “Foursquare”; and so on.

Simple technique of anonymizing networks by removing identifiers before publishing the actual data does not guarantee privacy [2,3]. Therefore, some approaches and methods have been developed to deal with each data type and each privacy disclosure [1]. The aim of this research is to develop privacy-preserving methods and algorithms that guarantee the users' privacy while keeping data utility as close as possible to the original data. These methods have to achieve a trade-off between data privacy and data utility. Consequently, several data mining tasks must be considered in order to quantify the information loss produced on anonymous data.

Due to the its nature, privacy-preserving in data mining involves some hot and interesting topics, such as security and privacy issues to ensure anonymity, data mining and machine learning to evaluate data utility and information loss, and also aspects related to big data.

[1] Casas-Roma, J., Herrera-Joancomartí, J. & Torra, V. Artif Intell Rev (2017) 47: 341.

[2] Torra V. (2009) Privacy in Data Mining. In: Maimon O., Rokach L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA

[3] Torra, V. and Navarro-Arribas, G. (2014), Data privacy. WIREs Data Mining Knowl Discov, 4: 269–280. doi:10.1002/widm.1129

Dr Jordi Casas-Roma


Data Mining and Community Detection in Graphs (Graph Mining)

In many applications, it is natural to represent data with graphs. Usually, data is represented in one large, connected network. Examples of such networks include the Internet, social networks, citation networks, concept networks, computer networks, chemical interaction networks, regulatory networks, socio-economic networks and encyclopedias. Sample datasets are publicly available at amongst others

Graph mining is the study of how to perform data mining and machine learning on data represented with graphs. It includes several types of analysis, from pattern recognition [2] to community detection [1] and information flow. Algorithms from structured data mining do not work properly, since structural and topological information is crucial for graph analysis. Thus, new extensions or algorithms should be developed to deal with graph-formatted data.

For instance, uncovering the community structure exhibited by real networks is a crucial step towards an understanding of complex systems that goes beyond the local organization of their constituents. Many algorithms have been proposed so far [3], but the problem is still open and new methods and algorithms appear. Additionally, the recently tremendous increment of graph-formatted data, specially in the context of social networks and IoT, needs new methods to deal with very large graphs, with thousands or millions of vertices and edges. Therefore, parallelism and other techniques imported from Big Data can be applied in order to overcome the complexity when dealing with such data.

The challenges in this area are still many and of great complexity, therefore the research is guaranteed for the years to come.

[1] Ferrara, E. (2012). Community structure discovery in Facebook. International Journal of Social Network Mining, 1(1), 67–90.

[2] Gibson, D., Kumar, R., & Tomkins, A. (2005). Discovering large dense subgraphs in massive graphs. In International Conference on Very Large Data Bases (VLDB), pp. 721–732.

[3] Lancichinetti, A., & Fortunato, S. (2009). Community detection algorithms: a comparative analysis. Physical Review E, 80(5), 56117.


Dr Jordi Casas-Roma

Dr Jordi Conesa Caralt

eHealth Center