Data Science

Proposta de tesi Investigadors/es Grup de recerca

Privacy-preserving in Data Mining

In recent years, an explosive increase of data has been made publicly available. Embedded within this data there is private information about users who appear in it. Therefore, data owners must respect the privacy of users before releasing datasets to third parties. In this scenario, anonymization processes become an important concern.

There exist several privacy breaches, each one related to one or more data types. For instance, medical datasets are published as database tables, so linking this information with publicly available datasets may disclose the identity of some individuals; social network data is usually published as graphs and there exist adversaries that can infer the identity of the users by solving a set of restricted graph isomorphism problems; location privacy concerns data from phone call networks or applications like “Foursquare”; and so on.

Simple technique of anonymizing networks by removing identifiers before publishing the actual data does not guarantee privacy [2,3]. Therefore, some approaches and methods have been developed to deal with each data type and each privacy disclosure [1]. The aim of this research is to develop privacy-preserving methods and algorithms that guarantee the users' privacy while keeping data utility as close as possible to the original data. These methods have to achieve a trade-off between data privacy and data utility. Consequently, several data mining tasks must be considered in order to quantify the information loss produced on anonymous data.

Due to the its nature, privacy-preserving in data mining involves some hot and interesting topics, such as security and privacy issues to ensure anonymity, data mining and machine learning to evaluate data utility and information loss, and also aspects related to big data.

[1] Casas-Roma, J., Herrera-Joancomartí, J. & Torra, V. Artif Intell Rev (2017) 47: 341. https://doi.org/10.1007/s10462-016-9484-8

[2] Torra V. (2009) Privacy in Data Mining. In: Maimon O., Rokach L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA

[3] Torra, V. and Navarro-Arribas, G. (2014), Data privacy. WIREs Data Mining Knowl Discov, 4: 269–280. doi:10.1002/widm.1129

Dr. Jordi Casas-Roma

KISON Research group

Data Mining and Community Detection in Graphs (Graph Mining)

In many applications, it is natural to represent data with graphs. Usually, data is represented in one large, connected network. Examples of such networks include the Internet, social networks, citation networks, concept networks, computer networks, chemical interaction networks, regulatory networks, socio-economic networks and encyclopedias. Sample datasets are publicly available at amongst others http://snap.stanford.edu/data/.

Graph mining is the study of how to perform data mining and machine learning on data represented with graphs. It includes several types of analysis, from pattern recognition [2] to community detection [1] and information flow. Algorithms from structured data mining do not work properly, since structural and topological information is crucial for graph analysis. Thus, new extensions or algorithms should be developed to deal with graph-formatted data.

For instance, uncovering the community structure exhibited by real networks is a crucial step towards an understanding of complex systems that goes beyond the local organization of their constituents. Many algorithms have been proposed so far [3], but the problem is still open and new methods and algorithms appear. Additionally, the recently tremendous increment of graph-formatted data, specially in the context of social networks and IoT, needs new methods to deal with very large graphs, with thousands or millions of vertices and edges. Therefore, parallelism and other techniques imported from Big Data can be applied in order to overcome the complexity when dealing with such data.

The challenges in this area are still many and of great complexity, therefore the research is guaranteed for the years to come.

[1] Ferrara, E. (2012). Community structure discovery in Facebook. International Journal of Social Network Mining, 1(1), 67–90.  http://doi.org/10.1504/IJSNM.2012.045106

[2] Gibson, D., Kumar, R., & Tomkins, A. (2005). Discovering large dense subgraphs in massive graphs. In International Conference on Very Large Data Bases (VLDB), pp. 721–732.

[3] Lancichinetti, A., & Fortunato, S. (2009). Community detection algorithms: a comparative analysis. Physical Review E, 80(5), 56117.

 

Dr. Jordi Casas-Roma

Dr. Jordi Conesa Caralt

eHealth Center

SmartLearn

KISON Research group

Data Mining and Deep Learning in Healthcare

Artificial Intelligence and machine learning is transforming the world of medicine [1]. They can help doctors make faster and more accurate diagnoses. Additionally, they can help researchers understand diseases, for instance the correlation between lifestyle and cancer or how genetic variations lead to disease.

Although AI has been around for decades, new advances have arised in deep learning. The AI technique powers self-driving cars, image recognition, and tremendous advances in medicine and healthcare.

Deep learning helps researchers analyze medical data to treat diseases [2]. It involves many types of data, from structured data about patient’s life (location, lifestyle, etc) to semi-structured data (mobility or social interaction) to unstructured data (medical images and so on).

It’s advancing the future of personalized medicine, which will change the health care in the next few years. The challenges in this area are still many and of great complexity. And even more important, society will be able to take full advantage of all these advances to improve the live of all citizens.

[1] W. Raghupathi and V. Raghupathi, “Big data analytics in healthcare: promise and potential,” Heal. Inf. Sci. Syst., vol. 2, p. 3, 2014

[2] Miotto, R., Wang, F., Wang, S., Jiang, X., & Dudley, J. T. (2017). Deep learning for healthcare: review, opportunities and challenges. Briefings in Bioinformatics.

 

Dr. Jordi Casas-Roma

Dr. Jordi Conesa Caralt

 

eHealth Center

SmartLearn

KISON Research group

Health Data Science

Data sicence has arised as a new paradigm that provides a lot of possibilities for discovering new information (and relationships) from data and providing predictive and prospective information [1]. In the context of healthcare, data analytics is a promising field for providing insight from very large data sets and improving outcomes while reducing costs [2].

Even though data managed in the health context is huge, it is expected to grow dramatically in the years ahead [3]. There is also a tendency of opening the access to health data repositories. In addition, new technologies allow individuals to self-track and collect their biological, physical, behavioural and environmental information. The availability of these information [4] may be a lever to achieve better personalized health [5].

This research is proposed in the context of the eHealth Center (http://www.uoc.edu/portal/en/ehealth-center/index.html), whose aim is to develop personalized, predictive, preventive and participative processes for the provision of health and well-being. Therefore, the proposals in this line will focus in the application of data science techniques over health data in order to promote personalized health and well-being.

1. Dhar, V. (2013). Data science and prediction. Communications of the ACM, 56(12), 64-73.

2. Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health information science and systems, 2(1), 3.

3. Institute for Health Technology Transformation (2013). Transforming Health Care through Big Data Strategies for leveraging big data in the health care industry.

4. Swan, M. (2013). The quantified self: Fundamental disruption in big data science and biological discovery. Big Data, 1(2), 85-99.

5. Sharon, T. (2017). Self-tracking for health and the quantified self: Re-articulating autonomy, solidarity, and authenticity in an age of personalized healthcare. Philosophy & Technology, 30(1), 93-121.

 

Dr. Jordi Conesa Caralt

Dr. Jordi Casas-Roma

 

 

 

eHealth Center

SmartLearn