Interviews

"Big data has the potential to predict future pandemics, but there are still some challenges to overcome before we get there"

 Foto: Julin Salas

Foto: Julin Salas

15/04/2020
Slvia Oller
Julin Salas Pin, Mexican researcher at the UOC's Internet Interdisciplinary Institute (IN3) research group K-riptography and Information Security for Open Networks (KISON)

 

Julin Salas Pin, a native of Mexico, completed his bachelor's degree in Mathematics at the Universidad Nacional Autnoma de Mxico (UNAM) before moving overseas 10 years ago and settling in Catalonia to pursue further training. In 2012 he obtained a PhD cum laude in Applied Mathematics and has since gone on to work in a number of institutes, including the Artificial Intelligence Research Institute (IIIA-CSIC). In 2017 he received a UOC grant to carry out a project aimed at safeguarding the privacy of dynamic data such as those generated on visited websites and social media and in purchase journals. Pin is currently a researcher at the UOC's Internet Interdisciplinary Institute (IN3) research group K-riptography and Information Security for Open Networks (KISON), which focuses on user privacy and security in open online environments. With his Catalan wife by his side he is keeping an eye on the other side of the Atlantic and following the spread of the coronavirus in his home country of Mexico.

 

-How are you handling the pandemic in Mexico from afar? Do you think a lockdown like the one that's been put in place here is possible or do densely populated urban centres like Mexico City, coupled with the poverty faced by a large portion of the population, make it an unfeasible option?

Mexico is currently at the stage that we were at in Spain a month ago when people were going about their daily lives with relative normalcy. Universities and schools shut down mid-March, but the problem there is that a huge percentage of the population have unregulated work. We've yet to see whether the government will be able to ensure that everyone left without work, such as the great number of people selling food on the street, will be able to get through the next few months. The underlying problem is an economic one: the dilemma lies in figuring out how to stop the pandemic while also minimizing its effect on the economy.

-In a healthcare crisis like today's, big data could be used by the government to obtain significant geographical information about the disease, but this would require handing over loads of private data. Does big data completely guarantee patient privacy?

Privacy is a bit of a hurdle in my area of research in fact. On the one hand, technology allows us to gather an increasing amount of data, and, the more data we have and the more accurate they are, the better our algorithms will work. However, privacy is working against this process. If you have extremely accurate data and know everything about someone, where's their right to privacy? The issue lies in exploiting data for an intended purpose without being tempted to know more than necessary.

-Can the scales be balanced?

Of course. To give an example, if you want to monitor citywide mobility and find out how many people are travelling from one place to another, you don't need to know every place each vehicle visits before reaching its final destination. Where they started and where they ended up, that's all you need. In one of the more recent studies we carried out at the UOC we mapped people's mobility without having access to the individual movements, as they could be used to reidentify people.

-Could our right to personal data protection stand in the way of battling the coronavirus?

I don't believe so, no, but we will perhaps have to make a greater effort to be able to use the data. I mean, we'll have to run algorithms to anonymize them, thereby removing the possibility of identifying the owner, before using them. Big data has allowed China to roll out a system for curbing the spread of the coronavirus, one which raises questions about whether or not it violates people's privacy. An app available for download on programs such as WeChat, China's most extensively used social medium, enables a central database to collect data on citizens' movements and coronavirus status, labelling them with a green, yellow or red QR code which allows them to move about freely, restricts their movements to nearby areas or extends their quarantine. The app then creates a centralized map of users' contact networks in real time using three data points: proximity between mobile telephones, GPS location and QR scans at the entrance and exit of buildings. The database is analysed by an artificial intelligence-based algorithm, which is responsible for the colour-coding.

-Would this be feasible in Spain?

In theory, if you give your consent, data like this can be gathered. Another matter would be determining whether it's really necessary to know everyone's whereabouts at all times. A good example of privacy protection is the app known as TraceTogether, in Singapore, which detects what users have been nearby and stores the data for 21 days on their mobile phones. Users only have to reveal these data if they are infected, meaning only essential information related to the pertinent time period is gathered, and only when this is truly necessary.

-A Canadian start-up detected the COVID-19 outbreak before the World Health Organization put out an official statement. Is artificial intelligence a useful tool for predicting future epidemics and preparing ourselves with effective treatments?

Artificial intelligence allows us to exploit computers' ability to process enormous amounts of data and render highly accurate predictive models. Big data has the potential to serve as a force for the common good, but there are still some challenges to overcome before we get there. This includes guaranteeing the well-meaning interests of the businesses who own the data and ensuring that the algorithms are transparent and explainable.

-The Government of Catalonia has rolled out an app to see where the majority of the coronavirus cases are concentrated. As citizens, can we rest assured that the data we provide will not be used with ulterior motives?

I've been asking myself the same thing. For example, when Spain's National Statistics Institute (INE) began to track the mobile phone of millions of people across the country to study our movements, it seemed like they were only going to use our rough locations. However, with the information they've gathered, there's a chance to find out who it belongs to. You don't need all of someone's data to be able to reidentify them. Many people believe that data are anonymous because they don't have a name or surname attached to them, but real anonymity means not being reidentifiable.

-The INE study gave rise to a number of doubts and complaints regarding privacy. Were they justified?

Full privacy requires not providing data at all, yet data are most accurate when you have everyone's at any given time. Our job is to find a happy medium between these two extremes. We are attempting to develop algorithms that can provide better privacy and security guarantees in different situations. In terms of the INE study, the telephone operators calculated and provided solely the origin-destination matrices, in which the entries represent the average number of people to move from point A to point B over a week's time. The problem lay in the fact that they didn't disclose their data protection methodology to the public at the time. Likewise, newspaper articles failed to explain what data the INE would have access to and how they planned to safeguard them.

-As citizens, can we be sure that the data we provide when we sign up on different websites, for example, are really anonymous?

Everyone should hear this story: a student from MIT managed to identify a governor's medical records with just three parameters – postal code, sex and date of birth – from a list of records that the governor himself had made public thinking that the data were anonymous because people's names and surnames had been redacted. There have also been reidentification attacks based on Netflix film reviews and geolocalized data from New York taxi drivers.

-Can our names and surnames be reattached to supposedly anonymous data through database analysis? Can we be reidentified in this way?

There are algorithms in place to try and keep this from happening. That being said, we should bear in mind that this comes at a cost: privacy protection equals less accuracy. Either way you tip the scale, the other side loses out. That's why I think we need a change in perspective. Maximum accuracy is the ultimate goal, as long as we respect people's right to privacy.

-In your mind, are citizens fully aware of the fact that sharing our data comes with a cost, that signing up for a website means giving away information that can then be used as the raw material for someone's business?

No, and we should be aware of all the data we generate on our computers, on Facebook, on Twitter, etc. We should also understand how they might be used and who may have access to them. Since most of our online activities take place from the comfort of our own homes, we feel like anything we type is private, but we need to know where all that information ends up and who can get their hands on it. When we navigate a web page or use a social medium, everything we do is stored. I would love to know what businesses have my data and what sorts of models they've built with them. In Europe, thanks to the General Data Protection Regulation, we can ask companies what data of ours they have. But they'll only tell you if you inquire.

Does disclosing our location come with any risks?

By tapping into your phone's GPS, we can find out where you live, where you spend your day, what time you leave your house, where you work, what hobbies you have, whether you are visiting the hospital more often than normal, and so on and so forth. Certain mobile apps ask for your location; if you allow them access to it, anyone with that information can find out what you're doing at any point during the day.

We're being watched from every direction, which is a bit unnerving, wouldn't you say?

I'm all for big data, but user privacy has to be there to limit its power. Businesses that gather data should be transparent and disclose what they use them for. We should also be able to audit the algorithms they have in place to safeguard our data and they should notify us when our data are added to databases.

What is your main technological challenge as experts working to anonymize data?

The world is filling up with sensors, such as those contained in our mobile phones or other objects with access to the Internet of Things, which makes keeping one's privacy an increasingly daunting task. Our challenge is to find a balance between respecting people's right to privacy and making the most out of data. Another hurdle involves raising people's awareness.