4/18/24 · Technology

UOC involved in new standard language for indexing AI training data

Called Croissant, it has been developed jointly by several universities and industry giants, such as Google and Meta, and has already been adopted by leading repositories and data search engines

IN3 researcher Joan Giner contributed to the project by implementing the responsible AI extension, which minimizes the risk of bias and incorrect decisions by machine algorithms

Croissant emerged from a collaborative effort among tech titans such as Google, Meta or Amazon (photo: unsplash.com)

Xavier Aguilar

Data form the backbone of artificial intelligence, and machine learning experts rely on large datasets to train the AI models that are changing the world in many ways. However, finding the data they need, understanding them, making sense of how they are organized, and identifying usable parts can be a very time-consuming task, becoming a tar pit for AI development. In order to overcome this challenge, the MLCommons association, with the participation of the Universitat Oberta de Catalunya (UOC), has just launched Croissant, a new metadata format for indexing datasets prepared for machine learning.

Croissant emerged from a collaborative effort among research teams from tech titans such as Google, Meta and Amazon, alongside universities such as Harvard, King's College London and the UOC, represented by Joan Giner, a researcher at the SOM Research Lab within the Internet Interdisciplinary Institute (IN3). Giner said: "This initiative is comparable to the breakthrough that made it possible to search for anything on the internet using the Google search engine 20 years ago, but adapted to the field of artificial intelligence."

Croissant does not change the format of data (e.g. in image, audio or text files), but provides a standard way of describing and organizing them. The new language builds upon Schema.org, a machine-readable standard for describing structured data, already employed in over 40 million datasets online, enabling these datasets to be discovered through search engines such as Google Dataset Search.

Croissant encompasses valuable layers of information regarding data structure, attribute types and download options, streamlining the process of finding datasets and integrating them into AI applications. This eliminates the need to search for data individually across multiple repositories. Giner said: "This is a very significant change, because the difference between an excellent AI and an ordinary AI is that the former is trained on a much larger dataset. Now that we're in the age of big data, and a lot more data are being published every day, it was important to bring order to them so that they are easier to access."

The world's leading AI data repositories, namely HuggingFace, Kaggle and OpenML, are now part of the project, with all their datasets described using Croissant and indexed in Google Dataset Search. Key machine learning programs for AI training with data have also integrated this standard. Reflecting on this success, Giner said: "We can say that we're actually leading the way in establishing the standard for AI data description."

“This initiative is comparable to the breakthrough that made it possible to search for anything using the Google 20 years ago, but adapted to the field of AI”

Ethical and socially responsible AI

Giner contributed to the MLCommons project as an expert in responsible AI and dataset documentation, the subject of his doctoral thesis with the Doctoral Programme in Network and Information Technologies at the UOC. "We set out to establish a way of documenting data that would allow us to feel confident about their use while preventing ethical dilemmas", he said. The responsible AI extension he worked on addresses concerns such as privacy issues and social representativeness, crucial challenges faced by AI in its nascent stages. "This will help avoid cases like those observed in medical AI apps, where more diagnoses failed in women, especially black women, than in white men, because women, especially black women, were missing from the training data," explained the IN3 researcher.

As a consortium partner, Google has placed great emphasis on this ethical aspect: "Supporting responsible AI (RAI) has been a central goal of Croissant's efforts from the start, and this extension allows us to describe the processes used to generate data, the people involved in those processes, and the potential biases inherent in the data," said sources within the technology company. "For me, the fact that the world's first data standard comes with a responsible data extension is a significant achievement for the ethical AI community, as companies typically don't give enough consideration to this aspect," Giner added.

While the project is confident that industry experts will use Croissant to publish their data, the team that developed the language will focus on specific domains such as healthcare and public data. In healthcare, for example, the focus will be on identifying the most important data types (X-rays, CT scans, doctor-patient conversations, etc.) and describing the essential aspects of social representativeness required for effective use. "At the end of the day, AI seems smart, but it's not. It's just good at reproducing the patterns inherent in data. And if those data don't accurately reflect the reality they're trying to represent, the outcomes won't be very good," concluded the UOC expert.

This research supports UN Sustainable Development Goals (SDGs) 3, Good Health and Well-being; 5, Gender Equality; and 9, Industry, Innovation and Infrastructure.

UOC R&I

The UOC's research and innovation (R&I) is helping overcome pressing challenges faced by global societies in the 21st century by studying interactions between technology and human & social sciences with a specific focus on the network society, e-learning and e-health.

Over 500 researchers and more than 50 research groups work in the UOC's seven faculties, its eLearning Research programme and its two research centres: the Internet Interdisciplinary Institute (IN3) and the eHealth Center (eHC).

The university also develops online learning innovations at its eLearning Innovation Center (eLinC), as well as UOC community entrepreneurship and knowledge transfer via the Hubbik platform.

Open knowledge and the goals of the United Nations 2030 Agenda for Sustainable Development serve as strategic pillars for the UOC's teaching, research and innovation. More information: research.uoc.edu.