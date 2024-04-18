Data form the backbone of artificial intelligence, and machine learning experts rely on large datasets to train the AI models that are changing the world in many ways. However, finding the data they need, understanding them, making sense of how they are organized, and identifying usable parts can be a very time-consuming task, becoming a tar pit for AI development. In order to overcome this challenge, the MLCommons association, with the participation of the Universitat Oberta de Catalunya (UOC), has just launched Croissant, a new metadata format for indexing datasets prepared for machine learning.

Croissant emerged from a collaborative effort among research teams from tech titans such as Google, Meta and Amazon, alongside universities such as Harvard, King's College London and the UOC, represented by Joan Giner, a researcher at the SOM Research Lab within the Internet Interdisciplinary Institute ( IN3 ). Giner said: "This initiative is comparable to the breakthrough that made it possible to search for anything on the internet using the Google search engine 20 years ago, but adapted to the field of artificial intelligence."

Croissant does not change the format of data (e.g. in image, audio or text files), but provides a standard way of describing and organizing them. The new language builds upon Schema.org, a machine-readable standard for describing structured data, already employed in over 40 million datasets online, enabling these datasets to be discovered through search engines such as Google Dataset Search.

Croissant encompasses valuable layers of information regarding data structure, attribute types and download options, streamlining the process of finding datasets and integrating them into AI applications. This eliminates the need to search for data individually across multiple repositories. Giner said: "This is a very significant change, because the difference between an excellent AI and an ordinary AI is that the former is trained on a much larger dataset. Now that we're in the age of big data, and a lot more data are being published every day, it was important to bring order to them so that they are easier to access."

The world's leading AI data repositories, namely HuggingFace, Kaggle and OpenML, are now part of the project, with all their datasets described using Croissant and indexed in Google Dataset Search. Key machine learning programs for AI training with data have also integrated this standard. Reflecting on this success, Giner said: "We can say that we're actually leading the way in establishing the standard for AI data description."