6/17/22 · Research

Developed a model for the automatic extraction of content from webs and apps

Content management systems (CMSs) are behind more than 60% of pages currently available online
The model could turn CMSs into a new source of data for training artificial intelligence systems

The IN3 researchers' technological proposal aims to generate the code that will act as a link between the CMS and the development of new applications (photo: Sigmund / unsplash.com)

Juan F. Samaniego

Autor

Content management systems or CMSs are the most popular tool for creating content on the internet. In recent years, they have evolved to become the backbone of an increasingly complex ecosystem of websites, mobile apps and platforms. In order to simplify processes, a team of researchers from the Internet Interdisciplinary Institute (IN3) at the Universitat Oberta de Catalunya (UOC) has developed an open-source model to automate the extraction of content from CMSs.

The open-source model is a fully functional scientific prototype that makes it possible to extract the data structure and libraries of each CMS and create a piece of software that acts as an intermediary between the content and the so-called front-end (the final application used by the user). This entire process is done automatically, making it an error-free and scalable solution, since it can be repeated multiple times without increasing its cost.

The importance of CMSs in the online world

Content management systems (CMSs) are behind more than 60% of pages currently available online. Systems such as WordPress, Joomla and Drupal have become popular mainly because they provide a simple user experience, which has allowed all kinds of non-technical users to become part of the online content creation chain.

"Over the last four or five years, these systems have been providing information not only to browsers, but also to mobile apps. CMSs have application programming interfaces (APIs), with which mobile apps communicate to extract content," explained Joan Giner Miguélez, a student on the doctoral programme in Network and Information Technologies with the Systems, Software and Models Research Lab (SOM Research Lab) group and lead author of the study that outlines the new model. "These systems, which are known as headless CMSs, allow content, created in a simple way, to be consumed later on different platforms."

CMSs have therefore become a large container of content and data used by each application or platform. This has simplified a lot of processes but has also added complexities in terms of development that are particularly evident for organizations that manage a high volume of content and platforms. It is increasingly common for the creation of a new mobile app to involve complex development work, and these tasks are simplified by the model designed by the IN3 researchers.

"Imagine a large content company that manages over a thousand websites and apps and wants to make a new mobile app that displays products from each of those websites. If they want to develop the connectors between each website and the application, the work would be immense and resource intensive. It is not scalable," added Joan Giner. "If the APIs are already in a standard format, why can't we also make a content extractor that reads and understands the APIs, represents them in a standard way, and generates the connector to send the information to the new mobile app automatically?"

Automating the extraction of content from CMSs

The model developed by Giner – together with his research partners Abel Gómez and Jordi Cabot, ICREA researcher and leader of the SOM Research Lab – greatly simplifies the development process of a new application and, in turn, results in significant savings in terms of time and resources. The process, which has been developed thanks to funding from the European projects AIDOaRT and TRANSACT, aims to extract and represent the CMS model in a clear and automatic way to make it easier to use as a source of information. In addition, the IN3 researchers' technological proposal aims to generate the code that will act as a link between the CMS and the development of new applications.

To achieve this, the first step is to give the tool the address and login information for the CMS. Once logged in, it reads the API, understands it and uses a reverse engineering process to represent the structure and content libraries of the CMS in a standard way. Based on this, it automatically generates the connector code through which the CMS and the new mobile app being developed will communicate.

"It is a way of standardizing the process between the CMS and the final application," highlighted Joan Giner. "Its biggest advantage is, in fact, standardization itself. We're talking about a process that is frequently repeated in organizations that manage content; a process that, each time it is performed, involves setting up a specific development team that requires expenditure on a series of resources and that, in addition, can generate errors. Through automation, everything is simplified and becomes more scalable."

As such, this model for automating CMS extractions focuses on scalability, since once the outline and code of the CMS has been created, this can be reused as many times as necessary and integrated into future development projects at no additional cost.

The researchers also point out that it is an automatic model that creates libraries of error-free content, whereas, if the work is done manually, developers can always make a mistake in a line of code.

"Content management systems are a major source of content on the internet. We are making it possible to standardize access to CMSs, just as access to databases was standardized in the past," concluded Joan Giner. "Moving forward, this model could even be used to turn CMSs into a new source of data for training artificial intelligence systems."

Related article

Giner-Miguelez, J., Gómez, A., Cabot, J. (2022). Enabling Content Management Systems as an Information Source in Model-Driven Projects. In: Guizzardi, R., Ralyté, J., Franch, X. (eds) Research Challenges in Information Science. RCIS 2022. Lecture Notes in Business Information Processing, vol 446. Springer, Cham. https://doi.org/10.1007/978-3-031-05760-1_30

This research by the UOC supports Sustainable Development Goal (SDG) 9, Industry, innovation and infrastructure.

The AIDOaRT project has been funded by the Electronics Components and Systems for European Leadership (ECSEL) Joint Undertaking through grant agreement No. 101007350. The ECSEL Joint Undertaking is supported by the European Union's Horizon 2020 research and innovation programme and by Sweden, Austria, the Czech Republic, Finland, Italy and Spain.

This project has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No. 101007260. The JU receives support from the European Union's Horizon 2020 research and innovation programme and from Austria, Belgium, Denmark, Finland, Germany, the Netherlands, Norway, Poland and Spain.

UOC R&I

The UOC's research and innovation (R&I) is helping overcome pressing challenges faced by global societies in the 21st century, by studying interactions between technology and human & social sciences with a specific focus on the network society, e-learning and e-health.

Over 500 researchers and 51 research groups work among the University's seven faculties and two research centres: the Internet Interdisciplinary Institute (IN3) and the eHealth Center (eHC).

The University also cultivates online learning innovations at its eLearning Innovation Center (eLinC), as well as UOC community entrepreneurship and knowledge transfer via the Hubbik platform.

The United Nations' 2030 Agenda for Sustainable Development and open knowledge serve as strategic pillars for the UOC's teaching, research and innovation. More information: research.uoc.edu #UOC25years