Multimedia Chatbot is a web application that can be integrated within a museum website or as a mobile web site, and provides an interaction based on natural language processing and chat. Users can either type their questions or interact with speech that is then translated to text. This web-based application implements a chatbot system that can answer questions about visual content of artworks or about their context, e.g. about the author and history of the artwork. The design of this application is motivated by the recent huge interest in chat-based interaction that has been popularized, for example, by ChatGPT.
The backend is implemented in Python, using Flask to provide the REST API to the frontend.
There are two different versions of the backend; one implements a set of three neural networks:
- a neural network classifies the type of the query of the user understanding if it is about the visual content or the context of th artwork;
- a neural network for question answering (QA) uses the contextual information of the artwork, stored as JSON data, to answer questions about the the context of the artwork;
- a neural network for visual question answering considers the visual data of the image and the visual description of the artwork, stored as JSON data, to answer questions about the content of the artwork.
The idea of this system is to overcome the limitations of existing visual question answering (VQA) approaches, that take as input an image and a question about the image content and aim to answer correctly to the input question (see following figure). In fact VQA systems are limited in that they:
- are able to answer questions about the image content (visual questions) with a few words;
- are not able to answer questions about the image which involve external information (contextual questions) not inferable from the image content.
However, in the Cultural Heritage domain contextual questions are very frequent (when was the painting depicted?... who is the author?...)
The design of the first type of chatbot implemented in the Multimedia chatbot application follows thus the schema represented in the next figure.
Multimedia chatbot: example of visual question answering (VQA) for cultural heritage, answering a question related to the context of the artwork
Multimedia chatbot: example of visual question answering (VQA) for cultural heritage, answering a question related to the content of the artwork
The question classifier network is a network that processes only the textual information given by the question. It has the structure of a Transformer model followed by a classification head, and has been trained on questions of both the VQA v22 and OK-VQA3 datasets.
The VQA network extracts the salient region features of the image using Faster-RCNN4 (pretrained on the Visual Genome5 Dataset). It uses an attention mechanism to filter the image regions according to the input question and has been trained on examples of VQA v2 dataset.
The QA network uses an attention mechanism to find the answer to the question in the text. It has the structure of the Transformer models and has been trained on Squad6 dataset.
The following figures show some qualitative results obtained by the system, highlighting in red the mistakes
Multimedia chatbot: qualitative results of the question classifier, question answering network and visual question answering network.
In addition to this first type of chatbot system, following the emergence and success of neural networks based on GPT architectures and training, a second chatbot engine has been added to the backend, using a GPT-based neural network. A paper describing the use of this type of neural network has been published in a workshop dedicated to applications of AI and computer vision to cultural heritage in one of the foremost conferences on computer vision (European Conference on Computer Vision 2022)
This second system allows to create longer answers compared to the first approach, thus creating a more natural interaction. To cope with the fact that GPT-based system tend to create texts that sound plausible only from a linguistic point of view, but that are not based on actual knowledge, the system has been designed using prompt-engineering techniques that force the neural network to adhere to the contextual and content-based information provided by the JSON files that contain actual details and information related to the artwork.
Examples of the interface of the multimedia chatbot for mobile devices are shown in the following are shown examples of the web application designed for PC browsers.
VIOLA multimedia chatbot: landing page and answering questions related to the context of an artwork
The source code of the app is available on the GitHub of ReInHerit: https://github.com/ReInHerit/multimedia-chatbot
Demo is available at this link: