Google AI Introduces ‘WIT’, a Wikipedia-Based Image Text Dataset for Multimodal Multilingual Machine Learning
Image and text data sets are widely used in many machine learning applications. To model the relationship between images and text, most multimodal visio-linguistic models today rely on large data sets. Historically, these datasets were created either by manually captioning images or by crawling the web and extracting alt text as a caption. While the former method produces better quality data, the intensive manual annotation process limits the amount of data produced. The automated extraction method can result in larger data sets. However, this requires either heuristics and careful filtering to ensure data quality, or scaling models to achieve robust performance.
To overcome these limitations, Google’s research team created a large, high-quality, multilingual dataset called the Wikipedia-Based Image Text (WIT) dataset. It is created by extracting multiple selections of text associated with an image from Wikipedia articles and Wikimedia image links.
The researcher planned to create a large dataset without sacrificing quality or conceptual coverage. As a result, they started by using the largest online encyclopedia available today: Wikipedia. They began to select images from Wikipedia pages and extract various image-to-text associations and surrounding contexts. This produced an organized database of 37.5 million entity-rich image-to-text examples in 108 languages, with 11.5 million unique images.
They used a rigorous filtering process to further refine the data and ensure data quality. The process includes:
- Text-based filtering that ensures availability, length and quality of subtitles
- Image-based filtering which ensures that each image has a specific size with an authorized license
- Filtering based on images and text entities to ensure search relevance (eg, excluding those classified as hate speech).
Human editors were given a set of random image captions to review. They universally concluded that 98 percent of the samples had good image caption alignment.
The WIT dataset offers multiple benefits, a few of which are listed below:
- Highly multilingual: WIT is the first large-scale multilingual and multimodal dataset, with data in 108 languages.
- First set of image-to-text contextual data: Most multimodal datasets simply provide a single text caption (or many copies of a similar sentence) for each image. WIT is the first dataset to include contextual data, which can help academics model the impact of context on captions and image selection.
The main textual fields of WIT: text captions and contextual information, in particular, could be beneficial for research. WIT has extensive coverage in each of these areas, as shown below.
- A high quality training package and benchmark: The WIT Assessment Sets serve as a difficult reference, even for cutting edge methods, thanks to Wikipedia’s extensive coverage of varied ideas. The WIT test set has average recall scores in the 40s for well-endowed languages and 30 for under-funded languages.
The team believes the proposed data set will help researchers develop stronger and more robust models.
Under the Creative Commons license, the WIT dataset is available for download and use. Together with Wikimedia Research and some external collaborators, Google AI has announced that they will soon be running a Kaggle competition using the WIT dataset.
Wikipedia has made available images at 300 pixel resolution and Resnet-50 based image integrations for the majority of the training and testing datasets to facilitate research in this area. In addition to the WIT dataset, Kaggle will store all this image data and provide Colab notebooks. Additionally, participants will have access to a Kaggle chat area to share code and communicate. This allows anyone interested in multimodality to get started and conduct experiments with ease.