There are around 7000 languages spoken in this world, and often, there is no direct one-to-one translation from one language to another. Even if such translations exist, they may not be exactly accurate, and different associations and connotations can be easily lost for a non-native speaker. This issue can be resolved by presenting a text paired with a supporting image. But, such image–text pair data does not exist for most languages. This type of data mostly comes for highly-resourced languages like English and Chinese.
To address this, Google AI has released the “MURAL: Multimodal, Multitask Representations Across Languages” model for image–text matching. It uses multitask learning applied to image–text pairs in combination with translation pairs covering over 100 languages.