DeepMind introduces a visual language model

By Srikanth
4 Min Read

Intelligence makes a measurement of how quickly a person can adjust to a new situation using only a few simple instructions. Despite the divergence between the two, children may recognize real animals in the zoo after visualizing a few photographs of the animals in a book. On the other hand, Typical visual models do not reflect this level of human intelligence. They need to be trained on thousands of instances that have been explicitly interpreted for that task. The need to train a new model each time it is challenged with a new job is the most predominant limitation, making the process inefficient, costly, and resource-intensive because of the large quantity of annotated data required.


Google’s Deepmind represented a bunch of machine learning models called Flamingo to consider this challenge by getting better results with a little intensive training. Flamingo is the only visual language model (VLM) that sets a new state-of-the-art in few-shot learning on a wide range of open-ended multimodal tasks. A flamingo can solve several problems using only a few task-specific instances and no extra training.

Flamingo takes an occasion of multimodal data consisting of interlaced images, videos, and text as input and gives a text-only output with affiliated language using its simple interface. In simple words, Flamingo can perform an inference task by returning the input’s explanation text with only a few samples given during the training session. It surpasses all previous few-shot learning algorithms, according to Deepmind researchers.

Flamingo mixes pre-trained language models alone with powerful visual representations and different architecture components in practice. Deepmind trains Flamingo using Chinchilla, its recently released 70 billion parameter language model, preventing the need for any extra task-specific fine-tuning. The model can be applied to visual tasks after this training session. The 43.3 million-items training dataset was gathered entirely from the internet and comprehended a mix of complementing unlabeled multimodal data.

The model’s qualitative capabilities were examined by captioning photographs with gender and skin colour, and the captions were run through Google’s Perspective API to assess text toxicity. While the preliminary findings were stirring up, the team believes that more knowledge of evaluating ethical risks in multimodal systems is required before distribution to address AI bias. Flamingo surpasses all previous few-shot learning algorithms when given a few instances per challenge. The model also faced certain challenges related to few-shot training.

Flamingo is not only for random data; it can also improvise the condition of machine learning in general, which is struggling with the rising cost of energy and processing required to train newer models. Deepmind analyzed that the model is “computationally expensive to train,” although it does not explicitly mention the energy costs required for training. On the other hand, the team believed that Flamingo could promptly adapt to low-resource environments and activities, such as analyzing data for PII, societal prejudices, stereotypes, and many more variables. Flamingo is not ready for prime time; models like this have much potential to improvise society practically.

Share This Article
Passionate Tech Blogger on Emerging Technologies, which brings revolutionary changes to the People life.., Interested to explore latest Gadgets, Saas Programs
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *