Multi-Model Learning

Table of contents⚓︎

Image to Text Mapping
Text to Image Mapping
Image Text Transformer Pretraining
Vision LLMs

What is multi-model learning ?

Mapping Image to Image then CNN itself can learn the mapping. Mapping Text to Text then RNN itself can learn the mapping. What is the possible architecture for learning the Image to text mapping ?

Images are suitable for the CNN and while Texts are for RNN Therefore if we consider an encoder-decoder based architecture then we needed to connect the CNN + RNN such that they learning the Image to text mapping.