News

A vision encoder is a necessary component for allowing many leading LLMs to be able to work with images uploaded by users.
The LLM is typically pre-trained. For instance, LLaVA uses the CLIP ViT-L/14 for an image encoder and Vicuna for an LLM decoder. Vicuna fine-tunes LLaMA on conversations from ShareGPT. Both the ViT ...