Google visual language model PaLI-3 is released, with only 5B parameters, smaller, faster and stronger

7ba2116a092d36e49ec0a8d7429a86c5.png

Source: Heart of the Machine
This article is about 3,000 words, it is recommended to read for 5 minutes
This article introduces the Google visual language model PaLi-3. 

In the field of large multi-modal (visual language) models, while fighting for parameters to win performance, pursuing smaller parameters, faster speed, and stronger performance is another research path.

In the era of large models, the parameters of visual language models (VLMs) have expanded to hundreds or even hundreds of billions, allowing performance to continue to increase. At the same time, smaller-scale models remain important, as they are easier to train and serve, are more environmentally friendly, and provide faster research cycles for model design.

In this field, Google Research launched a model called PaLI (Pathways Language and Image) last year. As a large multi-modal model, one of the key structures of PaLI is to reuse a large single-modal backbone for language and visual modeling, reusing mT5-XXL with 13B parameters for language, and ViT with 2B parameters for vision. ViT-e with -G and 4B parameters. At the time PaLI achieved performance superior to most models old and new.

Since then, Google has continued to focus on smaller-scale modeling and recently proposed PaLI-3, the third generation model of the PaLI series. With a pre-trained baseline model of only 5B parameters, they optimized the training method and achieved competitive and new SOTA results on multiple VLM benchmarks.

The method mainly consists of three parts, namely comparative pre-training of image encoders on web-scale image text data, an improved hybrid dataset for PaLI multi-modal training, and higher resolution training.

41faabab550699c6bf87059691fa54c3.png

The authors are from Google Research, Google DeepMind and Google Cloud.

Paper address: https://arxiv.org/pdf/2310.09199.pdf

The figure below shows an overview of the 5B PaLI-3 model, in which images are individually encoded into visual tokens by comparison with the pre-trained 2B SigLIP vision model. Together with the query, these visual tokens are passed to the UL2 Transformer of the 3B encoder-decoder architecture, which generates the expected answer. In this setting, the pre-trained model provides significantly more useful tokens than the single-category pre-trained model in the previous PaLI model.

96250ddbc50e22f6539c257a504e2a95.png

How is the effect? PaLI-3 achieves new state-of-the-art performance on tasks requiring visually localized text understanding and object localization, including 8 visually localized text understanding tasks and reference expression segmentation tasks on the RefCOCO dataset. PaLI-3 also performs well on a range of classification vision tasks.

In addition, the researchers also specifically conducted ablation experiments to compare with the pre-trained ViT baseline model for classification, and further confirmed the feasibility of the pre-trained visual encoder on noisy web-scale image and text data, thus becoming the first method to perform experiments on classification data. A preferred alternative to training.

In addition to the 5B PaLI-3 model, the researchers also used the recently proposed SigLIP method to build a SOTA multi-language contrastive vision model with parameters extended to 2B.

Model introduction

Architecture

At a high level, the architecture of PaLI-3 follows Chen et al. (2023b;a): the ViT model encodes images into tokens, which are passed to the encoder-decoder along with textual input such as questions, prompts, and instructions. A transformer for a structure, thereby producing text output.

Let’s look at the visual components first. The researchers used the SigLIP training method to initialize the visual backbone of PaLI-3 from the pre-trained ViT-G/14 model (parameters approximately 2B). In short, they trained the image embedding ViT-G/14 model and the text embedding transformer model to embed images and text respectively, so that a binary classifier using sigmoid cross-entropy of the image and text embedding dot products can Accurately classify whether respective images and text correspond to each other.

This is similar to CLIP and ALIGN, but more efficient, scalable and robust. At the same time, this method is to pre-train the ViT image embedding component, so when inserting ViT into PaLI, the text embedding transformer will be discarded.

Let’s look at the complete PaLI model. The output of the ViT image encoder forms a visual token before pooling and is linearly mapped and added to the embedded input text token. These tokens are then passed to the pre-trained 3B UL2 encoder-decoder model, which generates text output. Text input to the model typically contains prompts that describe the task type and encode the necessary text input for the task.

Training

The training process consists of multiple stages.

Phase 0: Unimodal pre-training. The image encoder was trained following the SigLIP protocol and the image encoder was trained at a resolution of 224×224; the text encoder-decoder was a 3B UL2 model trained following the hybrid denoising procedure described by Tay et al.

Phase 1: Multimodal training. The image encoder is combined with a text encoder-decoder, and the resulting PaLI model is trained on multi-modal tasks and data. At this time, the image encoder remains frozen and the resolution is still 224×224. The main hybrid components are again derived from the WebLI dataset by performing heuristic filtering on text quality and using the SplitCap training objective.

Stage 2: Increase resolution. High-resolution input is a widely accepted method to improve performance, both because more details in the image can be perceived and because model capabilities are improved by increasing sequence length. This paper improves the resolution of PaLI-3 by unfreezing the image encoder, keeping checkpoints at 812×812 and 1064×1064 resolutions.

Task migration. Finally, for each individual task (baseline), we use the frozen ViT image encoder to fine-tune the PaLI-3 model on the training data of the task; for most tasks, we fine-tune the 812×812 resolution checkpoint, but for two For a document understanding task, this article increases the resolution to 1064×1064.

Experiments and results

The experiment first compared the results of different ViT models under the PaLI framework. The researchers considered two ViT models: Classif and SigLIP.

The results are shown in Table 1, indicating that although the SigLIP model lags somewhat behind in few-shot linear classification, by using PaLI-3, the SigLIP model provides modest gains on simpler tasks (such as subtitles and question answering) and improves performance on more complex tasks. Huge gains have been achieved on scene text and spatial understanding tasks.

5b61e603501375928ea0bee3c9c8bb95.png

In addition, the researchers also evaluated PaLI-3 on the TextCaps, TextVQA, STVQA, OCRVQA, InfographicVQA, DocVQA, ChartQA, Scree2Words, and WidgetCap datasets. The results are shown in Table 2. PaLI-3 is only 0.7 points lower than the SOTA method when using an external OCR system. However, without this external system, PaLI-3 outperforms all SOTA methods combined by 4.4 points. For TextCaps, TextVQA, InfographicVQA, and DocVQA, PaLI-3 outperforms by 8 points or more.

dff889cb5c75e9a4586a66b0b5dc3de5.png

Reference expression segmentation

We extended PaLI-3 to predict segmentation masks from language-like output. To do this, they utilized Ning et al.’s (2023) vector quantized variational autoencoder (VQ-VAE). VQ-VAE is trained to learn 128 mask tokens, and its encoder can label a 64 × 64 pixel segmentation mask into 16 mask tokens, which the decoder can convert back.

We trained PaLI-3 to predict a single segmentation mask by first outputting 4 coordinates as text and represented as bounding boxes. This is followed by 16 mask tokens, representing the mask within the bounding box.

Table 1 shows that contrastive pretraining is more effective than classification pretraining for this type of localization task. Table 3 below shows that the full PaLI-3 model slightly outperforms the state-of-the-art in reference expression segmentation.

b75d60eedfe43fb88aac5df017fa823c.png

Image Understanding

The researchers next evaluated the PaLI-3 on a general visual language understanding task. As with previous work, they did not use an external OCR module because these benchmarks rarely involved text in images.

The results show that PaLI-3 is much smaller in size compared to recent SOTA models, but it shows very strong performance on these benchmarks. For COCO, PaLI-3 outperforms all models except BEiT-3 and 17B and 55B PaLI. On VQAv2 and TallyQA, PaLI-3 outperforms all previous models except PaLI-X. For the OKVQA task, PaLI-3 lags only behind PaLM-E (562B) and PaLI-X (55B), but still outperforms the 32-shot Flamingo (80B) model.

5f4084f0c311b3068254dcb3cb96a112.png

Video subtitles and Q&A

This study fine-tunes and evaluates the PaLI-3 model on 4 video captioning benchmarks: MSR-VTT, VATEX, ActivityNet Captions, and Spoken Moments in Time. Furthermore, the study performed the same operation on 3 video question answering benchmarks: NExT-QA, MSR-VTT-QA, and ActivityNet-QA.

Despite not using video data for pre-training, PaLI-3 achieves excellent video QA results at a smaller model size: achieving state-of-the-art performance on MSR-VTT-QA and ActivityNet-QA, and achieving state-of-the-art performance on NextQA achieved competitive results. Continuous improvements in image and video QA highlight the benefits of adopting contrastive ViT.

Additionally, PaLI-3 achieves very good video subtitle results, on average only 3 CIDEr points lower than the SOTA results. Considering the size of the model, the PaLI-3 seems to be an excellent choice in terms of both performance and practicality.

Direct Image Encoder Evaluation

The researchers also evaluated the ViT-G model, which can be understood as not being the complete PaLI-3, and the results are shown in Table 6.

First, the study tests image classification capabilities using the standard ImageNet benchmark and its two most popular variants. The results show that SigLIP lags slightly behind in top-1 and v2 accuracy, but has comparable results in ReaL.

Second, the study reports the results of different models on the Crossmodal-3600 benchmark. The results show that the SigLIP ViT-G model significantly outperforms the larger ViT-e model.

Finally, the study also reports linear probing results, which show that SigLIP is inferior to other models.

e1ce92d28da6df015db7ec7dd281a2b9.png

Tables 7 and 8 evaluate the model’s fairness, bias, and other potential issues.

2d778df69e613b8389f1e69acbe30fbf.png

Editor: Wang Jing

32ee2275fcea0909213ef5f71058dbcb.png