The paradox of generative AI: “What it can create, it may not understand”

Click the Card below and follow the “CVer” public account

AI/CV heavy-duty information, delivered as soon as possible

Click to enter->[Computer Vision and Transformer] Communication Group

Reprinted from: Heart of the Machine | Editor: Large Plate of Chicken, Egg Sauce

Scan the QR code to join CVer Knowledge Planet, you can quickly learn the paper ideas from the latest top conferences and journalsand CV from entry to Proficient materials, and cutting-edge projects and applications! If you want to post a paper, check it out!

e5f8ca2caa437a36e18d39eed6cb4477.jpeg

Without “understanding”, “creation” is out of the question.

From ChatGPT to GPT4, from DALL?E 2/3 to Midjourney, generative AI has attracted unprecedented global attention. The powerful potential has generated many expectations for AI, but powerful intelligence can also trigger people’s fears and worries. Recently, experts have staged a fierce debate on this issue. First, there was a “melee” among the Turing award winners, and then Andrew Ng joined at the end.

In the fields of language and vision, current generative models take only seconds to produce output and can challenge even experts with years of skill and knowledge. This seems to provide compelling motivation for the claim that the model has surpassed human intelligence. However, it is also important to note that there are often fundamental errors in understanding in model outputs.

A paradox seems to arise: How do we reconcile the seemingly superhuman abilities of these models with the persistence of fundamental errors that most humans can correct?

Recently, the University of Washington and the Allen Institute for AI jointly released a paper to study this paradox.

cb5882eca3ca14503ac8123b104ee98e.png

Paper address: https://arxiv.org/abs/2311.00059

This article believes that the reason for this phenomenon is that the ability configuration in today’s generative models deviates from human intelligence configuration. This article proposes and tests the generative AI paradox hypothesis: Generative models are trained to output results comparable to those of experts, a process that skips understanding the ability to generate output of that quality. For humans, however, this is very different, and basic understanding is often a prerequisite for expert-level output capabilities.

In this article, researchers test this hypothesis through controlled experiments and analyze the generative model’s ability to generate and understand text and vision. This article first discusses the conceptualization of “understanding” of generative models from two perspectives:

  • 1) Given a generative task, to what extent can the model select the correct response among the discriminative versions of the same task;

  • 2) Given a correct generated reply, to what extent the model can answer the content and questions about the reply. This resulted in two experimental setups, selective experiments and interrogative experiments.

The researchers found that in selective evaluation, models tended to perform as well as or better than humans in generative task settings, but in discriminative (understanding) settings, models performed worse than humans. Further analysis shows that compared with GPT-4, human discriminative ability is more closely related to generative ability, and human discriminative ability is also more robust to adversarial input. The discriminative ability gap between the model and humans increases with the difficulty of the task. increases with the increase.

Similarly, in the interrogative evaluation, although the model can produce high-quality outputs in different tasks, the researchers observed that the model often made errors when answering questions about these outputs, and the model’s understanding ability was again lower than human understanding ability. . This article discusses a range of potential reasons for divergence in the configuration of capabilities between generative models and humans, including model training objectives and the size and nature of the inputs.

The significance of this research is that, first of all, it means that existing concepts of intelligence derived from human experience may not be generalizable to AI, and even though AI’s capabilities seem to imitate or surpass human intelligence in many aspects, its capabilities may not be the same as those of humans. There are fundamental differences in the expected patterns. On the other hand, our findings also suggest caution when studying generative models to gain insights into human intelligence and cognition, as seemingly expert-level human-like output may mask non-human mechanisms.

In summary, the generative AI paradox encourages studying models as an interesting counterpart to human intelligence, rather than as a parallel counterpart.

“The Generative AI Paradox highlights the interesting concept that AI models can create content that they may not fully understand on their own. This raises the question of the limitations of AI’s understanding and the potential challenges behind its powerful generative capabilities. Problem.” said a netizen.

838e2c2534855c9b6dde10558674ac16.png

What is the Generative AI Paradox

We first look at the generative AI paradox and the experimental design for testing it.

c5e0a91c920b9363c38360bc3e7d540f.png

Figure 1: Generative AI in language and vision can produce high-quality results. Paradoxically, however, models have difficulty demonstrating selective (A, C) or interrogative (B, D) understanding of these patterns.

Generative models appear to be more efficient at acquiring generative abilities than at understanding, in contrast to human intelligence, where generative abilities are generally more difficult to acquire.

Testing this hypothesis requires operational definition of various aspects of the paradox. First, for a given model and task t, using human intelligence as the baseline, what does it mean for the ability to generate to be “more effective” than the ability to understand. Taking g and u as some performance indicators of generation and understanding, the researchers formally expressed the paradox hypothesis of generative artificial intelligence as:

84d4820f427a9b9976273883fb978c7e.png

Simply put, for a task t, if the human generation performance g is the same as the model, then the human understanding performance u will be significantly higher than the model (for reasonably large ? > ?). In other words, the models performed worse at understanding than researchers would expect from humans with similarly strong generative abilities.

The operational definition of generation is simple: given a task input (question/prompt), generation is the generation of observables that satisfy that input. Thus, performance g can be evaluated (e.g., style, correctness, preference) automatically or by humans. Although comprehension is not defined by some observable output, it can be tested by clearly defining its effects:

  1. Selective evaluation. For a given task for which a response can be generated, to what extent can the model also select an accurate answer from the set of provided candidates in a discriminative version of the same task? A common example is multiple-choice question answering, which is one of the most common ways to test human understanding and natural language understanding in language models. (Figure 1, columns A and C)

  2. Question-based evaluation. Given a generated model output, how accurately can the model answer questions about the content and appropriateness of that output? This is similar to an oral exam in education. (Figure 1, columns B and D).

These definitions of understanding provide a blueprint for assessing the “generative AI paradox,” allowing researchers to test whether Hypothesis 1 holds across different modalities, tasks, and models.

When models can be generated, can they be discriminative?

First, we conducted a side-by-side performance analysis of variants of generative and discriminative tasks in a selective evaluation to assess the model’s production and comprehension capabilities in both verbal and visual modalities. They compared this generation and discriminative performance to humans.

Figure 2 below compares the generation and discriminative performance of GPT-3.5, GPT-4, and humans. It can be seen that in 10 of the 13 datasets, at least one model supports sub-hypothesis 1,the model’s generative ability is better than that of humans, but its discriminative ability is lower than that of humans. Subhypothesis 1 was supported for both models in 7 of the 13 data sets.

64e37571cce5dbb985fb3626321c579b.png

It is unrealistic to ask humans to generate detailed images as well as visual models, and ordinary people cannot achieve the stylistic quality of models such as Midjourney, so humans are assumed to have lower generative performance. Here only the model’s generation and discriminative accuracy are compared with human discriminative accuracy. Similar to the language domain, Figure 3 shows thatCLIP and OpenCLIP also fall short of human accuracyin discriminative performance. Assuming that humans are less capable of generation, this is consistent with sub-hypothesis 1:Visual AI exceeds the human average in generation but lags behind in understanding.

06b8e76e9b4d4e367a8a90f6d069e8ce.png

Figure 4 (left) shows GPT-4 compared to humans. Through observation, it can be found thatmodels tend to make the most errors in the discriminative task when the answers are lengthy and challenging, such as summarizing lengthy documents. In contrast, humans can always maintain high accuracy in tasks of varying difficulty.

Figure 4 (right) shows the comparison of discriminative performance between OpenCLIP and humans at different levels of difficulty. Taken together, these results highlight that humans have the ability to identify correct answers even when faced with challenging or adversarial examples, but this ability is not as strong in language models. This discrepancy raises questions about how well these models are truly understood.

eee31762ed3c143a0260d860de8f3b84.png

Figure 5 illustrates a noteworthy trend: evaluators tend to favor GPT-4 responses over human-generated responses.

73cdff0e9b2624899bc85d930e34d7ed.png

Can the model understand the results it generates?

The previous section showedthat models are often good at generating accurate answers, but lag behind humansin discriminative tasks. Now, in question-based evaluation, researchers study the extent to which a model can demonstrate a meaningful understanding of what it generates-something humans are best at-by asking the model questions directly about what it generates.

ec785be4b79031107c7aea835ddf1f64.png

Figure 6 (left) shows the results for the language modality. While the model performed well at generation, it often made mistakes when answering questions about its generation, indicating a lapse in understanding. Assuming humans cannot generate such text at the same speed or scale, humans have consistently higher accuracy in quality assurance compared to models, although the question is about the output of the model itself. As stated in sub-hypothesis 2, we expected humans to achieve higher accuracy on self-generated text. It can also be noted that the humans in this study were not experts, and producing text as complex as the model output could be a huge challenge.

The researchers therefore expect that the performance gap in understanding the content they generate will widen even further if the model is compared to human experts, who are likely to answer such questions with near-perfect accuracy.

Figure 6 (right) shows the results of the question in visual mode. As you can see, image understanding models are still not as accurate as humans when it comes to answering simple questions about elements in generated images. At the same time, the image generation SOTA model surpasses most ordinary people in the quality and speed of generating images (it is expected that ordinary people will have difficulty generating similar realistic images), which shows that visual AI is better at generating (stronger) and understanding (weaker) There is a relative gap between humans and humans. Surprisingly, the performance gap between simple models and humans is smaller compared to state-of-the-art multimodal LLMs (i.e., Bard and BingChat), which have some fascinating visual understanding capabilities but still struggle to answer questions about generation Simple question of images.

For more research details, please refer to the original paper.

Click to enter-> [Computer Vision and Transformer] Communication Group

ICCV/CVPR 2023 paper and code download


Backend reply: CVPR2023, you can download the CVPR 2023 papers and code open source paper collection

Backend reply: ICCV2023, you can download the collection of ICCV 2023 papers and code open source papers
Computer Vision and Transformer exchange group established
Scan the QR code below, or add WeChat: CVer444 to add CVer assistant WeChat, and then apply to join the CVer-Computer Vision or Transformer WeChat communication group. In addition, other vertical directions have been covered: target detection, image segmentation, target tracking, face detection & recognition, OCR, pose estimation, super-resolution, SLAM, medical imaging, Re-ID, GAN, NAS, depth estimation, automatic Driving, reinforcement learning, lane detection, model pruning & compression, denoising, fog removal, rain removal, style transfer, remote sensing images, behavior recognition, video understanding, image fusion, image retrieval, paper submission & communication , PyTorch, TensorFlow and Transformer, NeRF, etc.
Be sure to note: Research direction + location + school/company + nickname (such as target detection or Transformer + Shanghai + hand in + Kaka). Note according to the format to get passed and invited to the group faster


▲Scan the QR code or add WeChat ID: CVer444 to join the communication group
CVer Computer Vision (Knowledge Planet) is here! If you want to know about the latest, fastest and best CV/DL/AI paper express delivery, high-quality practical projects, AI industry cutting-edge, and learning tutorials from entry to mastery, please scan the QR code below and join CVer Computer Vision (Knowledge Planet). Nearly ten thousand people have been gathered!

▲Scan the QR code to join Planet Learning

▲Click on the card above to follow the CVer official account

It is not easy to organize, please like and watch28ae5bb31288a1b4b6d37aeb5b1ad338.gif strong>

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeArtificial intelligenceMachine learning toolkit Scikit-learn386665 people are learning the system