Challenges of LLM code generator [Early observation of GDELT]

A growing body of research is beginning to question the quality of code generated by large LLM models, even as the tech industry continues to roll out more and more tools designed to augment or even replace human coders. As we (GDELT) continue to explore and evaluate more and more of these tools, here are some of our early observations.

Recommended online tools: Three.js AI texture development kit – YOLO synthetic data generator – GLTF/GLB online editing – 3D model format online conversion – Programmable 3D scene editor

Overall, we found that the usefulness of code generators is limited to basic tasks in a few languages, notably Python and HTML + JavaScript. Surprisingly, despite the ubiquity of basic shell scripts in almost every real-world workflow, code generators are generally terrible at handling anything other than menial tasks. What’s most shocking, however, is that the coding assistants offered by big tech companies are often nearly useless when it comes to those companies’ own software packages, failing to output code capable of even the most basic tasks.

When asked to generate a simple demo of a major library developed by its parent company, a code generator kept outputting random permutations of parameters and pipelines that had nothing to do with the library, demonstrating the critical importance of RLHF and other training and tuning methods. . these models. Contrast that with the experts surrounding its use, who can’t just get the user manual for a given library and then ask to generate various functions for it and generate successful code in most cases.

Ask the code generator to generate a Python script that can read a CSV file with a specific set of headers into a Pandas dataframe, and any major tool will produce perfectly passable code. Ask it to do the same thing in a less common language, and the result is usually gibberish. Common Python-related software usually has strong support. However, even Tier 1 software that is limited to a specific domain tends to have poor support. For example, while ImageMagick and ffmpeg are the de facto gold standards for image and video processing, support is spotty.

For example, when asking LLM:

Change the temporary directory of ImageMagick’s ‘convert’ utility to /dev/shm

A main code generator produces the following results when run four different times:

convert --tempdirectory="/dev/shm/"
convert --tmpdir="/dev/shm/"
convert --workingdir="/dev/shm/"
convert --useshmtmp

Although a basic web search would return the correct response, the utility simply tried random combinations of parameters that seemed reasonable but were completely wrong. While in some cases these wrong answers can still serve as a starting point, in this case they have no value in finding the correct answer.

What about the noisy prefix “json” that often appears before some LLM’s JSON output? Let’s ask for a main generator:

Remove json prefix from JSON string

The answer returned by LLM is as follows:

jq -r . <string>

Unfortunately, this isn’t a very useful answer. This is also surprising considering the centrality of JSON in modern data exchange and the versatility of removing prefixes and suffixes in shell scripts.

How about another common JSON-related task? For our GCP Imagen image captioning demo, we need to encode a Base64 version of the image with each request. We use the following tips:

Use the shell command Base64 to encode the image ./RUSSIA1_20230215_083000_60_minut-000014.jpg’, and set the field instances[0].image.bytesBase64Encoded’ in the JSON file input.json’ to the encoded image.

Run it repeatedly and get the following results. What we immediately understand is that LLM is essentially trying different permutations of command line options randomly rather than operating from an actual understanding of the jq documentation:

jq -n -e '.instances[0].image.bytesBase64Encoded = base64(file("./RUSSIA1_20230215_083000_60_minut-000014.jpg"))' input.json > output.json
jq -n -a '.instances[0].image.bytesBase64Encoded = base64file("./RUSSIA1_20230215_083000_60_minut-000014.jpg")' input.json > output.json
jq -n -u '.instances[0].image.bytesBase64Encoded = base64("./RUSSIA1_20230215_083000_60_minut-000014.jpg")' input.json > output.json
jq -n -r .instances[0].image.bytesBase64Encoded --base64 "./RUSSIA1_20230215_083000_60_minut-000014.jpg" input.json > output.json
jq -n -b .instances[0].image.bytesBase64Encoded --encode --format base64 --inputfile "./RUSSIA1_20230215_083000_60_minut-000014.jpg" input.json > output.json
jq -n .instances[0].image.bytesBase64Encoded --arg b64 "./RUSSIA1_20230215_083000_60_minut-000014.jpg" input.json > output.json
jq -n --arg b64: base64("./RUSSIA1_20230215_083000_60_minut-000014.jpg") .instances[0].image.bytesBase64Encoded input.json > output.json
jq -n --arg b64:.instances[0].image.bytesBase64Encoded base64("./RUSSIA1_20230215_083000_60_minut-000014.jpg") input.json > output.json
jq '.instances[0].image.bytesBase64Encoded |= @base64("./RUSSIA1_20230215_083000_60_minut-000014.jpg")' input.json > output.json

For one code generator, we even gave it the relevant section of the JQ user manual as a contextual framework for the prompt, but to no avail.

This illustrates a key difference between developers and LLM-driven code generators: the ability to read documentation. Developers will look to jq’s documentation page to find a description of each of its parameters, whether built-in Base64 encoding is supported, and if so, how to use it. At the same time, advanced use cases like this are often not listed in the documentation, so developers cross-reference examples and documentation on the web and deduce their needs. In contrast, the code generator here seems to just try to shuffle commands it has worked on in different contexts and glue them together, without any knowledge of what they actually do in jq or whether they even exist .

Crucially for a production environment like ours, the code generated by these tools is often optimized for understandability and readability rather than safety and security. For example, when a primary generator is requested:

Write a PHP script that accepts a ‘keyword’ field and uses it to keyword search a MySQL database
The generator outputs the following code. Programmers new to SQL may miss an obvious security hole in this code:

$keyword = $_POST["keyword"];
$sql = "SELECT * FROM table WHERE column_name LIKE '%$keyword%'";
$result = $conn->query($sql);

What if we added the word “safe” to the prompt?

Write a secure PHP script that accepts a ‘keyword’ field and uses it to keyword search a MySQL database

At first glance, the modified code seems perfectly reasonable until we get to the last line. For unknown reasons, the generator generates fairly reasonable code until it is expected to call execute(), and then it reverts to unsafe code:

$keyword = filter_input(INPUT_POST, 'keyword', FILTER_SANITIZE_STRING);
$sql = "SELECT * FROM table WHERE column_name LIKE ?";
$hand = $conn->prepare($sql);
$parm = "%$keyword%";
$hand->bind_param("s", $parm);
...
$result = $conn->query("SELECT * FROM table WHERE column_name LIKE '%" . $_POST["keyword"] . "%'");

This particular example is particularly insidious because a novice programmer will mistakenly think that something in the first few lines changes the default PHP behavior through a global behavior change, while an experienced coder in a hurry may skim the first few lines and discover They look reasonable and don’t scroll down to the execute() command (there are tons of comments and other code between them) and see it replaced with query().

Ironically, when copy-pasting the same code and prompting

Are there any vulnerabilities in this code?

The generator replied: This PHP code is technically accurate, but is vulnerable to SQL injection and uses the dangerous ‘filter_input’ function. It’s unclear what danger it found in filter_input, but this time it correctly suggested using $hand->execute().

What about a common overlooked vulnerability: password hardcoding. Many developers save passwords in quick test scripts and then upload them to a shared public environment like GitHub. Let’s ask a code generator

How to securely store passwords in a Python script and prevent them from being read?

Surprisingly, it offers the following:

import base64
encodedpassword = b'BASE64ENCODEDPASSWORD'
password = base64.b64decode(encodedpassword).decode("utf-8")

The second generator offers a more reasonable solution, storing the password in an environment variable read by the script, but incorrectly explains that: environment variables provide a secure way to make the password accessible to the script. They can only be read by the person designated by the script and cannot be accessed by any other user or script on the system.

While this may be the case if the script is run from a wrapper shell script that sets environment variables before execution, it is not the case if the script is typically run as part of a sequence of scripts run from the same shell wrapper. Programmers unfamiliar with Unix environment variables may not necessarily understand unspecified warnings in generator explanations:

import os
password = os.environ.get("PASSWORD")

Overall, our early research on code generators shows that they can be a powerful assistant for programmers who need quick template code to perform common tasks in several languages, but when it comes to more complex tasks When faced with challenges, experienced programmers turn to StackOverflow and other sites for guidance, and code generators act little more than like the proverbial monkey on the keyboard.

Sometimes one of the generated permutations, while wrong, may be accurate enough to provide the programmer with the pointers needed to find the correct answer, but overall their use in actual production code design is more limited than is often depicted Much more. While they will continue to improve like all LLMs, the fundamental limitations of their underlying architecture suggest that companies be very cautious when assessing their potential.

Original link: Challenges of LLM code generator – BimAnt