LLM instantly generates a 3D world in one sentence! The unpublished code has received 300+ stars! It may trigger a revolution in the 3D modeling industry

Click the Card below and follow the “CVer” public account

AI/CV heavy-duty information, delivered as soon as possible

Click to enter->[Computer Vision and Transformer] Communication Group

Reprinted from: Xinzhiyuan | Editor: Run So sleepy

[Introduction]Recently, researchers from the Australian National University, Oxford and Chiyuan have proposed an agent framework driven by LLM that can generate complex 3D scenes using text prompts. Are omnipotent large models really going to start creating 3D worlds?

Following the popularity of AI Vincent pictures and Vincent videos on the Internet, Vincent 3D scene technology is also here!

?With a prompt word of less than 30 words, such a 3D scene can be generated in an instant.

The scene effect is almost exactly the same as the requirements of the text – “The lake surface is as calm as glass, reflecting the cloudless sky, and the reflections of the surrounding mountains and water birds appear in the lake.”

“The scorching sun shines on the endless desert, and the stubbornly growing plants cast obvious shadows. The strong wind carves the small sand dunes into a piece of golden land.”

And for the generated scene, it also supports continuous modification and editing of different elements!

After seeing the effect, netizens exclaimed, “I have been waiting for this moment my whole life!”

The research team planned to publish the project code on Github after the paper was accepted, but before the code was released, the project had already received 141 stars!

Scan the QR code to join CVer Knowledge Planet, you can quickly learn the paper ideas from the latest top conferences and journalsand CV from entry to Proficient in information, as well as cutting-edge projects and applications!

This project is the “3D-GPT” system developed by researchers from the Australian National University, Oxford and Intellectual Property Research Institute. It can simply generate a variety of 3D models and scenes based on text descriptions provided by users.

Project address: https://chuny1.github.io/3DGPT/3dgpt.html

Unlike the independent models that Vincentian diagrams rely on, 3D-GPT still takes advantage of the multi-modality and reasoning capabilities of large language models (LLM) to decompose the 3D modeling task into multiple sub-tasks, which are completed by different agents, including task scheduling. Agents, conceptualizing agents, and modeling agents.

The researchers say 3D-GPT positions LLMs as skilled problem solvers, breaking down procedural 3D modeling tasks into accessible parts and specifying appropriate agents for each task.

Moreover, the entire system does not require any training and can complete the process from text to parameter extraction to 3D modeling without training.

Specifically, the task scheduling agent is responsible for selecting the appropriate program generation function based on the instructions. Conceptualizing agents reason about textual descriptions, filling in missing details.

The modeling agent infers function parameters, generates Python code, and controls the 3D modeling software Blender through the API to perform modeling.

This system is seamlessly integrated with Blender and supports various operations such as object deformation, material adjustment, mesh editing, and physical simulation.

And the 3D GPT framework can enhance user-provided brief scene descriptions to become more detailed and contextual. At the same time, a procedural generation method is integrated to extract parameters from rich text to control the 3D modeling software.

And because LLM can provide excellent semantic understanding and contextual capabilities, 3D GPT can generate a variety of 3D assets and support continuous and targeted editing and modification capabilities.

3D-GPT enables fine object control, including the capture of shapes, curves, and details, resulting in detailed modeling. At the same time, you can also control the generation of large scenes.

Moreover, 3D GPT supports continuous command input and can edit and modify scenes. The system can remember previous modifications and connect new commands with the scene context, allowing users to continuously edit and modify the generated scenes.

Moreover, 3D-GPT also supports continuous editing of a single element and function through natural language. For example, the figure below shows that users can modify the weather effect individually by changing the input requirements.

3D-GPT

Task Definition

The overall goal is to generate 3D content based on a series of natural language instructions.

Among them, the initial instruction L0 serves as a comprehensive description of the 3D scene, such as “a foggy spring morning, dew-kissed flowers dotted the lush grass surrounded by newly sprouted trees.”

Subsequent instructions are used to modify an existing scene, for example instructions such as “Change white flowers to yellow flowers” or “Convert scene to a winter environment”.

To accomplish this goal, the researchers introduced a framework called 3D-GPT, which enables large language models (LLMs) to act as problem-solving agents.

Model preparation

The researchers point out that there are significant challenges in getting LLM to directly create every element of 3D content. LLMs may have difficulty in proficient 3D modeling due to the lack of specialized pre-training data, and therefore, they may struggle to accurately judge which elements should be modified based on given instructions and how to modify them.

To deal with this problem, in the researchers’ framework, they utilized Infinigen, a Python-Blender-based procedural generator from previous research, which is equipped with a rich library of generation functions.

To enable LLMs to use Infinigen proficiently, researchers provide key tips for each function. These tips include function documentation, easy-to-understand code, required information, and usage examples.

By providing these resources to LLMs, researchers enable them to leverage their core capabilities in planning, reasoning, and tool utilization. As a result, LLMs are able to effectively leverage Infinigen for language-command-based 3D generation, a process that is seamless and efficient.

Multi-agent system for 3D reasoning, planning and tool usage

After tool preparation is complete, 3D-GPT uses a multi-agent system to handle procedural 3D modeling tasks.

The system contains three core agents: task scheduling agent, conceptualization agent and modeling agent, as shown in Figure 1 below.

Together, they break down procedural 3D modeling tasks into manageable pieces, with each agent focusing on a different aspect: 3D reasoning, planning, and tool usage.

Task scheduling agents play a key role in the planning process. It queries function documentation using user instructions and subsequently selects the necessary functions for execution.

Once a function is selected, the conceptualization agent enriches the user-provided textual description with reasoning.

On this basis, the modeling agent infers the parameters of each selected function and generates Python code scripts to call Blender’s API, thus facilitating the creation of corresponding 3D content. Additionally, you can use Blender’s rendering capabilities to generate images.

Task scheduling agent for planning

The task scheduling agent has comprehensive information about all available functions F and can efficiently identify the functions required for each instruction input. For example, when the instruction “convert the scene to a winter environment” comes up, it will precisely find functions like add_snow_layer() and update_trees().

This key role of the task scheduling agent facilitates efficient task coordination between conceptualizing and modeling agents.

Without it, conceptualization and modeling agents must analyze all provided functions F for each given instruction, which not only increases the workload of these agents, but also prolongs processing time and may lead to unexpected modifications.

The communication process between the LLM system, users and task scheduling agents is as follows:

Conceptualizing agents for reasoning

The description may not explicitly provide the detailed description of appearance required for modeling. For example, consider the description: “On a misty spring morning, dew-kissed flowers dotted a lush meadow surrounded by newly sprouted trees.”

When using tree modeling functions that require parameters such as branch length, tree size, and leaf type, it is obvious that these specific details are not directly stated in the given text.

When the modeling agent is instructed to infer parameters directly, it often provides simple solutions, such as using default or reasonable values from the parameter documentation, or copying values from the prompt examples. This reduces the diversity of the generation and complicates the process of parameter inference.

Modeling agents can use tools

After conceptualization, the 3D modeling process aims to convert detailed human language into machine-understandable language.

Blender rendering

The modeling agent ultimately provides Python function calls with inferred parameters that are used for Blender node control and rendering, resulting in the final 3D mesh and RGB results.

Build effectsEdit and modify experiments

The researchers’ experiments begin by demonstrating how efficient 3D-GPT is at consistently producing results that correspond to user instructions, covering a variety of scenarios involving large scenes and single objects.

We then drill down into specific examples to illustrate how our agents can effectively understand tool functionality, acquire the necessary knowledge, and use it for precise control. To deepen our understanding, we conducted ablation studies that systematically examined the contribution of each agent in our multi-agent system.

3D Modeling

Large scene generation

The researchers investigated the ability of 3D-GPT to control modeling tools based on scene descriptions.

To conduct this experiment, the researchers used ChatGPT to generate 100 scene descriptions with the following prompt: “You are a good writer, please provide me with 10 different descriptions of natural scenes.”

The researchers collected 10 responses to this prompt to form their data set. In Figure 2 below, the researchers show the multi-view rendering results of 3D-GPT.

The results show that the researchers’ approach is able to generate large-scale 3D scenes that are broadly consistent with the provided text descriptions and demonstrate significant diversity.

Notably, all 3D results were rendered directly using Blender, ensuring that all meshes were realistic, allowing the researchers’ approach to achieve absolute 3D consistency and produce realistic ray traced rendering results.

Detailed control of a single category

In addition to generating large scenes from concise descriptions, the researchers also evaluated 3D-GPT’s ability to model objects. The researchers evaluated key factors such as curve modeling, shape control and a deep understanding of object appearance.

To this end, the researchers present the results of fine-grained object control. This includes subtle aspects derived from input text descriptions, such as object curves, key appearance features, and color.

The researchers used random prompts to guide GPT in generating various real-world flower types. As shown in Figure 3 below, the researchers’ approach expertly models each flower type, faithfully capturing their distinct appearance.

This study highlights the potential of 3D-GPT to enable accurate object modeling and fine-grained attribute control.

Subsequence command editing

The researchers tested 3D-GPT’s capabilities for efficient human-agent communication and task manipulation.

In Figure 4 below, we observe that our approach is able to understand subsequence instructions and make accurate scene modification decisions.

Notably, unlike existing text-to-3D methods, 3D-GPT retains the memory of all previous modifications, thereby helping to connect new instructions with the context of the scene.

Additionally, the researchers’ approach eliminates the need for additional networks for controllable editing. This study highlights the efficiency and versatility of 3D-GPT in proficiently processing complex subsequence instructions for 3D modeling.

Single function control

To evaluate the effectiveness of 3D-GPT in tool usage, we present an illustrative example that highlights our method’s ability to control single functions and infer parameters.

Figure 5 below illustrates 3D-GPT’s ability to model the appearance of the sky based on input text descriptions.

The functions responsible for generating sky textures do not directly relate color information to sky appearance. Instead, it relies on the Nishita sky modeling approach, which requires a deep understanding of real-world sky and weather conditions and consideration of input parameters.

The researchers’ approach expertly extracts key information from text input and understands how each parameter affects the final sky appearance, as shown in Figures 5(c) and (d). These results show that the researchers’ method can effectively use a single function and infer the corresponding parameters.

References:

https://chuny1.github.io/3DGPT/3dgpt.html

Click to enter-> [Computer Vision and Diffusion Model] Communication Group< /strong>

ICCV/CVPR 2023 paper and code download

Backend reply: CVPR2023, you can download the CVPR 2023 papers and code open source paper collection

Backend reply: ICCV2023, you can download the collection of ICCV 2023 papers and code open source papers

Computer vision and diffusion model exchange group established
Scan the QR code below, or add WeChat: CVer444, to add CVer Assistant WeChat, and then apply to join the CVer-Computer Vision or Diffusion Model WeChat communication group. In addition, other vertical directions have been covered: target detection, image segmentation, target tracking, face detection & recognition, OCR, pose estimation, super-resolution, SLAM, medical imaging, Re-ID, GAN, NAS, depth estimation, automatic Driving, reinforcement learning, lane detection, model pruning & compression, denoising, fog removal, rain removal, style transfer, remote sensing images, behavior recognition, video understanding, image fusion, image retrieval, paper submission & communication , PyTorch, TensorFlow and Transformer, NeRF, etc.
Be sure to note: Research direction + location + school/company + nickname (such as target detection or diffusion model + Shanghai + hand in + Kaka). Note according to the format to be approved and invited to the group faster

▲Scan the QR code or add WeChat ID: CVer444 to join the communication group
CVer Computer Vision (Knowledge Planet) is here! If you want to know about the latest, fastest and best CV/DL/AI paper express delivery, high-quality practical projects, AI industry cutting-edge, and learning tutorials from entry to mastery, please scan the QR code below and join CVer Computer Vision (Knowledge Planet). Nearly ten thousand people have been gathered!

▲Scan the QR code to join Planet Learning

▲Click on the card above to follow the CVer official account

It’s not easy to organize, please like and watch