Model talk: 1.5G memory can run RNN 14B open source model (ChatRWKV)

In this article, let’s talk about how to quickly get started with a model with 14B parameters, but a special RNN model: ChatRWKV.

This article will introduce how to get started quickly, including using a 4090 high-speed inference generation content with 24 video memory, and how to run this model with only 1.5G video memory.

Write in front

If you have about 20GB of video memory, even a home game graphics card, you can get an amazing reasoning efficiency. Of course, if the video memory of your graphics card is not so large, as long as you have 2GB, it is not a big problem to run this model.

At the beginning of February, I saw this model on the Internet. At that time, I tossed a Docker container, but because I had other things at hand, I put it down.

Maybe, should I write an article at that time?

Recently, a friend in the group mentioned that he wanted to try it out, and he was just waiting for the fine-tune results of the 65B large model mentioned in the article yesterday, so let’s write an article related to it.

If you are only curious about how to run a model with 1.5 G of video memory, you can just read the model preparation and the 1.5 G model section.

Preparation for model operation

There are only two steps in the preparation of this model: get the project code containing the container, and build the container image.

Get Docker ChatRWKV project code

In order to run this model more easily, I forked the official project and added container configurations and programs that can quickly reproduce the model. The project address is soulteary/docker-ChatRWKV.

You can get the code in the following ways:

git clone https://github.com/soulteary/docker-ChatRWKV.git

# or

curl -sL -o chatrwkv.zip https://github.com/soulteary/docker-ChatRWKV/archive/refs/heads/main.zip

After getting the necessary code, we need to configure and prepare the container environment.

Prepare the container environment

In the previous article “Docker-based Deep Learning Environment: Getting Started”, we mentioned how to configure Docker to interact with the graphics card, so I won’t go into details here. You can execute a simple command to create a “clean and hygienic” container environment.

Enter the project directory and use Nvidia’s original PyTorch Docker base image to complete the construction of the basic environment. Compared with pulling the prepared image directly from DockerHub, building it yourself will save a lot of time.

docker build -t soulteary/model:chatrwkv . -f docker/Dockerfile

During the construction process, because a nearly 30G model file will be downloaded from HF, it will be relatively long.

# docker build -t soulteary/model:chatrwkv . -f docker/Dockerfile
[ + ] Building 1129.8s (8/12)
 => [internal] load .dockerignore 0.1s
 => => transferring context: 2B 0.0s
 => [internal] load build definition from Dockerfile 0.1s
 => => transferring dockerfile: 850B 0.0s
 => [internal] load metadata for nvcr.io/nvidia/pytorch:23.02-py3 0.0s
 => CACHED [1/8] FROM nvcr.io/nvidia/pytorch:23.02-py3 0.0s
 => [internal] load build context 0.1s
 => => transferring context: 5.72kB 0.0s
 => [2/8] RUN pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple & amp; & amp; pip install huggingface_hub 4.6s
 => [3/8] WORKDIR /app 0.1s
 => [4/8] RUN cat > /get-models.py <<EOF 0.3s
 => [5/8] RUN python /get-models.py & amp; & amp; rm -rf /get-models.py & amp; & amp; pip install ninja rwkv==0.6.2 gradio 1124.7s
 => => # Downloading (…)ctx8192-test1050.pth: 99%|█████████| 27.9G/28.3G [08:06<00:06, 57.6MB/s]

When the image is built, you can start playing.

Before talking about the “extreme challenge” of 1.5G, let’s use relatively sufficient resources to quickly run the model and experience its execution efficiency.

Quickly run ChatRWKV with Docker

If your graphics card has 20G or more video memory, you can directly execute the following command to start a ChatRWKV model program with interface:

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -p 7860:7860 soulteary/model:chatrwkv

After you execute the command, you will see a log similar to the following:

==============
==PyTorch==
==============

NVIDIA Release 23.02 (build 53420872)
PyTorch Version 1.14.0a0+44dac51

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

Using /root/.cache/torch_extensions/py38_cu120 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py38_cu120/wkv_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu120/wkv_cuda/build.ninja...
Building extension module wkv_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF wrapper.o.d -DTORCH_EXTENSION_NAME=wkv_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILDc1Axxi=""1 local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/ lib/python3.8/dist-packages/torch/include/THC-isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC-isystem /usr/local/cuda/include-isystem/ usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -c /usr/local/lib/python3.8/dist-packages/rwkv/cuda/wrapper.cpp -o wrapper.o
[2/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=wkv_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabsystem / local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/ lib/python3.8/dist-packages/torch/include/THC-isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC-isystem /usr/local/cuda/include-isystem/ usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch =compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch =compute_90,cod e=compute_90 -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -t 4 -std=c++17 --use_fast_math -O3 --extra-device-vectorization -c /usr/local /lib/python3.8/dist-packages/rwkv/cuda/operators.cu -o operators.cuda.o
[3/3] c++ wrapper.o operators.cuda.o -shared -L/usr/local/lib/python3.8/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o wkv_cuda.so
Loading extension module wkv_cuda...
RWKV_JIT_ON 1 RWKV_CUDA_ON 1 RESCALE_LAYER 6

Loading /root/.cache/huggingface/hub/models--BlinkDL--rwkv-4-pile-14b/snapshots/5abf33a0a7aca020a5d3fc189a50e9bf17def979/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth...
Strategy: (total 40 + 1=41 layers)
* cuda [float16, uint8], store 20 layers
* cuda [float16, float16], store 21 layers
0-cuda-float16-uint8 1-cuda-float16-uint8 2-cuda-float16-uint8 3-cuda-float16-uint8 4-cuda-float16-uint8 5-cuda-float16-uint8 6-cuda-float16-uint8 7 -cuda-float16-uint8 8-cuda-float16-uint8 9-cuda-float16-uint8 10-cuda-float16-uint8 11-cuda-float16-uint8 12-cuda-float16-uint8 13-cuda-float16-uint8 14- cuda-float16-uint8 15-cuda-float16-uint8 16-cuda-float16-uint8 17-cuda-float16-uint8 18-cuda-float16-uint8 19-cuda-float16-uint8 20-cuda-float16-float16 21-cuda -float16-float16 22-cuda-float16-float16 23-cuda-float16-float16 24-cuda-float16-float16 25-cuda-float16-float16 26-cuda-float16-float16 27-cuda-float16-float16 28-cuda- float16-float16 29-cuda-float16-float16 30-cuda-float16-float16 31-cuda-float16-float16 32-cuda-float16-float16 33-cuda-float16-float16 34-cuda-float16-float16 35-cuda-float -float16 36-cuda-float16-float16 37-cuda-float16-float16 38-cuda-float16-float16 39-cuda-float16-float16 40-cuda-float16-float16
emb.weight f16 cpu 50277 5120
blocks.0.ln1.weight f16 cuda:0 5120
blocks.0.ln1.bias f16 cuda:0 5120
blocks.0.ln2.weight f16 cuda:0 5120
blocks.0.ln2.bias f16 cuda:0 5120
blocks.0.att.time_decay f32 cuda:0 5120
blocks.0.att.time_first f32 cuda:0 5120
blocks.0.att.time_mix_k f16 cuda:0 5120
blocks.0.att.time_mix_v f16 cuda:0 5120
blocks.0.att.time_mix_r f16 cuda:0 5120
blocks.0.att.key.weight i8 cuda:0 5120 5120
blocks.0.att.value.weight i8 cuda:0 5120 5120
blocks.0.att.receptance.weight i8 cuda:0 5120 5120
blocks.0.att.output.weight i8 cuda:0 5120 5120
blocks.0.ffn.time_mix_k f16 cuda:0 5120
blocks.0.ffn.time_mix_r f16 cuda:0 5120
blocks.0.ffn.key.weight i8 cuda:0 5120 20480
blocks.0.ffn.receptance.weight i8 cuda:0 5120 5120
blocks.0.ffn.value.weight i8 cuda:0 20480 5120
................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ...................................
blocks.39.ln1.weight f16 cuda:0 5120
blocks.39.ln1.bias f16 cuda:0 5120
blocks.39.ln2.weight f16 cuda:0 5120
blocks.39.ln2.bias f16 cuda:0 5120
blocks.39.att.time_decay f32 cuda:0 5120
blocks.39.att.time_first f32 cuda:0 5120
blocks.39.att.time_mix_k f16 cuda:0 5120
blocks.39.att.time_mix_v f16 cuda:0 5120
blocks.39.att.time_mix_r f16 cuda:0 5120
blocks.39.att.key.weight f16 cuda:0 5120 5120
blocks.39.att.value.weight f16 cuda:0 5120 5120
blocks.39.att.receptance.weight f16 cuda:0 5120 5120
blocks.39.att.output.weight f16 cuda:0 5120 5120
blocks.39.ffn.time_mix_k f16 cuda:0 5120
blocks.39.ffn.time_mix_r f16 cuda:0 5120
blocks.39.ffn.key.weight f16 cuda:0 5120 20480
blocks.39.ffn.receptance.weight f16 cuda:0 5120 5120
blocks.39.ffn.value.weight f16 cuda:0 20480 5120
ln_out.weight f16 cuda:0 5120
ln_out.bias f16 cuda:0 5120
head.weight f16 cuda:0 5120 50277
Running on local URL: http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.

When you see Running on local URL: http://0.0.0.0:7860, open the browser and access the IP of the machine running the container (if it is running locally, visit http:/ /127.0.0.1:7860 ), you can see the interface below.

ChatRWKV interface preview

Because it uses the default template of gradio, the interface is very simple (clumsy). Enter the content you want to test on the left, or use the preset copywriting at the bottom of the page, then click the “Submit” button and wait for the model to output crazily.

ChatRWKV running effect

It can be seen that the running speed is still very fast. If we can fine-tune it with our own corpus, it may have better results. However, it seems that there is still a distance from the tools used every day. I hope the project will get better and better.

At this point, if we use the nvidia-smi management tool to check the graphics card status, we can see that the video memory usage is around 20G.

nvidia-smi
Sat Mar 25 14:09:48 2023
 + ------------------------------------------------- ---------------------------- +
| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |
|------------------------------- + ----------------- ----- + ---------------------- +
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=================================+ ================== ===== + =======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | Off |
| 31% 34C P8 22W / 450W | 20775MiB / 24564MiB | 0% Default |
| | | N/A |
 + ------------------------------- + ----------------- ----- + ---------------------- +
                                                                               
 + ------------------------------------------------- ---------------------------- +
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|==================================================== ===============================|
| 0 N/A N/A 1290 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 1506 G /usr/bin/gnome-shell 10MiB |
| 0 N/A N/A 4120 C python 20750MiB |
 + ------------------------------------------------- ---------------------------- +

Pre-converted model format

Although ChatRWKV runs very fast, it will take a long time to load each time it is started. The most time-consuming part here is: when the program starts, it will download the downloaded open source model according to the “fixed strategy”. Format conversion.

If we could perform such a transformation beforehand, we would save a lot of runtime. As well as saving unnecessary computing resource usage. The official provides a piece of code that briefly explains how the project converts the model format, and the detailed conversion details are in the source code of the PyPI package provided by the project.

We can execute the following command to start the container image containing the model we downloaded earlier, and enter the interactive shell with one click:

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -p 7860:7860 soulteary/model:chatrwkv bash

Then make some simplifications to the official example:

from huggingface_hub import hf_hub_download
title = "RWKV-4-Pile-14B-20230313-ctx8192-test1050"
model_path = hf_hub_download(repo_id="BlinkDL/rwkv-4-pile-14b", filename=f"{<!-- -->title}.pth")

from rwkv.model import RWKV
RWKV(model=model_path, strategy='cuda fp16i8 *20 -> cuda fp16', convert_and_save_and_exit = f"./models/{<!-- -->title}.pth")

The conversion strategy (strategy) here should be consistent with the app.py (official example) in the container. Regarding the use of the RWKV strategy, we will mention it below. Expand.

We save the above code as convert.py, then execute python convert.py, and wait patiently for the format conversion of the model:

# python convert.py
RWKV_JIT_ON 1 RWKV_CUDA_ON 0 RESCALE_LAYER 6

Loading /root/.cache/huggingface/hub/models--BlinkDL--rwkv-4-pile-14b/snapshots/5abf33a0a7aca020a5d3fc189a50e9bf17def979/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth...
Strategy: (total 40 + 1=41 layers)
* cuda [float16, uint8], store 20 layers
* cuda [float16, float16], store 21 layers
0-cuda-float16-uint8 1-cuda-float16-uint8 2-cuda-float16-uint8 3-cuda-float16-uint8 4-cuda-float16-uint8 5-cuda-float16-uint8 6-cuda-float16-uint8 7 -cuda-float16-uint8 8-cuda-float16-uint8 9-cuda-float16-uint8 10-cuda-float16-uint8 11-cuda-float16-uint8 12-cuda-float16-uint8 13-cuda-float16-uint8 14- cuda-float16-uint8 15-cuda-float16-uint8 16-cuda-float16-uint8 17-cuda-float16-uint8 18-cuda-float16-uint8 19-cuda-float16-uint8 20-cuda-float16-float16 21-cuda -float16-float16 22-cuda-float16-float16 23-cuda-float16-float16 24-cuda-float16-float16 25-cuda-float16-float16 26-cuda-float16-float16 27-cuda-float16-float16 28-cuda- float16-float16 29-cuda-float16-float16 30-cuda-float16-float16 31-cuda-float16-float16 32-cuda-float16-float16 33-cuda-float16-float16 34-cuda-float16-float16 35-cuda-float -float16 36-cuda-float16-float16 37-cuda-float16-float16 38-cuda-float16-float16 39-cuda-float16-float16 40-cuda-float16-float16
emb.weight f16 cpu 50277 5120
blocks.0.ln1.weight f16 cpu 5120
blocks.0.ln1.bias f16 cpu 5120
blocks.0.ln2.weight f16 cpu 5120
blocks.0.ln2.bias f16 cpu 5120
blocks.0.att.time_decay f32 cpu 5120
blocks.0.att.time_first f32 cpu 5120
blocks.0.att.time_mix_k f16 cpu 5120
blocks.0.att.time_mix_v f16 cpu 5120
blocks.0.att.time_mix_r f16 cpu 5120
blocks.0.att.key.weight i8 cpu 5120 5120
blocks.0.att.value.weight i8 cpu 5120 5120
blocks.0.att.receptance.weight i8 cpu 5120 5120
blocks.0.att.output.weight i8 cpu 5120 5120
blocks.0.ffn.time_mix_k f16 cpu 5120
blocks.0.ffn.time_mix_r f16 cpu 5120
blocks.0.ffn.key.weight i8 cpu 5120 20480
blocks.0.ffn.receptance.weight i8 cpu 5120 5120
blocks.0.ffn.value.weight i8 cpu 20480 5120
................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ................................................... ...................................
blocks.39.ln1.weight f16 cpu 5120
blocks.39.ln1.bias f16 cpu 5120
blocks.39.ln2.weight f16 cpu 5120
blocks.39.ln2.bias f16 cpu 5120
blocks.39.att.time_decay f32 cpu 5120
blocks.39.att.time_first f32 cpu 5120
blocks.39.att.time_mix_k f16 cpu 5120
blocks.39.att.time_mix_v f16 cpu 5120
blocks.39.att.time_mix_r f16 cpu 5120
blocks.39.att.key.weight f16 cpu 5120 5120
blocks.39.att.value.weight f16 cpu 5120 5120
blocks.39.att.receptance.weight f16 cpu 5120 5120
blocks.39.att.output.weight f16 cpu 5120 5120
blocks.39.ffn.time_mix_k f16 cpu 5120
blocks.39.ffn.time_mix_r f16 cpu 5120
blocks.39.ffn.key.weight f16 cpu 5120 20480
blocks.39.ffn.receptance.weight f16 cpu 5120 5120
blocks.39.ffn.value.weight f16 cpu 20480 5120
ln_out.weight f16 cpu 5120
ln_out.bias f16 cpu 5120
head.weight f16 cpu 5120 50277
Saving to ./models/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth...
Converted and saved. Now this will exit.

It doesn’t take too long. When we see the prompt of “Converted and saved.“, we can get the pre-converted model. Using the official default strategy, we can find that the size of the model has dropped About 10G:

# du -hs models/*
21G models/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth

Then, manually execute python app.py to quickly start the model application.

Challenge to run ChatRWKV model with 1.5G small video memory

If you want to use small video memory resources to run the model, there are some relatively reliable methods at this stage:

Quantize the model to 8-bit or 4-bit, or even lower, while reducing the size of the model file, offload part of the video memory to the memory used by the CPU.
The model is loaded in a streaming manner to reduce the resource usage in the video memory at the same time.

In the official document, we can find a very “extreme” solution, which streams almost all layers. The policy content: 'cuda fp16i8 *0 + -> cpu fp32 *1' .

However, before actual combat, an additional preparation is required.

How to limit the video memory of the graphics card and simulate a small video memory device

Because I only have a home game card (RTX4090) on hand, I need to find a way to limit the video memory of the graphics card.

Students who use GPU servers should know that Nvidia has a MIG (NVIDIA Multi-Instance GPU) technology, which can virtualize graphics cards and limit the upper limit of specific cache resources used by applications. However, this function is only open to several “advanced cards”: A30, A100, H100.

But fortunately, both Tensorflow and PyTorch support soft-limiting the amount of video memory used by software. Although it does not reach the hardware layer, it is sufficient for simulating low video memory devices in some scenarios.

In the ChatRWKV project, the author uses PyTorch, so we only need to add the resource limit statement in the place where Torch is referenced:

import os, gc, torch

torch.cuda.set_per_process_memory_fraction(0.5)

For example, in the above code, we downgrade the graphics card from 4090 to the graphics capacity level of RTX 3060 by limiting the use of up to 50% of the video memory.

At this point, we can perform a simple verification and use the following command to enter an interactive shell:

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -p 7860:7860 soulteary/model:chatrwkv bash

Add the above code to the appropriate place in app.py, and then execute python app.py, no surprises, you will get a “running error” that proves that the restriction is effective.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 200.00 MiB (GPU 0; 23.65 GiB total capacity; 11.46 GiB already allocated; 11.48 GiB free; 11.82 GiB allowed; 11.71 in GiB total if reserved) reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

One-click pre-conversion model format

Above, we have talked about the “principle” and “detailed implementation” in detail, so here we will stop repeating meaninglessly, and directly execute the following command to start the model format conversion:

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -p 7860:7860 -v `pwd`/models:/app/ChatRWKV/ models soulteary/model:chatrwkv python convert.mini.py

After the command is executed, we will get a model file of about 15GB.

# du -hs models/*
15G models/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth

One-click run model program that only needs 1.5G video memory

To use programs with low video memory resources, just execute the following command:

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -p 7860:7860 -v `pwd`/models:/app/ChatRWKV/ models soulteary/model:chatrwkv python webui.mini.py

After the command is executed, we use nvidia-smi to view the resources, and we can see that the model “standby” state only needs about 500MB of video memory resources.

 + ---------------------------------------------- -------------------------------+
| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |
|------------------------------- + ----------------- ----- + ---------------------- +
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=================================+ ================== ===== + =======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | Off |
| 31% 33C P8 22W / 450W | 463MiB / 24564MiB | 0% Default |
| | | N/A |
 + ------------------------------- + ----------------- ----- + ---------------------- +

 + ------------------------------------------------- ---------------------------- +
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|==================================================== ===============================|
| 0 N/A N/A 1290 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 1506 G /usr/bin/gnome-shell 10MiB |
| 0 N/A N/A 33684 C python 406MiB |
 + ------------------------------------------------- ---------------------------- +

In the process of actual use, the memory will change according to the amount of actual output content. I personally tested it many times, and basically used it around 800MB to 1.4GB.

 + ---------------------------------------------- -------------------------------+
| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |
|------------------------------- + ----------------- ----- + ---------------------- +
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=================================+ ================== ===== + =======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | Off |
| 31% 46C P2 132W / 450W | 1411MiB / 24564MiB | 84% Default |
| | | N/A |
 + ------------------------------- + ----------------- ----- + ---------------------- +

 + ------------------------------------------------- ---------------------------- +
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|==================================================== ===============================|
| 0 N/A N/A 1290 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 1506 G /usr/bin/gnome-shell 10MiB |
| 0 N/A N/A 10942 C python 1354MiB |
 + ------------------------------------------------- ---------------------------- +

Last

This article will be written here first.

As for the effect of the model, it is a question of pony crossing the river, try it yourself.

–EOF

We have a small tossing group, which gathers some friends who like tossing.

In the absence of advertisements, we will chat about software and hardware, HomeLab, and programming issues together, and will also share some technical information in the group from time to time.

Friends who like tossing, welcome to read the following content, scan the code to add friends.

Some suggestions and opinions about “making friends”

添加好友时，请备注实名和公司或学校、注明来源和目的，否则不会通过审核。

关于折腾群入群的那些事

This article uses the “Signature 4.0 International (CC BY 4.0)” license agreement. You are welcome to reprint or re-use it, but you need to indicate the source.署名 4.0 国际 (CC BY 4.0)

本文作者: 苏洋

创建时间: 2023年03月25日
统计字数: 19716字
阅读时间: 40分钟阅读
本文链接: https://soulteary.com/2023/03/25/model-talk-open-source-model-of-rnn-14b-that-can-run-on-little-gpu-memory-chatrwkv.html