Homepage| GitHub| Tools VS Code, Jetbrains| HF Repo| Paper
Join our Discord, Slack, Telegram, WeChat
BF16/FP16 version|BF16/FP16 version codegeex2-6b
CodeGeeX2: A more powerful multi-language code generation model
A More Powerful Multilingual Code Generation Model
1. Introduction to CodeGeeX2 model
CodeGeeX2 is the second generation model of the multilingual code generation model CodeGeeX (KDD’23). Code Gee %), more features include:
- More powerful code capabilities: Based on the ChatGLM2-6B base language model, CodeGeeX2-6B has further been pre-trained with 600B code data. Compared with the first-generation model, it has comprehensively improved its coding capabilities. HumanEval-X evaluation The six programming languages of the set have all improved significantly (Python + 57%, C++ 71%, Java + 54%, JavaScript + 83%, Go + 56%, Rust + 321%), reaching 35.9% on Python Pass@1 first pass rate, surpassing the larger StarCoder-15B.
- Better model features: Inheriting the ChatGLM2-6B model features, CodeGeeX2-6B better supports Chinese and English input, supports a maximum sequence length of 8192, and the inference speed is greatly improved compared to the first generation CodeGeeX-13B. After quantification, it only needs It can run with 6GB of video memory and supports lightweight localized deployment.
- More comprehensive AI programming assistant: The CodeGeeX plug-in (VS Code, Jetbrains) back-end upgrade supports more than 100 programming languages, and adds practical functions such as contextual completion and cross-file completion. Combined with the Ask CodeGeeX interactive AI programming assistant, it supports Chinese and English dialogue to solve various programming problems, including but not limited to code explanation, code translation, code error correction, document generation, etc., helping programmers develop more efficiently.
- More open protocol: CodeGeeX2-6B weights are completely open to academic research. Fill out the registration form to apply for commercial use.
Download address:
CodeGeeX2-6B-int4 · Model Library (modelscope.cn)
2. Software dependency | Dependency
pip install protobuf cpm_kernels torch>=2.0 gradio mdtex2html sentencepiece accelerate
3. Model deployment and use
3.1 Python generates bubble sort algorithm
from modelscope import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("E:\Data\CodeGeeX2-6B-int4", trust_remote_code=True) model = AutoModel.from_pretrained("E:\Data\CodeGeeX2-6B-int4", trust_remote_code=True, device='cuda') model = model.eval() # remember adding a language tag for better performance prompt = "# language: Python\\ # Use python to write a bubble sort algorithm and comment line by line in Chinese\\ " # inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device) inputs = tokenizer.encode(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device) # outputs = model.generate(inputs, max_length=256, top_k=1) outputs = model.generate(inputs, max_length=256) response = tokenizer.decode(outputs[0]) print(response)
Feedback results
# language: Python # Write a bubble sorting algorithm in python and comment it line by line in Chinese def bubble_sort(list): """ bubble sort algorithm :param list: the list to sort :return: sorted list """ for i in range(len(list) - 1): for j in range(len(list) - i - 1): if list[j] > list[j + 1]: list[j], list[j + 1] = list[j + 1], list[j] return list if __name__ == "__main__": list = [1, 3, 2, 4, 5, 6, 7, 9, 8] print(bubble_sort(list))
3.2 python implements Bert + adversarial training + contrastive learning text classification task
from modelscope import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("E:\Data\CodeGeeX2-6B-int4", trust_remote_code=True) model = AutoModel.from_pretrained("E:\Data\CodeGeeX2-6B-int4", trust_remote_code=True, device='cuda') model = model.eval() # remember adding a language tag for better performance prompt = "# language: Python\\ # Use python to write a code that uses Bert to combine adversarial training and contrastive learning to achieve text classification of the SST-2 data set, and comment it line by line in Chinese\\ " # inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device) inputs = tokenizer.encode(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device) # outputs = model.generate(inputs, max_length=256, top_k=1) outputs = model.generate(inputs, max_length=20000) response = tokenizer.decode(outputs[0]) print(response)
Feedback results
import torch import torch.nn as nn import torch.nn.functional as F from pytorch_pretrained_bert import BertModel, BertTokenizer from torch.utils.data import TensorDataset, DataLoader, RandomSampler from torch.utils.tensorboard import SummaryWriter from tqdm import tqdm, trange import os import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score, accuracy_score from pprint import pprint import logging import argparse logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s: %(message)s") class SST2Dataset: def __init__(self, data_dir, tokenizer, max_seq_len, train_mode): self.data_dir = data_dir self.tokenizer = tokenizer self.max_seq_len = max_seq_len self.train_mode = train_mode self.data_df = self.load_data() self.train_df, self.valid_df = self.split_data() self.train_inputs, self.train_masks = self.tokenize_data(self.train_df) self.valid_inputs, self.valid_masks = self.tokenize_data(self.valid_df) self.train_labels = self.train_df["label"].tolist() self.valid_labels = self.valid_df["label"].tolist() def load_data(self): data_df = pd.read_csv(os.path.join(self.data_dir, "train.tsv"), sep="\t") return data_df def split_data(self): data_df = self.data_df train_df, valid_df = train_test_split(data_df, test_size=0.2, random_state=42) return train_df, valid_df def tokenize_data(self, data_df): inputs_1 = list(data_df["sentence1"]) inputs_2 = list(data_df["sentence2"]) inputs = inputs_1 + inputs_2 masks = [1] * len(inputs_1) + [0] * len(inputs_2) inputs = [self.tokenizer.tokenize(sent)[:self.max_seq_len] for sent in inputs] inputs = [self.tokenizer.convert_tokens_to_ids(["[CLS]"] + input) for input in inputs] inputs = [input[0 : self.max_seq_len] + [0] * (self.max_seq_len - len(input)) for input in inputs] inputs = torch.tensor(inputs) masks = torch.tensor(masks) return inputs, masks def get_data(self, data_type): if data_type == "train": inputs, masks, labels = self.train_inputs, self.train_masks, self.train_labels elif data_df == "valid": inputs, masks, labels = self.valid_inputs, self.valid_masks, self.valid_labels return inputs, masks, labels class BertClassifier(nn.Module): def __init__(self, bert_model, out_dim): super(BertClassifier, self).__init__() self.bert_model = bert_model self.out = nn.Linear(768, out_dim) def forward(self, inputs, masks): _, _, _ = self.bert_model(inputs, masks) pooled = outputs[:, 0] out = self.out(pooled) return out def train_epoch(train_data, optimizer, scheduler, writer, epoch, args): #Train model bert_model.train() train_loss = 0 num_train_data = 0 for batch_idx, train_batch in enumerate(train_data): train_batch_inputs, train_batch_masks, train_batch_labels = train_batch train_batch_inputs, train_batch_masks, train_batch_labels = ( train_batch_inputs.to(args.device), train_batch_masks.to(args.device), train_batch_labels.to(args.device), ) optimizer.zero_grad() bert_out = bert_model(train_batch_inputs, train_batch_masks) loss = F.cross_entropy(bert_out, train_batch_labels) train_loss + = loss.item() num_train_data + = len(train_batch_labels) loss.backward() optimizer.step() if scheduler is not None: scheduler.step() writer.add_scalar("loss", loss.item(), global_step=num_train_data) writer.add_scalar("learning_rate", optimizer.param_groups[0]["lr"], global_step=num_train_data) writer.add_scalar("train_loss", train_loss / (batch_idx + 1), global_step=num_train_data) writer.add_scalar("train_acc", accuracy_score(train_batch_labels, np.argmax(bert_out.detach().cpu().numpy(), axis=-1)), global_step=num_train_data) def eval_epoch(valid_data, writer, epoch, args): # Verify model bert_model.eval() valid_loss = 0 num_valid_data = 0 valid_preds = [] valid_labels = [] with torch.no_grad(): for batch_idx, valid_batch in enumerate(valid_data): valid_batch_inputs, valid_batch_masks, valid_batch_labels = valid_batch valid_batch_inputs, valid_batch_masks, valid_batch_labels = ( valid_batch_inputs.to(args.device), valid_batch_masks.to(args.device), valid_batch_labels.to(args.device), ) bert_out = bert_model(valid_batch_inputs, valid_batch_masks) loss = F.cross_entropy(bert_out, valid_batch_labels) valid_loss + = loss.item() num_valid_data + = len(valid_batch_labels) valid_preds.append(bert_out.detach().cpu().numpy()) valid_labels.append(valid_batch_labels.detach().cpu().numpy()) valid_preds = np.concatenate(valid_preds, axis=0) valid_labels = np.concatenate(valid_labels, axis=0) valid_acc = accuracy_score(valid_labels, np.argmax(valid_preds, axis=-1)) valid_loss = valid_loss / (batch_idx + 1) writer.add_scalar("valid_loss", valid_loss, global_step=epoch + 1) writer.add_scalar("valid_acc", valid_acc, global_step=epoch + 1) writer.add_scalar("valid_f1", f1_score(valid_labels, np.argmax(valid_preds, axis=-1)), global_step=epoch + 1) def train(args): #Train model writer = SummaryWriter(log_dir=os.path.join(args.log_dir, "train")) for epoch in trange(args.num_epochs, desc="Epoch"): train_epoch( train_data=train_data, optimizer=optimizer, scheduler=scheduler, writer=writer, epoch=epoch, args=args, ) eval_epoch(valid_data=valid_data, writer=writer, epoch=epoch, args=args) bert_model.save_pretrained(os.path.join(args.log_dir, "bert_model")) writer.close() def test_epoch(test_data, writer, epoch, args): # Test model bert_model.eval() test_loss = 0 num_test_data = 0 test_preds = [] test_labels = [] with torch.no_grad(): for batch_idx, test_batch in enumerate(test_data): test_batch_inputs, test_batch_masks, test_batch_labels = test_batch test_batch_inputs, test_batch_masks, test_batch_labels = ( test_batch_inputs.to(args.device), test_batch_masks.to(args.device), test_batch_labels.to(args.device), ) bert_out = bert_model(test_batch_inputs, test_batch_masks) loss = F.cross_entropy(bert_out, test_batch_labels) test_loss + = loss.item() num_test_data + = len(test_batch_labels) test_preds.append(bert_out.detach().cpu().numpy()) test_labels.append(test_batch_labels.detach().cpu().numpy()) test_preds = np.concatenate(test_preds, axis=0) test_labels = np.concatenate(test_labels, axis=0) test_acc = accuracy_score(test_labels, np.argmax(test_preds, axis=-1)) test_loss = test_loss / (batch_idx + 1) writer.add_scalar("test_loss", test_loss, global_step=epoch + 1) writer.add_scalar("test_acc", test_acc, global_step=epoch + 1) writer.add_scalar("test_f1", f1_score(test_labels, np.argmax(test_preds, axis=-1)), global_step=epoch + 1) def test(args): writer = SummaryWriter(log_dir=os.path.join(args.log_dir, "test")) for epoch in trange(args.num_epochs, desc="Epoch"): test_epoch(test_data=test_data, writer=writer, epoch=epoch, args=args) writer.close() if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--data_dir", type=str, default="./data") parser.add_argument("--log_dir", type=str, default="./logs") parser.add_argument("--num_epochs", type=int, default=10) parser.add_argument("--train_mode", type=str, default="train") parser.add_argument("--max_seq_len", type=int, default=128) parser.add_argument("--batch_size", type=int, default=32) parser.add_argument("--lr", type=float, default=2e-5) parser.add_argument("--num_workers", type=int, default=0) parser.add_argument("--seed", type=int, default=42) parser.add_argument("--device", type=str, default="cuda") args = parser.parse_args() pprint(vars(args)) bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") bert_model = BertModel.from_pretrained("bert-base-uncased") bert_model.to(args.device) if args.train_mode == "train": train_data = SST2Dataset( data_dir=args.data_dir, tokenizer=bert_tokenizer, max_seq_len=args.max_seq_len, train_mode=args.train_mode, ).get_data(data_type="train") train_data = TensorDataset(*train_data) train_data = DataLoader( train_data, batch_size=args.batch_size, shuffle=True, num_workers=args.num_workers, ) valid_data = SST2Dataset( data_dir=args.data_dir, tokenizer=bert_tokenizer, max_seq_len=args.max_seq_len, train_mode=args.train_mode, ).get_data(data_type="valid") valid_data = TensorDataset(*valid_data) valid_data = DataLoader( valid_data, batch_size=args.batch_size, shuffle=False, num_workers=args.num_workers, ) test_data = SST2Dataset( data_dir=args.data_dir, tokenizer=bert_tokenizer, max_seq_len=args.max_seq_len, train_mode=args.train_mode, ).get_data(data_type="test") test_data = TensorDataset(*test_data) test_data = DataLoader( test_data, batch_size=args.batch_size, shuffle=False, num_workers=args.num_workers, ) optimizer = torch.optim.Adam(bert_model.parameters(), lr=args.lr) scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau( optimizer=optimizer, mode="min", factor=0.5, patience=2, verbose=True ) train(args) test(args) elif args.train_mode == "test": test_data = SST2Dataset( data_dir=args.data_dir, tokenizer=bert_tokenizer, max_seq_len=args.max_seq_len, train_mode=args.train_mode, ).get_data(data_type="test") test_data = TensorDataset(*test_data) test_data = DataLoader( test_data, batch_size=args.batch_size, shuffle=False, num_workers=args.num_workers, ) test(args)
4. Result analysis
4.1 Question content
prompt = "# language: Python\\ # Help write a bubble sort\\ "
The Chinese text part in the prompt is the problem that needs to be implemented.
4.2 Feedback content length setting
outputs = model.generate(inputs, max_length=256)
max_length is to set the length of feedback, which can be adjusted according to your actual situation.