[NLP] Llama2 model runs on Mac machine

This article will introduce how to use llama.cpp to locally deploy and run the quantitative version of Llama2 model inference on MacBook Pro, and build a simple document Q&A application locally based on LangChain. The experimental environment of this article is Apple M1 chip + 8GB memory.

Llama2 and llama.cpp

Llama2 is an iterative version of the Llama large language model developed by Meta AI. It provides specifications of 7B, 13B, and 70B parameters. Compared with Llama, Llama2 has further improved its capabilities in conversation scenes, and is better than most other models, including ChatGPT, in terms of the balance of Safety and Helpfulness. Importantly, Llama2 has an open source commercial license, so individuals and organizations can more easily build their own large model applications.

In order to be able to run Llama2 model inference on MacBook and take advantage of Apple Silicon’s hardware acceleration, this article uses llama.cpp as the Infra for model inference.

llama.cpp is a derivative project of the machine learning library ggml, specifically used for inference of the Llama series models. Both llama.cpp and ggml are pure C/C++ implementations, optimized and hardware accelerated for Apple Silicon chips, and support model integer quantization (Integer Quantization): 4-bit, 5-bit, 8-bit, etc. The community has also developed bindings for other languages, such as llama-cpp-python, to provide API calls in other languages.

The LLaMA.cpp project is a pure C/C++ version of the LLaMA model (simple Python code example) released by the developer Georgi Gerganov based on Meta, which is used for model reasoning. The so-called inference is the process of running the model by giving input, running the model, and obtaining output.

So, what are the advantages of the pure C/C++ version?

There is no need for any additional dependencies. Compared with the requirements of Python code for libraries such as PyTorch, C/C++ directly compiles executable files, skipping the complicated preparations for different hardware;

Supports ARM NEON acceleration of Apple Silicon chips, and the x86 platform is replaced by AVX2;

Features mixed precision of F16 and F32;

Supports 4-bit quantization;

No GPU required, runs on CPU only;

According to the data given by the author, when running the LLaMA-7B model on an M1 MacBook Pro, the inference process takes about 60 milliseconds per word (token). When converted to more than ten words per second, the speed is quite ideal.

After the structure of the deep neural network model is designed, the core purpose of the training process is to determine the weight parameter of each neuron, which is usually recorded as a floating point number with different precisions of 16, 32, and 64 bits. Based on the GPU accelerated training, the quantification It is the process of reducing the hardware requirements by reducing the accuracy of these weights.

For example, the LLaMA model has 16-bit floating point precision, and its 7B version has 7 billion parameters. The complete size of the model is 13 GB. The user must have at least this much memory and disk before the model can be used, not to mention the 13B version. At 24 GB, it’s prohibitive. However, through quantification, for example, reducing the accuracy to 4 digits, the 7B and 13B versions can be reduced to approximately 4 GB and 8 GB respectively. Consumer-grade hardware can meet the requirements, and everyone can experience large models on personal computers.

The quantification implementation of LLaMA.cpp is based on another library of the author – ggml, which uses tensor in the machine learning model implemented in C/C++. The so-called tensor is actually the core data structure in neural network models, commonly found in TensorFlow, PyTorch and other frameworks. After switching to C/C++, it has wider support and higher efficiency, which also laid the foundation for the emergence of LLaMA.cpp.

Local deployment of 7B parameter 4-bit quantized version of Llama2

Model download

To save time and space, the Llama2 model in gguf quantized format can be downloaded from TheBloke. You can also download the original model file after applying for Liscense on the official website of Meta AI, and then use the script provided by llama.cpp to convert and quantify the model format. This article will use the 7B parameter + 4bit quantized version for deployment.

It is downloaded from TheBloke’s huggingface repository (TheBloke/Chinese-Llama-2-7B-GGUF · Hugging Face)

1 Use llama.cpp project loading

To execute LLM on local CPU, we need a local model in GGML format. There are several ways to achieve this, but the easiest is to download the bin file directly from the Hugging Face Models repository. In the current case, we will download the Llama 7B model. These models are open source and can be downloaded for free.

What is GGML? Why GGML? How to GGML? LLaMA CPP

GGML is a Tensor library for machine learning, it’s just a C++ library that lets you run LLMs on CPU or CPU + GPU. It defines a binary format for distributing large language models (LLMs). GGML utilizes a technique called quantization to enable large language models to run on consumer-grade hardware.

You can run your own LLaMa2 large model directly locally. Note that M1 or above chip is required.

xcode-select --install # Make sure you download Git and C/C++ git clone https://github.com/ggerganov/llama.cpp.git cdllama.cpp LLAMA_METAL=1 make ./main -m ../hug-download/models--TheBloke--Chinese-Llama-2-7B-GGUF/snapshots/f81e959ca91492916b8b6f895202b6d478b8930c/chinese-llama-2-7b.Q4_K_M.gguf -n 1024 -ngl 1 - p "Answer in Chinese, 3-day travel guide to Shanghai"

Note: HuggingFace may have permission requirements. Direct execution will result in 403. You can log in on the web page and go to this link to directly download the model and place it in the llama.cpp directory of the clone you just cloned. In the models directory under.

The model of LLaMa2 itself does not support direct calling on Windows or Mac machines. It can only be used on Linux systems and supports N cards.

We can run Llama 2 locally on Mac based on the llama.cpp open source project.

It downloads the 4-bit optimized weights for Llama2 7B Chat from TheBloke’s huggingface repository (TheBloke/Chinese-Llama-2-7B-GGUF · Hugging Face), puts it into the model directory of llama.cpp, and then uses Apple’s Metal optimization tool to build the llama.cpp project.

The latest version of llama-cpp-python does not support the ggmlv3 model. If it is the ggml version, please use python3 convert-llama-ggmlv3-to-gguf.py –input –output (No Chinese path required), download the script [here](github.com/ggerganov/ll)

You can download the LLama2 Chinese model below.

Download method:

from huggingface_hub import snapshot_download snapshot_download(repo_id='TheBloke/Chinese-Llama-2-7B-GGUF', repo_type="model", resume_download=True, max_workers=1, allow_patterns="chinese-llama-2-7b.Q4_K_M.gguf", token="XXX", cache_dir='./')

The 7B weight should run on a machine with 8GB of RAM (but would be better if you have 16GB of RAM). Larger models like the 13B or 70B will require more RAM.

Log start main: build = 0 (unknown) main: built with Apple clang version 14.0.0 (clang-1400.0.29.202) for arm64-apple-darwin22.1.0 main: seed = 1699179655 llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ../hug-download/models--TheBloke--Chinese-Llama-2-7B-GGUF/snapshots/f81e959ca91492916b8b6f895202b6d478b8930c/chinese-llama-2-7b. Q4_K_M.gguf (version GGUF V2) llama_model_loader: - tensor 0: token_embd.weight q4_K [4096, 55296, 1, 1] llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_K [4096, 4096, 1, 1] . . . . . . llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = mostly Q4_K - Medium llm_load_print_meta: model params = 6.93 B llm_load_print_meta: model size = 3.92 GiB (4.86 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 ' ' llm_load_tensors: ggml ctx size = 0.11 MB llm_load_tensors: mem required = 4017.18 MB ................................................................. .................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 256.00 MB llama_build_graph: non-view tensors processed: 740/740 ggml_metal_init: allocating ggml_metal_init: found device: Apple M1 ggml_metal_init: picking default device: Apple M1 ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: loading '/Users/apple/PycharmProjects/NLPProject/llama.cpp/ggml-metal.metal' ggml_metal_init: GPU name: Apple M1 ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007) ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 5461.34 MB ggml_metal_init: maxTransferRate = built-in GPU llama_new_context_with_model: compute buffer total size = 122.63 MB llama_new_context_with_model: max tensor size = 177.19 MB ggml_metal_add_buffer: allocated 'data ' buffer, size = 4018.28 MB, ( 4018.78 / 5461.34) ggml_metal_add_buffer: allocated 'kv ' buffer, size = 256.02 MB, ( 4274.80 / 5461.34) ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 116.02 MB, ( 4390.81 / 5461.34) system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 generate: n_ctx = 512, n_batch = 512, n_predict = 1024, n_keep = 0

Answer in Chinese, 3-day travel guide to Shanghai. As new arrivals, my classmates and I flew to Shanghai at 5 o'clock on Monday. We took the bus from Hongqiao Airport to a hotel near the Bund at 7:50. This is my first time going abroad, so I want to have a lot of fun. 1. After staying in the hotel on the first night, we went out early the next morning for a day trip, mainly visiting the Nanjing Road Pedestrian Street, the Old Town God’s Temple, People’s Square, and Xintiandi. Went to the Oriental Pearl Tower in the afternoon and then returned to the city for dinner. 2. On the third day, I went to the Bund for a walk in the afternoon, visited the China Art Palace Museum, and had dinner with my classmates in Yuyuan Garden in the evening. 3. On the last day, I took the subway to the Shanghai Botanical Garden to see the cherry blossoms. After returning to the hotel, I took the bus back to the airport. On the first morning, we took the bus to the vicinity of People’s Square, and then visited Nanjing Road Pedestrian Street. In fact, there is nothing to buy, the main thing is to feel the atmosphere. On the way, we saw various shops, gourmet shops and street performances (selling candied haws and tambourines). In the afternoon we went to the Old City God Temple. My friend and I entered along the west gate. There are many food stalls inside that are quite delicious! Later, we went to People’s Square to watch the large float parade that we were going to take in the evening. Then we walked from People's Square to the Bund, but there was still a bit of traffic jam on the road because many people wanted to take this road, and there were many performances and snacks selling things on the roadside, so it was quite lively the next day. I went to the Shanghai Museum early in the morning. My classmates and I planned to visit the China Art Palace Museum and the Oriental Pearl Tower. However, we didn’t have time to see the cultural relics exhibition (which seemed quite rich), so we went directly to the second floor to see the exhibition of Chinese paintings and calligraphy works. Then I saw various Chinese porcelains from different periods on the first floor, as well as Japanese antiques and the like (which seemed to be quite valuable...). At noon, I ate at a restaurant near the Bund. The taste was good. In the afternoon, I started from the top. We took a bus to the Oriental Pearl Tower in front of the museum, but my friend and I didn’t have tickets because we didn’t bring our ID cards. QAQ We took the bus and drove around Lujiazui, and then went to Century Park. We saw various stalls on the road and then arrived at Century Park. We walked a lot in the park, and it felt like there were quite a lot of people. Finally, we came out of Xintiandi, had dinner first, and then went back to the hotel to rest. On the third day, my classmates and I took the subway early in the morning to go to the Botanical Garden to see the cherry blossoms (actually, we were going to take pictures), and It happened to be a nice sunny day! My friends and I took a lot of photos at the door, and then went for a walk around the cherry blossom viewing area. After that, we took the subway back and our trip to Shanghai ended like this. QAQ Haha, this time the itinerary was relatively tight and I felt like I ran out of time... But it was still interesting to walk around Shanghai (although I also ate Lots of snacks) Now let me take a look at some of my favorite spots in Shanghai~ First of all, there are some small shops and restaurants in the Bund area! When my friend and I were eating at a restaurant near Lujiazui, we passed by an internet-famous milk tea shop called "Ainong House". At that time, we bought a cup of mango-flavored milk tea to drink. It felt quite delicious (although it wasn't Very sweet) Later, I went to a restaurant called "Genting Dream" next door. They had various flavors of chicken steaks and different types of barbecue platters. But we ordered a set meal...but it tasted pretty good! Then there is a small shop called "Old Shanghai Daimaru Tea Room" near Lujiazui (actually, this tea room sells milk tea). My friend and I went to his house to drink milk tea that afternoon, and we also bought their milk tea. Signature dessert mango pudding ~ After feeling pretty good, we went to a restaurant called "Xiao Long Bao King" on the Bund! There are various flavors of xiaolongbao and special snacks here ~ but the price is a bit expensive... I also found a lot of roadside stalls when I came to Shanghai this time. There are many selling them on a street near Lujiazui. Stalls with various snacks and drinks. Now I would like to recommend the restaurant where my friends and I went. They have a brand called "Mala Tang", which also has something similar to small wontons (it seems to be called "tangyuan"). We ate it well. I thought it was pretty good, although it looked a bit dirty... In addition to these places in the Bund area, I also went to a restaurant called "Yunxiao Tower Restaurant" on Nanjing Road Pedestrian Street! There are various flavors of barbecue set meals and large dishes at different price points. However, my friend and I went to eat their special dish of stuffed abalone. I thought it tasted quite good. Apart from these small shops I found in the Bund area, In addition, I also went to a time-honored restaurant called "Dafuji" on Nanjing Road Pedestrian Street! The taste of this restaurant is more traditional, but my friend ordered his homemade tofu and roasted pork with green onions (I forgot other dishes), which was pretty good~ but the price is a little expensive... In the end, this is The hot pot restaurant near Xintiandi in Shanghai that we went to last time! There are various flavors of pot bases to choose from, and self-service snacks are also provided. Although the environment of this restaurant does not look very high-end...but the taste is still good~ next time I come to Shanghai llama_print_timings: load time = 8380.94 ms llama_print_timings: sample time = 2122.12 ms / 1024 runs (2.07 ms per token, 482.54 tokens per second) llama_print_timings: prompt eval time = 306.62 ms / 10 tokens (30.66 ms per token, 32.61 tokens per second) llama_print_timings: eval time = 196188.08 ms / 1023 runs (191.78 ms per token, 5.21 tokens per second) llama_print_timings: total time = 214813.21 ms ggml_metal_free: deallocating Log end

2 Use the llama-cpp-python project to load

llama.cpp is a C++ library. Applications used to develop llm often need to use Python to call the C++ interface. We will use llama-cpp-python, a Python Binding for LLaMA .cpp that acts as inference for LLaMA models in pure C/c++.

First install llama-cpp-python using pip. One thing to note is that you need to use a python version that supports arm when installing on mac. If not, you can use conda to create an environment first. If you use python with x86_64 architecture, Illegal instructions will appear when you run the server later.

This article will use the Python binding of llama.cpp: llama-cpp-python to deploy the Llama2 model locally. llama-cpp-python provides a consistent API with OpenAI, so it can be easily used in applications or frameworks that originally use OpenAI APIs (e.g. LangChain) with a locally deployed model.

Install llama-cpp-python (with Metal support)

To enable support for Metal (Apple’s GPU acceleration framework), install llama-cpp-python using the following command:

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

Install Web server

llama-cpp-python provides a web server that provides the same API as OpenAI, making it compatible with existing applications and frameworks. Install the web server using the following command:

pip install llama-cpp-python[server]

pip3 install uvicorn pip3 install anyio pip3 install starlette pip3 install fastapi pip3 install pydantic_settings pip3 install sse_starlette pip3 install starlette_context

Start llama-cpp-python web server (with Metal GPU acceleration)

python -m llama_cpp.server --model $MODEL_PATH --n_gpu_layers 1

Replace $MODEL_PATH with the path to the model you downloaded.

API documentation and attempts

After the Web server is started, you can access the OpenAPI documentation through http://localhost:8000/docs and try to call the API.

You can see that the web server provides an OpenAI-like interface:

/v1/completions: Provide text (String type) and return the predicted context (String type)

/v1/embeddings: Provides text (String type) and returns the text’s embeddings (vector)

/v1/chat/completions: Provides conversation history (a sequence of Messages) and returns predicted answers (Message type)

/v1/models/: Get language model information

Simply test /v1/chat/completions:

Note that in the dialog task, a Message object is provided containing two fields: content and role:

content: text content of the message (String)

role: The role that sends the message in the conversation, which can be one of system, user, assistant. Among them, system is a high-level instruction used to guide the behavior of the model. For example, in the example above, the model is told: “You are a helpful assistant.”. user represents the message sent by the user, and assistant represents the model’s answer.

The API provides a simple managed interface through the Llama class. Please replace ./models/7B/ggml-model.bin with the path of your model, the same below.

from llama_cpp import Llama llm = Llama(model_path="./models/7B/ggml-model.bin") output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\\ "], echo=True) print(output)

{ 'id': 'cmpl-456ef388-4cff-494b-b721-23492e06e43a', 'object': 'text_completion', 'created': 1699238435, 'model': './TheBloke--Chinese-Llama-2-7B-GGUF/chinese-llama-2-7b.Q4_K_M.gguf', 'choices': [{ 'text': 'Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Uranus, Neptune ', 'index': 0, 'logprobs': None, 'finish_reason': 'stop' }], 'usage': { 'prompt_tokens': 15, 'completion_tokens': 21, 'tokens': 36 } }

macbook m1 local deployment of llama2 model_Zaldini0711’s blog-CSDN blog

Deploy Llama2 language model on MacBook Pro and build LLM application based on LangChain – Zhihu (zhihu.com)

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeArtificial IntelligenceNatural Language Processing 386,765 people are learning the system