Skip to main content

Serving Llama-2 7B using llama.cpp with NVIDIA CUDA on Ubuntu 22.04

This blog post is a step-by-step guide for running Llama-2 7B model using llama.cpp, with NVIDIA CUDA and Ubuntu 22.04. llama.cpp is an C/C++ library for the inference of Llama/Llama-2 models. It has grown insanely popular along with the booming of large language model applications. Throughout this guide, we assume the user home directory (usually /home/username) is the working directory.

Install NVIDIA CUDA

To start, let's install NVIDIA CUDA on Ubuntu 22.04. The guide presented here is the same as the CUDA Toolkit download page provided by NVIDIA, but I deviate a little bit by installing CUDA 11.8 instead of the latest version. At the time of writing, PyTorch 2.0 stable is released for CUDA 11.8 and I find it convenient to keep my deployed CUDA version in sync with that.


$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
$ sudo dpkg -i cuda-keyring_1.1-1_all.deb
$ sudo apt update
$ sudo apt install cuda-11-8

After installing, the system should be restarted. This is to ensure that NVIDIA driver kernel modules are properly loaded with dkms. Then, you should be able to see your GPUs by using nvidia-smi.

$ sudo shutdown -r now

Clone and Compile llama.cpp

After installing NVIDIA CUDA, all of the prerequisites to compile llama.cpp are already satisfied. We simply need to clone llama.cpp and compile.

$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp
$ make LLAMA_CUBLAS=1 LLAMA_CUDA_NVCC=/usr/local/cuda/bin/nvcc

Here, defining the LLAMA_CUDA_NVCC variable is necessary because the CUDA Toolkit does not add itself to PATH by default on Ubuntu 22.04. Then, one should find many executables such as main and perplexity available in the project directory. In this guide we opted to use the make build method, but interested users can also checkout llama.cpp's repo page for instructions on building with cmake.

Download and Run Llama-2 7B

Normally, one needs to refer to Meta's LLaMA download page to access the models. To save time, we use the converted and quantized model by the awesome HuggingFace community user TheBloke. The pre-quantized models are available via this link. In the model repository name, GGUF refers to a new model file format introduced in August 2023 for llama.cpp. 

To download the model files, first we install and initialize git-lfs (Git Large File Storage).

$ sudo apt install git-lfs
$ git lfs install

You should see "Git LFS initialized." printed in the terminal after the last command. Then, we can clone the repository, only with links to the files instead of downloading all of them.

$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/TheBloke/Llama-2-7B-GGUF

The one file we actually need is llama-2-7b.Q4_0.gguf, which is the Llama 2 7B model processed using one of the 4-bit quantization method.

$ cd Llama-2-7B-GGUF
$ git lfs pull --include llama-2-7b.Q4_0.gguf

With the pre-quantized model downloaded, we can execute the programs in llama.cpp for many purposes using the Llama 2 7B model. One of them is to generate text in an interactive mode.

$ cd ~/llama.cpp
$ ./main -m ~/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf --color \
    --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 \
    --temp 0.2 --repeat_penalty 1.1 -t 8 -ngl 10000

The last argument "-ngl 10000" puts at most 10000 layers on the GPU. In the case of Llama-2 7B, that's more than all of the model layers so the entire model will be put on the CUDA GPU device. Feel free to follow the instructions on screen to play with the model.

Serving Llama-2 7B

Many useful programs are built when we execute the make command for llama.cpp. main is the one to use for generating text in the terminal. perplexity can be used for compute the perplexity against a given dataset for benchmarking purposes. In this part we look at the server program, which can be executed to provide a simple HTTP API server for models that are compatible with llama.cpp.
$ ./server -m ~/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf \
    -c 2048 -ngl 10000 --host localhost --port 8080

Then, you will have a language model completion API available at http://localhost:8080/completion. To try an example using curl

$ curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

More documentation about the API can be found in the llama.cpp repository. The instructions here are not limited to Llama 2 7B, one can follow the same guide for any model that is compatible with llama.cpp.

Comments

Popular posts from this blog

A Perplexity Benchmark of llama.cpp

Without further ado, here are the results (explanations and discussions later): Table 1: Perplexity on wikitext-2 test set.   Model \ Quantization q4_0 q4_1 q5_0 q5_1 q8_0 fp16 llama-7b 6.157 6.0915 5.9846 5.948 5.9063 5.68 llama-13b 5.385 5.3608 5.285 5.2702 5.2547 5.09 llama-30b 4.2707 - - - - 4.1 alpaca-30b 4.4521 - - - - - llama-2-7b 5.9675 6.0398 5.8328 5.8435 5.7897 - llama-2-7b-chat 7.7641 7.7853 7.5055 7.5392 7.5014 - llama-2-13b 5.2172 5.2115 5.1343 5.1289 5.1005 - llama-2-13b-chat 6.62

The SmileyFace Dream: Everyone can share the dividends of AI era.

  I have decided to take a break from money-making careers for the next 6 months, and focus on one thing: build a platform for decentralized AI serving. It has become obvious to me that we are technologically ready to change our economy such that common people, instead of being consistently exploited by the big AI players for both their data and their money, can be compensated in some ways and share the dividends of the AI era. The practical way to do it now is to lower the participation bar for AI serving as much as possible, which has become increasingly possible because of the awesome open source development in the AI field (e.g., llama.cpp), and the permissive licensing from companies like Meta (e.g., LLaMa-2). They have enabled consumer computing devices to serve large generative models. The key in this is a platform that connects people who needs AI inference to people who have spare computing power. If you knew cryptocurrency, this is like a mining pool, but instead of making pe