Serving Llama-2 7B using llama.cpp with NVIDIA CUDA on Ubuntu 22.04

This blog post is a step-by-step guide for running Llama-2 7B model using llama.cpp, with NVIDIA CUDA and Ubuntu 22.04. llama.cpp is an C/C++ library for the inference of Llama/Llama-2 models. It has grown insanely popular along with the booming of large language model applications. Throughout this guide, we assume the user home directory (usually /home/username) is the working directory.

Install NVIDIA CUDA

To start, let's install NVIDIA CUDA on Ubuntu 22.04. The guide presented here is the same as the CUDA Toolkit download page provided by NVIDIA, but I deviate a little bit by installing CUDA 11.8 instead of the latest version. At the time of writing, PyTorch 2.0 stable is released for CUDA 11.8 and I find it convenient to keep my deployed CUDA version in sync with that.

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
$ sudo dpkg -i cuda-keyring_1.1-1_all.deb
$ sudo apt update
$ sudo apt install cuda-11-8

After installing, the system should be restarted. This is to ensure that NVIDIA driver kernel modules are properly loaded with dkms. Then, you should be able to see your GPUs by using nvidia-smi.

$ sudo shutdown -r now

Clone and Compile llama.cpp

After installing NVIDIA CUDA, all of the prerequisites to compile llama.cpp are already satisfied. We simply need to clone llama.cpp and compile.

$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp
$ make LLAMA_CUBLAS=1 LLAMA_CUDA_NVCC=/usr/local/cuda/bin/nvcc

Here, defining the LLAMA_CUDA_NVCC variable is necessary because the CUDA Toolkit does not add itself to PATH by default on Ubuntu 22.04. Then, one should find many executables such as main and perplexity available in the project directory. In this guide we opted to use the make build method, but interested users can also checkout llama.cpp's repo page for instructions on building with cmake.

Download and Run Llama-2 7B

Normally, one needs to refer to Meta's LLaMA download page to access the models. To save time, we use the converted and quantized model by the awesome HuggingFace community user TheBloke. The pre-quantized models are available via this link. In the model repository name, GGUF refers to a new model file format introduced in August 2023 for llama.cpp.

To download the model files, first we install and initialize git-lfs (Git Large File Storage).

$ sudo apt install git-lfs
$ git lfs install

You should see "Git LFS initialized." printed in the terminal after the last command. Then, we can clone the repository, only with links to the files instead of downloading all of them.

$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/TheBloke/Llama-2-7B-GGUF

The one file we actually need is llama-2-7b.Q4_0.gguf, which is the Llama 2 7B model processed using one of the 4-bit quantization method.

$ cd Llama-2-7B-GGUF
$ git lfs pull --include llama-2-7b.Q4_0.gguf

With the pre-quantized model downloaded, we can execute the programs in llama.cpp for many purposes using the Llama 2 7B model. One of them is to generate text in an interactive mode.

$ cd ~/llama.cpp
$ ./main -m ~/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf --color \
    --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 \
    --temp 0.2 --repeat_penalty 1.1 -t 8 -ngl 10000

The last argument "-ngl 10000" puts at most 10000 layers on the GPU. In the case of Llama-2 7B, that's more than all of the model layers so the entire model will be put on the CUDA GPU device. Feel free to follow the instructions on screen to play with the model.