Skip to main content

A Perplexity Benchmark of llama.cpp

Without further ado, here are the results (explanations and discussions later):

Table 1: Perplexity on wikitext-2 test set.  
Model \ Quantization q4_0 q4_1 q5_0 q5_1 q8_0 fp16
llama-7b 6.157 6.0915 5.9846 5.948 5.9063 5.68
llama-13b 5.385 5.3608 5.285 5.2702 5.2547 5.09
llama-30b 4.2707 - - - - 4.1
alpaca-30b 4.4521 - - - - -
llama-2-7b 5.9675 6.0398 5.8328 5.8435 5.7897 -
llama-2-7b-chat 7.7641 7.7853 7.5055 7.5392 7.5014 -
llama-2-13b 5.2172 5.2115 5.1343 5.1289 5.1005 -
llama-2-13b-chat 6.6296 6.7059 6.5336 6.5771 6.5361 -

Other than the fp16 results, these are perplexity numbers obtained by running the perplexity program from llama.cpp on the test set of wikitext-2 dataset. qM_N refers to a quantization method of M bits, and N is a selector of the underlying quantization algorithm. Throughout the development of llama.cpp there has been attempts in improving the quantization methods even for the same quantization level, therefore there is N.

All of these results are obtained on an NVIDIA L4 instance from the Google Cloud Platform, thanks to the sponsorship from Philippe Beaudoin, Co-founder and CEO at Waverly. The NVIDIA L4 GPU has 24GB of VRAM, which is sufficient for running llama-30b quantized at q4_0, but not enough for anything beyond that.

Perplexity measures the likeliness of the next token conditioned on the previous tokens in a text sequence. Smaller perplexity means better sequence modeling for the given dataset. From the table, it becomes obvious that the quantization methods implemented in llama.cpp are quite good. Additionally, the determining factor for a large language model's performance is still the number of parameters, even when the level of quantization is high.

To demonstrate the last point better, here is a table detailing the VRAM requirements for parameters.

Table 2: VRAM requirement for model parameters in MB.
Model \ Quantization q4_0 q4_1 q5_0 q5_1 q8_0
llama-7b 4090 4484 4877 5271 7240
llama-13b 7656 8422 9188 9954 13784
llama-30b 18555 20481 22407 24333 33964
alpaca-30b 18555 20481 22407 24333 33964
llama-2-7b 4090 4484 4877 5271 7240
llama-2-7b-chat 4090 4484 4877 5271 7240
llama-2-13b 7656 8422 9188 9954 13784
llama-2-13b-chat 7656 8422 9188 9954 13784

Note that these are only the memory cost of the parameters. When running the model for real, more memory is needed for the internal representations and inputs / outputs, which is the reason why we were not able to run q4_1 for llama-30b and alpaca-30b, even though their parameters do fit on an L4 GPU.

The memory cost of llama-13b-q4_0 is very similar to that of llama-7b-q8_0. However, llama-13b-q4_0 significantly out-performs llama-7b-q8_0 in the perplexity table. Similar observations can be said for llama-2 models as well. This means that compared to quantization levels, number of parameters is a more determining factor for the performance of large language models, under the same memory constraint.

There are some other conclusions one can draw from these numbers, but I will leave those to the readers to interpret. I'm in the process of benchmarking the inference speed of these models, which will be published in a next blog post.


Popular posts from this blog

Serving Llama-2 7B using llama.cpp with NVIDIA CUDA on Ubuntu 22.04

This blog post is a step-by-step guide for running Llama-2 7B model using llama.cpp, with NVIDIA CUDA and Ubuntu 22.04.  llama.cpp is an C/C++ library for the inference of Llama/Llama-2 models. It has grown insanely popular along with the booming of large language model applications. Throughout this guide, we assume the user home directory (usually /home/username) is the working directory. Install NVIDIA CUDA To start, let's install NVIDIA CUDA on Ubuntu 22.04. The guide presented here is the same as the CUDA Toolkit download page provided by NVIDIA, but I deviate a little bit by installing CUDA 11.8 instead of the latest version. At the time of writing, PyTorch 2.0 stable is released for CUDA 11.8 and I find it convenient to keep my deployed CUDA version in sync with that. $ wget $ sudo dpkg -i cuda-keyring_1.1-1_all.deb $ sudo apt update $ sudo apt install cuda-11-8 After

The SmileyFace Dream: Everyone can share the dividends of AI era.

  I have decided to take a break from money-making careers for the next 6 months, and focus on one thing: build a platform for decentralized AI serving. It has become obvious to me that we are technologically ready to change our economy such that common people, instead of being consistently exploited by the big AI players for both their data and their money, can be compensated in some ways and share the dividends of the AI era. The practical way to do it now is to lower the participation bar for AI serving as much as possible, which has become increasingly possible because of the awesome open source development in the AI field (e.g., llama.cpp), and the permissive licensing from companies like Meta (e.g., LLaMa-2). They have enabled consumer computing devices to serve large generative models. The key in this is a platform that connects people who needs AI inference to people who have spare computing power. If you knew cryptocurrency, this is like a mining pool, but instead of making pe