A Perplexity Benchmark of llama.cpp

By Xiang Zhang September 07, 2023

Without further ado, here are the results (explanations and discussions later):

Table 1: Perplexity on wikitext-2 test set.

Model \ Quantization	q4_0	q4_1	q5_0	q5_1	q8_0	fp16
llama-7b	6.157	6.0915	5.9846	5.948	5.9063	5.68
llama-13b	5.385	5.3608	5.285	5.2702	5.2547	5.09
llama-30b	4.2707	-	-	-	-	4.1
alpaca-30b	4.4521	-	-	-	-	-
llama-2-7b	5.9675	6.0398	5.8328	5.8435	5.7897	-
llama-2-7b-chat	7.7641	7.7853	7.5055	7.5392	7.5014	-
llama-2-13b	5.2172	5.2115	5.1343	5.1289	5.1005	-
llama-2-13b-chat	6.6296	6.7059	6.5336	6.5771	6.5361	-

Other than the fp16 results, these are perplexity numbers obtained by running the perplexity program from llama.cpp on the test set of wikitext-2 dataset. qM_N refers to a quantization method of M bits, and N is a selector of the underlying quantization algorithm. Throughout the development of llama.cpp there has been attempts in improving the quantization methods even for the same quantization level, therefore there is N.

All of these results are obtained on an NVIDIA L4 instance from the Google Cloud Platform, thanks to the sponsorship from Philippe Beaudoin, Co-founder and CEO at Waverly. The NVIDIA L4 GPU has 24GB of VRAM, which is sufficient for running llama-30b quantized at q4_0, but not enough for anything beyond that.

Perplexity measures the likeliness of the next token conditioned on the previous tokens in a text sequence. Smaller perplexity means better sequence modeling for the given dataset. From the table, it becomes obvious that the quantization methods implemented in llama.cpp are quite good. Additionally, the determining factor for a large language model's performance is still the number of parameters, even when the level of quantization is high.

To demonstrate the last point better, here is a table detailing the VRAM requirements for parameters.

Table 2: VRAM requirement for model parameters in MB.

Model \ Quantization	q4_0	q4_1	q5_0	q5_1	q8_0
llama-7b	4090	4484	4877	5271	7240
llama-13b	7656	8422	9188	9954	13784
llama-30b	18555	20481	22407	24333	33964
alpaca-30b	18555	20481	22407	24333	33964
llama-2-7b	4090	4484	4877	5271	7240
llama-2-7b-chat	4090	4484	4877	5271	7240
llama-2-13b	7656	8422	9188	9954	13784
llama-2-13b-chat	7656	8422	9188	9954	13784

Note that these are only the memory cost of the parameters. When running the model for real, more memory is needed for the internal representations and inputs / outputs, which is the reason why we were not able to run q4_1 for llama-30b and alpaca-30b, even though their parameters do fit on an L4 GPU.

The memory cost of llama-13b-q4_0 is very similar to that of llama-7b-q8_0. However, llama-13b-q4_0 significantly out-performs llama-7b-q8_0 in the perplexity table. Similar observations can be said for llama-2 models as well. This means that compared to quantization levels, number of parameters is a more determining factor for the performance of large language models, under the same memory constraint.

There are some other conclusions one can draw from these numbers, but I will leave those to the readers to interpret. I'm in the process of benchmarking the inference speed of these models, which will be published in a next blog post.

General Generatives

A Perplexity Benchmark of llama.cpp

Comments

Post a Comment

Popular posts from this blog

Serving Llama-2 7B using llama.cpp with NVIDIA CUDA on Ubuntu 22.04

A Peek into Generative Control