How to Quantize Large Language Models - Super Lazy Coder

Super Lazy Coder Tutorial Summary

Table of Contents

  1. Quantization Technique for Large Language Models | 0:00:00-0:14:00
  2. Using GPTQ Technique for Model Quantization | 0:14:00-0:25:00
  3. Quantizing Models with Llama CPP Library | 0:25:00-10:10:10

Quantization Technique for Large Language Models | 0:00:00-0:14:00

\

The source for this section can be found here: Introduction to Weight Quantization

In this video, we discuss the quantization technique of large language models. Quantization is the process of reducing the size and computational requirements of a model by converting its parameters into lower precision data types. The number of parameters in a model determines its size and computational requirements. Loading a model with a large number of parameters requires more memory. The precision of the parameters is determined by the data type, such as float 32 bit. The weight of the parameters is the total number of bits required to store the model in memory. You are asking if each parameter in the model uses a certain number of bits.

image_1713726556299

Quantization is a technique used to reduce the size of large language models so they can be used on smaller machines. There are two major quantization techniques: post training and quantization aware training.

Post training involves compressing a pre-trained model, while quantization aware training involves compressing a model during training. The GPTQ technique is a popular quantization technique that converts parameters into smaller bit precision sizes. It is useful for GPUs with limited memory. GPTQ uses optimal weight quantization to compress weights while minimizing the difference between the old and compressed weights. Layer-wise compression is also used in GPTQ. Overall, quantization reduces the size of language models, making them more accessible for smaller machines.

image_1713726523042

The large language model, Gemma, is based on the transformers architecture. It takes an input and goes through multiple layers, including embedding, attention, normalization, and feed forward layers. The output is generated using the same process but in reverse. Each layer has 7 billion parameters that need to be trained. Optimal brain quantizer is used to compress the weights at each layer. GPTQ uses three techniques to make the compression process faster: arbitrary order insights, lazy batch update, and weight sharing.

image_1713728130155

image_1713728281368

image_1713728328776

image_1713728412044

image_1713728678576

Using GPTQ Technique for Model Quantization | 0:14:00-0:25:00

\

The source for this section can be found here: 4-Bit Quantization with GPTQ

GPTQ uses a lazy batch update to transform weights in batches, making the process faster. It also uses the Cholesky reformulation to reduce numerical errors. To quantize a model, you need to install the auto gptq library and import the necessary modules from transformers.

image_1713728755055

image_1713728788082

To use the GPTQ technique, you need the auto tokenizer and GPTQ config. The tokenizer provides the embeddings for the model. Load the pre-trained tokenizer and create the GPTQ config with the desired parameters. Use a dataset like C4 to train the model for quantization. Load the auto model for casual LM with the model ID and GPTQ config to quantize the model. Push the quantized model to the Hugging Face hub. GPTQ only works on GPUs.

image_1713730354130

image_1713730372728

image_1713730406210

image_1713730431213

image_1713730501300

image_1713730535306

image_1713730552901

Quantizing Models with Llama CPP Library | 0:25:00-10:10:10

\

The source for this section can be found here: # Quantize Llama models with GGUF and llama.cpp

image_1713739636373

The steps here

To run on CPUs, use the GGUF technique. GGUF uses the ggml library, which is compatible with the Llama CPP library. GGUF allows running models on both GPUs and CPUs. Quantize llama models with GGML using different variations, such as Q2K or Q80, depending on the desired precision. Attention and feed forward layers are important and should not be compromised during weight compression.

Steps: 1) quantize models by cloning the library, 2) install prerequisites, 3) and convert the model to a 16-bit GGML format. They perform quantization using different methods and create models in the GGUF format. Finally, load and use the quantized models using the Llama CPP Python library. The two popular approaches to quantization are GGML and GPDQ.

image_1713740044529

image_1713740062224

image_1713740077070

image_1713740094188

image_1713740112598

image_1713740141210

image_1713740159367

image_1713740186178

image_1713740206168