Version: 0.2

Quantization Guide

Selective/Semi-Automatic Quantization

By default, only the model tail (layers from the output node upstream to the last weighted layer(s) in the model) will be skipped automatically.

If you wish to customize which layers will be skipped you can use the following methods:

to avoid the automatic skipping of the model tail you can set global_quantization_settings.automatic_skip_quantization=False in your Setting object.
to skip quantization in the interval between two layers you can use set_quantization_settings_for_layer (see example #1 below).
to skip quantization downstream you can skip quantization from a layer in the graph all the way to the last layer(s) by setting cc.LayerQuantizationSettings.skip_quantization_downstream=True (see example #2 below).

tip

For this kind of customization it is recommended to use the generated graph visualization in model_init.png (the original model graph) and model_post_preprocessing.png (the quantized model graph). To make sure they are generated follow the relevant instructions here. You may also refer to the "parsing layers" section in the last section of CCO setup in the Output Log Breakdown page.

Example #1

If we wanted to skip all layers between the layer named adaptive_avg_pool and the layer named linear:

import clika_compression as cc

settings = cc.generate_default_settings()
settings.set_quantization_settings_for_layer(
    "adaptive_avg_pool",
    cc.LayerQuantizationSettings(skip_quantization=True, skip_quantization_until=["linear"])
)

tip

You can also specify more than one destination in skip_quantization_until.

Example #2

If we wanted to skip all layers from the layer named adaptive_avg_pool to the last layer:

import clika_compression as cc

settings = cc.generate_default_settings()
settings.set_quantization_settings_for_layer(
    "adaptive_avg_pool",
    cc.LayerQuantizationSettings(skip_quantization=True, skip_quantization_downstream=True)
)

caution

In case you want to quantize an Embedding layer, make sure to skip the quantization of its input.

For example, if input_0 is the input to an Embedding layer, it must be skipped using:

import clika_compression as cc

settings = cc.generate_default_settings()
settings.set_quantization_settings_for_layer(
    "input_0",
    cc.LayerQuantizationSettings(skip_quantization=True)
)

Graph Visualization

If you follow the instructions to install the optional requirements, the files:

model_init.png - Architecture of the original model before compression
model_post_preprocessing.png - Architecture of the quantized model after compression

will be generated in your outputs folder.

The color-coding for the layers is as follows:

Blue - is an input or output node
Green - is a quantized layer
Yellow - is non-quantized layer

Graph Visualization includes the following information about each layer:

Name and type
Input and output shapes
Layer attributes of the layer such as kernel size, strides etc.
Quantization Sensitivity (QS)

Quantization Sensitivity (QS) is a measurement of the difference between original and quantized outputs of each layer. It is used to determine which layers should be skipped.

The QS is computed as the L2-norm between the quantized and float layer outputs; the higher the number, the harder it is to quantize.

It is recommended to skip quantization layers with a QS value above 10,000, since the higher the QS value, the longer it will take to for CCO to compress the model. :::

Quantization Guide

Selective/Semi-Automatic Quantization​

Example #1​

Example #2​

Graph Visualization​

Selective/Semi-Automatic Quantization

Example #1

Example #2

Graph Visualization