Quantization guide
When choosing a learning rate to use within an ACE session, it is good practice to set it to the last learning rate used by the optimizer at the end of the original (fp32/fp16) model training.
Selective and semi-automatic quantization
By default, only the model tail (layers from the output node upstream to the last weighted layer(s) in the model) will be automatically skipped for quantization.
The following methods allow the user to customize which layers will be skipped:
- To avoid automatically skipping the model tail, set
global_quantization_settings.automatic_skip_quantization
=False
in theSetting
object. - To skip quantization in the interval between two specific layers, use
set_quantization_settings_for_layer
(see example #1 below). - To skip quantization downstream from a particular layer in the graph all the way to the last layer(s), set
LayerQuantizationSettings.skip_quantization_downstream
=True
(see example #2 below).
When invoking these types of customizations, it is recommended to examine the
generated graph visualization in model_init.svg
(which corresponds to the
original model graph) and model_post_preprocessing.svg
(the quantized model
graph).
To ensure that these files will be generated, follow the relevant instructions here. Additionally, refer to the "parsing layers" section in the last section of CCO setup in the Output Log Breakdown page.
Example #1
If skip quantization of all layers between the (specific) layer adaptive_avg_pool_1
and the layer linear_5
:
from clika_compression import LayerQuantizationSettings, Settings
settings = Settings() # Default settings
settings.set_quantization_settings_for_layer(
"adaptive_avg_pool_1",
LayerQuantizationSettings(skip_quantization=True, skip_quantization_until=["linear_5"])
)
You can specify more than one destination in the skip_quantization_until
argument.
Example #2
To skip all layers from the layer named adaptive_avg_pool_1
to the last layer
of the model:
from clika_compression import LayerQuantizationSettings, Settings
settings = Settings() # Default settings
settings.set_quantization_settings_for_layer(
"adaptive_avg_pool_1",
LayerQuantizationSettings(skip_quantization=True, skip_quantization_downstream=True)
)
Graph visualization
After following the instructions to install the optional requirements, the files:
model_init.{svg, dot}
: architecture of the original model before compressionmodel_post_preprocessing.{svg, dot}
: architecture of the quantized model after compression
will be generated in your outputs
folder.
The color-coding for the layers is as follows:
- Blue: an input or output node
- Green: quantized layers
- Yellow: non-quantized layers
Graph visualizations include the following information about each layer:
- Name and type
- Input and output shapes
- Layer attributes of the layer such as kernel size, strides etc.
- Quantization Sensitivity (QS)
Quantization sensitivity (QS) is a measurement of the difference between original and quantized outputs of each layer, and is used primarily as a metric to determine which layers should be skipped for quantization. The QS is computed as a relative l2-norm between the quantized and float layer outputs; the higher the number, the more difficult the operation is to quantize.
It is recommended to skip quantization layers with a QS value above 10,000, since the higher the QS value, the longer it will take to for the ACE to compress the model.