Quantization guide
Post-training quantization and adjusting sensitivity threshold
Post-training quantization (PTQ) refers to the quantization of a pre-trained model without a model training loop.
To configure post-training quantization using ACE, adjust the quantization_sensitivity_threshold parameter.
This parameter controls the trade-off between accuracy and degree of model compression; we recommend starting with a threshold value of 0.005
.
Upon completing the compression process, examine the generated logs,
which will include a helpful summary table (see output logs example).
This table provides guidance for adjusting the threshold based on the desired model compression level.
Be aware that greater compression often impacts model accuracy negatively.
Nevertheless, the model can be subsequently fine-tuned to recover accuracy,
as ClikaModule instances are designed to be compatible with torch.nn.Module
.
Quantization-aware training
Quantization-aware training (QAT) integrates the quantization process into the model training loop.
Doing so allows the model parameters to adapt as the model precision is reduced, potentially improving
the accuracy of the quantized model.
While the primary recommended entry point to quantization-based compression is through the PTQ feature set of ACE,
quantization-aware training is also a supported SDK feature. As ClikaModule
instances are compatible with
the torch.nn.Module
interface, the ClikaModule
instance can be used in place of the original torch.nn.module
object undergoing fine-tuning. Note that QAT compression workflows are typically slower and require more computational
resources than their PTQ counterparts do.
Graph visualization
After following the instructions to install the optional requirements,
a visualization of the resultant graph can be created using clika_model.clika_visualize(...)
Graph visualizations include the following information about each layer:
- Name and type
- Input and output shapes
- Layer attributes of the layer such as kernel size, strides etc.
- Quantization sensitivities are printed out as a table to the log file, see output logs example.
Quantization sensitivity (QS) is a measurement of the difference between original and quantized outputs of each layer, and is used primarily as a metric to determine which layers should be skipped for quantization. A sensitivity value of 0.0 can be considered to numerical identical to the original layer. On the other hand, 1.0 and above indicate substantially different numerical outputs as compared to the original layer; however, the results are potentially still good depending on the model and architecture in question.