CLIKA Compression Settings Documentation
This documentation outlines the settings available for configuring CLIKA Compression, as defined in cc_settings.py
.
Base Classes
_BaseHalfFrozen
Base class that ensures only existing attributes can be set in settings dataclasses. It includes methods for serialization and deserialization.
BaseSettings
Inherits from _BaseHalfFrozen
. Serves as a general base for specific setting types.
BaseDeploymentSettings
Base class that other Deployment Settings inherit from. Includes logic to initialize specific deployment framework settings based on a target_framework
key.
Methods
initialize_from_dict(cls, settings: Optional[dict])
: Generates a DeploymentSettings object from a dictionary. Requires atarget_framework
key ("openvino"
,"ov"
,"tensorrt"
,"trt"
,"tensorrt-llm"
,"trtllm"
,"trt-llm"
,"trt_llm"
,"ort"
,"onnxruntime"
).is_any_TensorRT()
: Checks if the instance is any TensorRT type.is_TensorRT_ONNX()
: Checks if the instance isDeploymentSettings_TensorRT_ONNX
.is_TensorRT_LLM_ONNX()
: Checks if the instance isDeploymentSettings_TensorRT_LLM_ONNX
.is_ONNXRuntime_ONNX()
: Checks if the instance isDeploymentSettings_ONNXRuntime_ONNX
.is_OpenVINO_ONNX()
: Checks if the instance isDeploymentSettings_OpenVINO_ONNX
.
Deployment Settings
DeploymentSettings_TensorRT_ONNX
CLASS - DeploymentSettings_TensorRT_ONNX()
Use this if you wish to deploy to NVIDIA's TensorRT in Settings.deployment_settings
. Sets target_framework
to "trt"
.
DeploymentSettings_TensorRT_LLM_ONNX
CLASS - DeploymentSettings_TensorRT_LLM_ONNX()
Use this if you wish to deploy to NVIDIA's TensorRT LLM in Settings.deployment_settings
. Sets target_framework
to "trt_llm"
.
DeploymentSettings_ONNXRuntime_ONNX
CLASS - DeploymentSettings_ONNXRuntime_ONNX()
Use this if you wish to deploy to Microsoft's ONNXRuntime in Settings.deployment_settings
. Sets target_framework
to "ort"
.
DeploymentSettings_OpenVINO_ONNX
CLASS - DeploymentSettings_OpenVINO_ONNX()
Use this if you wish to deploy to Intel's OpenVINO in Settings.deployment_settings
. Sets target_framework
to "ov"
.
Quantization Settings
QuantizationSettings
CLASS - QuantizationSettings(...)
Holds settings related to model quantization.
Attributes
weights_num_bits
:Union[int, List[int], Tuple[int]]
(Default:[8, 4]
)- Number of bits for weights quantization. A list/tuple means potential candidates will be evaluated.
activations_num_bits
:Union[int, List[int], Tuple[int]]
(Default:[8]
)- Number of bits for activations quantization. A list/tuple means potential candidates will be evaluated. Currently limited to 8 bits.
prefer_weights_only_quantization
:Optional[bool]
(Default:None
)- Controls preference for quantization type:
None
(best of all types),True
(weights only),False
(weights + activations).
- Controls preference for quantization type:
weights_only_quantization_block_size
:Optional[Union[int, List[int], Tuple[int]]]
(Default:[0, 32, 64, 128, 256, 512]
)- Block size for weights-only quantization (relevant for linear layers).
0
means per-channel. Values must be powers of two between 16 and 512, or 0.
- Block size for weights-only quantization (relevant for linear layers).
quantization_sensitivity_threshold
:Optional[Union[int, float]]
(Default:None
)- Sensitivity threshold. Layers with sensitivity above this value won't be considered for quantization. Higher values are more destructive. Guideline: 0.1-0.2 if fine-tuning, <= 0.05 otherwise.
weights_utilize_full_int_range
:Optional[bool]
(Default:None
)- Whether to utilize the full integer range (e.g.,
[-128, 127]
vs[-63, 63]
). IfNone
, defaults are set based on deployment target (False
for OpenVINO/ONNXRuntime,True
for QNN/TensorRT) unless explicitly set. See ONNXRuntime explaination in https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html#when-and-why-do-i-need-to-try-u8u8
- Whether to utilize the full integer range (e.g.,
quantization_cache_file
:Optional[Union[str, Path]]
(Default:None
)- (Not Implemented Yet) Path to cache quantization analysis results.
one_extra_bit_for_symmetric_weights
:Optional[bool]
(Default:None
)- Allows symmetric weight quantization range
[-N-1, N]
instead of[-N, N]
. Applied only for symmetric weight quantization. LeaveNone
if unsure.
- Allows symmetric weight quantization range
Methods
initialize_from_dict(cls, settings: Optional[dict])
: Initializes QuantizationSettings from a dictionary.
LayerQuantizationSettings
CLASS - LayerQuantizationSettings(...)
Inherits from QuantizationSettings
and allows specifying quantization settings for a specific layer. Excludes quantization_sensitivity_threshold
, weights_utilize_full_int_range
, and quantization_cache_file
attributes.
Attributes (in addition to inherited ones)
skip_quantization
:bool
(Default:False
)- Skip quantization for this specific layer.
skip_quantization_downstream
:bool
(Default:False
)- Skip quantization for this layer and all subsequent layers in the graph.
skip_quantization_until
:Optional[Union[str, Tuple[str], List[str]]]
(Default:None
)- Skip quantization from this layer up to (but not including) the specified layer(s).