Version: 25.4.0

CLIKA Compression Settings Documentation

This documentation outlines the settings available for configuring CLIKA Compression, as defined in cc_settings.py.

Base Classes

`_BaseHalfFrozen`

Base class that ensures only existing attributes can be set in settings dataclasses. It includes methods for serialization and deserialization.

`BaseSettings`

Inherits from _BaseHalfFrozen. Serves as a general base for specific setting types.

`BaseDeploymentSettings`

Base class that other Deployment Settings inherit from. Includes logic to initialize specific deployment framework settings based on a target_framework key.

Methods

initialize_from_dict(cls, settings: Optional[dict]): Generates a DeploymentSettings object from a dictionary. Requires a target_framework key ("openvino", "ov", "tensorrt", "trt", "tensorrt-llm", "trtllm", "trt-llm", "trt_llm", "ort", "onnxruntime").
is_any_TensorRT(): Checks if the instance is any TensorRT type.
is_TensorRT_ONNX(): Checks if the instance is DeploymentSettings_TensorRT_ONNX.
is_TensorRT_LLM_ONNX(): Checks if the instance is DeploymentSettings_TensorRT_LLM_ONNX.
is_ONNXRuntime_ONNX(): Checks if the instance is DeploymentSettings_ONNXRuntime_ONNX.
is_OpenVINO_ONNX(): Checks if the instance is DeploymentSettings_OpenVINO_ONNX.

Deployment Settings

`DeploymentSettings_TensorRT_ONNX`

CLASS - DeploymentSettings_TensorRT_ONNX()

Use this if you wish to deploy to NVIDIA's TensorRT in Settings.deployment_settings. Sets target_framework to "trt".

`DeploymentSettings_TensorRT_LLM_ONNX`

CLASS - DeploymentSettings_TensorRT_LLM_ONNX()

Use this if you wish to deploy to NVIDIA's TensorRT LLM in Settings.deployment_settings. Sets target_framework to "trt_llm".

`DeploymentSettings_ONNXRuntime_ONNX`

CLASS - DeploymentSettings_ONNXRuntime_ONNX()

Use this if you wish to deploy to Microsoft's ONNXRuntime in Settings.deployment_settings. Sets target_framework to "ort".

`DeploymentSettings_OpenVINO_ONNX`

CLASS - DeploymentSettings_OpenVINO_ONNX()

Use this if you wish to deploy to Intel's OpenVINO in Settings.deployment_settings. Sets target_framework to "ov".

Quantization Settings

`QuantizationSettings`

CLASS - QuantizationSettings(...)

Holds settings related to model quantization.

Attributes

weights_num_bits: Union[int, List[int], Tuple[int]] (Default: [8, 4])
- Number of bits for weights quantization. A list/tuple means potential candidates will be evaluated.
activations_num_bits: Union[int, List[int], Tuple[int]] (Default: [8])
- Number of bits for activations quantization. A list/tuple means potential candidates will be evaluated. Currently limited to 8 bits.
prefer_weights_only_quantization: Optional[bool] (Default: None)
- Controls preference for quantization type: None (best of all types), True (weights only), False (weights + activations).
weights_only_quantization_block_size: Optional[Union[int, List[int], Tuple[int]]] (Default: [0, 32, 64, 128, 256, 512])
- Block size for weights-only quantization (relevant for linear layers). 0 means per-channel. Values must be powers of two between 16 and 512, or 0.
quantization_sensitivity_threshold: Optional[Union[int, float]] (Default: None)
- Sensitivity threshold. Layers with sensitivity above this value won't be considered for quantization. Higher values are more destructive. Guideline: 0.1-0.2 if fine-tuning, <= 0.05 otherwise.
weights_utilize_full_int_range: Optional[bool] (Default: None)
- Whether to utilize the full integer range (e.g., [-128, 127] vs [-63, 63]). If None, defaults are set based on deployment target (False for OpenVINO/ONNXRuntime, True for QNN/TensorRT) unless explicitly set. See ONNXRuntime explaination in https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html#when-and-why-do-i-need-to-try-u8u8
quantization_cache_file: Optional[Union[str, Path]] (Default: None)
- (Not Implemented Yet) Path to cache quantization analysis results.
one_extra_bit_for_symmetric_weights: Optional[bool] (Default: None)
- Allows symmetric weight quantization range [-N-1, N] instead of [-N, N]. Applied only for symmetric weight quantization. Leave None if unsure.

Methods

initialize_from_dict(cls, settings: Optional[dict]): Initializes QuantizationSettings from a dictionary.

`LayerQuantizationSettings`

CLASS - LayerQuantizationSettings(...)

Inherits from QuantizationSettings and allows specifying quantization settings for a specific layer. Excludes quantization_sensitivity_threshold, weights_utilize_full_int_range, and quantization_cache_file attributes.

Attributes (in addition to inherited ones)

skip_quantization: bool (Default: False)
- Skip quantization for this specific layer.
skip_quantization_downstream: bool (Default: False)
- Skip quantization for this layer and all subsequent layers in the graph.
skip_quantization_until: Optional[Union[str, Tuple[str], List[str]]] (Default: None)
- Skip quantization from this layer up to (but not including) the specified layer(s).

CLIKA Compression Settings Documentation

Base Classes​

_BaseHalfFrozen​

BaseSettings​

BaseDeploymentSettings​

Methods​

Deployment Settings​

DeploymentSettings_TensorRT_ONNX​

DeploymentSettings_TensorRT_LLM_ONNX​

DeploymentSettings_ONNXRuntime_ONNX​

DeploymentSettings_OpenVINO_ONNX​

Quantization Settings​

QuantizationSettings​

Attributes​

Methods​

LayerQuantizationSettings​

Attributes (in addition to inherited ones)​

Base Classes

`_BaseHalfFrozen`

`BaseSettings`

`BaseDeploymentSettings`

Methods

Deployment Settings

`DeploymentSettings_TensorRT_ONNX`

`DeploymentSettings_TensorRT_LLM_ONNX`

`DeploymentSettings_ONNXRuntime_ONNX`

`DeploymentSettings_OpenVINO_ONNX`

Quantization Settings

`QuantizationSettings`

Attributes

Methods

`LayerQuantizationSettings`

Attributes (in addition to inherited ones)