CLIKA Compression Settings Documentation
This documentation outlines the settings available for configuring CLIKA Compression, as defined in cc_settings.py.
Base Classes
_BaseHalfFrozen
Base class that ensures only existing attributes can be set in settings dataclasses. It includes methods for serialization and deserialization.
BaseSettings
Inherits from _BaseHalfFrozen. Serves as a general base for specific setting types.
BaseDeploymentSettings
Base class that other Deployment Settings inherit from. Includes logic to initialize specific deployment framework settings based on a target_framework key.
Methods
initialize_from_dict(cls, settings: Optional[dict]): Generates a DeploymentSettings object from a dictionary. Requires atarget_frameworkkey ("openvino","ov","tensorrt","trt","tensorrt-llm","trtllm","trt-llm","trt_llm","ort","onnxruntime").is_any_TensorRT(): Checks if the instance is any TensorRT type.is_TensorRT_ONNX(): Checks if the instance isDeploymentSettings_TensorRT_ONNX.is_TensorRT_LLM_ONNX(): Checks if the instance isDeploymentSettings_TensorRT_LLM_ONNX.is_ONNXRuntime_ONNX(): Checks if the instance isDeploymentSettings_ONNXRuntime_ONNX.is_OpenVINO_ONNX(): Checks if the instance isDeploymentSettings_OpenVINO_ONNX.
Deployment Settings
DeploymentSettings_TensorRT_ONNX
CLASS - DeploymentSettings_TensorRT_ONNX()
Use this if you wish to deploy to NVIDIA's TensorRT in Settings.deployment_settings. Sets target_framework to "trt".
DeploymentSettings_TensorRT_LLM_ONNX
CLASS - DeploymentSettings_TensorRT_LLM_ONNX()
Use this if you wish to deploy to NVIDIA's TensorRT LLM in Settings.deployment_settings. Sets target_framework to "trt_llm".
DeploymentSettings_ONNXRuntime_ONNX
CLASS - DeploymentSettings_ONNXRuntime_ONNX()
Use this if you wish to deploy to Microsoft's ONNXRuntime in Settings.deployment_settings. Sets target_framework to "ort".
DeploymentSettings_OpenVINO_ONNX
CLASS - DeploymentSettings_OpenVINO_ONNX()
Use this if you wish to deploy to Intel's OpenVINO in Settings.deployment_settings. Sets target_framework to "ov".
Quantization Settings
QuantizationSettings
CLASS - QuantizationSettings(...)
Holds settings related to model quantization.
Attributes
weights_num_bits:Union[int, List[int], Tuple[int]](Default:[8, 4])- Number of bits for weights quantization. A list/tuple means potential candidates will be evaluated.
activations_num_bits:Union[int, List[int], Tuple[int]](Default:[8])- Number of bits for activations quantization. A list/tuple means potential candidates will be evaluated. Currently limited to 8 bits.
prefer_weights_only_quantization:Optional[bool](Default:None)- Controls preference for quantization type:
None(best of all types),True(weights only),False(weights + activations).
- Controls preference for quantization type:
weights_only_quantization_block_size:Optional[Union[int, List[int], Tuple[int]]](Default:[0, 32, 64, 128, 256, 512])- Block size for weights-only quantization (relevant for linear layers).
0means per-channel. Values must be powers of two between 16 and 512, or 0.
- Block size for weights-only quantization (relevant for linear layers).
quantization_sensitivity_threshold:Optional[Union[int, float]](Default:None)- Sensitivity threshold. Layers with sensitivity above this value won't be considered for quantization. Higher values are more destructive. Guideline: 0.1-0.2 if fine-tuning, <= 0.05 otherwise.
weights_utilize_full_int_range:Optional[bool](Default:None)- Whether to utilize the full integer range (e.g.,
[-128, 127]vs[-63, 63]). IfNone, defaults are set based on deployment target (Falsefor OpenVINO/ONNXRuntime,Truefor QNN/TensorRT) unless explicitly set. See ONNXRuntime explaination in https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html#when-and-why-do-i-need-to-try-u8u8
- Whether to utilize the full integer range (e.g.,
quantization_cache_file:Optional[Union[str, Path]](Default:None)- (Not Implemented Yet) Path to cache quantization analysis results.
one_extra_bit_for_symmetric_weights:Optional[bool](Default:None)- Allows symmetric weight quantization range
[-N-1, N]instead of[-N, N]. Applied only for symmetric weight quantization. LeaveNoneif unsure.
- Allows symmetric weight quantization range
Methods
initialize_from_dict(cls, settings: Optional[dict]): Initializes QuantizationSettings from a dictionary.
LayerQuantizationSettings
CLASS - LayerQuantizationSettings(...)
Inherits from QuantizationSettings and allows specifying quantization settings for a specific layer. Excludes quantization_sensitivity_threshold, weights_utilize_full_int_range, and quantization_cache_file attributes.
Attributes (in addition to inherited ones)
skip_quantization:bool(Default:False)- Skip quantization for this specific layer.
skip_quantization_downstream:bool(Default:False)- Skip quantization for this layer and all subsequent layers in the graph.
skip_quantization_until:Optional[Union[str, Tuple[str], List[str]]](Default:None)- Skip quantization from this layer up to (but not including) the specified layer(s).