Skip to main content
Version: 25.4.0

CLIKA Compression Settings Documentation

This documentation outlines the settings available for configuring CLIKA Compression, as defined in cc_settings.py.

Base Classes

_BaseHalfFrozen

Base class that ensures only existing attributes can be set in settings dataclasses. It includes methods for serialization and deserialization.

BaseSettings

Inherits from _BaseHalfFrozen. Serves as a general base for specific setting types.

BaseDeploymentSettings

Base class that other Deployment Settings inherit from. Includes logic to initialize specific deployment framework settings based on a target_framework key.

Methods

  • initialize_from_dict(cls, settings: Optional[dict]): Generates a DeploymentSettings object from a dictionary. Requires a target_framework key ("openvino", "ov", "tensorrt", "trt", "tensorrt-llm", "trtllm", "trt-llm", "trt_llm", "ort", "onnxruntime").
  • is_any_TensorRT(): Checks if the instance is any TensorRT type.
  • is_TensorRT_ONNX(): Checks if the instance is DeploymentSettings_TensorRT_ONNX.
  • is_TensorRT_LLM_ONNX(): Checks if the instance is DeploymentSettings_TensorRT_LLM_ONNX.
  • is_ONNXRuntime_ONNX(): Checks if the instance is DeploymentSettings_ONNXRuntime_ONNX.
  • is_OpenVINO_ONNX(): Checks if the instance is DeploymentSettings_OpenVINO_ONNX.

Deployment Settings

DeploymentSettings_TensorRT_ONNX

CLASS - DeploymentSettings_TensorRT_ONNX()

Use this if you wish to deploy to NVIDIA's TensorRT in Settings.deployment_settings. Sets target_framework to "trt".

DeploymentSettings_TensorRT_LLM_ONNX

CLASS - DeploymentSettings_TensorRT_LLM_ONNX()

Use this if you wish to deploy to NVIDIA's TensorRT LLM in Settings.deployment_settings. Sets target_framework to "trt_llm".

DeploymentSettings_ONNXRuntime_ONNX

CLASS - DeploymentSettings_ONNXRuntime_ONNX()

Use this if you wish to deploy to Microsoft's ONNXRuntime in Settings.deployment_settings. Sets target_framework to "ort".

DeploymentSettings_OpenVINO_ONNX

CLASS - DeploymentSettings_OpenVINO_ONNX()

Use this if you wish to deploy to Intel's OpenVINO in Settings.deployment_settings. Sets target_framework to "ov".

Quantization Settings

QuantizationSettings

CLASS - QuantizationSettings(...)

Holds settings related to model quantization.

Attributes

  • weights_num_bits: Union[int, List[int], Tuple[int]] (Default: [8, 4])
    • Number of bits for weights quantization. A list/tuple means potential candidates will be evaluated.
  • activations_num_bits: Union[int, List[int], Tuple[int]] (Default: [8])
    • Number of bits for activations quantization. A list/tuple means potential candidates will be evaluated. Currently limited to 8 bits.
  • prefer_weights_only_quantization: Optional[bool] (Default: None)
    • Controls preference for quantization type: None (best of all types), True (weights only), False (weights + activations).
  • weights_only_quantization_block_size: Optional[Union[int, List[int], Tuple[int]]] (Default: [0, 32, 64, 128, 256, 512])
    • Block size for weights-only quantization (relevant for linear layers). 0 means per-channel. Values must be powers of two between 16 and 512, or 0.
  • quantization_sensitivity_threshold: Optional[Union[int, float]] (Default: None)
    • Sensitivity threshold. Layers with sensitivity above this value won't be considered for quantization. Higher values are more destructive. Guideline: 0.1-0.2 if fine-tuning, <= 0.05 otherwise.
  • weights_utilize_full_int_range: Optional[bool] (Default: None)
  • quantization_cache_file: Optional[Union[str, Path]] (Default: None)
    • (Not Implemented Yet) Path to cache quantization analysis results.
  • one_extra_bit_for_symmetric_weights: Optional[bool] (Default: None)
    • Allows symmetric weight quantization range [-N-1, N] instead of [-N, N]. Applied only for symmetric weight quantization. Leave None if unsure.

Methods

  • initialize_from_dict(cls, settings: Optional[dict]): Initializes QuantizationSettings from a dictionary.

LayerQuantizationSettings

CLASS - LayerQuantizationSettings(...)

Inherits from QuantizationSettings and allows specifying quantization settings for a specific layer. Excludes quantization_sensitivity_threshold, weights_utilize_full_int_range, and quantization_cache_file attributes.

Attributes (in addition to inherited ones)

  • skip_quantization: bool (Default: False)
    • Skip quantization for this specific layer.
  • skip_quantization_downstream: bool (Default: False)
    • Skip quantization for this layer and all subsequent layers in the graph.
  • skip_quantization_until: Optional[Union[str, Tuple[str], List[str]]] (Default: None)
    • Skip quantization from this layer up to (but not including) the specified layer(s).