Version: 25.5.1

Release Notes

Version 25.5.1 - May 2025

note

Support is generally provided for both torch.function_name and Tensor.function_name syntaxes. If you find a case where only one is supported, please report it to support@clika.io.

Overview

This major release introduces significant enhancements to our core model compression and deployment capabilities, streamlining workflows and expanding hardware support.

Bug Fixes

Quantization Summary:
- Fixed an issue where Quantization Summary failed due to a numerical precision issue.
- Fixed a user-facing 4bit/8bit reporting mismatch issue.
Fixed an issue with the parsing of torch.nn.Upsample module.
Fixed an issue where Quantization Summary failed due to number rounding and infinity involved.
Fixed an issue with the quantized version of ReduceMin, ReduceMax crashing due to an output issue.
Allows the ClikaModule to receive non-Tensor arguments that will be ignored without throwing an error.

Version 25.4.0 - April 2025

New Features

Supports the case of torch.nn.Linear with 1D weights.

Enhanced quantization algorithm: Our core quantization engine has been substantially upgraded. Fine-tuning is now optional for achieving excellent results, significantly simplifying the optimization process.
- A new Quantization Sensitivity Threshold option is available in the QuantizationSettings object for more direct control over the quantization process.
Initial TensorRT-LLM support: Introduced support for weights-only quantization (WOQ) with TensorRT-LLM, enabling optimized deployment for large language models on compatible NVIDIA hardware. Support for additional TensorRT-LLM features is planned.
Automatic CPU offloading: Models that exceed available accelerator memory can now automatically offload parts of the computation to the CPU, ensuring successful execution even with large models.
Simplified model export: Easily export optimized models to common deployment formats, including INT4, BFloat16, and Float16.

Upcoming Features

We are actively working on the following enhancements for future releases:

Multi-GPU support: Multi-GPU capabilities, temporarily unavailable due to internal architectural upgrades, will be restored.
Expanding TensorRT-LLM integration: Further enhancements and broader feature support for TensorRT-LLM deployments.
Memory optimization: Continued focus on reducing VRAM and RAM consumption during model optimization and inference.
Compile ClikaModule to framework: Run the ClikaModule directly on the Framework using a simple method call.
Quantization sensitivity calibration caching: Avoid re-measuring quantization sensitivities and cache the results for quick development iterations and experimentation.
Easier access to quantization sensitivities: Introduce an API to retrieve sensitivity data for deeper analysis and visualization purposes.

Release Notes

Version 25.5.1 - May 2025​

Overview​

Bug Fixes​

Version 25.4.0 - April 2025​

New Features​

Upcoming Features​