Release Notes
Version 25.5.1 - May 2025
note
Support is generally provided for both torch.function_name
and Tensor.function_name
syntaxes. If you find a case where only one is supported, please report it to support@clika.io
.
Overview
This major release introduces significant enhancements to our core model compression and deployment capabilities, streamlining workflows and expanding hardware support.
Bug Fixes
- Quantization Summary:
- Fixed an issue where Quantization Summary failed due to a numerical precision issue.
- Fixed a user-facing 4bit/8bit reporting mismatch issue.
- Fixed an issue with the parsing of
torch.nn.Upsample
module. - Fixed an issue where Quantization Summary failed due to number rounding and infinity involved.
- Fixed an issue with the quantized version of ReduceMin, ReduceMax crashing due to an output issue.
- Allows the
ClikaModule
to receive non-Tensor arguments that will be ignored without throwing an error.
Version 25.4.0 - April 2025
New Features
- Supports the case of
torch.nn.Linear
with 1D weights.
- Enhanced quantization algorithm: Our core quantization engine has been substantially upgraded. Fine-tuning is now optional for achieving excellent results, significantly simplifying the optimization process.
- A new
Quantization Sensitivity Threshold
option is available in theQuantizationSettings
object for more direct control over the quantization process.
- A new
- Initial TensorRT-LLM support: Introduced support for weights-only quantization (WOQ) with TensorRT-LLM, enabling optimized deployment for large language models on compatible NVIDIA hardware. Support for additional TensorRT-LLM features is planned.
- Automatic CPU offloading: Models that exceed available accelerator memory can now automatically offload parts of the computation to the CPU, ensuring successful execution even with large models.
- Simplified model export: Easily export optimized models to common deployment formats, including INT4, BFloat16, and Float16.
Upcoming Features
We are actively working on the following enhancements for future releases:
- Multi-GPU support: Multi-GPU capabilities, temporarily unavailable due to internal architectural upgrades, will be restored.
- Expanding TensorRT-LLM integration: Further enhancements and broader feature support for TensorRT-LLM deployments.
- Memory optimization: Continued focus on reducing VRAM and RAM consumption during model optimization and inference.
- Compile ClikaModule to framework: Run the ClikaModule directly on the Framework using a simple method call.
- Quantization sensitivity calibration caching: Avoid re-measuring quantization sensitivities and cache the results for quick development iterations and experimentation.
- Easier access to quantization sensitivities: Introduce an API to retrieve sensitivity data for deeper analysis and visualization purposes.