Skip to main content
Version: 25.5.1

Release Notes

Version 25.5.1 - May 2025

note

Support is generally provided for both torch.function_name and Tensor.function_name syntaxes. If you find a case where only one is supported, please report it to support@clika.io.

Overview

This major release introduces significant enhancements to our core model compression and deployment capabilities, streamlining workflows and expanding hardware support.

Bug Fixes

  • Quantization Summary:
    • Fixed an issue where Quantization Summary failed due to a numerical precision issue.
    • Fixed a user-facing 4bit/8bit reporting mismatch issue.
  • Fixed an issue with the parsing of torch.nn.Upsample module.
  • Fixed an issue where Quantization Summary failed due to number rounding and infinity involved.
  • Fixed an issue with the quantized version of ReduceMin, ReduceMax crashing due to an output issue.
  • Allows the ClikaModule to receive non-Tensor arguments that will be ignored without throwing an error.

Version 25.4.0 - April 2025

New Features

  • Supports the case of torch.nn.Linear with 1D weights.
  • Enhanced quantization algorithm: Our core quantization engine has been substantially upgraded. Fine-tuning is now optional for achieving excellent results, significantly simplifying the optimization process.
    • A new Quantization Sensitivity Threshold option is available in the QuantizationSettings object for more direct control over the quantization process.
  • Initial TensorRT-LLM support: Introduced support for weights-only quantization (WOQ) with TensorRT-LLM, enabling optimized deployment for large language models on compatible NVIDIA hardware. Support for additional TensorRT-LLM features is planned.
  • Automatic CPU offloading: Models that exceed available accelerator memory can now automatically offload parts of the computation to the CPU, ensuring successful execution even with large models.
  • Simplified model export: Easily export optimized models to common deployment formats, including INT4, BFloat16, and Float16.

Upcoming Features

We are actively working on the following enhancements for future releases:

  • Multi-GPU support: Multi-GPU capabilities, temporarily unavailable due to internal architectural upgrades, will be restored.
  • Expanding TensorRT-LLM integration: Further enhancements and broader feature support for TensorRT-LLM deployments.
  • Memory optimization: Continued focus on reducing VRAM and RAM consumption during model optimization and inference.
  • Compile ClikaModule to framework: Run the ClikaModule directly on the Framework using a simple method call.
  • Quantization sensitivity calibration caching: Avoid re-measuring quantization sensitivities and cache the results for quick development iterations and experimentation.
  • Easier access to quantization sensitivities: Introduce an API to retrieve sensitivity data for deeper analysis and visualization purposes.