Skip to main content
Version: 25.4.0

Release Notes

Version 25.4.0 - April 2025

note

Support is generally provided for both torch.function_name and Tensor.function_name syntaxes. If you find a case where only one is supported, please report it to support@clika.io.

Overview

This major release introduces significant enhancements to our core model compression and deployment capabilities, streamlining workflows and expanding hardware support.

Key Features

  • Enhanced quantization algorithm: Our core quantization engine has been substantially upgraded. Fine-tuning is now optional for achieving excellent results, significantly simplifying the optimization process.
    • A new Quantization Sensitivity Threshold option is available in the QuantizationSettings object for more direct control over the quantization process.
  • Initial TensorRT-LLM support: Introduced support for weights-only quantization (WOQ) with TensorRT-LLM, enabling optimized deployment for large language models on compatible NVIDIA hardware. Support for additional TensorRT-LLM features is planned.
  • Automatic CPU offloading: Models that exceed available accelerator memory can now automatically offload parts of the computation to the CPU, ensuring successful execution even with large models.
  • Simplified model export: Easily export optimized models to common deployment formats, including INT4, BFloat16, and Float16.

Upcoming Features

We are actively working on the following enhancements for future releases:

  • Multi-GPU support: Multi-GPU capabilities, temporarily unavailable due to internal architectural upgrades, will be restored.
  • Expanding TensorRT-LLM integration: Further enhancements and broader feature support for TensorRT-LLM deployments.
  • Memory optimization: Continued focus on reducing VRAM and RAM consumption during model optimization and inference.
  • Compile ClikaModule to framework: Run the ClikaModule directly on the Framework using a simple method call.
  • Quantization sensitivity calibration caching: Avoid re-measuring quantization sensitivities and cache the results for quick development iterations and experimentation.
  • Easier access to quantization sensitivities: Introduce an API to retrieve sensitivity data for deeper analysis and visualization purposes.