Version: 25.5.1

How to use CLIKA ACE

Background

The clika-ace Python package implements CLIKA's unique Automatic Compression Engine (ACE).

CLIKA ACE

ACE is a "hardware-aware" engine that compresses the model specifically for a selected target framework such as Microsoft's ONNX Runtime, NVIDIA's TensorRT, or Intel's OpenVINO.
ACE can be applied to fine-tune a pre-trained model or to train a model from scratch.
While ACE is capable of training models from scratch, we recommend using it to fine-tune pre-trained models or to quantize them.
ACE exposes entrypoints using the clika_compile function or using a backend to call torch.compile. We recommended using the clika_compile function call, as it will offer auto-completion in an IDE.

The clika-ace package has three main usages:

Start the ACE, initialized from an existing torch.nn.Module.
The torch.nn.Module will be wrapped using a ClikaModule instance, which inherits from torch.nn.Module.
As such, a ClikaModule instance behaves like a torch.nn.Module.
Resume an ACE session initialized from saved ClikaModule.
Export the compressed model to the chosen framework.
Deployment of a ClikaModule instance to a particular output target framework is done using the clika_model.clika_export(...) function, which has a similar API to torch.onnx.export.

info

When initializing ClikaModule out of torch.nn.Module, the model is transformed into a CLIKA IR (intermediate representation). When the model is serialized for later use, it will be saved in the CLIKA IR format.

Terminology:

CLIKA ACE: CLIKA's Automatic Compression Engine
CLIKA IR / pompom: When the SDK receives a torch.nn.Module as an input, it parses it and converts the model into an intermediate representation that is solely used by CLIKA ACE SDK.
ClikaModule: The object that wraps a given input model. ClikaModule inherits from torch.nn.Module and behaves as such.
Monolithic ClikaModule: A single ClikaModule that represents an entire given input model. Alternatively, in the future, if any Data-Dependent Control-Flow statements are present in a given input model, the SDK may return the same input model given but with different submodules that are each individually wrapped by a ClikaModule.

Start

In order to start the ACE, it is recommended to use torch.compile to wrap the torch.nn.Module with a ClikaModule as detailed below.

caution

The model to be compressed by ACE is assumed to be without any Data-Dependent Control-Flow statements, such as if statements, for loops, etc. For additional details on these restrictions, see Data-Dependent Control-Flow.

At present, the CLIKA SDK does not support partial compilation of submodules inside the user-supplied torch.nn.Module. In a future release, the CLIKA SDK will support partial compilation and will also be able to handle control flow operations; as of now, however, the returned result from the torch.compile call will return a single monolithic ClikaModule as output.

Example

import torch
import tempfile
from clika_ace import (
    ClikaModule,
    clika_compile,
    DeploymentSettings_TensorRT_ONNX
)
import onnxruntime as ort

class Model(torch.nn.Module):

    def forward(self, x):
        x[..., :3] = 1.0
        return x

xs = torch.rand(32, 3, 224)
clika_model: ClikaModule = clika_compile(
    model=Model(),
    calibration_inputs=xs,
    deployment_settings=DeploymentSettings_TensorRT_ONNX(),
)
with tempfile.NamedTemporaryFile("wb") as fp:
    clika_model.clika_export(
        file=fp.name,
        input_names=["x"]
    )
    session = ort.InferenceSession(fp.name)
    outputs = session.run(
        output_names=None,
        input_feed={
            "x": xs.numpy()
        }
    )[0]
    ref_outputs = Model()(xs).numpy()
    assert outputs.shape == ref_outputs.shape and np.all(outputs == ref_outputs)

Compile API

Please see Compile API

Save / Resume

Once the Model has been wrapped, it can easily be saved or loaded:

...
clika_model: ClikaModule = clika_compile(...)

# This will serialize the model into multiple chunks if needed.
clika_model.clika_save(save_path)

# This will read the chunks
restored_clika_model = ClikaModule.clika_load(save_path)

Saving is beneficial in the following scenarios:

Interruption of the training process:
- In the event of a mid-run crash or interruption during an ACE session, you can resume the operation from the last checkpoint.
Introduction of new data:
- When new data is introduced to the training process, it will allow you to resume the ACE session with a new dataset, ensuring continuity in the compression process.
Additional fine-tuning:
- If you wish to further fine-tune the model by running more epochs, you can continue the ACE session starting from a previous checkpoint, enabling you to run additional epochs without starting from scratch.

To do so, load the previously-used ClikaModule and continue with script execution as with a normal torch.nn.Module.

Deploy

To deploy a ClikaModule instance, use the function clika_module.clika_export(...). It has similar API to torch.onnx.export.

There are two types of deployment

Dynamic shape deployment - dynamic_axes provided
Static shape deployment - dynamic_axes not provided.

Choice of dynamic shape deployment will ensure that the compressed model will work for differing input shapes. If the dynamic_axes argument is provided and specifies a symbolic shape for one of the axes, (e.g. None or str) the entire model will be deployed with dynamic shape input.

Choice of static shape deployment will result in a compressed model that takes a single, particular shape for each input.
The input shape for which the compressed model will be deployed is determined by that of the tensor object passed to the args argument.

tip

The benefit of static shape deployment is typically faster inference speed, since all shapes are specified; most inference frameworks can provide additional optimization once all shapes are known. Additionally, some target frameworks may not support dynamically-shaped inputs.

caution

Note that dynamic shape deployment may still fail if a model is dependent on specific shapes. For example, if a model includes a Flatten layer followed by a Linear layer, as is common at the end of a model. In this case, instead of a Flatten operation, an AdaptiveAvgPool operation with an output size of 1 could be used.

Dynamic shape deployment example

clika_model.clika_export(
    f=f,
    input_names=["x"],
    dynamic_axes={"x": {0: "batch_size"}}  # we want a dynamic batch-size
)

Static shape deployment example

clika_model.clika_export(
    model=clika_module,
    f=f,
    args=torch.rand(1, 3, 224, 224),  # will create a deployed model accepting *this* shape
    input_names=["x"],
    dynamic_axes=None
)

TensorRT deployment example

After running ACE, as a result of deploying the compressed model, a <model_name>.onnx file will be generated. This file should be used in conjunction with the trtexec command as shown below to create a .engine file, which can be deployed to TensorRT.

To deploy your model to a .engine file, install TensorRT on your local machine or use an NVIDIA-provided docker container. In the example below, we use a docker container by NVIDIA to deploy our model by running the following commands (the docker image will be automatically pulled):

docker run --gpus all --rm -v .:/workspace/ nvcr.io/nvidia/tensorrt:23.07-py3 -c trtexec --onnx=outputs/MyModel.onnx --saveEngine=MyModel.engine  --shapes=input_0:1x3x640x640  --int8 --workspace=1024

The command sets the following parameters:

TensorRT container version - 23.07
Path to CLIKA deployed .onnx file - e.g. outputs/MyModel.onnx
Path to save TensorRT complied model - MyModel.engine
Input shape - 1x3x640x640

For more information, see ACE Examples and the TensorRT documentation.

Multi-GPU distributed compression

info

This feature will be re-enabled our next major release! We are in the process of automating the distribution process completely to avoid the user having to specify FSDP/DDP manually.

info

ClikaModule instances do not support being wrapped by PyTorch FSDP/DDP as of now. The FSDP/DDP functionality is handled internally by the ClikaModule instance. The only responsibility of the user is making sure to save the model on Rank-0 as typically done in a distributed training setting.

In general, to use multi-GPU distributed compression on a CLIKA model, simply use the torchrun command:

torchrun --nproc_per_node ... my_main.py

How to use CLIKA ACE

Background​

Terminology:​

Start​

Example​

Compile API

Save / Resume​

Deploy​

Dynamic shape deployment example​

Static shape deployment example​

TensorRT deployment example​

Multi-GPU distributed compression​

Background

Terminology:

Start

Example

Save / Resume

Deploy

Dynamic shape deployment example

Static shape deployment example

TensorRT deployment example

Multi-GPU distributed compression