Skip to main content
Version: 24.9.0

How to use CLIKA ACE

Background

The clika-ace Python package implements CLIKA's unique Automatic Compression Engine (ACE).

CLIKA ACE

  • ACE is a "hardware-aware" engine, and compresses the model optimally for a selected target framework such as Microsoft's ONNX Runtime, NVIDIA's TensorRT, Intel's OpenVINO, Qualcomm's QNN or Google's TFLite.
  • ACE can be applied to fine-tune a pre-trained model or to train a model from scratch.
    While ACE is capable of training models from scratch, it's better to use it to fine-tune pre-trained models.

The clika-ace has three main usages:

  • Start the ACE, initialized from an existing torch.nn.Module.
    The torch.nn.Module will be wrapped using a ClikaModule which inherits from torch.nn.Module.
    The ClikaModule behaves the same as a torch.nn.Module.
  • Resume an ACE session initialized from saved ClikaModule.
  • Deploy the compressed model to the chosen framework.
    Deploying ClikaModule is straight-forward and can be done using torch.onnx.export
info

When initializing ClikaModule out of torch.nn.Module, the model is transformed into a CLIKA IR (intermediate representation). When you serialize the model for later use it will be saved in the CLIKA IR format.

Terminology:

  • CLIKA ACE: CLIKA's Automatic Compression Engine
  • CLIKA IR / pompom: When the SDK receives a torch.nn.Module as an input, it parses it and converts the model into an intermediate representation that is solely used by CLIKA ACE SDK.
  • ClikaModule: The object that wraps any given input model. ClikaModule inherits from torch.nn.Module and behaves as such.
  • Monolithic ClikaModule: A single ClikaModule that represents an entire given input model. Alternatively, in the future, if any Control-Flow statements are present in a given input model, the SDK may return the same input model given but with different submodules that are each individually wrapped by a ClikaModule.

Start

In order to start the ACE, it is recommended to use torch.compile to wrap the torch.nn.Module with a ClikaModule as detailed below.

caution

The model to be compressed by ACE is assumed to be without any 'control flow' statements, such as if statements, for loops, etc. For additional details on these restrictions, see ACE Model Requirements.

At present, the CLIKA SDK does not support partial compilation of submodules inside the user-supplied torch.nn.Module. In a future release, the CLIKA SDK will support partial compilation and will also be able to handle control flow operations; as of now, however, the returned result from the torch.compile call will return a single monolithic ClikaModule as output.

Example

import torch
from clika_ace import DeploymentSettings_TensorRT_ONNX, ClikaModule

class Model(torch.nn.Module):

def forward(self, x):
mask = x > 0.5
indices = mask.nonzero(as_tuple=True)
x[indices] = 0.0
return x

xs = torch.rand(32, 3, 224)
settings = Settings()
settings.deployment_settings = DeploymentSettings_TensorRT_ONNX()
clika_model: ClikaModule = torch.compile(
Model(),
backend="clika",
options={
"settings": settings,
"example_inputs": xs,
}
)

Arguments for the torch.compile options parameter

The torch.compile function can be supplied with a wide range of options through the options argument, the parameters for which we detail below.

Example

self._model: ClikaModule = torch.compile(
self._model, # original module that will be replaced
backend="clika", # CLIKA backend.
options={ # options to 'clika' backend
"settings": self._args.clika_config,
"example_inputs": example_inputs,
"train_dataloader": train_dataloader, # returns elements of structure (xs, ys)
"discard_input_model": True,
"logs_dir": os.path.join(self._args.output_dir, "logs"),
"apply_on_data_fn": lambda x: x[0], # grab the first element of a tuple (xs, ys),
"feed_data_to_model_fn": lambda model, xs: model(xs), # feed 'xs' into model
}
)

Parameters

settings (REQUIRED)

Union[Settings, str, pathlib.Path] - Either a Settings object or path to Settings

The settings is the CLIKA ACE configuration.

example_inputs (REQUIRED)

Any: torch.Tensor, Sequence[torch.Tensor], Dict[str, torch.Tensor]

This is used to trace and parse the model. There is no need to pass this argument if train_dataloader is provided.

train_dataloader (REQUIRED)

Iterable: torch.utils.data.Dataloader, Sequence

This is used to run the calibration of the model for Quantization or any other internally used algorithms to optimize the model. Note: this parameter is strictly required if performing quantization, as data is required to perform calibration. Otherwise, it can be omitted.

discard_input_model (OPTIONAL)

Bool - True if should clear and delete the given model to free VRAM/RAM Default: True

apply_on_data_fn (OPTIONAL)

Callable[[Any], Any]

This is used to return the actual data that the model should see. This function is applied on the data that is returned from the train_dataloader. By default it will try to understand the structure of the returned data.

feed_data_to_model_fn (OPTIONAL)

Callable[[torch.nn.Module, Any], Any]

This is used to feed the data that was returned by the process of: train_dataloader -> apply_on_data_fn -> feed_data_to_model_fn

The first argument of the function is the torch.nn.Module given and the 2nd argument is the returned data. The result of the function should be the outputs of the torch.nn.Module.

logs_dir (OPTIONAL)

Union[str, pathlib.Path] - Path to a directory to create and save CLIKA ACE logs.

Saving / loading the model

Once the Model has been wrapped, it can easily be saved as follows:

...
clika_model: ClikaModule = torch.compile(
Model(),
backend="clika",
options={
"settings": settings,
"example_inputs": xs,
}
)
# Option1, recommended since in distributed training you may want to save only on Rank-0.
# also in distributed training settings the `state_dict()` syncs the state implictly from other ranks
# so `state_dict()` is important to be called on all nodes and only then saving it on the Rank-0.
state_dict = clika_model.state_dict()
torch.save(state_dict, file_path)

loaded_clika_model = ClikaModule()
loaded_clika_model.load_state_dict(torch.load(file_path)) # load it

# Option2:
torch.save(clika_model, file_path)
loaded_clika_model = torch.load(file_path) # load it

# Option3, less recommended because in the future the SDK might
# return the original torch.nn.Module with submodules of `ClikaModule` in-case
# control-flow statements are present
torch.save(clika_model.clika_serialize(), file_path)
loaded_clika_model = ClikaModule.clika_from_serialized(torch.load(file_path)) # load it
caution

Large models will result in a bigger ClikaModule, and it is therefore recommended to use a higher pickle_protocol in order to overcome the size-saving limit.

import pickle
import torch
torch.save(..., pickle_protocol=pickle.HIGHEST_PROTOCOL)

Resume

This feature is beneficial in the following scenarios:

  1. Interruption of the training process:

    • In the event of a mid-run crash or interruption during an ACE session, you can resume the operation from the last checkpoint.
  2. Introduction of new data:

    • When new data is introduced to the training process, it will allow you to resume the ACE session with a new dataset, ensuring continuity in the compression process.
  3. Additional fine-tuning:

    • If you wish to further fine-tune the model by running more epochs, you can continue the ACE session starting from a previous checkpoint, enabling you to run additional epochs without starting from scratch.

To do so, load the previous ClikaModule and continue executing your script as you would with a normal torch.nn.Module.

Deploy

To deploy a ClikaModule, you can use either torch.onnx.export API or, in the case of a monolithic ClikaModule, you can call clika_module.clika_export to export the model using the CLIKA API.

caution

To deploy a TFLite model, the clika_module.clika_export API must be used.

There are two types of deployment

  1. Dynamic shape deployment - dynamic_axes provided
  2. Static shape deployment - dynamic_axes not provided.

Choice of dynamic shape deployment will ensure that the compressed model will work for differing input shapes. If the dynamic_axes argument is provided and specifies a symbolic shape for one of the axes, (e.g. None or str) the entire model will be deployed with dynamic shape input in-mind.

Choice of static shape deployment will result in a compressed model which will take a single, particular shape for each input.
The input shape for which the compressed model will be deployed is determined by that of the tensor object passed to the args argument.

tip

The benefit of static shape deployment is typically faster inference speed, since all shapes are specified; most inference frameworks can provide additional optimization once all shapes are known. Additionally, some target frameworks may not support dynamically-shaped inputs.

caution

Note that dynamic shape deployment may still fail if a model is dependent on specific shapes. For example, if a model includes a Flatten layer followed by a Linear layer, as is common at the end of a model.


Instead of `Flatten`, you could use an `AdaptiveAvgPool` operation with an output size of 1.

Dynamic shape deployment example

dummy_inputs: dict = clika_module.clika_generate_dummy_inputs()  # this API is only available if using monolithic ClikaModule.
torch.onnx.export(
model=clika_module,
f=f,
args=dummy_inputs,
input_names=["x"],
dynamic_axes={"x": {0: "batch_size"}} # we want a dynamic batch-size
)

Static shape deployment example

torch.onnx.export(
model=clika_module,
f=f,
args=torch.rand(1, 3, 224, 224), # will create a deployed model accepting *this* shape.
input_names=["x"],
dynamic_axes=None
)

TensorRT deployment example

After running ACE and deploying a <model_name>.onnx file will be generated. This file should be used in conjunction with the trtexec command as shown below to create a .engine file, which can be deployed to TensorRT.

To deploy your model to a .engine file, install TensorRT on your local machine or use an NVIDIA-provided docker container. Here, we use a docker container by NVIDIA to deploy our model by running the following commands (the docker image will be automatically pulled):

docker run --gpus all --rm -v .:/workspace/ nvcr.io/nvidia/tensorrt:23.07-py3 -c trtexec --onnx=outputs/MyModel.onnx --saveEngine=MyModel.engine  --shapes=input_0:1x3x640x640  --int8 --workspace=1024 

The command sets the following parameters:

For more information, see CLIKA examples and the TensorRT documentation.

Multi-GPU distributed compression

To use multi-GPU distributed compression on a CLIKA deployed model, you can run the model the same way as you would have normally run the input PyTorch model. Alternatively, you can use the common torchrun command:

torchrun --nproc_per_node ... my_main.py

In either case, be sure to set multi_gpu=True and optionally use_sharding=True in the CLIKA Settings if you're interested in FSDP/DeepSpeed behavior.

tip

You can always set both multi_gpu and use_sharding as True even if you don't plan to use distributed compression; in this case, the model will run in a distributed manner upon invocation of the torchrun command, but will otherwise run in a non-distributed manner.

caution

ClikaModules do not support being wrapped by PyTorch FSDP/DDP. The FSDP/DDP functionality is handled internally by the ClikaModule instance. The only responsibility of the user is making sure to save the model on Rank-0 as is typical in a distributed training setting.