How to use CLIKA ACE
Background
The clika-ace
Python package implements CLIKA's unique Automatic Compression Engine (ACE).
CLIKA ACE
- ACE is a "hardware-aware" engine that compresses the model specifically for a selected target framework such as Microsoft's ONNX Runtime, NVIDIA's TensorRT, or Intel's OpenVINO.
- ACE can be applied to fine-tune a pre-trained model or to train a model from
scratch.
While ACE is capable of training models from scratch, we recommend using it to fine-tune pre-trained models or to quantize them. - ACE exposes entrypoints using the
clika_compile
function or using a backend to calltorch.compile
. We recommended using theclika_compile
function call, as it will offer auto-completion in an IDE.
The clika-ace
package has three main usages:
- Start the ACE, initialized from an existing
torch.nn.Module
.
Thetorch.nn.Module
will be wrapped using aClikaModule
instance, which inherits fromtorch.nn.Module
.
As such, aClikaModule
instance behaves like atorch.nn.Module
. - Resume an ACE session initialized from saved
ClikaModule
. - Export the compressed model to the chosen framework.
Deployment of aClikaModule
instance to a particular output target framework is done using theclika_model.clika_export(...)
function, which has a similar API totorch.onnx.export
.
When initializing ClikaModule
out of torch.nn.Module
, the model is transformed into a CLIKA IR (intermediate representation).
When the model is serialized for later use, it will be saved in the CLIKA IR format.
Terminology:
- CLIKA ACE: CLIKA's Automatic Compression Engine
- CLIKA IR /
pompom
: When the SDK receives atorch.nn.Module
as an input, it parses it and converts the model into an intermediate representation that is solely used by CLIKA ACE SDK. ClikaModule
: The object that wraps a given input model.ClikaModule
inherits fromtorch.nn.Module
and behaves as such.- Monolithic
ClikaModule
: A singleClikaModule
that represents an entire given input model. Alternatively, in the future, if any Data-Dependent Control-Flow statements are present in a given input model, the SDK may return the same input model given but with different submodules that are each individually wrapped by aClikaModule
.
Start
In order to start the ACE, it is recommended to use torch.compile
to wrap the torch.nn.Module
with a ClikaModule
as detailed below.
The model to be compressed by ACE is assumed to be without any Data-Dependent Control-Flow
statements, such as if
statements, for
loops, etc.
For additional details on these restrictions, see Data-Dependent Control-Flow.
At present, the CLIKA SDK does not support partial compilation of submodules inside the user-supplied torch.nn.Module
.
In a future release, the CLIKA SDK will support partial compilation and will also be able to handle control flow operations;
as of now, however, the returned result from the torch.compile
call will return a single monolithic ClikaModule
as output.
Example
import torch
import tempfile
from clika_ace import (
ClikaModule,
clika_compile,
DeploymentSettings_TensorRT_ONNX
)
import onnxruntime as ort
class Model(torch.nn.Module):
def forward(self, x):
x[..., :3] = 1.0
return x
xs = torch.rand(32, 3, 224)
clika_model: ClikaModule = clika_compile(
model=Model(),
calibration_inputs=xs,
deployment_settings=DeploymentSettings_TensorRT_ONNX(),
)
with tempfile.NamedTemporaryFile("wb") as fp:
clika_model.clika_export(
file=fp.name,
input_names=["x"]
)
session = ort.InferenceSession(fp.name)
outputs = session.run(
output_names=None,
input_feed={
"x": xs.numpy()
}
)[0]
ref_outputs = Model()(xs).numpy()
assert outputs.shape == ref_outputs.shape and np.all(outputs == ref_outputs)
Compile API
Please see Compile API
Save / Resume
Once the Model has been wrapped, it can easily be saved or loaded:
...
clika_model: ClikaModule = clika_compile(...)
# This will serialize the model into multiple chunks if needed.
clika_model.clika_save(save_path)
# This will read the chunks
restored_clika_model = ClikaModule.clika_load(save_path)
Saving is beneficial in the following scenarios:
Interruption of the training process:
- In the event of a mid-run crash or interruption during an ACE session, you can resume the operation from the last checkpoint.
Introduction of new data:
- When new data is introduced to the training process, it will allow you to resume the ACE session with a new dataset, ensuring continuity in the compression process.
Additional fine-tuning:
- If you wish to further fine-tune the model by running more epochs, you can continue the ACE session starting from a previous checkpoint, enabling you to run additional epochs without starting from scratch.
To do so, load the previously-used ClikaModule
and continue with script execution
as with a normal torch.nn.Module
.
Deploy
To deploy a ClikaModule
instance, use the function clika_module.clika_export(...)
.
It has similar API to torch.onnx.export
.
There are two types of deployment
- Dynamic shape deployment -
dynamic_axes
provided - Static shape deployment -
dynamic_axes
not provided.
Choice of dynamic shape deployment will ensure that the compressed model will work for differing input shapes.
If the dynamic_axes
argument is provided and specifies a symbolic shape for one of the axes, (e.g. None or str
)
the entire model will be deployed with dynamic shape input.
Choice of static shape deployment will result in a compressed model that takes a single, particular shape for each input.
The input shape for which the compressed model will be deployed is
determined by that of the tensor object passed to the args
argument.
The benefit of static shape deployment is typically faster inference speed, since all shapes are specified; most inference frameworks can provide additional optimization once all shapes are known. Additionally, some target frameworks may not support dynamically-shaped inputs.
Note that dynamic shape deployment may still fail if a model is dependent on specific shapes.
For example, if a model includes a Flatten
layer followed by a Linear
layer, as is common at the end of a model.
In this case, instead of a Flatten
operation, an AdaptiveAvgPool
operation with an output size of 1 could be used.
Dynamic shape deployment example
clika_model.clika_export(
f=f,
input_names=["x"],
dynamic_axes={"x": {0: "batch_size"}} # we want a dynamic batch-size
)
Static shape deployment example
clika_model.clika_export(
model=clika_module,
f=f,
args=torch.rand(1, 3, 224, 224), # will create a deployed model accepting *this* shape
input_names=["x"],
dynamic_axes=None
)
TensorRT deployment example
After running ACE, as a result of deploying the compressed model, a <model_name>.onnx
file will be generated.
This file should be used in conjunction with the trtexec
command as shown below to create a .engine
file, which can be deployed to TensorRT.
To deploy your model to a .engine
file, install TensorRT
on your local machine or use an NVIDIA-provided docker container.
In the example below, we use a docker container by NVIDIA
to deploy our model by running the following commands (the docker image will be automatically pulled):
docker run --gpus all --rm -v .:/workspace/ nvcr.io/nvidia/tensorrt:23.07-py3 -c trtexec --onnx=outputs/MyModel.onnx --saveEngine=MyModel.engine --shapes=input_0:1x3x640x640 --int8 --workspace=1024
The command sets the following parameters:
- TensorRT container version -
23.07
- Path to CLIKA deployed
.onnx
file - e.g.outputs/MyModel.onnx
- Path to save TensorRT complied model -
MyModel.engine
- Input shape -
1x3x640x640
For more information, see ACE Examples and the TensorRT documentation.
Multi-GPU distributed compression
This feature will be re-enabled our next major release! We are in the process of automating the distribution process completely to avoid the user having to specify FSDP/DDP manually.
ClikaModule
instances do not support being wrapped by PyTorch FSDP/DDP as of now.
The FSDP/DDP functionality is handled internally by the ClikaModule
instance.
The only responsibility of the user is making sure to save the model on Rank-0 as typically done in a distributed training setting.
In general, to use multi-GPU distributed compression on a CLIKA model, simply use the torchrun
command:
torchrun --nproc_per_node ... my_main.py