How to use CLIKA ACE
Background
The clika-ace
Python package implements CLIKA's unique Automatic Compression Engine (ACE).
CLIKA ACE
- ACE is a "hardware-aware" engine, and compresses the model optimally for a selected target framework such as Microsoft's ONNX Runtime, NVIDIA's TensorRT, Intel's OpenVINO, Qualcomm's QNN or Google's TFLite.
- ACE can be applied to fine-tune a pre-trained model or to train a model from
scratch.
While ACE is capable of training models from scratch, it's better to use it to fine-tune pre-trained models.
The clika-ace
has three main usages:
- Start the ACE, initialized from an existing
torch.nn.Module
.
Thetorch.nn.Module
will be wrapped using aClikaModule
which inherits fromtorch.nn.Module
.
TheClikaModule
behaves the same as atorch.nn.Module
. - Resume an ACE session initialized from saved
ClikaModule
. - Deploy the compressed model to the chosen framework.
DeployingClikaModule
is straight-forward and can be done usingtorch.onnx.export
When initializing ClikaModule
out of torch.nn.Module
, the model is transformed into a CLIKA IR (intermediate representation).
When you serialize the model for later use it will be saved in the CLIKA IR format.
Terminology:
- CLIKA ACE: CLIKA's Automatic Compression Engine
- CLIKA IR /
pompom
: When the SDK receives atorch.nn.Module
as an input, it parses it and converts the model into an intermediate representation that is solely used by CLIKA ACE SDK. ClikaModule
: The object that wraps any given input model.ClikaModule
inherits fromtorch.nn.Module
and behaves as such.- Monolithic
ClikaModule
: A singleClikaModule
that represents an entire given input model. Alternatively, in the future, if any Control-Flow statements are present in a given input model, the SDK may return the same input model given but with different submodules that are each individually wrapped by aClikaModule
.
Start
In order to start the ACE, it is recommended to use torch.compile
to wrap the torch.nn.Module
with a ClikaModule
as detailed below.
The model to be compressed by ACE is assumed to be without any 'control flow' statements, such as if
statements, for
loops, etc.
For additional details on these restrictions, see ACE Model Requirements.
At present, the CLIKA SDK does not support partial compilation of submodules inside the user-supplied torch.nn.Module
.
In a future release, the CLIKA SDK will support partial compilation and will also be able to handle control flow operations;
as of now, however, the returned result from the torch.compile
call will return a single monolithic ClikaModule
as output.
Example
import torch
from clika_ace import DeploymentSettings_TensorRT_ONNX, ClikaModule
class Model(torch.nn.Module):
def forward(self, x):
mask = x > 0.5
indices = mask.nonzero(as_tuple=True)
x[indices] = 0.0
return x
xs = torch.rand(32, 3, 224)
settings = Settings()
settings.deployment_settings = DeploymentSettings_TensorRT_ONNX()
clika_model: ClikaModule = torch.compile(
Model(),
backend="clika",
options={
"settings": settings,
"example_inputs": xs,
}
)
Arguments for the torch.compile
options
parameter
The torch.compile
function can be supplied with a wide range of options
through the options
argument, the parameters for which we detail below.
Example
self._model: ClikaModule = torch.compile(
self._model, # original module that will be replaced
backend="clika", # CLIKA backend.
options={ # options to 'clika' backend
"settings": self._args.clika_config,
"example_inputs": example_inputs,
"train_dataloader": train_dataloader, # returns elements of structure (xs, ys)
"discard_input_model": True,
"logs_dir": os.path.join(self._args.output_dir, "logs"),
"apply_on_data_fn": lambda x: x[0], # grab the first element of a tuple (xs, ys),
"feed_data_to_model_fn": lambda model, xs: model(xs), # feed 'xs' into model
}
)
Parameters
settings
(REQUIRED)
Union[Settings, str, pathlib.Path] - Either a Settings object or path to Settings
The settings is the CLIKA ACE configuration.
example_inputs
(REQUIRED)
Any: torch.Tensor, Sequence[torch.Tensor], Dict[str, torch.Tensor]
This is used to trace and parse the model.
There is no need to pass this argument if train_dataloader
is provided.
train_dataloader
(REQUIRED)
Iterable: torch.utils.data.Dataloader, Sequence
This is used to run the calibration of the model for Quantization or any other internally used algorithms to optimize the model. Note: this parameter is strictly required if performing quantization, as data is required to perform calibration. Otherwise, it can be omitted.
discard_input_model
(OPTIONAL)
Bool - True if should clear and delete the given model to free VRAM/RAM Default: True
apply_on_data_fn
(OPTIONAL)
Callable[[Any], Any]
This is used to return the actual data that the model should see. This function is applied on the data that is returned from the
train_dataloader
.
By default it will try to understand the structure of the returned data.
feed_data_to_model_fn
(OPTIONAL)
Callable[[torch.nn.Module, Any], Any]
This is used to feed the data that was returned by the process of:
train_dataloader
-> apply_on_data_fn
-> feed_data_to_model_fn
The first argument of the function is the torch.nn.Module
given and the 2nd argument is the returned data.
The result of the function should be the outputs of the torch.nn.Module
.
logs_dir
(OPTIONAL)
Union[str, pathlib.Path] - Path to a directory to create and save CLIKA ACE logs.
Saving / loading the model
Once the Model has been wrapped, it can easily be saved as follows:
...
clika_model: ClikaModule = torch.compile(
Model(),
backend="clika",
options={
"settings": settings,
"example_inputs": xs,
}
)
# Option1, recommended since in distributed training you may want to save only on Rank-0.
# also in distributed training settings the `state_dict()` syncs the state implictly from other ranks
# so `state_dict()` is important to be called on all nodes and only then saving it on the Rank-0.
state_dict = clika_model.state_dict()
torch.save(state_dict, file_path)
loaded_clika_model = ClikaModule()
loaded_clika_model.load_state_dict(torch.load(file_path)) # load it
# Option2:
torch.save(clika_model, file_path)
loaded_clika_model = torch.load(file_path) # load it
# Option3, less recommended because in the future the SDK might
# return the original torch.nn.Module with submodules of `ClikaModule` in-case
# control-flow statements are present
torch.save(clika_model.clika_serialize(), file_path)
loaded_clika_model = ClikaModule.clika_from_serialized(torch.load(file_path)) # load it
Large models will result in a bigger ClikaModule
, and it is therefore recommended to use a higher pickle_protocol
in order to overcome the size-saving limit.
import pickle
import torch
torch.save(..., pickle_protocol=pickle.HIGHEST_PROTOCOL)
Resume
This feature is beneficial in the following scenarios:
Interruption of the training process:
- In the event of a mid-run crash or interruption during an ACE session, you can resume the operation from the last checkpoint.
Introduction of new data:
- When new data is introduced to the training process, it will allow you to resume the ACE session with a new dataset, ensuring continuity in the compression process.
Additional fine-tuning:
- If you wish to further fine-tune the model by running more epochs, you can continue the ACE session starting from a previous checkpoint, enabling you to run additional epochs without starting from scratch.
To do so, load the previous ClikaModule
and continue executing your script
as you would with a normal torch.nn.Module
.
Deploy
To deploy a ClikaModule
, you can use either torch.onnx.export
API or, in the
case of a monolithic ClikaModule
, you can call clika_module.clika_export
to export the model using the CLIKA API.
To deploy a TFLite model, the clika_module.clika_export
API must be used.
There are two types of deployment
- Dynamic shape deployment -
dynamic_axes
provided - Static shape deployment -
dynamic_axes
not provided.
Choice of dynamic shape deployment will ensure that the compressed model will work for differing input shapes.
If the dynamic_axes
argument is provided and specifies a symbolic shape for one of the axes, (e.g. None or str
)
the entire model will be deployed with dynamic shape input in-mind.
Choice of static shape deployment will result in a compressed model which will
take a single, particular shape for each input.
The input shape for which the compressed model will be deployed is
determined by that of the tensor object passed to the args
argument.
The benefit of static shape deployment is typically faster inference speed, since all shapes are specified; most inference frameworks can provide additional optimization once all shapes are known. Additionally, some target frameworks may not support dynamically-shaped inputs.
Note that dynamic shape deployment may still fail if a model is dependent on specific shapes.
For example, if a model includes a Flatten
layer followed by a Linear
layer, as is common at the end of a model.
Instead of `Flatten`, you could use an `AdaptiveAvgPool` operation with an output size of 1.
Dynamic shape deployment example
dummy_inputs: dict = clika_module.clika_generate_dummy_inputs() # this API is only available if using monolithic ClikaModule.
torch.onnx.export(
model=clika_module,
f=f,
args=dummy_inputs,
input_names=["x"],
dynamic_axes={"x": {0: "batch_size"}} # we want a dynamic batch-size
)
Static shape deployment example
torch.onnx.export(
model=clika_module,
f=f,
args=torch.rand(1, 3, 224, 224), # will create a deployed model accepting *this* shape.
input_names=["x"],
dynamic_axes=None
)
TensorRT deployment example
After running ACE and deploying a <model_name>.onnx
file will be generated.
This file should be used in conjunction with the trtexec
command as shown below to create a .engine
file, which can be deployed to TensorRT.
To deploy your model to a .engine
file, install TensorRT
on your local machine or use an NVIDIA-provided docker container.
Here, we use a docker container by NVIDIA
to deploy our model by running the following commands (the docker image will be automatically pulled):
docker run --gpus all --rm -v .:/workspace/ nvcr.io/nvidia/tensorrt:23.07-py3 -c trtexec --onnx=outputs/MyModel.onnx --saveEngine=MyModel.engine --shapes=input_0:1x3x640x640 --int8 --workspace=1024
The command sets the following parameters:
- TensorRT container version -
23.07
- Path to CLIKA deployed
.onnx
file - e.g.outputs/MyModel.onnx
- Path to save TensorRT complied model -
MyModel.engine
- Input shape -
1x3x640x640
For more information, see CLIKA examples and the TensorRT documentation.
Multi-GPU distributed compression
To use multi-GPU distributed compression on a CLIKA deployed model, you can run
the model the same way as you would have normally run the input PyTorch model.
Alternatively, you can use the common torchrun
command:
torchrun --nproc_per_node ... my_main.py
In either case, be sure to set multi_gpu=True
and optionally use_sharding=True
in the CLIKA Settings if you're interested in FSDP/DeepSpeed behavior.
You can always set both multi_gpu
and use_sharding
as True
even if you don't plan to use distributed compression; in this case, the model will run in a distributed manner upon invocation of the torchrun
command, but will otherwise run in a non-distributed manner.
ClikaModule
s do not support being wrapped by PyTorch FSDP/DDP.
The FSDP/DDP functionality is handled internally by the ClikaModule
instance.
The only responsibility of the user is making sure to save the model on Rank-0 as is typical in a distributed training setting.