Version: 0.3

How to use CLIKA Compression

Background

The clika-compression Python package implements CLIKA's unique compression engine called

CLIKA Compression Operation (CCO).

CCO is the process in which the model is compressed with CLIKA's compression engine.
CCO is "Hardware-aware", and compress the model optimally for the selected framework.
CCO can be applied to fine-tune a pretrained model or to train a model from scratch.

The clika-compression has three main usages:

Start CCO that is initialized from an existing torch.nn.Module.
Resume CCO that is initialized from an existing compressed model (.pompom file).
Deploy the compressed model (.pompom file) to a chosen framework.

info

.pompom files are the outputs of the CCO that contains the compressed model's checkpoints state.

For more information, see model.pompom.

Start a CLIKA Compression Operation

You can start a CCO from any torch.nn.Module by calling clika_compress

In order to start the CCO you MUST provide the following items:

Model to compress (torch.nn.Module)
Optimizer (torch.optim.Optimizer)
Loss function (Callable -> Union[tuple,dict])
Function that returns the training dataloader (Callable -> torch.utils.data.DataLoader)
Target deployment framework, one of TensorRT, ONNX Runtime, OpenVINO, TFlite (Experimental).

There are several optional inputs that you MAY provide as well:

Function that returns an evaluation dataset dataloader (Callable -> torch.utils.data.DataLoader), in case you require evaluation performance to be calculated during the compression process
Training metrics (if you require metrics on the training dataset during CCO)
Evaluation metrics (if you require metrics on the evaluation dataset during CCO)

caution

The provided loss functions, metrics and dataloaders must conform to certain requirements.

See CCO input requirements for more details.

Example

import torch
from clika_compression import Settings, clika_compress, get_path_to_best_clika_state_result
from torch.utils.data import DataLoader


model: torch.nn.Module = ...  # Set your model

# Function that returns the training dataloader
def get_train_loader() -> DataLoader:
    # TODO: Create your dataset and dataloader here
    return train_dataloader

optimizer = torch.optim.AdamW(params=model.parameters(), lr=0.0001)
settings = Settings()  # Default settings

# Start the "CCO" to compress model to generate a `.pompom` file
clika_compress(
    output_path='outputs',  # Path to save the compressed models and other outputs
    settings=settings,  # CCO settings
    model=model,  # Model to compress
    optimizer=optimizer,
    init_training_dataset_fn=get_train_loader, 
    init_evaluation_dataset_fn=None,
    training_losses={'cross_entropy_loss': torch.nn.CrossEntropyLoss()},
)
best_clika_state_file = get_path_to_best_clika_state_result('outputs')

Resume a CLIKA Compression Operation

If a CCO session was previously executed, and you wish to resume from the point where you left off, you can utilize the clika_resume function along with the relevant .pompom file.

This feature is beneficial in the following scenarios:

Interruption of the training process :
- In the event of a mid-run crash or interruption during a CCO, you can resume the operation from the last checkpoint.
Introduction of new data:
- When new data is introduced to the training process, it will allow you to resume the CCO with a new dataset, ensuring continuity in the compression process.
Additional fine-tuning:
- If you wish to further fine-tune the model by running more epochs, you may continue the CCO from the previous checkpoint, enabling you to run additional epochs without starting from scratch.

Example

import torch
from clika_compression import Settings, clika_resume
from torch.utils.data import DataLoader


settings = Settings()  # Default settings
optimizer = torch.optim.AdamW(params=model.parameters(), lr=0.0001)


# Function that returns the training dataloader
def get_train_loader() -> DataLoader:
    # TODO: Create your dataset and dataloader here
    return train_dataloader


# Resume the "CCO" compression process
# Returns the path to the latest epoch's `.pompom` file
final: str = clika_resume(
    clika_state_path='path/to/.pompom',
    init_training_dataset_fn=get_train_loader,
    optimizer=optimizer,
    training_losses={'ce_loss': torch.nn.CrossEntropyLoss()},
    new_settings=settings,  # the new CCO settings
)

Deploy a model

caution

Do not deploy a .onnx file that was generated by CCO with a certain deployment setting (for example, ONNX Runtime) to another framework (for example, TensorRT). Each .pompom file is tailored to the deployment framework set in Settings.deployment_settings during the CCO and should only be deployed and used in that framework.

Once the CCO has been executed, you can deploy the resulting .pompom file to the target framework it was compressed for by one of:

Python API - clika_deploy
CLI Tool - clika-deploy

These two methods enables a straightforward deployment for the compressed model .pompom file obtained from the CCO to a .onnx/.tflite file.

For more information, see DeploymentSettings.

CLIKA deployment examples

CLI Tool

# for a model with one input and a dynamic first dimention (batch)
clika-deploy <path_to_onnx> --shape "?x3x224x224" # use "?" to represent a dynamic dimension

# for a model that takes two inputs with the same shape [1,512] 
clika-deploy <path_to_onnx> --shape "1x512 1x512" # use whitespace (" ") to seperate multiple inputs 

For more information use run clika-deploy -h

tip

To observe the model inputs shapes and order easily you may use clika-deploy without the --shape arguments which will give you the .onnx file with the shapes used during training, then you can examine the inputs shapes and order on the deployed onnx file using VEGA by CLIKA or any other tool, then rerun deployment with the required shapes

Python API

from clika_compression import DeploymentSettings_TensorRT_ONNX, Settings, clika_compress, clika_deploy

output_dir = "outputs"
settings = Settings()  # Default settings
settings.deployment_settings = DeploymentSettings_TensorRT_ONNX() # select deployment framework

clika_state_path = clika_compress(  # type: str
    output_path=output_dir,
    settings=settings,
    # ... other args
)

# Generate `.onnx` file and saves it to 'output' folder
deployed_file_path = clika_deploy(  # type: str
    clika_state_path=clika_state_path,
    output_dir_path=output_dir,
    input_shapes=[(None, 1, 28, 28)],
)

TensorRT deployment

After running CCO and clika_deploy a <model_name>.onnx will be generated. This file should be used in conjunction with the trtexec command as shown below to create a .engine file, which can be deployed to TensorRT.

To deploy your model to a .engine file, install TensorRT on your local machine OR use a docker container by NVIDIA.

Here, we use a docker container by NVIDIA to deploy our model by running the following commands (the docker image will be automatically pulled):

For example, you may run the following command that sets the following parameters:

TensorRT container version - 23.07
Path to CLIKA deployed .onnx file - outputs/MyModel.onnx
Path to save TensorRT complied model - MyModel.engine
Input shape - 1x3x640x640

docker run --gpus all --rm -v .:/workspace/ nvcr.io/nvidia/tensorrt:23.07-py3 -c trtexec --onnx=outputs/MyModel.onnx --saveEngine=MyModel.engine  --shapes=input_0:1x3x640x640  --int8 --workspace=1024

For more information, see CLIKA examples and TensorRT documentation.

Multi-GPU distributed compression

To use multi-GPU distributed compression with CCO, you must create a DistributedTrainingSettings object and set it as a Settings.distributed_training_settings attribute, before passing the Settings object to clika_compress or clika_resume function call.

In DistributedTrainingSettings:

You can use set argument multi_gpu to True to use distributed compression.
if use_sharding is True the model will be sharded across the available GPU similar to PyTorch FSDP and DeepSpeed ZeRO-3, which reduce GPU memory usage but may increase latency.
If use_sharding is False a technique similar to PyTorch's DistributedDataParallel paradigm (or DDP) will be used, in which the model will be copied to each GPU and the dataset split between them. This will reduce latency but will not reduce GPU memory usage.

Usage example with multi-GPU and sharding training strategy:

from clika_compression import DistributedTrainingSettings, Settings, clika_compress

settings = Settings()  # Default settings
settings.distributed_training_settings = DistributedTrainingSettings(multi_gpu=True,use_sharding=True)

...

clika_compress(
    output_path="outputs",
    settings=settings
    # ... other args
)

For more information, see multi-GPU input restrictions.

Configuration file

You may use a YAML file to save and load CCO configuration.

For more information, see Configuration file.

Monitoring integration

If you wish to monitor your training using TensorBoard may pass the TensorBoardCallback instance to the clika_compress or clika_resume functions as show here:

from clika_compression.utils.callbacks import TensorBoardCallback
from clika_compression import clika_compress

clika_compress(
    # ... other args
    callbacks=[TensorBoardCallback(output_path='tensorboard_runs')]
)

CCO outputs

The output directory tree structure will be as follows:

<output-dir>/ 
├── epoch_1/
│ ├── model.pompom
│ └── summary.json
├── epoch_2/
│ ├── model.pompom
│ └── summary.json
...
├── logs/ 
│ └──clika_optimize_<timestamp>.log
...  
└── <model-name>.onnx/tflite # only generated when deploying the model

During CCO, for each epoch (unless it was stated otherwise in the TrainingSettings.save_interval), a new folder will be created named epoch_<current-epoch-index>/.

Inside the folder, the following two files will be generated:

model.pompom
summary.json

`model.pompom`

The outputs of CCO are stored in .pompom files, they contain the compressed model and have everything that is required to resume CCO or deploy the compressed model. These .pompom files have two main usages:

as input to the clika_resume method to resume CCO.
as input to the clika_deploy method to create a deployable model for a specific framework.

`summary.json`

We will break down the summary.json structure by using an example. The script:

import torch
from clika_compression import Settings, clika_compress, get_path_to_best_clika_state_result
from torchmetrics.classification.accuracy import MulticlassAccuracy

...
optimizer = torch.optim.AdamW(params=model.parameters(), lr=0.0001)
settings = Settings()  # Default settings

# Start the "CCO" to compress model and generate `.pompom` files
clika_compress(  # type: str 
    output_path='outputs',
    settings=settings,  
    model=model,  
    init_training_dataset_fn=get_train_loader,  # type: Callable
    init_evaluation_dataset_fn=get_evaluation_loader,  # type: Callable
    optimizer=optimizer,
    training_losses={"ce_loss": torch.nn.CrossEntropyLoss()},
    training_metrics={'multiclass_accuracy': MulticlassAccuracy(num_classes=10)},
    evaluation_losses={
        "ce_loss(sum)": torch.nn.CrossEntropyLoss(reduction="sum"),
        "ce_loss(mean)": torch.nn.CrossEntropyLoss(reduction="mean")
    },
    evaluation_metrics={'multiclass_accuracy': MulticlassAccuracy(num_classes=10)},
)
best_clika_state_file = get_path_to_best_clika_state_result('outputs')

will generate the following summary.json file:

{
  "epoch": 1,
  "training": {
    // `training_loss` names and corresponding values
    "ce_loss": 9.7,

    // total loss - a summation of all losses 
    // since we have just provided one loss function it is identical to "ce_loss"
    "loss": 9.7, 
    
    // `training_metrics` supplied by users
    "multiclass_accuracy": 0.0, 
  },
  "evaluation": { 
    // `evaluation_losses` names and corresponding values
    "ce_loss(sum)": 87.0,
    "ce_loss(mean)": 8.7,

    // `evaluation_metrics` supplied by users
    "eval_multiclass_accuracy": 0.9,
  },
  "time_elapsed": "0:00:00" 
}

Logs

During every run of CCO (start, resume or deploy), logs printed to the terminal are saved to a file in the outputs folder. The logs contain information about the CCO run and can be helpful for monitoring the process and for debugging purposes.

See Output log breakdown for more details.

Next steps

That's it! Both the CCO and the deployment are flexible and highly configurable to best fit your needs.

For further details, please refer to the Python API documentation.
For more examples and use cases, see our Examples GitHub Repository.

How to use CLIKA Compression

Background​

Start a CLIKA Compression Operation​

Example​

Resume a CLIKA Compression Operation​

Example​

Deploy a model​

CLIKA deployment examples​

CLI Tool​

Python API​

TensorRT deployment​

Multi-GPU distributed compression​

Configuration file​

Monitoring integration​

CCO outputs​

model.pompom​

summary.json​

Logs​

Next steps​

Background

Start a CLIKA Compression Operation

Example

Resume a CLIKA Compression Operation

Example

Deploy a model

CLIKA deployment examples

CLI Tool

Python API

TensorRT deployment

Multi-GPU distributed compression

Configuration file

Monitoring integration

CCO outputs

`model.pompom`

`summary.json`

Logs

Next steps