ONNX can be compared to a programming language specialized in mathematical functions. It defines all the necessary operations a machine learning model needs to implement its inference function with this language. A linear regression could be represented in the following way:
ONNX aims at providing a common language any machine learning framework can use to describe its models.
Building an ONNX graph means implementing a function with the ONNX language or more precisely the ONNX Operators. A linear regression would be written this way. The following lines do not follow python syntax. It is just a kind of pseudocode to illustrate the model.
x
, a
, c
are the inputs, y
is the output.
r
is an intermediate result.
MatMul
and Add
are the nodes.
They also have inputs and outputs. A node has also a type, one of the operators in ONNX Operators.
The graph could also have an initializer. When an input never changes such as the coefficients of the linear regression, it is most efficient to turn it into a constant stored in the graph.
Serialization with protobuf
The deployment of a machine-learned model into production usually requires replicating the entire ecosystem used to train the model, most of the time with a docker. Once a model is converted into ONNX, the production environment only needs a runtime to execute the graph defined with ONNX operators. This runtime can be developed in any language suitable for the production application, C, java, python, JavaScript, C#, WebAssembly, ARM, etc.
But to make that happen, the ONNX graph needs to be saved. ONNX uses protobuf
to serialize the graph into
one single block. It aims at optimizing the model size as much as possible.
Supported Types
ONNX specifications are optimized for numerical computation with tensors. A tensor is a multidimensional array. It is defined by:
- a type: the element type, the same for all elements in the tensor
- a shape: an array with all dimensions, this array can be empty, a dimension can be null
- a contiguous array: it represents all the values
ONNX is strongly typed, and its definition does not support implicit cast. It is impossible to add two tensors or matrices with different types even if other languages does. That’s why an explicit cast must be inserted in a graph.
What is a opset
version?
The opset
is mapped to the version of the ONNX package.
It is incremented every time the minor version increases.
Every version brings updated or new operators.
Tools
netron
is very useful to help visualize ONNX graphs. That’s the only one without programming.
A simple example: a linear regression
The linear regression is the most naive model in machine learning described by the following expression.
Y = XA + B
We can see it as a function of three variables Y = f(X, A, B)
decomposed into y = Add(MatMul(X, A), B)
. That what we need to represent with ONNX operators.
The first thing is to implement a function with ONNX operators.
ONNX is strongly typed. Shape and type must be defined for both input and output of the function.
That said, we need four functions to build the graph among the make function:
make_tensor_value_info
: declares a variable (input or output) given its shape and typemake_node
: creates a node defined by an operation (an operator type), its inputs and outputsmake_graph
: a function to create an ONNX graph with the objects created by the two previous functionsmake_model
: a last function which merges the graph and additional metadata
All along the creation, we need to give a name to every input, output of every node of the graph. Input and output of the graph are defined by ONNX objects, strings are used to refer to intermediate results. This is how it looks like.
An empty shape (None
) means any shape, a shape defined as [None, None]
tells this object is a tensor
with two dimensions without any further precision.
Serialization
The model needs to be saved to be deployed.
ONNX is based on protobuf
. It minimizes the space needed to save the graph on disk.
Every object in ONNX can be serialized with method SerializeToString
.
ONNX Runtime Execution Providers
ONNX Runtime works with different hardware acceleration libraries through its extensible Execution Providers (EP) framework to optimally execute the ONNX models on the hardware platform.
ONNX Runtime works with the execution provider(s) using the GetCapability()
interface to allocate specific nodes
or sub-graphs for execution by the EP library in supported hardware. The EP libraries that are pre-installed in the
execution environment process and execute the ONNX sub-graph on the hardware.
This architecture abstracts out the details of the hardware specific libraries that are essential to optimize the
execution of deep neural networks across hardware platforms like CPU, GPU, FPGA or specialized NPUs.