Efficient Path Signature Features

In what follows I will try to give an accessible introduction to what path signatures are, and the work I did for my Master’s dissertation last year. I am very grateful to Emilio Ferrucci for supervising me. The full dissertation can be found here, but this post includes my thoughts on the main takeaways and limitations.

What are path signatures?

Recall that paths in $d$ -dimensional Euclidean space can be defined as continuous functions $[0, T] \to \mathbb{R}^d$ . Consider a path $X: [0, T] \to \mathbb{R}^n$ with the additional property that it has finite length (this is also known as a path of bounded variation). We will use the notation $X_t$ to mean $X(t)$ . We can break down $X$ into $d$ components which are each one-dimensional paths $X^1, \ldots, X^d$ , with $X_t = (X^1_t, \ldots, X^d_t)$ .

Each $X^i$ is a function of bounded variation, so an integral of the form

\int_a^b f(s) dX^i_s

can be given meaning by the Lebesgue-Stieltjes integral. Using this, we can now take repeated integrals of various components against each other, like this:

\int\limits_{0 < t_1 < \dots < t_k < T} dX_{t_1}^{i_1} \cdots dX_{t_k}^{i_k}

(note that $\int_s^t dX_u$ is simply the increment of the path $X$ over the time interval $[s, t]$ ). Here $k$ is the choice of how many repeated integrals to do, and the $i_1, \ldots i_k$ are choices of dimensions from $1$ up to $d$ . The same dimension can be chosen more than once.

The signature of a bounded variation path is essentially the collection of all such iterated integrals, at every level $k$ . Note that the $t_1, \ldots, t_k$ are dummy variables, so the signature doesn’t depend on them. Such iterated integrals were studied by the mathematician Kuo-Tsai Chen in the mid-20th century¹. The signature is an infinite object, but one can take truncations, i.e. only take iterated integrals up to level $N$ .

Mathematically, the signature lives in the tensor algebra of the original vector space, $\mathbb{R}^d$ . It is not essential to understand what this means at first, and the tensor product notation can make things seem more intimidating than they really are, for example the following expression:

\int\limits_{0 < t_1 < t_2 < \dots < t_k < T} dX_{t_1} \otimes dX_{t_2} \otimes \ldots \otimes dX_{t_k}

which simply denotes all the $k$ -th order iterated integrals of $X$ , but arranged to form a tensor in $(\mathbb{R}^d)^{\otimes k}$ . To be precise,

\int\limits_{0 < t_1 < \dots < t_k < T} dX_{t_1} \otimes \ldots \otimes dX_{t_k} = \sum_{i_1, \dots, i_k =1}^d \left(\mkern5mu\int\limits_{0 < t_1 < \dots < t_k < T} dX_{t_1}^{i_1} \cdots dX_{t_k}^{i_k} \right) e_{i_1} \otimes \cdots \otimes e_{i_k},

where $e_1, \ldots, e_d$ is a basis of $\mathbb{R}^d$ .

Signatures have several properties which suggest that they are a natural mathematical object. It would be nice if distinct paths had distinct signatures, but this isn’t strictly true: for example, reparametrising a path, or adding a section which exactly doubles back on itself will not change the signature. However, the notion of “tree-like equivalence” perfectly distinguishes when two paths have the same signature², so one could say that the signature does determine the path, up to tree-like equivalence.

It turns out that signatures contain some redundancies — there are algebraic relations between the values of the iterated integrals which must always hold. But it is possible to define a logarithm map which removes these redundancies. If we start with truncated signatures, then taking logarithms results in truncated log signatures, which take their values in a vector space of lower-dimension than the space where truncated signatures take values.

Path signatures in machine learning

Path signatures can be used in machine learning as feature transform, when the input data takes the form of a path. A strong theoretical justification for doing so is that the signature is a universal nonlinearity. This means that any continuous real-valued function on paths can be approximated arbitrarily well by a linear function composed with the signature! This justifies applying simple linear models to path signature features³.

Signatures have also been used in combination with modern machine learning techniques such as deep learning and gradient boosting, and a range of applications ⁴⁵⁶⁷⁸⁹. There are also kernel tricks which allow one to avoid explicitly computing truncated signatures¹⁰, and even to implicitly use the untruncated signature¹¹.

It is worth emphasising that any streamed data consisting of numerical values at a number of time-steps can be considered as a continuous path — e.g. by linear interpolation, or through various other methods. More details can be found in this primer by Chevyrev and Kormilitzin, and in the more recent survey by Lyons and McLeod (both of these are great introductions to path signatures in machine learning).

Truncation level vs. number of subpaths

A common technique when creating path signature features is to extract multiple subpaths, compute truncated signatures, or log signatures, over each of them, and stack the results to form a feature vector. This is also done implicitly by several works which use make use of deep learning architectures which use truncated log signatures over subintervals, like Logsig-RNN, Neural Rough Differential Equations, and Log Neural Controlled Differential Equations.

But why take subpaths in the first place? One reason is that, remembering that in practice signatures must be truncated, and therefore contain only a finite amount of information about the path, taking signatures over $m$ subpaths allows more information to be captured while only increasing the feature size linearly, by a factor of $m$ . In contrast, the feature size grows exponentially if we attempt to capture more information by increasing the truncation level. Depending on the dataset, there may be additional reasons to favour features which are localised in time.

The simplest approach to constructing subpaths is to pick a partition $\mathcal{D} = (t_0, \ldots, t_m)$ of the time domain, and take the collection of level $N$ -truncated signatures over each of the $m$ subintervals $[t_i, t_{i+1}]$ . Another approach, illustrated below, is to split the time domain in a hierarchical dyadic manner, and compute truncated signatures over every interval (not just those on the lowest level)¹². Ordinary vs hierarchical partition

Let’s consider this setup of using (log) signatures over subpaths as a feature transform, and suppose we want to find a feature set that has minimal dimension while also being sufficiently descriptive. Then, a trade-off between truncation level and path segmentation arises. Increasing either the truncation level of the signatures, or the number of subpaths will make the features more descriptive, but also larger. Having low-dimensional features may be important in “small data” problems, to avoid overfitting, or perhaps we are interested in efficient ways to compress paths just out of mathematical curiosity.

My dissertation explored this trade-off, as well how best to choose the subpaths — previous practical works typically take subpaths by splitting the time domain into equal chunks, but in my opinion the theory suggests that splitting into equal length subpaths is more natural. In experiments I have observed this is sometimes beneficial.

In practice, there is not a principled way to make the choice of truncation level and the family of subpaths. Focussing on the truncation level, the universal nonlinearity result says that by setting it sufficiently high we can approximate any continuous function on paths arbitrarily well by a linear function composed with the truncated signature. However, this tells us nothing about how high $N$ needs to be in practice to get decent results — just like universal approximation results for neural networks typically don’t tell you how wide or deep your network needs to be for any specific task.

My dissertation considers two ways of measuring how descriptive the features are: the first via error bounds for approximating a certain class of functions, and the second by training some simple models and measuring the generalisation error.

Error bounds for solutions to controlled differential equations

On the more theoretical side, I considered a class of functions which map paths to the time- $T$ solution of a linear controlled differential equation driven by the path. For a fixed linear CDE, with the vector field and the driving path specified, there is a natural way to approximate the time- $T$ solution if we are given the signatures up to level $N$ over a number of subintervals, using an Euler-like numerical scheme.

An error bound for this approximation scheme can be computed. For a given feature set, this tells us an upper bound on how well an arbitrary function in the set of functions defined by linear CDEs (with bounds on the norm of the vector field, and also on the length of the input paths) can be approximated by some function of the feature set.

The following figures plot these upper bounds against the size of the feature set (assuminng that log signatures are used). $d$ is the number of dimensions of the space in which the paths lie. $rL$ is the product of $r$ , the bound on the norm of the vector fields, and $L$ , the bound on the length of paths. For each truncation level $N$ , the number of subintervals $m$ increases in steps of $1$ up to a maximum determined by computational constraints.

You can read off from these plots the truncation level of the most efficient feature set that achieves a given value for the error bound. When $d = 2$ , higher values of $N$ tend to be optimal. But when $d$ is increased to $3$ or $4$ , higher truncation levels are more costly in terms of feature size, and feature sets which use lower-level signatures over a greater number of subpaths seem to be more efficient than ones using higher-order signatures. Increasing $rL$ favours features sets with lower $N$ and higher $m$ .

Empirical tests of generalisation performance

The error bound analysis doesn’t account for the fact that the function needs to be learned from a finite dataset. A theoretical analysis of this seems difficult, but we can run empirical experiments.

My empirical tests were to simply train machine learning models and compare choices of $N$ , the truncation level, and $m$ , to see what is the most efficient way to achieve a given test accuracy. I ran these on both synthetic data and real accelerometer data provided by the Child Mind Institute for a Kaggle competition.

I found setting up these experiments well to be tricky, and did not get amazing results. Initially, we wanted to use log signature features as inputs for polynomial regression, since lo;g signatures are more concise than signatures. The motivation for polynomial regression is that a using a polynomial one can go from the log signatures back to the normal signatures, and then apply a linear function. With the universal nonlinearity theorem, this easily gives a universal nonlinearity result. However, actually computing polynomial features beyond low degrees results in a huge number of features, so I decided to use kernel ridge regression with a polynomial kernel. This then introduces new issues, since there are hyperparameters to tune and the computational complexity limits the size of the dataset that can be used.

In addition to this, splitting the path into more subpaths and taking signatures, while in theory providing more information, can in practice make it harder to learn the desired function. Since the feature sets become large and I was using only small datasets, there are also issues with overfitting.

For more details, see the full dissertation, but overall I would note that I think there is lots of room for improvement in this approach to empirical tests!

The experiments on real data show much better performance when the paths are split by equal length (1-variation) instead of equal time, with the large caveat that I only tried out one (small) dataset:

I think the main takeaway for a practitioner is that splitting by equal length can result in better performance than splitting by equal time. This change is not difficult to implement and it is theoretically motivated.

Using path signature features in practice

On what sorts of real-world data would path signature features be a particularly useful feature transform? I don’t have an answer for this, but here are some rough intuitions (based on my limited experience):

If the dimensionality of the data is high to begin with then this is not good because signature features beyond a very low truncation level will be very high dimensional.
If the problem feels like predicting the state of a system which is in some sense “driven” by the path, this might be good because such a system might be modelled as a controlled differential equation.
If the paths are measured with a lot of noise then this might be bad — I don’t know how errors in measurements of the path propagate to errors in signature terms.

Several software libraries are available which compute (log) signatures, including RoughPy (still under development as of September 2024), signax (a JAX library), Signatory (a differentiable implementation), iisignature, and esig.

Other practical considerations include the following:

It is important to apply an appropriate form of normalisation. Asymptotically, the magnitude of signature terms decays as the level increases. Morrill et al. compared two dataset-independent approaches to rescaling signature terms. However, the asymptotic results don’t really say much about the magnitudes of signature terms at lower levels, so I think it a good idea to apply a standard approach to normalisation of the path signature features (using the dataset statistics).
There are a number of augmentations that can be applied to paths as preprocessing steps before computing signatures (see Section 2.1 of Chevyreva and Kormilitzina and Section 2.5 of Lyons and McLeod). Several approaches were compared by Morrill et al.