Skip to content

Efficient AD for the simulation of gradients #504

@FelixBenning

Description

@FelixBenning

Introduction: How would you simulate gradients

If you want to simulate the gradient of a random function $Z$, it turns out that you simply need to take derivatives of the covariance funcion, as (w.l.o.g. $Z$ is centered)

$$\text{Cov}(\nabla Z(x), Z(y)) = \mathbb{E}[\nabla Z(x) Z(y)] = \nabla_x \mathbb{E}[Z(x)Z(y)] = \nabla_x C(x,y)$$

And similarly

$$\text{Cov}(\nabla Z(x), \nabla Z(y)) = \nabla_x\nabla_y C(x,y)$$

For stationary covariance functions $C(x,y) = C(x-y)$ this simplifies to

$$\begin{aligned} \nabla_x C(x,y) &= \nabla C(x-y)\\\ \nabla_x\nabla_y C(x,y) &= -\nabla^2 C(x-y) \end{aligned}$$

So if we consider the multivariate random function $T_1(x) = (Z(x), \nabla Z(x))$, then its covariance kernel is given by

$$\begin{aligned} C_{T_1}(x,y) &= \begin{pmatrix} C(x,y) & \nabla_y C(x,y)^T\\\ \nabla_x C(x,y) & \nabla_x \nabla_y C(x,y) \end{pmatrix}\\\ &\overset{\mathllap{\text{stationary}}}= \begin{pmatrix} C(x-y) & -\nabla C(x-y)^T\\\ \nabla C(x-y) & -\nabla^2C(x-y) \end{pmatrix}\\\ &\overset{\mathllap{\text{isotrope}}}= \begin{pmatrix} C(d) & -f'\bigl(\frac{\|d\|^2}{2}\bigr)d^T\\\ f'\bigl(\frac{\|d\|^2}{2}\bigr) d & -\Bigl[f''\bigl(\frac{\|d\|^2}2\bigr)dd^T + f'\bigl(\frac{\|d\|^2}2\bigr) \mathbb{I}\Bigr] \end{pmatrix} \end{aligned}$$

where we use $d=x-y$ and $C(d) = f\bigl(\frac{|d|^2}2\bigr)$ in the last equation.

Performance Considerations

In principle you could just directly apply Autodiff (AD) to any kernel $C$ to obtain $C_{T_1}$, but since I heard that the expense of autodiff scales with the number of input arguments, this would be really wasteful when we only really need to differentiate a one dimensional function in the isotropic case.

Unfortunately the way that KernelFunctions implement length scales results in general kernel functions, so I am not completely sure how to tell the compiler, that "these derivatives are much simpler than they look".

One possibility might be, to add the Abstract types IsotropicKernel and StationaryKernel and carry these types over when transformations do not violate them. Scaling would not, more general affine transformations would violate isotropy but not stationarity, etc. This could probably be done with type parameters.

But even once that is implemented, how do you tell autodiff what to differentiate? I have seen the file chainrules.jl in this repository, so I thought I would ask if someone already knows how to implement this.

Kernels for multiple outputs considerations

Since you implemented kernels for multiple outputs as an extension of the input space, the reuse of the derivative $f'(\frac{|d|}{2})$ also becomes more complicated.

Maybe this is all premature optimization, as the evaluation of the kernel is complexity wise in the shadow of the cholesky decomposition.

Extension: Simulate $n$-order derivatives

In principle you could similarly simulate $T_n(x) = (Z(x), Z'(x), \dots, Z^{(n)}(x))$. But AD is already underdeveloped for hessians, so I don't know how to get this to work. It might be possible to write custom logic for certain isotropic random functions, like with the squared exponential one.

What do you think? I handcrafted something for first order derivatives in a personal project, but for KernelFunctions.jl a more general approach is probably needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions