Efficient AD for the simulation of gradients

## Introduction: How would you simulate gradients

If you want to simulate the gradient of a random function $Z$, it turns out that you simply need to take derivatives of the covariance funcion, as (w.l.o.g. $Z$ is centered)
```math
\text{Cov}(\nabla Z(x), Z(y)) = \mathbb{E}[\nabla Z(x) Z(y)] = \nabla_x \mathbb{E}[Z(x)Z(y)] = \nabla_x C(x,y)
```

And similarly
```math
\text{Cov}(\nabla Z(x), \nabla Z(y)) = \nabla_x\nabla_y C(x,y)
```

For stationary covariance functions $C(x,y) = C(x-y)$ this simplifies to
```math
\begin{aligned}
\nabla_x C(x,y)  &= \nabla C(x-y)\\
\nabla_x\nabla_y C(x,y) &= -\nabla^2 C(x-y)
\end{aligned}
```

So if we consider the multivariate random function $T_1(x) = (Z(x), \nabla Z(x))$, then its covariance kernel is given by
```math
\begin{aligned}
C_{T_1}(x,y) 
&= \begin{pmatrix}
     C(x,y) & \nabla_y C(x,y)^T\\
     \nabla_x C(x,y) & \nabla_x \nabla_y C(x,y)
\end{pmatrix}\\
&\overset{\mathllap{\text{stationary}}}= 
\begin{pmatrix}
     C(x-y) & -\nabla C(x-y)^T\\
     \nabla C(x-y) & -\nabla^2C(x-y)
\end{pmatrix}\\
&\overset{\mathllap{\text{isotrope}}}=
 \begin{pmatrix}
     C(d) & -f'\bigl(\frac{\|d\|^2}{2}\bigr)d^T\\
     f'\bigl(\frac{\|d\|^2}{2}\bigr) d & -\Bigl[f''\bigl(\frac{\|d\|^2}2\bigr)dd^T + f'\bigl(\frac{\|d\|^2}2\bigr) \mathbb{I}\Bigr]
\end{pmatrix}
\end{aligned}
```
where we use $d=x-y$ and $C(d) = f\bigl(\frac{\|d\|^2}2\bigr)$ in the last equation.


## Performance Considerations

In principle you could just directly apply Autodiff (AD) to any kernel $C$ to obtain $C_{T_1}$, but since I heard that the expense of autodiff scales with the number of input arguments, this would be really wasteful when we only really need to differentiate a one dimensional function in the isotropic case.

Unfortunately the way that KernelFunctions implement length scales results in general kernel functions, so I am not completely sure how to tell the compiler, that "these derivatives are much simpler than they look".

One possibility might be, to add the Abstract types `IsotropicKernel` and `StationaryKernel` and carry these types over when transformations do not violate them. Scaling would not, more general affine transformations would violate isotropy but not stationarity, etc. This could probably be done with type parameters.

But even once that is implemented, how do you tell autodiff what to differentiate? I have seen the file `chainrules.jl` in this repository, so I thought I would ask if someone already knows how to implement this.

### Kernels for multiple outputs considerations

Since you implemented kernels for multiple outputs as an [extension of the input space](https://juliagaussianprocesses.github.io/KernelFunctions.jl/stable/design/#inputs_for_multiple_outputs), the reuse of the derivative $f'(\frac{\|d\|}{2})$ also becomes more complicated.

Maybe this is all premature optimization, as the evaluation of the kernel is complexity wise in the shadow of the cholesky decomposition.

## Extension: Simulate $n$-order derivatives

In principle you could similarly simulate $T_n(x) = (Z(x), Z'(x), \dots, Z^{(n)}(x))$. But AD is already underdeveloped for hessians, so I don't know how to get this to work. It might be possible to write custom logic for certain isotropic random functions, like with the squared exponential one.


What do you think? I handcrafted something for first order derivatives in a personal project, but for `KernelFunctions.jl` a more general approach is probably needed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Efficient AD for the simulation of gradients #504

Introduction: How would you simulate gradients

Performance Considerations

Kernels for multiple outputs considerations

Extension: Simulate $n$-order derivatives

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Efficient AD for the simulation of gradients #504

Description

Introduction: How would you simulate gradients

Performance Considerations

Kernels for multiple outputs considerations

Extension: Simulate $n$-order derivatives

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions