Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 62 additions & 3 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,6 @@ Almost always the _pushforward_/_pullback_ will be declared locally within the `
The **pushforward** of ``f`` takes the _sensitivity_ of the input of ``f`` to a quantity, and gives the _sensitivity_ of the output of ``f`` to that quantity
The **pullback** of ``f`` takes the _sensitivity_ of a quantity to the output of ``f``, and gives the _sensitivity_ of that quantity to the input of ``f``.

#### Math
This is all a bit simplied by talking in 1D.

##### Lighter Math
For a chain of expressions:
```
Expand Down Expand Up @@ -118,6 +115,68 @@ then I can use the pushforward to find ``\dfrac{∂f}{∂x}``

``\dfrac{∂f}{∂x}=\mathrm{pushforward}_{h(b)|b=g(x)}\left(\left.\dfrac{∂g}{∂a}\right|_{a=x}\right)``

##### Geometric interpretation of reverse and forwards mode AD

Let us think of our types geometrically. In other words, elements of a type form a _manifold_.
This document will explain this point of view in some detail.

###### Some terminology/conventions.

Let ``p`` be an element of type M, which is defined by some assignment of numbers ``x_1,...,x_m``,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get what you want to say here, do you mean how a type is represented in memory? Wouldn't we also want to require some type of "smoothness", so we can do calculus on it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right in saying that manifold should imply smoothness.

I wish to interpret things geometrically, which is to say I am interested in some geometric structure beyond the "set" of elements. Sometimes the word "space" is used. Smoothness isn't necessarily something we want to think about: Push-forwards and pull-backs can be defined without it.

say ``(x_1,...,x_m) = (a_1,...,1_m)``

A _function_ ``f:M -> K`` on ``M`` is (for simplicity) a polynomial ``K[x_1, ... x_m]``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really make things simpler? I would probably just require that f is analytic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the right set of functions in the AD context is. I don't think "smooth" or even "analytic" are exactly right. I feel these two kind of imply a-priori definition of symbolic derivatives, infinite limits or some sort of finite differences.

I think we basically combining rational functions with look-ups? (add, subtract, multiply, divide, lookup table)


The tangent space ``T_pM`` of ``T`` is the ``K``-vector space spanned by derivations ``d/dx``.
The tangent space acts linearly on the space of functions. They act as usual on functions. Our starting point is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might add here, that they map a curve through p to the derivative of f in that direction.

Copy link
Author

@aisopous aisopous Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curves are a nice interpretation of tangent vectors, but I'm not convinced they are the right one for AD, since I don't think they necessarily add anything.

For reference, we can define an isomorphism
maps from the first order neighborhood of 0 in K (infinitesimal curves) -> T_pM
by taking the derivation in the direction of the curve.

that we know how to write down ``d/dx(f) = df/dx``.

The collection of tangent spaces ``{T_pM}`` for ``p\in M`` is called the _tangent bundle_ of ``M``.

Let ``df`` denote the first order information of ``f`` at each point. This is called the differential of ``f``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"first order information" sounds a bit vague to me. Can't we define df as element of the tangent space T_{f(p)}K?

Copy link
Author

@aisopous aisopous Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distinction between tangents and functions as duals to one another is crucial here.

For example, Griehwank defines (IIRC) df at any p as the linear function, vanishing at p, which approximates f to first order. Notice that then, abusing terminology, we may write f = f(p) + df_p

The technical definition should be as follows:
Let m be the set of functions vanishing at p. Then there is a natural map from functions on M (rational functions with lookups, say, and let's denote it by K(M)) to functions vanishing to order 1, i.e.d: K(M) -> m/m^2, defined by f \mapsto f - f(p) modulo m^2. This is precisely enough to algebraically specify rules of differentiation, and give us parametrisations of tangent and cotangent spaces.

I'm not committed to this "algebraic" view of things, I am simply keen to see if this would be clearer than the "smooth" view.

Example (maps into non-smooth spaces)

struct PointOnDegenerateConic
x::float
y::float

function PointOnDegenerateConic(x, y)
    @assert x*y == 0
    new(x, y)
end

At any point away from the origin, everything should look just like on a real line. With the non-smooth definitions, we can also make sense of the origin for free. At origin, we have a two-dimensional tangent space -- the vector space dual to linear functions of x and y. Path interpretation is not really helpful here, unless we are okay with infinitesimal paths that lead nowhere.

function ProjectToConic(x, y) 
    if |x| == |y|
         return (0, 0)
    return |x| > |y| ? (x, 0) : (0, y)
end

Again we have a non-smooth mapping, but pull-backs of linearised functions on the conic, and push-forwards of vectors in R^2 to vectors on the conic should be pretty interpretable. Crucially, they can also be computed without any fuss!

If the derivatives of ``f`` and ``g`` agree at ``p``, we say that ``df`` and ``dg`` represent the same cotangent at ``p``.
The covectors ``dx_1, ..., dx_m`` form the basis of the cotangent space T^*_pM at ``p``. Notice that this vector space is
dual to ``T_p``

The collection of cotangent spaces ``{T^*_pM}`` for ``p\in M`` is called the _cotangent bundle_ of ``M``.

###### Push-forwards and pullbacks

Let ``N`` be another type, defined by numbers ``y_1,...,y_n``, and let ``g:M -> N`` be a _map_, that is,
an ``n``-dimensional vector ``(g_1, ..., g_m)`` of functions on ``M``.

We define the _push-forward_ ``g_*:TM -> TN`` between tangent bundles by ``g_*(X)(h) = X(g\circ h)`` for any tangent vector ``X`` and function ``f``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We define the _push-forward_ ``g_*:TM -> TN`` between tangent bundles by ``g_*(X)(h) = X(g\circ h)`` for any tangent vector ``X`` and function ``f``.
We define the _push-forward_ ``g_*:TM -> TN`` between tangent bundles by ``g_*(X)(h) = X(g\circ h)`` for any tangent vector ``X`` and smooth, real-valued function ``h``.

We have ``g_*(d/dx_i)(y_j) = dg_j/dx_i, so the push-forward is equal to the Jacobian when written in coordinates.

Similarly, the pullback of the differential ``df`` is defined by
``g^*(df) = d(g\circ f)``. So for a coordinate differential ``dy_j``, we have
``g^*(dy_j) = d(g_j)``. Notice that this is a covector, and we could have defined the pullback by its action on vectors by
``g^*(dh)(X) = g_*(X)(dh) = X(g\circ h)`` for any function ``f`` on ``N`` and ``X\in TM``. In particular,
``g^*(dy_j)(d/dx_i) = d(g_j)/dx_i``. If you work out the action in a basis of the cotangent space, you see that it acts
by the adjoint of the Jacobian.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might add:

Note that for complex functions, how you define the adjoint of the Jacobian depends on the basis you choose as covectors, for example ``dRe(z), dIm(z)`` or ``dz, dz̅``

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the tangent space as I've defined it would in fact be only n dimensional, whereas complexifying the tangent space of the underlying real manifold leads to a 2n dimensional tangent space or holomorphic and anti-holomorphic vectors. I've only defined push-forwards (and pull-backs) of holomorphic (co)-tangents.


Notice that the pullback of a differential and the pushforward of a vector have a very different meaning, and this should
be reflected on how they are used in code.

The information contained in the push-forward map is exactly _what does my function do to tangent vectors_.
Pullbacks, acting on differentials of functions, act by taking the first order information of a function.
This works in a coordinate invariant way, and works without the notion of a metric.
_Gradients_ recall are vectors, yet they should contain the same information of the differential ``df``.
Assuming we use the standard euclidean metric, we can identify ``df`` and ``\nabla f`` as vectors.
But pulling back gradients still should not be a thing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure, what you mean here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I should explain here, is how pull-backs are applied to gradients via the identification of linear functions and vectors by
v <--> w \mapsto <v, w> (inner product with a fixed vector is a linear function on the dual of V), and why thinking about pulling back co-vectors may be a better idea. Off the top of my head, I am not sure why. There is the whole story of not choosing an inner product just to do AD, of course, which might be useful if we want to do things that are not gradient descent.


If the goal is to evaluate the gradient of a function ``f=g\circ h:M -> N -> K``, where ``g`` is a map and ``h`` is a function,
we have two obvious options:
First, we may push-forward a basis of ``M`` to ``TK`` which we identify with K itself.
This results in ``m`` scalars, representing components of the gradient.
Step-by-step in coordinates:
1. Compute the push-forward of the basis of ``T_pM``, i.e. just the columns of the Jacobian ``dg_i/dx_j``.
2. Compute the push-forward of the function ``h`` (consider it as a map, K is also a manifold!) to get ``h_*(g_*T_pM) = \sum_j dh/dy_i (dg_i/dx_j)

Second, we pull back the differential ``dh``:
1. compute ``dh = dh/dy_1,...,dh/dy_n`` in coordinates.
2. pull back by (in coordinates) multiplying with the adjoint of the Jacobian, resulting in ``g_*(dh) = \sum_i(dg_i/dx_j)(dh/dy_i)``.


#### The anatomy of pushforward and pullback

Expand Down