A Simple fwd_diff<T> for Forward-Mode AD in C++
A practical introduction to forward-mode differentiation in C++, using a small fwd_diff<T> value type to compute derivatives alongside values.
We implement a tiny forward-mode autodiff type, fwd_diff<T>, that carries a value and one derivative and composes through normal C++ code.
This works with real programs, including arbitrary control flow.
Full code and examples: https://godbolt.org/z/3eMecWPYx
(PS: pretty pictures at the end of this blog post!)
There are three standard ways to compute derivatives of functions.
Symbolic differentiation is what you learn in school: take an expression and mechanically produce a new expression that computes its derivative. It is exact, but it operates on formulas, not programs. Once control flow, loops, or abstraction enter the picture, symbolic methods stop being practical.
Finite differences stay at the level of programs. You evaluate \( (f(x + h) - f(x)) / h \) or a variant thereof and get an approximation of the derivative at a point. This is simple and often "good enough", but it is inherently approximate and usually costs an extra function evaluation per derivative direction (and gives you weird behavior when your function jumps or has kinks). With careful inlining and SIMD you can hide some overhead, but the trade-off is still there.
Automatic differentiation sits in between. It runs the original program, not a symbolic proxy, and produces exact derivatives at a point (up to normal floating-point roundoff). Control flow, loops, function calls, and templates all work as written. You get the function value and its derivative together, in one pass.
The first time I learned this, it definitely felt like magic. But as we know, sufficiently advanced technology is indistinguishable from magic! Thus, let's break the mechanism down until the problem is easy, and then write down the obvious implementation.
So:
In this post we build a minimal forward-mode autodiff type, fwd_diff<T>.
It behaves like a scalar T, but carries two pieces of data: the current value and the derivative with respect to one chosen input.
If you already have code written as template <class T> T foo(T x), you can call it unchanged with fwd_diff<double> (or float).
The result contains both foo(x) and \( \mathrm{d}\ \text{foo}/\mathrm{d}x \) at that point.
Forward-Mode AutoDiff
This is an oversimplification, but most automatic differentiation falls into two broad categories: forward mode and backward mode.
Forward mode propagates derivatives from the inputs forward through the computation, alongside the normal evaluation of values. Every operation produces both a value and its derivative at the same time.
Backward mode (what arguably powers most of modern machine learning) works differently. During the forward evaluation, the program records a trace (often called a "tape") of the operations performed. A second pass then propagates gradient information backward from the outputs to the inputs.
A common rule of thumb is this: Backward mode is a good fit when you have one (or a few) outputs and many inputs. Forward mode is a good fit when you have one (or a few) inputs and many outputs.
In this post we deliberately stay in the smallest possible corner: forward mode, a single derivative direction, and first derivatives only. The entire data structure looks like this:
template <class T>
struct fwd_diff
{
T value;
T deriv;
};
That really is enough to propagate derivatives.
I distinctly remember how this felt hard to believe the first time I saw it, so let's check if we can implement operator*.
You might remember the product rule from basic calculus: $$ (f \cdot g)' = f' \cdot g + f \cdot g' $$
Ok so at least here we only need the original values and derivatives:
template <class T>
fwd_diff<T> operator*(fwd_diff<T> const& f, fwd_diff<T> const& g)
{
return {
.value = f.value * g.value,
.deriv = f.deriv * g.value + f.value * g.deriv,
};
}
That's all?
But surely this doesn't work for more complicated functions, right?
$$ (\sqrt{f})' = \frac{f'}{2 \sqrt{f}} $$
template <class T>
fwd_diff<T> sqrt(fwd_diff<T> const& f)
{
auto sqrt_f = sqrt(f.value);
return {
.value = sqrt_f,
.deriv = f.deriv / (2 * sqrt_f),
};
}
Huh, works.
As a nice side benefit: we only need to evaluate the sqrt once, making this already more efficient than finite differences.
So still only requiring value and derivative of f, nothing more.
But maybe this was luck?
$$ (\sin(f))' = \cos(f) \cdot f' $$
template <class T>
fwd_diff<T> sin(fwd_diff<T> const& f)
{
return {
.value = sin(f.value),
.deriv = cos(f.value) * f.deriv,
};
}
sin and cos of the same value, we could also use sincos to compute both in one call.
Well, let’s try one last time.
The general power rule seems like a good stress-test. Both the base and the exponent can depend on the input:
$$ (f^g)' = f^g \left( g' \ln(f) + g \frac{f'}{f} \right) $$
Definitely more complicated than before. But still: only \(f,g,f',g'\) needed from the inputs:
template <class T>
fwd_diff<T> pow(fwd_diff<T> const& f, fwd_diff<T> const& g)
{
auto value = pow(f.value, g.value);
return {
.value = value,
.deriv = value * (g.deriv * log(f.value) +
g.value * f.deriv / f.value),
};
}
That's it.
No hidden state or expression trees, no symbolic manipulation. Everything composes by simple operator overloading.
Completing the Type
At this point the pattern is hard to unsee. As long as a function is differentiable, its forward-mode implementation is nothing more than:
- Compute the primal value.
- Apply the chain rule locally.
We still want to expand the type a bit.
For example, how should we implement operator==?
Should it be structural and check same value and same deriv?
Should operator< be lexicographic, treating the type as a pair (value, deriv)?
The way I use this type, the answer is a clear "no".
I want fwd_diff<T> to behave as close to T as possible and simply compute the derivative as a kind of side-effect / sidecar you can extract at any point.
That's why ==, <, etc. should only compare .value.
As a result, you can have code like this:
auto foo(auto v)
{
if (v < 0)
return v * v;
else
return sqrt(v);
}
and this works correctly with a fwd_diff<double>.
Even wild things like this are fine:
auto foo(auto v)
{
while (v < 10)
v = v * v + 1;
return v;
}
(Sure, with arbitrary branching and looping it's easy to introduce discontinuities and "technically non-differentiable" locations.
fwd_diff<T> doesn't care and simply gives you the derivative of the current operation trace, as if it were extended enough to make it differentiable.)
Finally, for convenience I usually define two named constructors (aka "factories"):
// x' = 1, thus input(x) is {x, 1}
static fwd_diff<T> input(T value)
{
return {
.value = value,
.deriv = T(1),
};
}
// c' = 0, thus constant(c) is {c, 0}
static fwd_diff<T> constant(T value)
{
return {
.value = value,
.deriv = T(0),
};
}
There is an implicit conversion from T to fwd_diff<T> (semantically the same as constant), to make the type easier to use in a generic context.
Examples
There are many examples of autodiff in machine learning and optimization in general. Our main domain is geometry and graphics, so we'll elaborate on a few uses there.
It's fine if you skim these sections. The code at https://godbolt.org/z/3eMecWPYx contains the full fwd_diff class and enough example code to get the ideas across.
Example: A smooth camera frame from a spline path

Forward-mode derivatives are surprisingly geometric. If a function maps time to a 3D position, its derivative is the path tangent. And once you have a tangent, you can build a full camera frame.
Below is a tiny spline-based camera path t -> pos(t).
The control points are plain pos3f (not templated), placed at integer t.
For a parameter T t, we pick the segment index i = floor(t) and a local parameter u = t - i.
The spline construction is "real code": integer indexing, clamping, vector math, and a quadratic interpolation.
The nice part is that the tangent just falls out.
Evaluate the path at t = fwd_diff<float>::input(our_t) and the resulting .value is the camera position while .deriv is the unnormalized tangent (velocity & speed) at that time.
From there, a camera frame is just cross products:
- pick a world-up reference like
(0,1,0) left = normalize(cross(world_up, tangent))up = cross(tangent, left)(no need to normalize astangentandleftare orthogonal and normalized)
Now you have (pos, left, up, tangent) as an orthonormal camera basis (as long as the path isn't pointing straight up).
// Minimal math types assumed:
// pos3f, vec3f, pos3<T>, vec3<T>
// operators: +, -, *
// functions: clamp, normalize, cross
template <class T>
pos3<T> quadratic_bspline(pos3f const& p0,
pos3f const& p1,
pos3f const& p2,
T u)
{
// uniform quadratic B-spline segment
auto P0 = pos3<T>(p0);
auto P1 = pos3<T>(p1);
auto P2 = pos3<T>(p2);
auto one_minus_u = T(1) - u;
auto u2 = u * u;
// B-spline basis functions: 0.5*(1-u)^2, 0.5+u-u^2, 0.5*u^2
return pos3<T>(0.5f * (one_minus_u * one_minus_u * P0 +
(T(1) + T(2) * u - T(2) * u2) * P1 +
u2 * P2));
}
template <class T>
pos3<T> camera_path(std::span<pos3f const> control_pts, T t)
{
// control points at integer positions, segment index is floor(t)
auto const n = int(control_pts.size());
auto const i = int(std::floor(t.value));
auto const u = t - T(i);
auto const pt_at = [&](int k) -> pos3f const& {
return control_pts[std::clamp(k, 0, n - 1)];
};
return quadratic_bspline(pt_at(i - 1), pt_at(i), pt_at(i + 1), u);
}
struct camera_frame
{
pos3f pos;
vec3f left;
vec3f up;
vec3f forward;
};
camera_frame camera_frame_at(std::span<pos3f const> control_pts, float t)
{
using fdd = fwd_diff<float>;
// fwd_diff-powered evaluation
auto p = camera_path(control_pts, fdd::input(t));
// unpack pos & tangent
pos3f pos = pos3f(p.x.value, p.y.value, p.z.value);
vec3f forward = normalize(vec3f(p.x.deriv, p.y.deriv, p.z.deriv));
// construct frame
vec3f world_up = {0, 1, 0};
vec3f left = normalize(cross(world_up, forward));
vec3f up = cross(forward, left);
return {
.pos = pos,
.left = left,
.up = up,
.forward = forward,
};
}
Example: Heightfield normals via two forward passes

A heightfield is a surface parameterization:
$$ S(x, y) = (x, y, h(x,y)) $$
The surface tangents are the partial derivatives \(\partial S/\partial x\) and \(\partial S/\partial y\). Forward-mode gives those partials with two passes:
- first treat \(x\) as the input and \(y\) as constant
- then vice versa
Crossing the two tangents yields a high-quality normal, without finite differences and without writing a separate derivative for h.
This is also a nice example where the "heightfield" itself can be arbitrarily complicated.
As long as h(x, y) is callable on fwd_diff inputs, the normal computation stays generic.
// h must be callable as
// h(fwd_diff<float> x, fwd_diff<float> y) -> fwd_diff<float>
template <class HeightFn>
std::pair<float, vec3f> heightfield_with_normal_at(float x, float y, HeightFn&& h)
{
using fd32 = fwd_diff<float>;
// pass 1: differentiate wrt. x
auto const hx = h(fd32::input(x), fd32::constant(y)); // height, dh/dx
vec3f tx = {1.0f, 0.0f, hx.deriv}; // d/dx of (x, y, h) = (1, 0, dh/dx)
// pass 2: differentiate wrt. y
auto const hy = h(fd32::constant(x), fd32::input(y)); // height, dh/dy
vec3f ty = {0.0f, 1.0f, hy.deriv}; // d/dy of (x, y, h) = (0, 1, dh/dy)
// normal is the cross of both tangent dirs
auto const normal = normalize(cross(tx, ty));
return std::make_pair(hx.value, normal);
}
Example: SDF normals from exact gradients

Signed distance fields are scalar fields \( \phi(x,y,z) \) where you can define a surface at the zero set \( \phi = 0 \). (Any other constant is also fine.)
The surface normal is the normalized gradient:
$$ n = \frac{\nabla \phi}{|\nabla \phi|} $$
With forward-mode differentiation, the gradient is just three passes: differentiate once wrt. x, once wrt. y, once wrt. z.
This is an easy drop-in upgrade for isosurface extraction. Marching cubes typically estimates normals from triangle geometry or from finite differences of the field. Using exact gradients from the SDF gives smoother, more stable normals, especially for procedural SDFs with lots of composition. Dual contouring benefits similarly: you can provide better Hermite data (positions and normals) for the QEF solve.
// sdf must be callable as
// sdf(fwd_diff<float> x, fwd_diff<float> y, fwd_diff<float> z) -> fwd_diff<float>
template <class SdfFn>
vec3f sdf_normal_at(float x, float y, float z, SdfFn&& sdf)
{
using fd32 = fwd_diff<float>;
auto fx = sdf(fd32::input(x), fd32::constant(y), fd32::constant(z));
auto fy = sdf(fd32::constant(x), fd32::input(y), fd32::constant(z));
auto fz = sdf(fd32::constant(x), fd32::constant(y), fd32::input(z));
vec3f g = {fx.deriv, fy.deriv, fz.deriv};
return normalize(g);
}
(If you actually use this for isosurface extraction, remember that this is an accurate normal. So if your SDF has high-frequency parts that you undersample, you'll get noisy normals.)
Going Forward (pun intended)
We only carried a single first derivative here, but the basic trick generalizes: store a little more "side information" next to the primal value, overload the same operations, and let ordinary C++ code propagate it.
From here, there are a few natural upgrades:
-
Higher derivatives (2nd, 3rd, ...): For this, we can simply extend the payload: carry
(value, deriv, deriv_deriv)(and so on), and apply the corresponding local rules. -
Many derivatives in one pass (SIMD "packets"): The heightfield and SDF examples used multiple passes because we tracked only one input direction at a time. If
derivbecomes a SIMD vector (or a small fixed-size vector type), one evaluation can carry several directions at once:valuestays scalar,derivis a packet of partials. That gives you gradients/Jacobian rows with much less overhead than N separate runs. -
Optimization-friendly APIs: Once your code is templated on a "scalar-like" type, you can write optimizers that accept any autodiff-capable function: line search, gradient-based minimization, Gauss–Newton, Levenberg–Marquardt, custom solvers.
The Wikipedia entry points for these methods are:
- https://en.wikipedia.org/wiki/Dual_number (that's one view of our
fwd_diff<T>) - https://en.wikipedia.org/wiki/Automatic_differentiation (more general)
If you want a down-to-earth explanation based on dual numbers (\(a + b \cdot \varepsilon\)):