Back-Propagation Spelled Out - As Explained by Karpathy

Hi there! I'm Shrijith Venkatrama, founder of Hexmos. Right now, I’m building LiveAPI, a tool that makes generating API docs from your code ridiculously easy. Adding Labels To Improve Graph Readability Add label parameter to Value class: class Value: def __init__(self, data, _children=(), _op='', label=''): self.data = data self._prev = set(_children) self._op = _op self.label = label def __repr__(self): return f"Value(data={self.data})" def __add__(self, other): return Value(self.data + other.data, (self, other), '+') def __mul__(self, other): return Value(self.data * other.data, (self, other), '-') a = Value(2.0, label='a') b = Value(-3.0, label='b') c = Value(10, label='c') e = a * b; e.label = 'e' d = e + c; d.label = 'd' print(d._prev) print(d._op) print("---") print(e._prev) print(e._op) Update draw_dot to include the label in the graph Originally we had the node expression as: dot.node(name=uid, label="{ data %.4f }" % (n.data,), shape='record') Replace with: dot.node(name=uid, label="{ %s | data %.4f }" % (n.label, n.data), shape='record') Now draw_dot(d) returns: Re-Render graph with Labels Let's add a few nodes - f and L to the expression a = Value(2.0, label='a') b = Value(-3.0, label='b') c = Value(10, label='c') e = a * b; e.label = 'e' d = e + c; d.label = 'd' f = Value(-2.0, label='f') L = d * f; L.label = 'L' L Generate graph: draw_dot(L) This graph we've built above is the forward-pass of laying out the nodes. What We Want to Calculate We want to know how the inputs (weights - a,b,c,d,e,f) affect the output (the loss function L). So - we want to find: dL/dL, dL/df, dL/de, dL/dd, dL/dc, dL/db, dL/da. Add the grad parameter to accommodate backpropogation class Value: def __init__(self, data, _children=(), _op='', label=''): self.data = data self._prev = set(_children) self._op = _op self.label = label self.grad = 0.0 # 0 means no impact on output to start with Update the node graphics information dot.node(name=uid, label="{ %s | data %.4f | grad %.4f }" % (n.label, n.data, n.grad), shape='record') Manually Performing Back-Propagation for The Given Graph Node L What is dL/dL - that is if we change L by a tiny amount, how will it affect the output L? The answer is obviously - 1. That is, L.grad = 1 The Expression a = Value(2.0, label='a') b = Value(-3.0, label='b') c = Value(10, label='c') e = a * b; e.label = 'e' d = e + c; d.label = 'd' f = Value(-2.0, label='f') L = d * f; L.label = 'L' L Node d L = d * f By known rules: dL/dd = f By derivation: dL/dd = (f(x+h) - f(x))/h = (d*f + h*f - d*f)/h = h*f/h = f That is, dL/dd = f = -2.0 So, we do d.grad = -2.0 Node f By symmetry, we get that dL/df = d = 4.0 That is, f.grad = 4.0 The new updated graph is like this: How to do Numerical Verification of the Derivatives def verify_dL_by_df(): h = 0.001 a = Value(2.0, label='a') b = Value(-3.0, label='b') c = Value(10, label='c') e = a * b; e.label = 'e' d = e + c; d.label = 'd' f = Value(-2.0, label='f') L = d * f; L.label = 'L' L1 = L.data a = Value(2.0, label='a') b = Value(-3.0, label='b') c = Value(10, label='c') e = a * b; e.label = 'e' d = e + c; d.label = 'd' f = Value(-2.0 + h, label='f') # bumb f a little bit L = d * f; L.label = 'L' L2 = L.data print((L2 - L1)/h) verify_dL_by_df() # prints out 3.9999 ~ 4 The Challenge - How do we calculate dL/dc? We know dL/dd = -2.0 - so we know how L is affected by d. The question is how is c going to impact L through d. First, we can calculate the "local derivative", or figure out how c impacts d first. That is, dd/dc = ? We know that: d = c + e So once we differentiate by c, we get: dd/dc = 1 Similarly, dd/de = 1. Now the question is, how to put together dd/dc and dL/dd? We need something called the Chain Rule: So, applying chain rule, we get: dL/dc = dL/dd * dd/dc dL/dc = -2.0 * 1.0 = -2.0 Similarly, dL/de = -2.0 Let's set the values in python, and redraw the graph now: c.grad = -2.0 e.grad = -2.0 Figuring out dL/da and dL/db We know: dL/de = -2.0 We want to know: dL/da = dL/de * de/da We know that: e = a * b de/da = b de/da = b = -3.0 We can also find: e = a * b de/db = a de/db = a = 2.0 So, now to get what we need: dL/da = dL/de * de/da = -2.0 * -3.0 = 6.0 dL/db = dL/de * de/db = -2.0 * 2.0 = -4.0 We set the values in python, and redraw to get the full graph: a.grad = 6.0 b.grad = -4.0 Reference

Feb 15, 2025 - 17:44
 0
Back-Propagation Spelled Out - As Explained by Karpathy

Hi there! I'm Shrijith Venkatrama, founder of Hexmos. Right now, I’m building LiveAPI, a tool that makes generating API docs from your code ridiculously easy.

Adding Labels To Improve Graph Readability

Add label parameter to Value class:

class Value:
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self._prev = set(_children)
    self._op = _op
    self.label = label

  def __repr__(self):
    return f"Value(data={self.data})"

  def __add__(self, other):
    return Value(self.data + other.data, (self, other), '+')

  def __mul__(self, other):
    return Value(self.data * other.data, (self, other), '-')

a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10, label='c')
e = a * b; e.label = 'e'
d = e + c; d.label = 'd'
print(d._prev)
print(d._op)
print("---")
print(e._prev)
print(e._op)

Update draw_dot to include the label in the graph

Originally we had the node expression as:

dot.node(name=uid, label="{ data %.4f }" % (n.data,), shape='record')

Replace with:

dot.node(name=uid, label="{ %s | data %.4f }" % (n.label, n.data), shape='record')

Now draw_dot(d) returns:

Re-Render graph with Labels

Graph with Labels

Let's add a few nodes - f and L to the expression

a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10, label='c')
e = a * b; e.label = 'e'
d = e + c; d.label = 'd'
f = Value(-2.0, label='f')
L = d * f; L.label = 'L'
L

Generate graph:

draw_dot(L)

More Complex Expression

This graph we've built above is the forward-pass of laying out the nodes.

What We Want to Calculate

We want to know how the inputs (weights - a,b,c,d,e,f) affect the output (the loss function L). So - we want to find: dL/dL, dL/df, dL/de, dL/dd, dL/dc, dL/db, dL/da.

Add the grad parameter to accommodate backpropogation

class Value:
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self._prev = set(_children)
    self._op = _op
    self.label = label
    self.grad = 0.0 # 0 means no impact on output to start with

Update the node graphics information

dot.node(name=uid, label="{ %s | data %.4f | grad %.4f }" % (n.label, n.data, n.grad), shape='record')

Graph with grad property

Manually Performing Back-Propagation for The Given Graph

Node L

What is dL/dL - that is if we change L by a tiny amount, how will it affect the output L? The answer is obviously - 1.

That is,

L.grad = 1

The Expression

a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10, label='c')
e = a * b; e.label = 'e'
d = e + c; d.label = 'd'
f = Value(-2.0, label='f')
L = d * f; L.label = 'L'
L

Node d

L = d * f

By known rules:

dL/dd = f

By derivation:

dL/dd = 

(f(x+h) - f(x))/h = 

(d*f + h*f - d*f)/h = 

h*f/h =

f

That is, dL/dd = f = -2.0

So, we do

d.grad = -2.0

Node f

By symmetry, we get that dL/df = d = 4.0

That is,

f.grad = 4.0

The new updated graph is like this:

Updated Graph

How to do Numerical Verification of the Derivatives

def verify_dL_by_df():
  h = 0.001

  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10, label='c')
  e = a * b; e.label = 'e'
  d = e + c; d.label = 'd'
  f = Value(-2.0, label='f')
  L = d * f; L.label = 'L'
  L1 = L.data

  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10, label='c')
  e = a * b; e.label = 'e'
  d = e + c; d.label = 'd'
  f = Value(-2.0 + h, label='f') # bumb f a little bit
  L = d * f; L.label = 'L'
  L2 = L.data

  print((L2 - L1)/h)

verify_dL_by_df() # prints out 3.9999 ~ 4

The Challenge - How do we calculate dL/dc?

We know dL/dd = -2.0 - so we know how L is affected by d.

The question is how is c going to impact L through d.

First, we can calculate the "local derivative", or figure out how c impacts d first.

That is,

dd/dc = ?

We know that:

d = c + e

So once we differentiate by c, we get: dd/dc = 1

Similarly, dd/de = 1.

Now the question is, how to put together dd/dc and dL/dd?

We need something called the Chain Rule:

Chain Rule

So, applying chain rule, we get:

dL/dc = dL/dd * dd/dc
dL/dc = -2.0 * 1.0 = -2.0

Similarly, dL/de = -2.0

Let's set the values in python, and redraw the graph now:

c.grad = -2.0
e.grad = -2.0

Graph with grads for c & e

Figuring out dL/da and dL/db

We know:

dL/de = -2.0

We want to know:

dL/da = dL/de * de/da

We know that:

e = a * b
de/da = b
de/da = b = -3.0

We can also find:

e = a * b
de/db = a
de/db = a = 2.0

So, now to get what we need:

dL/da = dL/de * de/da = -2.0 * -3.0 = 6.0
dL/db = dL/de * de/db = -2.0 * 2.0 = -4.0

We set the values in python, and redraw to get the full graph:

a.grad = 6.0
b.grad = -4.0

Final graph

Reference