Machine Learning Series: Episode 1.5


Before diving into building TinyTorch, let’s understand the key ML concepts you’ll encounter. These are the building blocks that frameworks handle for you.

Autograd (Automatic Differentiation)

What it is: Automatic computation of gradients (derivatives) for your operations.

A note on the term “gradients” Gradients measure how much a small change in one value affects another. In neural networks, they tell us how much changing a weight changes the loss.

What role it plays: When you perform operations on tensors (add, multiply, matmul), autograd tracks the computation graph. Later, when you call loss.backward(), it automatically computes how each weight contributed to the final loss.

Why the framework handles it:

  • Manual gradient calculation is tedious and error-prone
  • The chain rule requires tracking every operation
  • Frameworks compute gradients automatically, so you focus on model design

Simple analogy: Like a debugger that tracks every variable change, but for mathematical derivatives.

In code:

x = Tensor([1, 2, 3], requires_grad=True)
y = x * 2  # Autograd tracks this operation
loss = y.sum()
loss.backward()  # Autograd computes gradients automatically
# Now x.grad contains how x affects the loss

Layers

What it is: Pre-built building blocks for neural networks (Linear, ReLU, Conv2D, etc.).

What role it plays: Layers are the components that transform data. A Linear layer applies a matrix multiplication and bias, ReLU applies an activation function, etc.

Why the framework handles it:

  • Layers encapsulate common operations (weight matrices, activations)
  • They integrate with autograd automatically
  • You compose layers to build complex networks

Simple analogy: Like UI components (Button, Input) that you compose into interfaces, but for data transformations.

In code:

# Instead of manually doing: output = input @ weights + bias
layer = Linear(in_features=784, out_features=256)
output = layer(input)  # Framework handles weights, bias, autograd

Backpropagation

What it is: The algorithm that computes gradients by working backwards through the computation graph.

What role it plays: After computing the loss (forward pass), backpropagation calculates how much each weight should change to reduce the loss. It propagates gradients from the output back to the inputs.

Why the framework handles it:

  • Implementing backprop manually requires careful chain rule application
  • It’s the same pattern every time: forward → loss → backward → update
  • Frameworks automate this entire flow

Simple analogy: Like a reverse debugger - you know the error (loss), and it traces back to find what caused it.

The flow:

Forward:  input → layer1 → layer2 → output → loss
Backward: loss → layer2.grad → layer1.grad → input.grad

In code:

# Forward pass
output = model(input)
loss = loss_function(output, target)

# Backpropagation (framework handles this)
loss.backward()  # Computes gradients for all weights

Optimizers

What it is: Algorithms that update model weights based on computed gradients (SGD, Adam, etc.).

What role it plays: After backpropagation computes gradients, optimizers decide how to update weights. Different optimizers use different strategies (simple step, momentum, adaptive learning rates).

Why the framework handles it:

  • Weight update logic is standardized but complex
  • Optimizers manage learning rates, momentum, and other hyperparameters
  • You just call optimizer.step() instead of manually updating weights

Simple analogy: Like a package manager that updates dependencies, but for neural network weights.

In code:

optimizer = SGD(model.parameters(), lr=0.01)

# Training loop
for batch in data:
    loss = compute_loss(model(batch))
    loss.backward()  # Compute gradients
    optimizer.step()  # Update weights using gradients
    optimizer.zero_grad()  # Reset gradients for next iteration

How They Work Together

Here’s the complete flow in a training loop:

1. Forward Pass
   input → layers → output → loss

2. Backpropagation (autograd)
   loss.backward() → computes gradients for all weights

3. Optimization
   optimizer.step() → updates weights based on gradients

4. Repeat

The framework’s job: Handle steps 2 and 3 automatically, so you focus on step 1 (designing your model architecture).

Why This Matters

As a software engineer, you’re used to:

  • Writing explicit logic
  • Debugging step-by-step
  • Understanding every line of code

In ML, frameworks abstract away the mathematical complexity (gradients, chain rule, optimization) so you can:

  • Focus on model architecture
  • Experiment with different designs
  • Build systems without deep math knowledge

But by building TinyTorch, you’ll understand what these abstractions are doing under the hood - making you a better ML engineer.