Everything is matter

Reflections on the blog "Attention and Linear Regression"

Featured image

In the blog post “Attention and Linear Regression”, a fascinating claim is made: attention and linear regression share similar mathematical structures—most notably matrix operations involving dot products and normalization—yet behave very differently in practice. This insight led me to a deeper realization:

Everything is matter. Not just the mathematical forms, but the underlying context, mechanisms, and behavior.

Introduction

Motivated by this observation, I re-implemented and significantly extended the idea using PyTorch, creating a full pipeline that compares attention mechanisms to linear regression, both mathematically and empirically. The goal was not only reproduction, but also exploration—of different data types, model configurations, and operational nuances.

This experiment investigates how changes in architecture (e.g., whitening, shared weights) and loss behavior affect the alignment between attention and regression, especially under different data regimes.


Experimental Design

Data Modes

Attention Variants

Architectural Choices

Comparison Baseline

Ridge Regression: With a small L2 regularization ($λ = 1 \times 10^{-3}$) to ensure numerical stability. Serves as the ground truth in both synthetic and real settings.


Core Analyses

1. Prediction Accuracy

For each configuration, I computed MSE (mean squared error) on both training and test sets. Attention models were compared to ridge regression not just by raw accuracy, but also by cosine similarity of outputs.

2. Projection Matrix Comparison

The projection behavior was analyzed by comparing:

These matrices were visualized side-by-side to inspect similarities and divergences in projection behavior.

3. Effective Weights

In synthetic settings, where true weights are known, I compared:

This gives insight into what the attention model “learns” implicitly.


Highlights & Observations

Even when attention and regression achieve similar output quality, their mechanisms differ starkly:


Closing Thoughts

This exploration confirms the original blog’s thesis, but in a more empirical and systematic way. The core operations may look similar (dot products, projections), but their semantics and emergent behaviors differ.

They reflect different philosophies: one built on analytic closure, the other on contextual flexibility.

Same math ≠ same meaning, structure matters—but behavior matters more.