July 16, 2025 3 min to read

Everything is matter

Reflections on the blog "Attention and Linear Regression"

In the blog post “Attention and Linear Regression”, a fascinating claim is made: attention and linear regression share similar mathematical structures—most notably matrix operations involving dot products and normalization—yet behave very differently in practice. This insight led me to a deeper realization:

Everything is matter. Not just the mathematical forms, but the underlying context, mechanisms, and behavior.

Introduction

Motivated by this observation, I re-implemented and significantly extended the idea using PyTorch, creating a full pipeline that compares attention mechanisms to linear regression, both mathematically and empirically. The goal was not only reproduction, but also exploration—of different data types, model configurations, and operational nuances.

This experiment investigates how changes in architecture (e.g., whitening, shared weights) and loss behavior affect the alignment between attention and regression, especially under different data regimes.

Experimental Design

Data Modes

Synthetic Linear Data: Controlled environment with known ground-truth weights and structured feature correlation.
Real Data: Diabetes dataset from scikit-learn, scaled and split.
Simple Data: A toy example using sine activation to inject nonlinearity.

Attention Variants

Vanilla Attention: Softmax over scaled dot products.
Regression-like Attention: Removes softmax, approximating linear projection behavior.
Whitened Attention: Applies data whitening to decorrelate inputs, simulating assumptions in linear regression.

Architectural Choices

With / Without Whitening: Use of whitening matrix derived from eigendecomposition of covariance.
Shared vs. Separate Weight Matrices: Determines whether $Wq$ and $Wk$ are tied.

Comparison Baseline

Ridge Regression: With a small L2 regularization ($λ = 1 \times 10^{-3}$) to ensure numerical stability. Serves as the ground truth in both synthetic and real settings.

Core Analyses

1. Prediction Accuracy

For each configuration, I computed MSE (mean squared error) on both training and test sets. Attention models were compared to ridge regression not just by raw accuracy, but also by cosine similarity of outputs.

2. Projection Matrix Comparison

The projection behavior was analyzed by comparing:

Hat Matrix ($H$) from ridge regression: $X(X^TX + λI)^{-1}X^T$
Attention Weight Matrix (A): learned during training
Raw Scores ($QK^T$)
Weight Composition ($W_qW_k^T$)

These matrices were visualized side-by-side to inspect similarities and divergences in projection behavior.

3. Effective Weights

In synthetic settings, where true weights are known, I compared:

True Weights
Estimated Weights from Ridge Regression
Effective Weights from Attention: calculated via $W_{combined} X^T Y / d_{features}$

This gives insight into what the attention model “learns” implicitly.

Highlights & Observations

Even when attention and regression achieve similar output quality, their mechanisms differ starkly:

Regression uses a global, closed-form mapping.
Attention operates locally, relying on similarity or context via softmax or whitening.

Closing Thoughts

This exploration confirms the original blog’s thesis, but in a more empirical and systematic way. The core operations may look similar (dot products, projections), but their semantics and emergent behaviors differ.

Attention isn’t just a “fancy matrix multiplication”, it adapts to local or learned similarity, which can align—but does not have to.
Linear regression isn’t “just another dot product model”, it makes assumptions about global structure.

They reflect different philosophies: one built on analytic closure, the other on contextual flexibility.

Same math ≠ same meaning, structure matters—but behavior matters more.

wojiao-yc Love what I love

Everything is matter

Introduction

Experimental Design

Data Modes

Attention Variants

Architectural Choices

Comparison Baseline

Core Analyses

1. Prediction Accuracy

2. Projection Matrix Comparison

3. Effective Weights

Highlights & Observations

Closing Thoughts

Machine Learning Summary

wojiao-yc

Everything is matter

Introduction

Experimental Design

Data Modes

Attention Variants

Architectural Choices

Comparison Baseline

Core Analyses

1. Prediction Accuracy

2. Projection Matrix Comparison

3. Effective Weights

Highlights & Observations

Closing Thoughts

Machine Learning Summary

Share

wojiao-yc