Mehdi Zakroum / Posts / Where does the normal equation come from?

Where does the normal equation come from?

September 29, 2019 at 06:12 PM, by Mehdi
machine learning mathematics

The normal equation is a method for analytically finding the parameters of a multivariate linear regression problem. This post aims to help ML students understand how to find the normal equation from scratch.

Important properties

Suppose a differentiable function $f: \mathbb{R}^{m \times n} \rightarrow \mathbb{R}$ and $A \in \mathbb{R}^{m \times n}$, $m \in \mathbb{N}^+$ and $n \in \mathbb{N}^+$. We call the following matrix the gradient matrix of $f(A)$ with respect to $A$:

$$ \nabla_A f(A) = \left( \begin{matrix} \frac{\partial f}{\partial A_{11}} & \frac{\partial f}{\partial A_{12}} & \cdots & \frac{\partial f}{\partial A_{1n}} \\ \frac{\partial f}{\partial A_{21}} & \frac{\partial f}{\partial A_{22}} & \cdots & \frac{\partial f}{\partial A_{2n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial A_{m1}} & \frac{\partial f}{\partial A_{m2}} & \cdots & \frac{\partial f}{\partial A_{mn}} \\ \end{matrix} \right) $$

Let's consider a square matrix $A \in \mathbb{R}^{n \times n}, n \in \mathbb{N}^+$. $\text{tr} A = \sum\limits_{i=1}^{n} A_{ii}$ is called the trace of $A$ and it has some nice properties. If $A$, $B$ and $C$ are 3 matrices, then:

Prop. 1: $\forall a \in \mathbb{R} \quad \text{tr} a = a$
Prop. 2: $\text{tr} AB = \text{tr} BA$ (matrices $AB$ and $BA$ have the same diagonal)
Prop. 3: $\text{tr} ABC = \text{tr} CAB = \text{tr} BCA$ (direct application of Prop. 2)
Prop. 4: $\text{tr} A = \text{tr} A^T$ ($A$ and $A^T$ have the same diagonal)
Prop. 5: $\nabla_A \text{tr} ABA^TC = CAB + C^TAB^T$
Prop. 6: $\nabla_A \text{tr} AB = B^T$

Problem setup

Let's consider a data set $(X, \mathbf{y})$ with $n \in \mathbb{N}^+$ features and $m \in \mathbb{N}^+$ observations: $$ X = \left( \begin{matrix} 1 & x^{(1)}_1 & x^{(1)}_2 & ... & x^{(1)}_n \\ 1 & x^{(2)}_1 & x^{(2)}_2 & ... & x^{(2)}_n \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x^{(m)}_1 & x^{(m)}_2 & ... & x^{(m)}_n \\ \end{matrix} \right) \quad , \quad \mathbf{y} = \left( \begin{matrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \\ \end{matrix} \right) $$

Now, suppose the following multivariate linear regression model: $$ h_{\mathbf{w}}(\mathbf{x^{(i)}}) = \mathbf{x^{(i)}} \mathbf{w} = w_0 + w_1 x^{(i)}_1 + w_2 x^{(i)}_2 + \dots + w_n x^{(i)}_n \quad , \quad \mathbf{w} = \left( \begin{matrix} w_0 \\ w_1 \\ \vdots \\ w_n \end{matrix} \right) $$

Where $\mathbf{w}$ is the parameters' column vector and $\mathbf{x^{(i)}} = (x^{(i)}_1, x^{(i)}_2, \dots, x^{(i)}_n)$ is the features of the $i^{\text{th}}$ observation of a data set $(X, \mathbf{y})$.

Our goal is to analytically find the parameters $\mathbf{w}$ that fit the best the data set. For this purpose, we minimize the following least squares loss function: $$ L(\mathbf{w}) = \frac{1}{2} \sum\limits_{i = 1}^{m} (h_{\mathbf{w}}(\mathbf{x^{(i)}}) - y^{(i)})^2 $$

which could be written in the following matrix form: $$ L(\mathbf{w}) = \frac{1}{2} (X \mathbf{w} - \mathbf{y})^T (X \mathbf{w} - \mathbf{y}) $$

Inferring the normal equation

In order to minimize $L(\mathbf{w})$ with respect to $\mathbf{w}$, we take: $$ \nabla_{\mathbf{w}} L(\mathbf{w}) = 0 \qquad \text{(1)} $$

So, let's solve this equation for $\mathbf{w}$ step by step. $$ \begin{align} \nabla_{\mathbf{w}} L(\mathbf{w}) & = 0 & \\ \frac{1}{2} ((X \mathbf{w})^T - \mathbf{y}^T) (X \mathbf{w} - \mathbf{y})) & = 0 & \\ \frac{1}{2} \nabla_{\mathbf{w}} \underbrace{(\mathbf{w}^T X^T X \mathbf{w} - \mathbf{w}^T X^T \mathbf{y} - \mathbf{y}^T X \mathbf{w} + \mathbf{y}^T \mathbf{y})}_{\in \mathbb{R}} & = 0 & \\ \frac{1}{2} \nabla_{\mathbf{w}} \text{tr}(\mathbf{w}^T X^T X \mathbf{w} - \mathbf{w}^T X^T \mathbf{y} - \mathbf{y}^T X \mathbf{w} + \mathbf{y}^T \mathbf{y}) & = 0 & \text{(Prop. 1)} \\ \frac{1}{2} \nabla_{\mathbf{w}} (\text{tr}(\mathbf{w}^T X^T X \mathbf{w}) - \text{tr}(\mathbf{w}^T X^T \mathbf{y}) - \text{tr}(\mathbf{y}^T X \mathbf{w}) + \text{tr}(\mathbf{y}^T \mathbf{y})) & = 0 & \text{(tr is distributive)} \\ \frac{1}{2} \nabla_{\mathbf{w}} (\text{tr}(\mathbf{w} \mathbf{w}^T X^T X) - \text{tr}(\mathbf{y}^T X \mathbf{w}) - \text{tr}(\mathbf{y}^T X \mathbf{w}) + \text{tr}(\mathbf{y}^T \mathbf{y})) & = 0 & \text{(Prop. 3 and Prop. 4)} \\ \frac{1}{2} (\nabla_{\mathbf{w}} \text{tr}(\mathbf{w} \mathbf{w}^T X^T X) - 2 \nabla_{\mathbf{w}} \text{tr}(\mathbf{y}^T X \mathbf{w}) + \nabla_{\mathbf{w}} \text{tr}(\mathbf{y}^T \mathbf{y})) & = 0 & \\ \end{align} $$

Let's develop the first term by applying Prop. 5: $$ \begin{align} \nabla_{\mathbf{w}} \text{tr}(\mathbf{w} \mathbf{w}^T X^T X) & = \nabla_{\mathbf{w}} \text{tr}(\underbrace{\mathbf{w}}_A \underbrace{I}_B \underbrace{\mathbf{w}^T}_{A^T} \underbrace{X^T X}_C) \\ & = \underbrace{X^T X \mathbf{w} I}_{C A B} + \underbrace{X^T X \mathbf{w} I}_{C^T A B^T} \\ & = 2 X^T X \mathbf{w} \end{align} $$

For the second term, we apply Prop.3 and Prop. 6: $$ \begin{align} \nabla_{\mathbf{w}} \text{tr}(\mathbf{y}^T X \mathbf{w}) & = \nabla_{\mathbf{w}} \text{tr}(\mathbf{w} \mathbf{y}^T X) & \text{(Prop. 2)} \\ & = X^T \mathbf{y} & \text{(Prop. 6 by taking } A = \mathbf{w} \text{ and } B = \mathbf{y}^T X \text{)} \end{align} $$

The last term $\nabla_{\mathbf{w}} \text{tr}(\mathbf{y}^T \mathbf{y})$ is null because it does not depend on $\mathbf{w}$.

Finally, $$ \begin{align} (1) & \iff & \frac{1}{2} (2 X^T X \mathbf{w} - 2 X^T \mathbf{y} ) & = 0 & \\ & \iff & X^T X \mathbf{w} & = X^T \mathbf{y} & \\ & \implies & \mathbf{w} & = (X^T X)^{-1} X^T \mathbf{y} & \quad \text{(QED)} \end{align} $$

In particular, when $X^T X$ is not invertible, which is rarely the case, we may consider its pseudoinverse.