matrix derivative 矩阵偏导

每次看到矩阵导数就查资料,这回总结一波,

参考资料: CS5240

1. 矩阵偏导表示

\[
\begin{array}{c|lcr}
\text {Type} & \text{Scalar} & \text{Vector} & \text{Matrix} \\
\hline
\text{Scalar} & \cfrac {\partial y} {\partial x} & \cfrac {\partial \mathbf {y}} {\partial x} & \cfrac {\partial \mathbf{Y}} {\partial x} \\
\text{Vector} & \cfrac {\partial y} {\partial \mathbf {x}} & \cfrac {\partial \mathbf {y}} {\partial \mathbf {x}} \\
\text{Matrix} & \cfrac {\partial y} {\partial \mathbf {X}} \\
\end{array}
\]

首先要注意一个概念,numerator layout 和 denominator layout 的区别,下图展示了 6 种矩阵偏导的形式。但 numerator layout 的结果是: 分子的行数 \(\times\) 分母的列数,结果和分子的行数是相同的;denominator layout 的结果是: 分母的行数 \(\times\) 分子的列数,结果和分母的行数是相同的。也就是说,这两种结果都是可行的,两者的关系是转置关系。

在看不同论文的时候,可以看到两种表达方式都被广泛使用,但作者并没有标识清楚他使用的是哪种,需要我们自己去猜测。

下面给出两种方式的区别:

\(x\) 为 scalar 的情形: \[
\begin{array}{cc}
\text{numerator layout} & \text{denominator layout} \\
\hline
\cfrac {\partial y} {\partial x} & \cfrac {\partial y} {\partial x} \\
\cfrac {\partial \mathbf {y}} {\partial x} = \begin {bmatrix} \cfrac {\partial y_1} {\partial x} \\ \vdots \\\cfrac {\partial y_m} {\partial x} \end {bmatrix} &
\cfrac {\partial \mathbf {y}} {\partial x} = \begin {bmatrix} \cfrac {\partial y_1} {\partial x} & \cdots & \cfrac {\partial y_m} {\partial x} \end {bmatrix} = \cfrac {\partial \mathbf {y}^T} {\partial x}\\
\cfrac {\partial \mathbf {Y}} {\partial x} = \begin {bmatrix} \cfrac {\partial y_{11}} {\partial x} & \cdots & \cfrac {\partial y_{1n}} {\partial x} \\ \vdots & \ddots & \vdots \\ \cfrac {\partial y_{m1}} {\partial x} & \cdots & \cfrac {\partial y_{mn}} {\partial x} \end {bmatrix}
\end {array}
\]

\(\mathbf {x}\) 是 vector 的情形: \[
\begin{array}{cc}
\text{numerator layout} & \text{denominator layout} \\
\hline
\cfrac {\partial y} {\partial \mathbf {x}} = \begin {bmatrix} \cfrac {\partial y} {\partial x_1} & \cdots & \cfrac {\partial y} {\partial x_n} \end {bmatrix} = \cfrac {\partial y} {\partial \mathbf {x}^T} & \cfrac {\partial y} {\partial \mathbf {x}} = \begin {bmatrix} \cfrac {\partial y} {\partial x_1} \\ \vdots \\\cfrac {\partial y} {\partial x_n} \end {bmatrix} \\
\cfrac {\partial \mathbf {y}} {\partial \mathbf {x}} = \begin {bmatrix} \cfrac {\partial y_{1}} {\partial x_1} & \cdots & \cfrac {\partial y_{1}} {\partial x_n} \\ \vdots & \ddots & \vdots \\ \cfrac {\partial y_{m}} {\partial x_1} & \cdots & \cfrac {\partial y_{m}} {\partial x_n} \end {bmatrix} &
\cfrac {\partial \mathbf {y}} {\partial \mathbf {x}} = \begin {bmatrix} \cfrac {\partial y_{1}} {\partial x_1} & \cdots & \cfrac {\partial y_{m}} {\partial x_1} \\ \vdots & \ddots & \vdots \\ \cfrac {\partial y_{1}} {\partial x_n} & \cdots & \cfrac {\partial y_{m}} {\partial x_n} \end {bmatrix} \\
\equiv \cfrac {\partial \mathbf {y}} {\partial \mathbf {x}^T} &
\equiv \cfrac {\partial \mathbf {y}^T} {\partial \mathbf {x}}
\end {array}
\]
\(\mathbf {X}\) 是 matrix 的情形: \[
\begin{array}{cc}
\text{numerator layout} & \text{denominator layout} \\
\hline
\cfrac {\partial y} {\partial \mathbf {X}} = \begin {bmatrix} \cfrac {\partial y} {\partial x_{11}} & \cdots & \cfrac {\partial y} {\partial x_{m1}} \\ \vdots & \ddots & \vdots \\ \cfrac {\partial y} {\partial x_{1n}} & \cdots & \cfrac {\partial y} {\partial x_{mn}} \end {bmatrix} &
\cfrac {\partial y} {\partial \mathbf {X}} = \begin {bmatrix} \cfrac {\partial y} {\partial x_{11}} & \cdots & \cfrac {\partial y} {\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \cfrac {\partial y} {\partial x_{m1}} & \cdots & \cfrac {\partial y} {\partial x_{mn}} \end {bmatrix} \\
\equiv \cfrac {\partial y} {\partial \mathbf {X}^T} &
\equiv \cfrac {\partial y} {\partial \mathbf {X}}
\end {array}
\]

虽然前面讲了两种表达方式: numerator layout (分子展开) 和 denominator layout (分母展开)

后面我们均使用前者 – numerator layout。

2. 矩阵导数公式

2.1 标量(scalar) 导数规则

  1. \(\cfrac {\partial (u+v)} {\partial x} = \cfrac {\partial u} {\partial x} + \cfrac {\partial v} {\partial x}\)

  2. \(\cfrac {\partial uv} {\partial x} = u\cfrac {\partial v} {\partial x} + v\cfrac {\partial u} {\partial x}\)
    证明请用导数的定义,求 \[
    \lim_{h \to 0} \cfrac {u(x+h) v(x+h) – u(x) v(x)} {h}
    \]

  3. \(\cfrac {\partial g(u)} {\partial x} = \cfrac {\partial g(u)} {\partial u} \cfrac {\partial u} {\partial x}\) (chain rule)

  4. \(\cfrac {\partial f(g(u))} {\partial x} = \cfrac {\partial f(g)} {\partial g} \cfrac {\partial g(u)} {\partial u} \cfrac {\partial u} {\partial x}\) (chain rule)

2.2 no function relations

scalar \(a\), vector \(\mathbf {a}\) and matrix \(\mathbf {A}\) are not functions of \(x, \mathbf {x}, \mathbf {X}\)

  1. \(\cfrac {d \mathbf {a}} {dx} = \mathbf {0}\) (column vector)
  2. \(\cfrac {da} {d \mathbf {x}} = \mathbf {0}^T\) (row vector)
  3. \(\cfrac {da} {d \mathbf {X}} = \mathbf {0}^T\) (matrix transpose)
  4. \(\cfrac {d \mathbf {a}} {d \mathbf {x}} = \mathbf {0}\)

2.3 derivatives of vector by scalar

  1. \(\cfrac {\partial a \mathbf {u}} {\partial x} = a \cfrac {\partial \mathbf {u}} {\partial x}\), where \(a\) is not a function of \(x\)

  2. \(\cfrac {\partial \mathbf {A} \mathbf {u}} {\partial x} = \mathbf {A} \cfrac {\partial \mathbf {u}} {\partial x}\), where \(\mathbf {A}\) is not a function of \(x\)

  3. \(\cfrac {\partial \mathbf {u}^T} {\partial x} = (\cfrac {\partial \mathbf {u}} {\partial x})^T\)

  4. \(\cfrac {\partial (\mathbf {u} + \mathbf {v})} {\partial x} = \cfrac {\partial \mathbf {u}} {\partial x} + \cfrac {\partial \mathbf {v}} {\partial x}\)

  5. \(\cfrac {\partial \mathbf {g(u)}} {\partial x} = \cfrac {\partial \mathbf {g(u)}} {\partial \mathbf {u}} \cfrac {\partial \mathbf {u}} {\partial x}\) (chain rule)
    with consistent matrix

  6. \(\cfrac {\partial \mathbf {f(g(u))}} {\partial x} = \cfrac {\partial \mathbf {f(g)}} {\partial \mathbf {g}} \cfrac {\partial \mathbf {g(u)}} {\partial \mathbf {u}} \cfrac {\partial \mathbf {u}} {\partial x}\)

2.4 derivative of matrix by scalar

  1. \(\cfrac {\partial a \mathbf {U}} {\partial x} = a \cfrac {\partial \mathbf {U}} {\partial x}\) , where \(a\) is not a function of \(x\)
  2. \(\cfrac {\partial \mathbf {AUB}} {\partial x} = \mathbf {A} \cfrac {\partial \mathbf {U}} {\partial x} \mathbf {B}\) , where \(\mathbf {A}\) and \(\mathbf {B}\) are not function of \(x\)
  3. \(\cfrac {\partial (\mathbf {U} + \mathbf {V})} {\partial x} = \cfrac {\partial \mathbf {U}} {\partial x} + \cfrac {\partial \mathbf {V}} {\partial x}\)
  4. \(\cfrac {\partial \mathbf {UV}} {\partial x} =\mathbf {U} \cfrac {\partial \mathbf {V}} {\partial x} + \cfrac {\partial \mathbf {U}} {\partial x}\mathbf {V}\) (product rule)

2.5 derivatives of scalar by vector

  1. \(\cfrac {\partial a u} {\partial \mathbf {x}} = a \cfrac {\partial u} {\partial \mathbf {x}}\), where \(a\) is not a function of \(\mathbf {x}\)

  2. \(\cfrac {\partial (u + v)} {\partial \mathbf {x}} = \cfrac {\partial u} {\partial \mathbf {x}} + \cfrac {\partial v} {\partial \mathbf {x}}\)

  3. \(\cfrac {\partial uv} {\partial \mathbf {x}} = u \cfrac {\partial v} {\partial \mathbf {x}} + v \cfrac {\partial u} {\partial \mathbf {x}}\) (product rule)

  4. \(\cfrac {\partial g(u)} {\partial \mathbf {x}} = \cfrac {\partial g(u)} {\partial u} \cfrac {\partial u} {\partial \mathbf {x}}\) (chain rule)

  5. \(\cfrac {\partial f(g(u))} {\partial \mathbf {x}} = \cfrac {\partial f(g)} {\partial g}\cfrac {\partial g(u)} {\partial u} \cfrac {\partial u} {\partial \mathbf {x}}\) (chain rule)

  6. \(\cfrac {\partial \mathbf {u}^T \mathbf {v}} {\partial \mathbf {x}} =\mathbf {u}^T \cfrac {\partial \mathbf {v}} {\partial \mathbf {x}} + \mathbf {v}^T\cfrac {\partial \mathbf {u}} {\partial \mathbf {x}}\) (product rule)
    这是最重要的公式,没有之一,其他的都可以根据它推出来
    where \(\cfrac {\partial \mathbf {u}} {\partial \mathbf {x}}\) and \(\cfrac {\partial \mathbf {v}} {\partial \mathbf {x}}\) are in numerator layout

  7. \(\cfrac {\partial \mathbf {u}^T \mathbf {Av}} {\partial \mathbf {x}} =\mathbf {u}^T \mathbf {A} \cfrac {\partial \mathbf {v}} {\partial \mathbf {x}} + \mathbf {v}^T \mathbf {A}^T\cfrac {\partial \mathbf {u}} {\partial \mathbf {x}}\) (product rule)
    where \(\cfrac {\partial \mathbf {u}} {\partial \mathbf {x}}\) and \(\cfrac {\partial \mathbf {v}} {\partial \mathbf {x}}\) are in numerator layout
    and \(\mathbf {A}\) is not a function of \(\mathbf {x}\)
    可以视为 \(\cfrac {\partial \mathbf {u}^T \mathbf {(Av)}} {\partial \mathbf {x}} = \mathbf {u}^T \cfrac {\partial \mathbf {Av}} {\partial \mathbf {x}} + (\mathbf {Av})^T\cfrac {\partial \mathbf {u}} {\partial \mathbf {x}}\)

2.6 derivative of scalar by matrix

  1. \(\cfrac {\partial a u} {\partial \mathbf {X}} = a \cfrac {\partial u} {\partial \mathbf {X}}\), where \(a\) is not a function of \(\mathbf {X}\)
  2. \(\cfrac {\partial (u + v)} {\partial \mathbf {X}} = \cfrac {\partial u} {\partial \mathbf {X}} + \cfrac {\partial v} {\partial \mathbf {X}}\)
  3. \(\cfrac {\partial uv} {\partial \mathbf {X}} = u \cfrac {\partial v} {\partial \mathbf {X}} + v \cfrac {\partial u} {\partial \mathbf {X}}\) (product rule)
  4. \(\cfrac {\partial g(u)} {\partial \mathbf {X}} = \cfrac {\partial g(u)} {\partial u} \cfrac {\partial u} {\partial \mathbf {X}}\) (chain rule)
  5. \(\cfrac {\partial f(g(u))} {\partial \mathbf {X}} = \cfrac {\partial f(g)} {\partial g}\cfrac {\partial g(u)} {\partial u} \cfrac {\partial u} {\partial \mathbf {X}}\) (chain rule)

2.7 derivative of vector by vector

  1. \(\cfrac {\partial a \mathbf {u}} {\partial \mathbf {x}} = a \cfrac {\partial \mathbf {u}} {\partial \mathbf {x}} + \mathbf {u} \cfrac {\partial a } {\partial \mathbf {x}}\) (product rule)
  2. \(\cfrac {\partial (\mathbf {u} + \mathbf {v})} {\partial \mathbf {x}} = \cfrac {\partial \mathbf {u}} {\partial \mathbf {x}} + \cfrac {\partial \mathbf {v}} {\partial \mathbf {x}}\)
  3. \(\cfrac {\partial \mathbf {Au}} {\partial \mathbf {x}} = \mathbf {A} \cfrac {\partial \mathbf {u}} {\partial \mathbf {x}}\) (product rule)
  4. \(\cfrac {\partial \mathbf {g(u)}} {\partial \mathbf {x}} = \cfrac {\partial \mathbf {g(u)}} {\partial \mathbf {u}} \cfrac {\partial \mathbf {u}} {\partial \mathbf {x}}\) (chain rule)
  5. \(\cfrac {\partial \mathbf {f(g(u))}} {\partial \mathbf {x}} = \cfrac {\partial \mathbf {f(g)}} {\partial \mathbf {g}}\cfrac {\partial \mathbf {g(u)}} {\partial \mathbf {u}} \cfrac {\partial \mathbf {u}} {\partial \mathbf {x}}\) (chain rule)

2.8 一些常用的偏导

  1. \(\cfrac {d \mathbf {x}} {d \mathbf {x}} = \mathbf {I}\)

  2. \(\cfrac {d \mathbf {a}^T \mathbf {x}} {d \mathbf {x}} = \cfrac {d \mathbf {x}^T \mathbf {a}} {d \mathbf {x}} =\mathbf {a}^T\)
    证明过程可用 \(\cfrac {\partial \mathbf {u}^T \mathbf {v}} {\partial \mathbf {x}} =\mathbf {u}^T \cfrac {\partial \mathbf {v}} {\partial \mathbf {x}} + \mathbf {v}^T\cfrac {\partial \mathbf {u}} {\partial \mathbf {x}}\)

  3. \(\cfrac {d (\mathbf {x}^T \mathbf {a})^2} {d \mathbf {x}} =2 \mathbf {x}^T \mathbf {a}\mathbf {a}^T\)
    用链式法则和矩阵公式

  4. \(\cfrac {d \mathbf {x}^T \mathbf {x}} {d \mathbf {x}} = 2 \mathbf {x}^T\)
    证明过程和前面相似
    可以用特殊例子记一下:
    \(s = \mathbf {x}^T \mathbf {x} = \sum_i x_i^2\) , 则 \(\cfrac {\partial s} {\partial x_i} = 2x_i\) , 于是 \(\cfrac {d s} {d \mathbf {x}} = 2 \mathbf {x}^T\)

  5. \(\cfrac {d \mathbf {A}\mathbf {x}} {d \mathbf {x}} = \mathbf {A}\)
    证明过程可以用 \(\cfrac {\partial \mathbf {Au}} {\partial \mathbf {x}} = \mathbf {A} \cfrac {\partial \mathbf {u}} {\partial \mathbf {x}}\)

  6. \(\cfrac {d \mathbf {x}^T \mathbf {A}\mathbf {x}} {d \mathbf {x}} = 2 \mathbf {x}^T (\mathbf {A}+\mathbf {A}^T)\)
    还是用乘法法则 \(\cfrac {\partial \mathbf {u}^T \mathbf {Av}} {\partial \mathbf {x}} =\mathbf {u}^T \mathbf {A} \cfrac {\partial \mathbf {v}} {\partial \mathbf {x}} + \mathbf {v}^T \mathbf {A}^T\cfrac {\partial \mathbf {u}} {\partial \mathbf {x}}\)

3. 计算线性回归的解

首先 cost function 为 : \[
J(\pmb {\theta}) = \sum_{i=1}^{m} (h(\pmb {\theta}) – y_i)^2 = ||A \pmb {\theta} – \mathbf {y}||^2
\]
证明: \[
\begin {align}
J(\pmb {\theta}) = ||A \pmb {\theta} – \mathbf {y}||^2 &= (A \pmb {\theta} – \mathbf {y})^T (A \pmb {\theta} – \mathbf {y}) \\
&= (\pmb {\theta}^TA^T – \mathbf {y}^T)(A \pmb {\theta} – \mathbf {y}) \\
&= \pmb {\theta}^TA^T A \pmb {\theta} – \pmb {\theta}^TA^T\mathbf {y} – \mathbf {y}^TA \pmb {\theta} + \mathbf {y}^T \mathbf {y}
\end {align}
\]
现在求 \(\min J(\pmb {\theta})\) 对应的 \(\pmb {\theta}\),就是 \(\cfrac {d J(\pmb {\theta})} {d \pmb {\theta}} = \mathbf {0}\)

根据前面的矩阵导数公式: \[
\begin {align}
\cfrac {d J(\pmb {\theta})} {d \pmb {\theta}} &= \pmb {\theta}^T (A^T A + A^TA) – (A^T \mathbf {y})^T – (\mathbf {y}^T A) + \mathbf {0} \\
&= 2 \pmb {\theta}^T A^T A – 2 \mathbf {y}^T A \\
& = \mathbf {0}
\end {align}
\]
推出: \[
\pmb {\theta} = (A^T A)^{-1} A^T \mathbf {y}
\]

One thought on “matrix derivative 矩阵偏导

发表评论

电子邮件地址不会被公开。 必填项已用*标注