看了一下斯坦福大學公開課:機器學習教程(吳恩達教授),記錄了一些筆記,寫出來以便以後有用到。筆記如有誤,還望告知。
本系列其它筆記:
線性回歸(Linear Regression)
分類和邏輯回歸(Classification and logistic regression)
廣義線性模型(Generalized Linear Models)
生成學習算法(Generative Learning algorithms)
分類和邏輯回歸(Classification and logistic regression)
1 邏輯回歸(Logistic regression)
h θ ( x ) = g ( θ T x ) = 1 1 + e − θ T x h_{\theta}(x) = g(\theta^Tx) = \frac{1}{1+e^{-\theta^Tx}} hθ(x)=g(θTx)=1+e−θTx1, g ( z ) = 1 1 + e − z g(z) = \frac{1}{1+e^{-z}} g(z)=1+e−z1(logistic function / sigmoid function)
p ( y = 1 ∣ x ; θ ) = h θ ( x ) p(y=1|x;\theta) = h_\theta(x) p(y=1∣x;θ)=hθ(x)
p ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) p(y=0|x;\theta) = 1 - h_\theta(x) p(y=0∣x;θ)=1−hθ(x)
p ( y ∣ x ; θ ) = ( h θ ( x ) ) y ( 1 − h θ ( x ) ) 1 − y p(y|x;\theta) = (h_\theta(x))^y(1 - h_\theta(x))^{1-y} p(y∣x;θ)=(hθ(x))y(1−hθ(x))1−y
L ( θ ) = p ( y ⃗ ∣ X ; θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) ⇓ ℓ ( θ ) = log L ( θ ) = log ∏ i = 1 m ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) = ∑ i = 1 m log ( ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) ) = ∑ i = 1 m ( log ( ( h θ ( x ( i ) ) ) y ( i ) + log ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) ) = ∑ i = 1 m ( y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ) L(\theta) = p(\vec y | X;\theta) \\ = \prod_{i=1}^{m} p(y^{(i)} | x^{(i)}; \theta) \\ = \prod_{i=1}^{m}(h_\theta(x^{(i)}))^{y^{(i)}}(1 - h_\theta(x^{(i)}))^{1-y^{(i)}} \\ \Downarrow \\ \ell(\theta) = \log L(\theta) \\ = \log \prod_{i=1}^{m}(h_\theta(x^{(i)}))^{y^{(i)}}(1 - h_\theta(x^{(i)}))^{1-y^{(i)}} \\ = \sum_{i=1}^{m} \log ((h_\theta(x^{(i)}))^{y^{(i)}}(1 - h_\theta(x^{(i)}))^{1-y^{(i)}}) \\ = \sum_{i=1}^{m}(\log ((h_\theta(x^{(i)}))^{y^{(i)}} + \log (1 - h_\theta(x^{(i)}))^{1-y^{(i)}}) \\ = \sum_{i=1}^{m}(y^{(i)} \log (h_\theta(x^{(i)})) + (1-y^{(i)}) \log (1 - h_\theta(x^{(i)}))) L(θ)=p(y
∣X;θ)=i=1∏mp(y(i)∣x(i);θ)=i=1∏m(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i)⇓ℓ(θ)=logL(θ)=logi=1∏m(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i)=i=1∑mlog((hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i))=i=1∑m(log((hθ(x(i)))y(i)+log(1−hθ(x(i)))1−y(i))=i=1∑m(y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i))))
最大化 L ( θ ) L(\theta) L(θ), θ : = θ + α ∇ θ ℓ ( θ ) ( 此 處 + , 與 前 面 學 習 梯 度 下 降 算 法 的 − 不 同 , 因 為 h θ ( x ) 不 同 ) \theta \ := \theta + \alpha \ \nabla_{\theta}\ell(\theta) \ (此處+,與前面學習梯度下降算法的-不同,因為h_\theta(x)不同) θ :=θ+α ∇θℓ(θ) (此處+,與前面學習梯度下降算法的−不同,因為hθ(x)不同)
∂ ∂ θ j ℓ ( θ ) = ∂ ∂ θ j ∑ i = 1 m ( y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ) = ∑ i = 1 m ∂ ∂ θ j ( y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ) = ∑ i = 1 m ( y ( i ) h θ ( x ( i ) ) ∂ ∂ θ j h θ ( x ( i ) ) + 1 − y ( i ) 1 − h θ ( x ( i ) ) ∂ ∂ θ j ( 1 − h θ ( x ( i ) ) ) ) = ∑ i = 1 m ( y ( i ) h θ ( x ( i ) ) ∂ ∂ θ j h θ ( x ( i ) ) − 1 − y ( i ) 1 − h θ ( x ( i ) ) ∂ ∂ θ j ( h θ ( x ( i ) ) ) ) = ∑ i = 1 m y ( i ) − h θ ( x ( i ) ) h θ ( x ( i ) ) ( 1 − h θ ( x ( i ) ) ) ∂ ∂ θ j h θ ( x ( i ) ) { n o t e 1 : ∂ ∂ θ j h θ ( x ( i ) ) = h θ ( x ( i ) ) ( 1 − h θ ( x ( i ) ) ) ∂ ∂ θ j θ T x ( i ) = h θ ( x ( i ) ) ( 1 − h θ ( x ( i ) ) x j ( i ) } = ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \left.\frac{\partial}{\partial\theta_j}\right.\ell(\theta) = \left.\frac{\partial}{\partial\theta_j}\right.\sum_{i=1}^{m}(y^{(i)} \log (h_\theta(x^{(i)})) + (1-y^{(i)}) \log (1 - h_\theta(x^{(i)}))) \\ = \sum_{i=1}^{m} \left.\frac{\partial}{\partial\theta_j}\right.(y^{(i)} \log (h_\theta(x^{(i)})) + (1-y^{(i)}) \log (1 - h_\theta(x^{(i)}))) \\ = \sum_{i=1}^{m} (\frac{y^{(i)}}{h_\theta(x^{(i)})} \left.\frac{\partial}{\partial\theta_j}\right.h_\theta(x^{(i)}) + \frac{1 - y^{(i)}}{1 - h_\theta(x^{(i)})} \left.\frac{\partial}{\partial\theta_j}\right.(1 - h_\theta(x^{(i)}))) \\ = \sum_{i=1}^{m} (\frac{y^{(i)}}{h_\theta(x^{(i)})} \left.\frac{\partial}{\partial\theta_j}\right.h_\theta(x^{(i)}) - \frac{1 - y^{(i)}}{1 - h_\theta(x^{(i)})} \left.\frac{\partial}{\partial\theta_j}\right.(h_\theta(x^{(i)}))) \\ = \sum_{i=1}^{m} \frac{y^{(i)} - h_\theta(x^{(i)})}{h_\theta(x^{(i)})(1 - h_\theta(x^{(i)}))} \left.\frac{\partial}{\partial\theta_j}\right.h_\theta(x^{(i)}) \\ \lbrace note1:\left.\frac{\partial}{\partial\theta_j}\right.h_\theta(x^{(i)}) = h_\theta(x^{(i)})(1 - h_\theta(x^{(i)})) \left.\frac{\partial}{\partial\theta_j}\right. \theta^{T}x^{(i)} = h_\theta(x^{(i)})(1 - h_\theta(x^{(i)}) x_{j}^{(i)} \rbrace \\ = \sum_{i=1}^{m} (y^{(i)} - h_\theta(x^{(i)}))x_{j}^{(i)} ∂θj∂ℓ(θ)=∂θj∂i=1∑m(y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i))))=i=1∑m∂θj∂(y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i))))=i=1∑m(hθ(x(i))y(i)∂θj∂hθ(x(i))+1−hθ(x(i))1−y(i)∂θj∂(1−hθ(x(i))))=i=1∑m(hθ(x(i))y(i)∂θj∂hθ(x(i))−1−hθ(x(i))1−y(i)∂θj∂(hθ(x(i))))=i=1∑mhθ(x(i))(1−hθ(x(i)))y(i)−hθ(x(i))∂θj∂hθ(x(i)){note1:∂θj∂hθ(x(i))=hθ(x(i))(1−hθ(x(i)))∂θj∂θTx(i)=hθ(x(i))(1−hθ(x(i))xj(i)}=i=1∑m(y(i)−hθ(x(i)))xj(i)
θ j : = θ j + α ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta_{j} \ := \theta_{j} + \alpha \ \sum_{i=1}^{m} (y^{(i)} - h_\theta(x^{(i)}))x_{j}^{(i)} θj :=θj+α ∑i=1m(y(i)−hθ(x(i)))xj(i)
2 感覺器學習算法(Digression: The perceptron learning algorithm)
定義g(z)函數:
g ( z ) = { 1 i f z ≥ 0 0 i f z ≤ 0 g(z) = \left\{\begin{array}{cc} 1 \quad if \ z\geq 0 \\ 0 \quad if \ z\leq 0 \end{array}\right. g(z)={1if z≥00if z≤0
如果我們讓 h θ x = g ( θ T x ) h_{\theta}{x} = g({\theta^{T}x)} hθx=g(θTx),那麼可得到 θ j : = θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta_{j} \ := \theta_{j} + \alpha(y^{(i)} - h_\theta(x^{(i)}))x_{j}^{(i)} θj :=θj+α(y(i)−hθ(x(i)))xj(i)(感覺器學習算法)。
3 牛頓法最大化 ℓ ( θ ) \ell(\theta) ℓ(θ)(Another algorithm for maximizing ℓ ( θ ) \ell(\theta) ℓ(θ))
函數 f ( θ ) f(\theta) f(θ)找一個 θ \theta θ使得 f ( θ ) = 0 f(\theta) = 0 f(θ)=0,牛頓法執行以下操作:
θ : = θ − f ( θ ) f ′ ( θ ) . \theta := \theta - \frac{f(\theta)}{f'(\theta)}. θ:=θ−f′(θ)f(θ).
那麼我們如何找打一個 θ \theta θ使得函數 ℓ ( θ ) \ell(\theta) ℓ(θ)值最大?我們需要是 ℓ ′ ( θ ) = 0 \ell'(\theta) = 0 ℓ′(θ)=0(不論 ℓ ( θ ) \ell(\theta) ℓ(θ)最大值或者最小值,其 ℓ ′ ( θ ) = 0 \ell'(\theta)=0 ℓ′(θ)=0,極值在導函數拐點處),使用牛頓法可得出以下結論:
θ : = θ − ℓ ′ ( θ ) ℓ ′ ′ ( θ ) . \theta := \theta - \frac{\ell'(\theta)}{\ell''(\theta)}. θ:=θ−ℓ′′(θ)ℓ′(θ).
在邏輯回歸設定中, θ \theta θ是一個向量。是以牛頓法中也需滿足此條件。
θ : = θ − H − 1 ∇ θ ℓ ( θ ) . H i j = ∂ 2 ℓ ( θ ) ∂ θ i ∂ θ j . \theta := \theta - H^{-1}\nabla_{\theta}\ell(\theta). \\ H_{ij} = \frac{\partial^{2}\ell(\theta)}{\partial\theta_{i}\partial\theta_{j}}. θ:=θ−H−1∇θℓ(θ).Hij=∂θi∂θj∂2ℓ(θ).