天天看點

分類和邏輯回歸(Classification and logistic regression)

看了一下斯坦福大學公開課:機器學習教程(吳恩達教授),記錄了一些筆記,寫出來以便以後有用到。筆記如有誤,還望告知。

本系列其它筆記:

線性回歸(Linear Regression)

分類和邏輯回歸(Classification and logistic regression)

廣義線性模型(Generalized Linear Models)

生成學習算法(Generative Learning algorithms)

分類和邏輯回歸(Classification and logistic regression)

1 邏輯回歸(Logistic regression)

h θ ( x ) = g ( θ T x ) = 1 1 + e − θ T x h_{\theta}(x) = g(\theta^Tx) = \frac{1}{1+e^{-\theta^Tx}} hθ​(x)=g(θTx)=1+e−θTx1​, g ( z ) = 1 1 + e − z g(z) = \frac{1}{1+e^{-z}} g(z)=1+e−z1​(logistic function / sigmoid function)

p ( y = 1 ∣ x ; θ ) = h θ ( x ) p(y=1|x;\theta) = h_\theta(x) p(y=1∣x;θ)=hθ​(x)

p ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) p(y=0|x;\theta) = 1 - h_\theta(x) p(y=0∣x;θ)=1−hθ​(x)

p ( y ∣ x ; θ ) = ( h θ ( x ) ) y ( 1 − h θ ( x ) ) 1 − y p(y|x;\theta) = (h_\theta(x))^y(1 - h_\theta(x))^{1-y} p(y∣x;θ)=(hθ​(x))y(1−hθ​(x))1−y

L ( θ ) = p ( y ⃗ ∣ X ; θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) ⇓ ℓ ( θ ) = log ⁡ L ( θ ) = log ⁡ ∏ i = 1 m ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) = ∑ i = 1 m log ⁡ ( ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) ) = ∑ i = 1 m ( log ⁡ ( ( h θ ( x ( i ) ) ) y ( i ) + log ⁡ ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) ) = ∑ i = 1 m ( y ( i ) log ⁡ ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ) L(\theta) = p(\vec y | X;\theta) \\ = \prod_{i=1}^{m} p(y^{(i)} | x^{(i)}; \theta) \\ = \prod_{i=1}^{m}(h_\theta(x^{(i)}))^{y^{(i)}}(1 - h_\theta(x^{(i)}))^{1-y^{(i)}} \\ \Downarrow \\ \ell(\theta) = \log L(\theta) \\ = \log \prod_{i=1}^{m}(h_\theta(x^{(i)}))^{y^{(i)}}(1 - h_\theta(x^{(i)}))^{1-y^{(i)}} \\ = \sum_{i=1}^{m} \log ((h_\theta(x^{(i)}))^{y^{(i)}}(1 - h_\theta(x^{(i)}))^{1-y^{(i)}}) \\ = \sum_{i=1}^{m}(\log ((h_\theta(x^{(i)}))^{y^{(i)}} + \log (1 - h_\theta(x^{(i)}))^{1-y^{(i)}}) \\ = \sum_{i=1}^{m}(y^{(i)} \log (h_\theta(x^{(i)})) + (1-y^{(i)}) \log (1 - h_\theta(x^{(i)}))) L(θ)=p(y

​∣X;θ)=i=1∏m​p(y(i)∣x(i);θ)=i=1∏m​(hθ​(x(i)))y(i)(1−hθ​(x(i)))1−y(i)⇓ℓ(θ)=logL(θ)=logi=1∏m​(hθ​(x(i)))y(i)(1−hθ​(x(i)))1−y(i)=i=1∑m​log((hθ​(x(i)))y(i)(1−hθ​(x(i)))1−y(i))=i=1∑m​(log((hθ​(x(i)))y(i)+log(1−hθ​(x(i)))1−y(i))=i=1∑m​(y(i)log(hθ​(x(i)))+(1−y(i))log(1−hθ​(x(i))))

最大化 L ( θ ) L(\theta) L(θ), θ   : = θ + α   ∇ θ ℓ ( θ )   ( 此 處 + , 與 前 面 學 習 梯 度 下 降 算 法 的 − 不 同 , 因 為 h θ ( x ) 不 同 ) \theta \ := \theta + \alpha \ \nabla_{\theta}\ell(\theta) \ (此處+,與前面學習梯度下降算法的-不同,因為h_\theta(x)不同) θ :=θ+α ∇θ​ℓ(θ) (此處+,與前面學習梯度下降算法的−不同,因為hθ​(x)不同)

∂ ∂ θ j ℓ ( θ ) = ∂ ∂ θ j ∑ i = 1 m ( y ( i ) log ⁡ ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ) = ∑ i = 1 m ∂ ∂ θ j ( y ( i ) log ⁡ ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ) = ∑ i = 1 m ( y ( i ) h θ ( x ( i ) ) ∂ ∂ θ j h θ ( x ( i ) ) + 1 − y ( i ) 1 − h θ ( x ( i ) ) ∂ ∂ θ j ( 1 − h θ ( x ( i ) ) ) ) = ∑ i = 1 m ( y ( i ) h θ ( x ( i ) ) ∂ ∂ θ j h θ ( x ( i ) ) − 1 − y ( i ) 1 − h θ ( x ( i ) ) ∂ ∂ θ j ( h θ ( x ( i ) ) ) ) = ∑ i = 1 m y ( i ) − h θ ( x ( i ) ) h θ ( x ( i ) ) ( 1 − h θ ( x ( i ) ) ) ∂ ∂ θ j h θ ( x ( i ) ) { n o t e 1 : ∂ ∂ θ j h θ ( x ( i ) ) = h θ ( x ( i ) ) ( 1 − h θ ( x ( i ) ) ) ∂ ∂ θ j θ T x ( i ) = h θ ( x ( i ) ) ( 1 − h θ ( x ( i ) ) x j ( i ) } = ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \left.\frac{\partial}{\partial\theta_j}\right.\ell(\theta) = \left.\frac{\partial}{\partial\theta_j}\right.\sum_{i=1}^{m}(y^{(i)} \log (h_\theta(x^{(i)})) + (1-y^{(i)}) \log (1 - h_\theta(x^{(i)}))) \\ = \sum_{i=1}^{m} \left.\frac{\partial}{\partial\theta_j}\right.(y^{(i)} \log (h_\theta(x^{(i)})) + (1-y^{(i)}) \log (1 - h_\theta(x^{(i)}))) \\ = \sum_{i=1}^{m} (\frac{y^{(i)}}{h_\theta(x^{(i)})} \left.\frac{\partial}{\partial\theta_j}\right.h_\theta(x^{(i)}) + \frac{1 - y^{(i)}}{1 - h_\theta(x^{(i)})} \left.\frac{\partial}{\partial\theta_j}\right.(1 - h_\theta(x^{(i)}))) \\ = \sum_{i=1}^{m} (\frac{y^{(i)}}{h_\theta(x^{(i)})} \left.\frac{\partial}{\partial\theta_j}\right.h_\theta(x^{(i)}) - \frac{1 - y^{(i)}}{1 - h_\theta(x^{(i)})} \left.\frac{\partial}{\partial\theta_j}\right.(h_\theta(x^{(i)}))) \\ = \sum_{i=1}^{m} \frac{y^{(i)} - h_\theta(x^{(i)})}{h_\theta(x^{(i)})(1 - h_\theta(x^{(i)}))} \left.\frac{\partial}{\partial\theta_j}\right.h_\theta(x^{(i)}) \\ \lbrace note1:\left.\frac{\partial}{\partial\theta_j}\right.h_\theta(x^{(i)}) = h_\theta(x^{(i)})(1 - h_\theta(x^{(i)})) \left.\frac{\partial}{\partial\theta_j}\right. \theta^{T}x^{(i)} = h_\theta(x^{(i)})(1 - h_\theta(x^{(i)}) x_{j}^{(i)} \rbrace \\ = \sum_{i=1}^{m} (y^{(i)} - h_\theta(x^{(i)}))x_{j}^{(i)} ∂θj​∂​ℓ(θ)=∂θj​∂​i=1∑m​(y(i)log(hθ​(x(i)))+(1−y(i))log(1−hθ​(x(i))))=i=1∑m​∂θj​∂​(y(i)log(hθ​(x(i)))+(1−y(i))log(1−hθ​(x(i))))=i=1∑m​(hθ​(x(i))y(i)​∂θj​∂​hθ​(x(i))+1−hθ​(x(i))1−y(i)​∂θj​∂​(1−hθ​(x(i))))=i=1∑m​(hθ​(x(i))y(i)​∂θj​∂​hθ​(x(i))−1−hθ​(x(i))1−y(i)​∂θj​∂​(hθ​(x(i))))=i=1∑m​hθ​(x(i))(1−hθ​(x(i)))y(i)−hθ​(x(i))​∂θj​∂​hθ​(x(i)){note1:∂θj​∂​hθ​(x(i))=hθ​(x(i))(1−hθ​(x(i)))∂θj​∂​θTx(i)=hθ​(x(i))(1−hθ​(x(i))xj(i)​}=i=1∑m​(y(i)−hθ​(x(i)))xj(i)​

θ j   : = θ j + α   ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta_{j} \ := \theta_{j} + \alpha \ \sum_{i=1}^{m} (y^{(i)} - h_\theta(x^{(i)}))x_{j}^{(i)} θj​ :=θj​+α ∑i=1m​(y(i)−hθ​(x(i)))xj(i)​

2 感覺器學習算法(Digression: The perceptron learning algorithm)

定義g(z)函數:

g ( z ) = { 1 i f   z ≥ 0 0 i f   z ≤ 0 g(z) = \left\{\begin{array}{cc} 1 \quad if \ z\geq 0 \\ 0 \quad if \ z\leq 0 \end{array}\right. g(z)={1if z≥00if z≤0​

如果我們讓 h θ x = g ( θ T x ) h_{\theta}{x} = g({\theta^{T}x)} hθ​x=g(θTx),那麼可得到 θ j   : = θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta_{j} \ := \theta_{j} + \alpha(y^{(i)} - h_\theta(x^{(i)}))x_{j}^{(i)} θj​ :=θj​+α(y(i)−hθ​(x(i)))xj(i)​(感覺器學習算法)。

3 牛頓法最大化 ℓ ( θ ) \ell(\theta) ℓ(θ)(Another algorithm for maximizing ℓ ( θ ) \ell(\theta) ℓ(θ))

函數 f ( θ ) f(\theta) f(θ)找一個 θ \theta θ使得 f ( θ ) = 0 f(\theta) = 0 f(θ)=0,牛頓法執行以下操作:

θ : = θ − f ( θ ) f ′ ( θ ) . \theta := \theta - \frac{f(\theta)}{f'(\theta)}. θ:=θ−f′(θ)f(θ)​.

那麼我們如何找打一個 θ \theta θ使得函數 ℓ ( θ ) \ell(\theta) ℓ(θ)值最大?我們需要是 ℓ ′ ( θ ) = 0 \ell'(\theta) = 0 ℓ′(θ)=0(不論 ℓ ( θ ) \ell(\theta) ℓ(θ)最大值或者最小值,其 ℓ ′ ( θ ) = 0 \ell'(\theta)=0 ℓ′(θ)=0,極值在導函數拐點處),使用牛頓法可得出以下結論:

θ : = θ − ℓ ′ ( θ ) ℓ ′ ′ ( θ ) . \theta := \theta - \frac{\ell'(\theta)}{\ell''(\theta)}. θ:=θ−ℓ′′(θ)ℓ′(θ)​.

在邏輯回歸設定中, θ \theta θ是一個向量。是以牛頓法中也需滿足此條件。

θ : = θ − H − 1 ∇ θ ℓ ( θ ) . H i j = ∂ 2 ℓ ( θ ) ∂ θ i ∂ θ j . \theta := \theta - H^{-1}\nabla_{\theta}\ell(\theta). \\ H_{ij} = \frac{\partial^{2}\ell(\theta)}{\partial\theta_{i}\partial\theta_{j}}. θ:=θ−H−1∇θ​ℓ(θ).Hij​=∂θi​∂θj​∂2ℓ(θ)​.

繼續閱讀