天天看点

Machine Learning Notes Course 3Course 3 Regression

Course 3 Regression

Example

Stock market forecast : input data about company to regress an out put like Dow Jones Industrial Average tomorrow

Self-driving Car: input sensor data and out put with direction

Example Application

Estimate a Pokémon’s Combat Power value after evolution

Input: a Pokémon with xcp shows its combat power before evolution, xs shows its specie, xhp shows its strength, xw shows its weight and xh shows its height.

Output: y which represents Combat Power after evolution

Step 1: Model

find model from a set of function

suppose we choose a linear model like:

y = b + w × x c p y=b+w \times x_{cp} y=b+w×xcp​

among which w and b are parameters.

to sum up, we could determine a function :

y = b + ∑ w i × x i y=b+\sum w_i \times x_i y=b+∑wi​×xi​

among which xi represents an attribute of input x called feature, b called bias and wi called weight

Step 2: Goodness of Function

using x1 represents a complete data input individual. y1 represents a complete output individual.

Collect many of xi and corresponding yi in pairs

( x i , y i ) (x^{i},y^{i}) (xi,yi)

which can be plotted in a graph.

Machine Learning Notes Course 3Course 3 Regression

With all training data, we can define the goodness of a function, using a loss function :

Loss function:

Input: A function

Output: Haw bad it is, called Estimated Error

y = b + w × x c p y=b+w \times x_{cp} y=b+w×xcp​

L ( f ) = L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w × x c p n ) ) 2 L(f)=L(w,b)= \sum_{n=1}^{10}(\hat y^n-(b+w \times x^{n}_{cp}))^2 L(f)=L(w,b)=n=1∑10​(y^​n−(b+w×xcpn​))2

Step 3: Best Function

Choose a best function from the set via loss function:

f ∗ = a r g m i n f L ( f ) f^*={arg} min_fL(f) f∗=argminf​L(f)

w ∗ , b ∗ = a r g m i n w , b L ( w , b ) w^*,b^*=argmin_{w,b}L(w,b) w∗,b∗=argminw,b​L(w,b)

which means choose the w, b and f that make L(f) and L(w,b) minimum

Using the method: Gradient Descent

Consider loss function L(w) with only one parameter w:

Randomly choose an initial value w0

Compute

d L d w ∣ w = w 0 \frac {dL}{dw}|_{w=w^0} dwdL​∣w=w0​

if negative than increase w, else positive decrease w.

Machine Learning Notes Course 3Course 3 Regression

next w1 :

w 1 = w 0 − η d L d w ∣ w = w 0 w^1=w^0 - \eta\frac {dL}{dw}|_{w=w^0} w1=w0−ηdwdL​∣w=w0​

Repeat process above until reach a local optimal nut not global optimal.

And about two parameters:

Machine Learning Notes Course 3Course 3 Regression
PS: In linear situation, no local optimal in GD method

Generalization

Choose another 10 Pokémon as test data to calculate error

Another Model

y = b + w 1 × x c p + w 2 × ( x c p ) 2 y=b+w_1\times x_{cp}+w_2 \times (x_{cp})^2 y=b+w1​×xcp​+w2​×(xcp​)2

the same method GD is used to calculate a best model

also , some other models like:

y = b + w 1 × x c p + w 2 × ( x c p ) 2 + w 3 × ( x c p ) 3 y=b+w_1\times x_{cp}+w_2 \times (x_{cp})^2+w_3 \times (x_{cp})^3 y=b+w1​×xcp​+w2​×(xcp​)2+w3​×(xcp​)3

and using a more complex model may result in a larger error.

Some other factors

Considering Pokémon’s species may have influence, and based on that the model could be redesigned:

Choose different linear function for different species:

E.g. xs represents specie

i f x s = P i d g e y , y = b 1 + w 1 × x c p if\quad x_s=Pidgey, \quad y=b_1+w_1 \times x_{cp} ifxs​=Pidgey,y=b1​+w1​×xcp​

i f x s = W e e d l e , y = b 2 + w 2 × x c p if \quad x_s=Weedle,\quad y=b_2+w_2 \times x_{cp} ifxs​=Weedle,y=b2​+w2​×xcp​

and to all above could be summed into a linear function:

Machine Learning Notes Course 3Course 3 Regression

And more factors like weight, height and some other ones could be taken into consideration which could probably lead to a lower training error but a high testing error since the overfitting could happen.

And to avoid overfitting, the strategy called Regularization could be adapted to the model.

E.g.

L = ∑ n ( y ^ n − ( b + ∑ w i × x i ) ) 2 + λ ∑ ( w i ) 2 L=\sum_{n}^{}(\hat y^n-(b+\sum w_i \times x_i))^2+\lambda \sum(w_i)^2 L=n∑​(y^​n−(b+∑wi​×xi​))2+λ∑(wi​)2

Among that, the part

λ ∑ ( w i ) 2 \lambda \sum(w_i)^2 λ∑(wi​)2

could represent the sensitive level of a function which mainly influenced by the input noise, making the function less influenced by the noise. So the function with a smaller that part could be better, and λ is a parameter.

Larger λ tends to consider the influence of wi more than the difference between outputs and test data which means considering the training error less.