Course 3 Regression
Example
Stock market forecast : input data about company to regress an out put like Dow Jones Industrial Average tomorrow
Self-driving Car: input sensor data and out put with direction
Example Application
Estimate a Pokémon’s Combat Power value after evolution
Input: a Pokémon with xcp shows its combat power before evolution, xs shows its specie, xhp shows its strength, xw shows its weight and xh shows its height.
Output: y which represents Combat Power after evolution
Step 1: Model
find model from a set of function
suppose we choose a linear model like:
y = b + w × x c p y=b+w \times x_{cp} y=b+w×xcp
among which w and b are parameters.
to sum up, we could determine a function :
y = b + ∑ w i × x i y=b+\sum w_i \times x_i y=b+∑wi×xi
among which xi represents an attribute of input x called feature, b called bias and wi called weight
Step 2: Goodness of Function
using x1 represents a complete data input individual. y1 represents a complete output individual.
Collect many of xi and corresponding yi in pairs
( x i , y i ) (x^{i},y^{i}) (xi,yi)
which can be plotted in a graph.
With all training data, we can define the goodness of a function, using a loss function :
Loss function:
Input: A function
Output: Haw bad it is, called Estimated Error
y = b + w × x c p y=b+w \times x_{cp} y=b+w×xcp
L ( f ) = L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w × x c p n ) ) 2 L(f)=L(w,b)= \sum_{n=1}^{10}(\hat y^n-(b+w \times x^{n}_{cp}))^2 L(f)=L(w,b)=n=1∑10(y^n−(b+w×xcpn))2
Step 3: Best Function
Choose a best function from the set via loss function:
f ∗ = a r g m i n f L ( f ) f^*={arg} min_fL(f) f∗=argminfL(f)
w ∗ , b ∗ = a r g m i n w , b L ( w , b ) w^*,b^*=argmin_{w,b}L(w,b) w∗,b∗=argminw,bL(w,b)
which means choose the w, b and f that make L(f) and L(w,b) minimum
Using the method: Gradient Descent
Consider loss function L(w) with only one parameter w:
Randomly choose an initial value w0
Compute
d L d w ∣ w = w 0 \frac {dL}{dw}|_{w=w^0} dwdL∣w=w0
if negative than increase w, else positive decrease w.
next w1 :
w 1 = w 0 − η d L d w ∣ w = w 0 w^1=w^0 - \eta\frac {dL}{dw}|_{w=w^0} w1=w0−ηdwdL∣w=w0
Repeat process above until reach a local optimal nut not global optimal.
And about two parameters:
PS: In linear situation, no local optimal in GD method
Generalization
Choose another 10 Pokémon as test data to calculate error
Another Model
y = b + w 1 × x c p + w 2 × ( x c p ) 2 y=b+w_1\times x_{cp}+w_2 \times (x_{cp})^2 y=b+w1×xcp+w2×(xcp)2
the same method GD is used to calculate a best model
also , some other models like:
y = b + w 1 × x c p + w 2 × ( x c p ) 2 + w 3 × ( x c p ) 3 y=b+w_1\times x_{cp}+w_2 \times (x_{cp})^2+w_3 \times (x_{cp})^3 y=b+w1×xcp+w2×(xcp)2+w3×(xcp)3
and using a more complex model may result in a larger error.
Some other factors
Considering Pokémon’s species may have influence, and based on that the model could be redesigned:
Choose different linear function for different species:
E.g. xs represents specie
i f x s = P i d g e y , y = b 1 + w 1 × x c p if\quad x_s=Pidgey, \quad y=b_1+w_1 \times x_{cp} ifxs=Pidgey,y=b1+w1×xcp
i f x s = W e e d l e , y = b 2 + w 2 × x c p if \quad x_s=Weedle,\quad y=b_2+w_2 \times x_{cp} ifxs=Weedle,y=b2+w2×xcp
and to all above could be summed into a linear function:
And more factors like weight, height and some other ones could be taken into consideration which could probably lead to a lower training error but a high testing error since the overfitting could happen.
And to avoid overfitting, the strategy called Regularization could be adapted to the model.
E.g.
L = ∑ n ( y ^ n − ( b + ∑ w i × x i ) ) 2 + λ ∑ ( w i ) 2 L=\sum_{n}^{}(\hat y^n-(b+\sum w_i \times x_i))^2+\lambda \sum(w_i)^2 L=n∑(y^n−(b+∑wi×xi))2+λ∑(wi)2
Among that, the part
λ ∑ ( w i ) 2 \lambda \sum(w_i)^2 λ∑(wi)2
could represent the sensitive level of a function which mainly influenced by the input noise, making the function less influenced by the noise. So the function with a smaller that part could be better, and λ is a parameter.
Larger λ tends to consider the influence of wi more than the difference between outputs and test data which means considering the training error less.