天天看点

Ridge Lasso Regression

import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
from sklearn.linear_model import LinearRegression
rcParams['figure.figsize'] = 12, 10

x = np.array([i*np.pi/180 for i in range(60, 300, 4)])
np.random.seed(0)

y = np.sin(x)+np.random.normal(0,0.15, len(x))

data = pd.DataFrame(np.column_stack([x,y]), columns=['x', 'y'])
#plt.plot(data['x'], data['y'], '.')

for i in range(2,16):
    colname='x_%d'%i
    data[colname]=data['x']**i

def linear_regression(data, power, models_to_plot):
    predictors=['x']
    
    if power >= 2:
        predictors.extend(['x_%d' % i for i in range(2, power+1)])
    
    linreg = LinearRegression(normalize=True)
    linreg.fit(data[predictors], data['y'])
    y_pred = linreg.predict(data[predictors])
    
    if power in models_to_plot:
        plt.subplot(models_to_plot[power])
        plt.tight_layout()
        plt.plot(data['x'], y_pred)
        plt.plot(data['x'], data['y'], '.')
        plt.title('Plot for power: %d'%power)
        
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([linreg.intercept_])
    ret.extend(linreg.coef_)
    return ret
    
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['model_pow_%d'%i for i in range(1,16)]
coef_matrix_simple = pd.DataFrame(index=ind, columns=col)
models_to_plot = {1:231,3:232,6:233,9:234,12:235,15:236}

for i in range(1, 16):
    coef_matrix_simple.iloc[i-1,0:i+2] = linear_regression(data,power=i, models_to_plot=models_to_plot)

pd.options.display.float_format='{:,.2g}'.format

print(coef_matrix_simple)
           
Ridge Lasso Regression

It is clearly evident that the size of coefficients increase exponentially with increase in model complexity. I hope this gives some intuition into why putting a constraint on the magnitude of coefficients can be a good idea to reduce model complexity

Lets try to understand this even better.

What does a large coefficient signify? It means that we're putting a lot of emphasis on that feature, i.e. the particular feature is a good predictor for the outcome.  when it becomes too large, the algorithm starts modelling intricate relations to estimate the output and ends up overfitting to the particular training data.

Ridge Rehgession:

parameter:alpha

I hope this gives some sense on how alpha would impact the magnitude of coefficients. One thing is for sure that any non-zero value would give values less than that of simple linear regression.

Keep in mind that normalizing the inputs is generally a good idea is every type of regression and should be used in case of ridge regression as well.

import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

rcParams['figure.figsize'] = 12, 10

x = np.array([i*np.pi/180 for i in range(60, 300, 4)])
np.random.seed(0)

y = np.sin(x)+np.random.normal(0,0.15, len(x))

data = pd.DataFrame(np.column_stack([x,y]), columns=['x', 'y'])
#plt.plot(data['x'], data['y'], '.')

for i in range(2,16):
    colname='x_%d'%i
    data[colname]=data['x']**i

def linear_regression(data, power, models_to_plot):
    predictors=['x']
    
    if power >= 2:
        predictors.extend(['x_%d' % i for i in range(2, power+1)])
    
    linreg = LinearRegression(normalize=True)
    linreg.fit(data[predictors], data['y'])
    y_pred = linreg.predict(data[predictors])
    
    if power in models_to_plot:
        plt.subplot(models_to_plot[power])
        plt.tight_layout()
        plt.plot(data['x'], y_pred)
        plt.plot(data['x'], data['y'], '.')
        plt.title('Plot for power: %d'%power)
        
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([linreg.intercept_])
    ret.extend(linreg.coef_)
    return ret
def ridge_regression(data, predictors, alpha, models_to_plot={}):
    ridgereg = Ridge(alpha=alpha, normalize=True)
    ridgereg.fit(data[predictors], data['y'])
    y_pred = ridgereg.predict(data[predictors])
    
    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data['x'], y_pred)
        plt.plot(data['x'], data['y'], '.')
        plt.title('Plot for alpha: %.3g'%alpha)
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([ridgereg.intercept_])
    ret.extend(ridgereg.coef_)
    return ret
    
#col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
#ind = ['model_pow_%d'%i for i in range(1,16)]
#coef_matrix_simple = pd.DataFrame(index=ind, columns=col)
#models_to_plot = {1:231,3:232,6:233,9:234,12:235,15:236}

#for i in range(1, 16):
#    coef_matrix_simple.iloc[i-1,0:i+2] = linear_regression(data,power=i, models_to_plot=models_to_plot)
#
#pd.options.display.float_format='{:,.2g}'.format
#
#print(coef_matrix_simple)

predictors=['x']
predictors.extend(['x_%d'%i for i in range(2,16)])
alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['alpha_%.2g'%alpha_ridge[i] for i in range(0,10)]
coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)
models_to_plot = {1e-15:231, 1e-10:232, 1e-4:233, 1e-3:234, 1e-2:235, 5:236}

for i in range(10):
    coef_matrix_ridge.iloc[i,] = ridge_regression(data, predictors, alpha_ridge[i], models_to_plot)
           
Ridge Lasso Regression

Here we can clearly observe that as the value of alpha increases, the model complexity reduces.

Lets have a look at the value of coefficients in the above models:

Ridge Lasso Regression

This straight away gives us the following inferences:

1. The RSS increases with increase in alpha, this model complexity reduces

2. An alpha as small as 1e-15 gives us significant reduction in magnitude of coefficients. How? Compare the coefficients in the first row of this table to the last row of simple linear regression table.

3. High alpha values can lead to significant underfitting. Note that rapid increase in RSS for values of alpha greater than 1

4. Though the coefficients are very very small, they are NOT zero.

Lasso:

def lasso_regression(data, predictors, alpha, models_to_plot={}):
    lassoreg = Lasso(alpha, normalize=True, max_iter=1e5)
    lassoreg.fit(data[predictors],data['y'])
    y_pred = lassoreg.predict(data[predictors])

    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for alpha: %.3g'%alpha)
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([lassoreg.intercept_])
    ret.extend(lassoreg.coef_)
    return ret
           

Notice the additional parameters defined in Lasso function——'max_iter'. This is the maximum number of iterations for which we want the model to run if it doesn't converge before. This exists for Ridge as well but setting this to a higher than default value was required in this case.

Ridge Lasso Regression

This again  tells us that the model complexity decreases with increase in the values of alpha. But notice the straight line at alpha=1.

Ridge Lasso Regression

 Apart from the expected inference of higher RSS for higher alphas, we can see the following:

1. For the same values of alpha, the coefficients of lasso regression are much smaller as compared to that of ridge regression(compare row 1 of the 2 tables)

2. For the same alpha, lasso has higher RSS(poor fit) as compared to ridge regression

3. Many of the coefficients are zero even for very small values of alpha.

继续阅读