Regularization in Neural Network

In the earlier era of machine learning, there used to be a lot of discussion on the bias-variance trade-off and the reason for that was we could increase bias and reduce variance, or reduce bias and increase variance. But back in the pre-deep learning era, we didn’t have many tools that just reduce bias or just reduce variance without hurting the other one. 

But in the modern deep learning, neural network or big data era getting a bigger network almost always just reduces bias without necessarily hurting variance, so long as we regularize appropriately. Also, getting more data always reduces variance and doesn’t hurt bias much. But we can’t always get more training data, or it could be expensive. Adding regularization will often help to prevent overfitting, or to reduce the errors in our network.

So generally, high complexity models overfit the data having very low bias but high variance and low complexity models underfit the data having a high bias, but low variance. We want to balance the trade-off between bias and variance to get to that spot of having good predictive performance. In this tutorial, we will show how we can use regularization to automatically balance between bias and variance. We will show concepts like ridge and lasso regression and we will develop these ideas using logistic regression.

Regularization

We use regularization to prevent overfitting how do we know that a model is overfitting? We have to find a quantitative measure that’s indicative of when a model is overfitting. Consider two models

$$\begin{aligned}\hat{y} &=x_1+x_2 \\ \hat{y}&=10x_1+10x_2 \end{aligned}$$

Our task is to separate two points (1,1) and (-1,-1) with a line using these models. We will use sigmoid activation function \(\). In the first model:

$$\begin{aligned} \hat{y}&= \sigma(w_1x_1 + w_2x_2 + b) \\\hat{y}&=\sigma(1+1) = \sigma(2)=0.880797 \\ \hat{y}&=\sigma(−1–1)= \sigma(−2)=0.1192\end{aligned}$$

Now in the second model, for (1,1) and (-1,-1):

$$\begin{aligned}\hat{y} &=\sigma(10(1)+10(1)) =\sigma(20)=0.99999 \\\hat{y}&=\sigma(-10-10) =\sigma(−20)=0.000000002 \end{aligned}$$

Looking for the first time at the probabilities one might think that since the second model is giving us better probabilities it is better model. But that’s not the case since second model is actually overfitting.

When we use sigmoid to the small values we get the first model which has a nice slope for the gradient descent. But when we multiply the linear function \(x_1 + x_2\) with 10 and take \(\sigma(10x_1 + 10x_2)\) our predictions are much better since we are closer to 0 and 1 but the function becomes much steeper and much harder to do gradient descent here since the derivatives are mostly close to zero and they are very large when we go to the middle of the curve. Therefore in order to do gradient descent we prefer model 1 more than the model 2. In a conceptual way the second model is too certain and it gives little room for applying gradient descent.

So, when models become overfit the estimated coefficients of these models tend to become really large in magnitude. Ridge and lasso regression just quantify overfitting through this measure of the magnitude of the coefficients and we will do this by modifying our cost function.

Lasso Regression or L1 Norm

So, in our desired total cost format we want to balance:

  1. How well function fits data
  2. Magnitude of coefficients

We have seen the magnitude of our estimated coefficients as an indicator of overfitting. So we can write down a total cost that has these two terms:

$$ \textbf{Total Cost} = \textbf{Measure of Fit} + \textbf{Measure of Magnitude of Coefficients}$$

where total cost is our new measure of the quality of the fit, and when we say measure of fit here, it means a small number indicates there’s a good fit to the data. On the other hand, if the measure of the magnitude of the coefficients is small this indicates the size of the coefficients is small and we’re unlikely to be in the setting of a very overfit model. So, what’s our measure of fit? For logistic regression, we try to minimize the cost function which is defined as: $$\text{Cross-Entropy} = – \frac{1}{m}\sum_{i=1}^m y_{i}ln(p_i) + (1- y_i)ln(1-p_i)$$ Now we just need a measure of the magnitude of the coefficients which will tell us that the coeeficients are big. One way is just summing all the coefficients together. $$ \textbf{Magnitude of Coefficients} = \sum_{i = 1}^n w_i$$ Problem is if we take two coefficients \(w_1 = 1,527,301\) and \(w_2 = -1,605,253\) in our model and if we look at \(w_1 + w_2\), this is going to be a small number despite the fact that each of the coefficients themselves was quite large. We can fix this issue by taking the absolute value: $$ \textbf{Magnitude of Coefficients} = \mid w_1\mid + … + \mid w_i\mid = \sum_{i = 1}^n \mid w_i\mid$$ This is one of the well known solutions and defined as the lasso regression or \(L_1\) norm. So let’s consider the resulting objective, where we’re gonna try and search over all possible w vectors. To find the ones that minimize our cost function plus the \(L_1\) norm. So that’s gonna be our \(\hat{w}\), our estimated model parameters. But we’d like to be able to control how much we’re weighing the complexity of the model as measured by this magnitude of our coefficient, relative to the fit of the model. We’d like to balance between these two terms, and so we’re gonna introduce another parameter \(\lambda\). This is called a tuning parameter and this is balancing between this fit and magnitude. So, our cost function becomes $$\text{ERROR FUNCTION} = \frac{1}{m} \left(- \sum_{i=1}^m y_{i}ln(p_i) + (1- y_i)ln(1-p_i) + \lambda ( \mid w_1\mid + … + \mid w_n\mid ) \right) $$

Ridge Regression or L2 Norm

Another choice is to consider the sum of the squares of the coefficients $$ \textbf{Magnitude of Coefficients} = w_1^2 + … + w_i^2 = \sum_{i = 1}^n w_i^2$$ which is called the ridge regression or \(L_2\) norm. $$ \text{ERROR FUNCTION} = \frac{1}{m} \left(- \sum_{i=1}^m y_{i}ln(p_i) + (1- y_i)ln(1-p_i) + \lambda \sum_{i = 1}^n w_i^2 \right)$$

Now, there are many ways to implement ridge regression. We will just add \(L_2\) norm after our cost function. Here we will use \(\frac{\lambda}{2}\) instead of \(\lambda\) since it will be easier to implement. $$ \text{ERROR FUNCTION} = \frac{1}{m} \left(-\sum_{i=1}^m y_{i}ln(p_i) + (1- y_i)ln(1-p_i) + \frac{\lambda}{2} \sum_{i = 1}^n w_i^2\right) $$

Gradient Descent

So, how do we implement Gradient descent in regularization? Previously without reguliraztion our gradient descent was like: $$\begin{aligned}w_{new} &= w – \alpha * dw \\ dw &= \frac{dE}{dw} \end{aligned}$$ For \(L_1\) regularization the problem is the gradient of the norm does not exist at 0, so you need to be careful. As for the regularization term, if \(w_i>0\) then \(\mid w_i\mid = w_i\) and the gradient is \(+1\), similarly when \(w_i<0\)  the gradient is \(−1\), so in summary $$\begin{aligned}w_{new} &= w – \alpha * dw \\ &= w – \alpha * (\frac{dE}{dw} + \frac{\lambda}{m}\frac{d\mid w\mid }{dw}) \\ &=\left\{ \begin{array}{c l} w – \alpha * (\frac{dE}{dw} + \frac{\lambda}{m}) & w>0\\ w – \alpha * (\frac{dE}{dw} – \frac{\lambda}{m}) & w<0 \end{array}\right. \end{aligned} $$

For \(L_2\) regularization $$dw = \frac{dE}{dw} + \frac{\lambda}{m}w $$ And then we just compute this update, same as before $$\begin{aligned}w_{new} &= w – \alpha * dw \\ &= w – \alpha * (\frac{dE}{dw} + \frac{\lambda}{m}w) \\ &= (1 – \frac{\alpha \lambda}{m}) w – \alpha \frac{dE}{dw} \end{aligned} $$

Balance Bias variance Trade-off

In the above equation, we see that if we take \(\lambda = 0\), we get the previous cost function without regularization. Now, let’s take \(\lambda = \infty\) and we have a massively large weight on this magnitude term. So, what happens to any solution where \(\hat{w} \neq 0\)? For solutions where \(\hat{w} \neq 0\), the total cost is \(\infty\). But for sure, if \( \hat{w} = 0\), then the total cost is minimum. So, the minimizing solution here is always \( \hat{w} = 0\) because that’s gonna minimize the total cost over all possible \(w\). So, we will be operating in a regime where \(\lambda\) is in between \(0\) to \(\infty\).

Now, let’s talk about this in the context of the bias-variance trade-off. And what we saw is when we had very large lambda, we had a solution with very high bias, but low variance and one way to see this is that, when we’re cranking lambda all the way up to infinity, in that limit, we get coefficients shrunk to be zero, and clearly that’s a model with high bias but low variance. On the other hand, when we had very small lambda, we have a model that is low bias, but high variance. And to see this think about setting lambda to zero, in which case, we just get our old solution. And there we see that for higher complexity models clearly you’re gonna have low bias but high variance. So what we see is this lambda tuning parameter controls our model complexity and controls this bias-variance trade-off.