To build an accurate machine learning model we need to have a proper understanding of the error. In forming predictions of a model there are three sources of error: noise, bias, and variance. Having proper knowledge of error and bias-variance would help us building accurate models and avoiding mistakes of overfitting and underfitting.

In this tutorial, our case study is discussing how to predict house prices. We have a dataset which consists of house prices with the square feet of the house associated with it.

## Noise

The data we use in machine learning are inherently noisy. So the way the world works is that there’s some true relationship between $$x$$ and $$y$$. And we’re representing that arbitrary relationship by $$f_{w(true)}$$ which is the notation we’re using for that functional relationship. But of course, that’s not a perfect description between $$x$$ and $$y$$, the number of square feet and the house value.

There are a lot of other contributing factors including other attributes of the house that are not included just in square feet or how a person feels when they go in and make a purchase of a house or a personal relationship they might have with the owners. Or lots and lots of other things that we can’t ever perfectly capture with just some function between square feet and value, and so that is the noise that’s inherent in this process represented by this epsilon term.

So in particular for any observation $$y_i$$, it’s the sum of this relationship between the square feet and the value plus this noise term $$\epsilon _i$$ specific to that ith house. We assume that this noise has zero mean because if it didn’t that could be shoved into the f function instead. This noise is something that’s just a property of the data. We don’t have control over this. This has nothing to do with our model nor our estimation procedure, it’s just something that we have to deal with. And so this is called Irreducible error because it’s nothing that we can reduce through choosing a better model or a better estimation procedure. Things that we can control are bias and variance. We’re gonna discuss on these two terms.

## Bias

Suppose, we have a dataset that’s just a random snapshot of N houses that were sold and recorded and we tabulated in our data set. Well, based on that data set, we fit some function e.g. a constant function.

But what if another set of N houses had been sold? Then we would have had a different data set that we were using. And when we went to fit our model, we would have gotten a different line.

So for one data set of size N, we get one fit and for another dataset, we have different fit associated with it. And of course, there’s a continuum of possible fits we might have gotten. And for all those possible fits, here this dashed green line below represents our average fit, averaged over all those fits weighted by how likely they were to have appeared.

Now bias is the difference between this average fit and the true function, $$f_{w(true)}$$. The gray shaded region above, that’s the difference between the true function and our average fit. So intuitively what bias is saying is, if our model flexible enough to on average be able to capture the true relationship between square feet and house value. What we see is that for this very simple low complexity constant model, has high bias and it’s not flexible enough to have a good approximation to the true relationship. $$Bias(x) = f_{w(true)}(x) – f_{\bar{w}}(x)$$ Similarly a high complexity model has low bias.

## Variance

Variance is how specific fits to a given data set can be different from one another, as we are looking at different possible data sets. In this case, when we look at a just simple constant model, the actual resulting fits don’t vary too much. And when we look at the space of all possible observations we see that the fits, they’re fairly similar and stable. So, when we look at the variation in these fits, which is drawn with grey bars above we see that they don’t vary very much.

So, for this low complexity model, we see that there’s low variance. To summarize, the variance is how much fits can vary. But if they could vary dramatically from one data set to the other, then we would have very erratic predictions. The predictions would just be sensitive to what data set we got. So, that would be a source of error in our predictions. And to see this, we can start looking at high-complexity models. So in particular, let’s look at this data set again. Now, let’s fit some high-order polynomial to it and when we think about looking over all possible data sets we might get, we might get some crazy set of curves. So, a high complexity model has a high variance.

Finally, we define the error term as: \begin{aligned}Error &= Reducible \: Error + Irreducible \; Error \\ &= {Bias}^2 + variance + Irreducible \; Error\end{aligned}

Here, we’re gonna plot bias and variance as a function of model complexity. We have discussed previously as our model complexity increases, our bias decreases. Because we can better and better approximate the true relationship between $$x$$ and $$y$$. On the other hand, variance increases. So, a very simple model has very low variance and the high-complexity models have high variance.

What we see is there’s a natural tradeoff between bias and variance called bias-variance tradeoff. And one way to summarize this is something that’s called mean squared error. Mean squared error is simply the sum of bias squared plus variance. Machine learning is all about this bias-variance tradeoff. We’re gonna see this again and again. And the goal is finding the optimal spot. The optimal spot where we get our minimum error, the minimum contribution of bias and variance, to our prediction errors.

In the earlier era of machine learning, there used to be a lot of discussion on the bias-variance tradeoff. And the reason for that was that we could increase bias and reduce variance, or reduce bias and increase variance. But back in the pre-deep learning era, we didn’t have many tools, we didn’t have as many tools that just reduce bias or that just reduce variance without hurting the other one.

But in the modern deep learning, neural network or big data era, so long as we can keep training a bigger network, and so long as we can keep getting more data, which isn’t always the case for either of these, but if that’s the case, then getting a bigger network almost always just reduces bias without necessarily hurting variance, so long as we regularize appropriately. And getting more data pretty much always reduces variance and doesn’t hurt bias much.

## How to calculate Bias-Variance?

Well, we defined bias very explicitly in terms of the relationship relative to the true function. And when we think about defining variance, we have to average over all possible data sets, and the same was true for bias too. But all possible data sets of size n, we could have gotten from the world, and we just don’t know what that is. So, we can’t compute bias-variance exactly. But there are ways to optimize this tradeoff between bias and variance in a practical way. For example, if we underfit the data this implies we have high bias and if we overfit the data this implies we have a high variance.

So for the sake of argument, let’s say that we’re recognizing cats in pictures, which is something that people can do nearly perfectly.

Here, we can see when we have a small training set error and a relatively large dev set error it implies we might have overfitted the training data and we are not generalizing well. So, we have a high variance. Similarly, if we have a large training set error it means we have underfitted the data implies high bias.

To fix the high bias problem we can do the following:

• Using a bigger network
• Train longer
• Find better suited NN architecture

To fix high variance problem we can do the following:

• Use more data
• Regularization
• Find better NN architecture