Logistic regression is one of the most commonly used algorithms for machine learning and is a building block for neural networks. It is really important to understand the concepts and the derivations of logistic regression. In this post we will explore the fundamentals of logistic regression and also concepts like maximum likelihood function, cross-entropy is covered briefly. In the end, we have provided codes for Logistic Regression in Python from scratch.
Previously, we have covered linear classifier models and we have described the concept of decision boundaries. We have taken some features \(x_i\), chosen some random weights \(w_i\) and calculated weighted sums. Then we used a simple sign function to determine if the value is positive or negative. : $$\begin{aligned} &\text{Model: } \hat{y}_i = \text{sign(Score(}\textbf{x}_i)) \\ &\text{Score(}\textbf{x}_i) = w_1 \textbf{x}_1+ … + w_n \textbf{ x}_n + b \end{aligned}$$ If the value is postive we have given a postive prediction and if the value is negative we have given a negative prediction. But there are problems with these models. The error function we have used in linear classifiers is a simple sign function that gives us output depending on the sign of the values and it is a discrete value.
Using linear classifiers, numbers like .0000001 and 10 are going to have the same predictions. But what if we want a model in which we can predict and at the same time we can tell how confident we are about a prediction. Here’s where probability comes in. Using probability we can tell how confident we are about a prediction. Also in order to use gradient descent, our error function needs to be both continuous and differentiable and we need to move from discrete predictions to continuous predictions.
Class Probabilities
In logistic regression, we don’t just use discrete predictions like \(+1\) or \(-1\), we actually predict a number generally between 0 and 1 which we will consider a probability. These probabilities are extremely useful because they give us an indication of how sure we are about the predictions we make. However, a model can not predict correctly every time. Using probability we can actually predict how confident we are about a prediction.
So, we want to use a continuous error function and we want to use probability. Question is, how can we do that? The way we address both of these problems is by changing our step function from a discrete step function to a continuous step function.
This function is called the sigmoid function and its equation is: $$ \sigma(x) = \frac{1}{1+e^{-x}} $$
It gives output between 0 and 1 and it is a continuous function. Now, we need an algorithm that will help us find the best model. The best model is the model that gives a higher probability of the events that actually happen to us. The method is called the Maximum likelihood in which we pick the model that gives the existing labels the highest probability. Thus by maximizing probability, we can pick the best possible model.
Maximum Likelihood
Maximum likelihood is a quite simple but interesting concept. Suppose, we have some probabilities of events \(p_1, p_2, …, p_n\). In order to obtain the probabilities of the whole arrangement, we have to multiply these probabilities \( p_1 * p_2 * …* p_n \). The bigger the individual probabilities will be the resulting value will get bigger as well. Now, all we need to do is to maximize this probability. The equation of maximum likelihood function: $$l(w) = \prod_{i=1}^NP(y_i | \textbf{x}_i, \textbf{w})$$
The maximum likelihood is the product of numbers and the product has got some problems. If we have a big data-set with thousands of probabilities and if these probabilities \(p_1, p_2, …, p_n\) get really small then the resulting multiplication will become really tiny since the value of the probability is always between 0 and 1. Also, if we have a product of thousands of numbers and we change one of them the product will change drastically. So, we want to avoid products and one alternative is to use sums. One easy way to do this is to use logarithms and the concept is called the cross-entropy.
Cross-Entropy
In cross-entropy, we take products and we take the logarithms of them. We know: $$\begin{aligned}ln (a*b) = ln (a) + ln (b)\end{aligned} $$ So, if we take the logarithms of the probabilities we get: $$ ln(p_1 * p_2 * …* p_n) = ln(p_1) + ln(p_2)+ … +ln(p_n)$$ Which is a good solution. But there’s still a problem. Every \(ln(p_i)\) will be a negative number because logarithms of numbers between 0 and 1 is always a negative number and \(ln(1) = 0\). But if we take the negative of the logarithms of the probabilities we get positive numbers. $$\begin{aligned}\text{CE} &= -ln(p_1) – ln(p_2) … – ln(p_n) \\ &= – \sum ln(p_i)\end{aligned}$$ This value is called cross-entropy which is a very important concept and will tell us how good our model is. What cross-entropy says is if we have some events and some corresponding probabilities how likely are those events going to happen based on the probabilities. If it is very likely then we have a small cross-entropy, if it is unlikely cross-entropy will be large. It is important to notice that, our goal has changed from maximizing the probability to minimizing the cross-entropy. Now, we need a general formula for cross-entropy.
Let’s say we are going to do an assignment and our supervisor has suggested three books to look at to get some help. Suppose, the probability of getting something useful from these books are \(p_1 = 0.9\), \(p_2 = 0.7\) and \(p_3 = 0.2\) respectively and suppose these are independent events. So, the probability of these events to be happening is 0.9 for the first book, 0.7 for the second book and 0.8 for not having useful content in the third book. We know the cross-entropy will be the sum of the negative logarithms of the probabilities $$ -ln(p_1) – ln(p_2) – ln(1-p_3) = – ln(0.9) – ln (0.7) – ln(0.8)$$ Let’s there be another variable called \(y_i\) which will be 1 if there is something useful in the book and 0 if it’s not. Again suppose, for the first book \(y_1 = 1\), for the second book \(y_2 = 1\) and for the third book \(y_3 = 0\). If we put all this information together, we can find a formula: $$\text{Cross-Entropy} = – \sum_{i=1}^m y_{i}ln(p_i) + (1- y_i)ln(1-p_i)$$
If we look closely, we will notice if there is something useful in the books \(y_i\) will be 1 and the second term will be zero and if there is not something useful the \(y_i = 0\) and the first term becomes zero. That is the beauty of this formula and it really gives the sums of the negatives of the logarithms of the probabilities which we have defined as cross-entropy. If we calculate the cross-entropy of the pair: $$\text{CE }[(1,1,0),(0.9,0.7,0.2)] = 0.69$$ which is low since the event is likely. On the other hand, if we calculate $$\text{CE }[(0,0,1),(0.9,0.7,0.2)] = 5.12$$ the cross-entropy is high. By convention we actually use average instead of using the sum: $$\text{Cross-Entropy} = – \frac{1}{m}\sum_{i=1}^m y_{i}ln(p_i) + (1- y_i)ln(1-p_i)$$
In this example, we only had two classes. Having something useful in the book and not having something useful in the book. What if wwe have more than two classes? The concept is called Multi-Class Cross Entropy.
Multi-Class Cross-Entropy
Suppose again, we need to do an assignment and our supervisor has suggested looking at three different books. But, in this case we have three different topics: Science, Technology, and Sports. For these cases, we use a formula which is more general. $$\text{Cross-Entropy} = -\sum_{i=1}^n\sum_{j=1}^m y_{ij} ln(p_{ij})$$
The above formula is actually the same as the formal we have used for two classes.So, now that we have our error function we can use gradient descent to minimize the function.
Gradient Descent
Let’s recall that we have n points labelled \(x_1, x_2, \ldots, x_n\) and we know the error formula and the prediction is given by: $$\begin{aligned}E &= -\frac{1}{n} \sum_{i=1}^n \left( y_i \ln(\hat{y_i}) + (1-y_i) \ln (1-\hat{y_i}) \right) \\ \hat{y_i} &= \sigma(Wx_i + b) \end{aligned}$$ Where the prediction is given by \(\hat{y_i} = \sigma(Wx^{(i)} + b)\) and we can get the derivative of the Sigmoid:
$$\begin{aligned} \sigma(x) &= \frac{1}{1+e^{-x}} \\ \sigma(x)^{’} & = \frac{d}{dx} \frac{1}{1+e^{-x}} \\ &= \frac{e^{-x}}{({1+e^{-x}})^2} = \frac{1}{1+e^{-x}}. \frac{e^{-x}}{({1+e^{-x}})}\\ &= \sigma(x) . (1 – \sigma(x)) \end{aligned} $$
We can use the above formula to derive: $$ \begin{aligned} \frac{\partial}{\partial w_j}\hat{y} &= \frac{\partial}{dw_j} \sigma(Wx+b)\\ &= \sigma(Wx+b) (1 – \sigma(Wx+b)) . \frac{\partial}{\partial w_j}(Wx+b)\\ &= \hat{y}(1-\hat{y}) . \frac{\partial}{\partial w_j}(Wx+b) \\ &= \hat{y}(1-\hat{y}) . \frac{\partial}{\partial w_j} (w_1x_1 + … + w_jx_j+ … + w_nx_n + b ) \\ &= \hat{y}(1-\hat{y}) . x_j \end{aligned} $$
Our goal is to calculate the gradient of E at a point \(x = (x_1, \ldots, x_n)\) given by the partial derivatives $$\nabla E =\left(\frac{\partial}{\partial w_1}E, \cdots, \frac{\partial}{\partial w_n}E, \frac{\partial}{\partial b}E \right)$$
To simplify our calculations, we’ll actually think of the error that each point produces, and calculate the derivative of this error. The total error, then, is the average of the errors at all the points. Now, we can go ahead and calculate the derivative of the error E at a point x with respect to the weight \(w_j\): $$\begin{aligned} \frac{\partial}{\partial w_j} E &= \frac{\partial}{\partial w_j} \left[- y \ln(\hat{y}) – (1-y) \ln (1-\hat{y}) \right] \\ &= – y \frac{\partial}{\partial w_j} \ln(\hat{y}) – (1-y) \frac{\partial}{\partial w_j} \ln (1-\hat{y}) \\ &= -y . \frac{1}{\hat{y}}. \frac{\partial}{\partial w_j} \hat{y} – (1-y) . \frac{1}{1-\hat{y}} . \frac{\partial}{\partial w_j} (1 – \hat{y}) \\ &= -y(1-\hat{y}).x_j + (1- y) \hat{y}. x_j \\ &= -(y-\hat{y}) x_j\end{aligned}$$ A similar calculation will show us that $$\frac{\partial}{\partial b} E = – (y – \hat{y})$$
This actually tells us something very important. For a point with coordinates \((x_1, \ldots, x_n)\), label y, and prediction \(\hat{y}\), the gradient of the error function at that point is
$$\left(-(y – \hat{y})x_1, \cdots, -(y – \hat{y})x_n, -(y – \hat{y}) \right)$$ In summary, the gradient is $$\nabla E = -(y – \hat{y}) (x_1, \ldots, x_n, 1)$$
Logistic Regression Algorithm
We now have all the tools we need to describe the Logistic Regression algorithm. Here is the pseuodocode for the Logistic Regression algorithm: