Linear Classifiers are one of the most commonly used classifiers and Logistic Regression is one of the most commonly used linear classifiers. The concepts we are going to learn here will actually extend a lot of other classification methods beyond linear classifiers. We’re going to learn the fundamental concepts, but also the underlying algorithms that let us optimize the parameters of this model to fit your training data.

## Motivation

Suppose, we want to make a spam filter which will filter spams from our email list automatically. So, for every email, we’re going to feed it to a classifier and the classifier will say if this is a spam or maybe we want to visit a restaurant for a specific food e.g. sushi but before visiting the restaurant we want to make sure that the sushi there is good. So, what we want to do is to check the review of others visiting this restaurant about sushi and see if it is positive. It is actually very difficult to manually check all the reviews. Instead, what we can do is to feed these reviews to a classifier and it is going to say, is this a positive sentiment or negative sentiment and then watching the number of positive sentiment and the negative sentiment we will decide whether the sushi there is good to go or not.

## Linear Classifiers Intuition

A linear classifier will take some quantity x as input which in our case will be emails or reviews and is going to make a prediction \(\hat{y}\) that says is this a positive statement which means is this a no spam or a positive review in which case \(\hat{y} = +1\) or is this a negative statement which implies it is a spam or a negative review in which case \(\hat{y} = -1\) . We will use restaurant review classifier for the rest fo the tutorial as our use case. A linear classifier does a little bit more associating every word for weight or coefficient which says how positively or negatively influential this word is for a review. For example,

Word | Coefficient |

good | 1.0 |

delicious | 1.5 |

wonderful | 2.7 |

bad | -1.0 |

terrific | -2.1 |

awful | -3.3 |

but, food, the, we, to, have, … | 0.0 |

… | … |

Here good and delicious have a coefficient of \(1.0\) and \(1.5\) respectively. Wonderful is very positive and has a coefficient of \(2.7\). On the negative side, bad and terrific might have a coefficient of \(-1\) and \(-2.1\) respectively. But awful is just awful, so \(-3.3\). There are also some words that are not that relevant to the sentiment of the review, might have \(0\) coefficient.

Now how will we use these coefficient’s to make a prediction of whether a sentence is positive or negative? Let’s take a review, for example, $$\begin{aligned} &\text{Input } \textbf{x}_i: \\ &\text{ Sushi was} \textbf{ good} \text{, the food was } \textbf{delicious}\text{, but the service was } \textbf{terribfic.}\end{aligned} \\ \text{Score}(\textbf{x}_i) = 1.0 + 1.7 – 2.1 = 0. 6$$ here good, delicious and great have coefficients greater than \(0\) and rest of the terms having coefficients greater than \(0\) as they are not relevant. After calculating we get a positive score which implies the sentiment in the sentence is positive and \(\hat{y} = +1\). This is called a linear classifier because the output is the weighted sum of the inputs.

So more generally for a simple linear classifier, we are going to take a review and the coefficient associated with each word and feed that in our classifier and calculate the score for that input. If the score is greater than \(0\) the prediction is positive and \(\hat{y} = +1\) and if the score is less than \(0\) we say the prediction is \(\hat{y} = -1\). Now, what we need to do is to train the coefficients or weights of these linear classifiers from data.

So given some input training data that includes sentences of reviews labeled with either \(+1\) or \(-1\). We’re going to split those into some training set and some validation set. Then we’re going to feed that training set to some learning algorithm which is going to learn the weights associated with each word. And then after we learn this classifier, we’re going back and evaluate its accuracy on that validation set. Now, how do we learn this classifier from data?

## Decision Boundary

The decision boundary is a boundary between positive and negative predictions. Let’s say that we have taken our data and trained our linear classifier and every word has zero weight except for two of them. Awesome has weight 1.0 and awful has weight -1.5. That means that the score of any sentence is 1.0 times the number of the word awesome minus 1.5 times the number of times the word awful shows up. We can plot that into a graph which depends on every sentence the number of awesome and the number of awful. So for example, for a sentence, $$\begin{aligned} \text{ Sushi was} \textbf{ awesome} \text{, the food was } \textbf{awesome}\text{, but the service was } \textbf{awful.}\end{aligned} $$ We’re going to plot that into a space where we’re going to have two awesome and one awful. So it gets plotted in the \((2,1)\) point. And then for every sentence that we might have in our training data set say, three awful and one awesome, three awesome and no awful and so on we will have different points and if we plot the dataset we find

The classifier that we’ve trained with the coefficients 1.0 and -1.5 will have a decision boundary that corresponds to a line plotted above, where 1.0 times awesome minus 1.5 times the number of awful is equal to zero. Everything below that line has a score greater than zero and any points above that line are going to have scored less than zero. So there’s that line, everything below the line is positive, everything above the line’s negative a linear decision boundary. That is what makes it a linear classifier.

## Linear Classifiers Model

in our previous example that we had with just two features with no zero coefficients, awesome and awful and we have calculated the score as $$\text{Score(x)} = w_0 + w_1 \text{ #awesome} + w_2 \text{ #awful}$$ where \(w_0\) was \(0\) in our example. Now suppose that we had a third feature with no zero coefficient. Let’s say that the word great. So the score function then will be $$ \text{Score(x)} = w_0 + w_1 \text{ #awesome} + w_2 \text{ #awful} + w_3 \text{ #great}$$ So, the general form of our linear classifier model will be : $$\begin{aligned} &\text{Model: } \hat{y}_i = \text{sign(Score(}\textbf{x}_i)) \\ &\text{Score(}\textbf{x}_i) = w_0 + w_1 \textbf{x}_i[1]+ … + w_d \textbf{ x}_i[d] \end{aligned}$$ where $$\begin{aligned} feature \; 1 &= 1 \\ feature \; 2 &= \textbf{x}[1] \dots e.g. , \#awesome \\ feature \; 3 &= \textbf{x}[2] \dots e.g., \#awful \\ \vdots \\ feature \; d+1 &= \textbf{x}[d]\ \dots e.g., \#ramen \end{aligned}$$

Theses notations that we’ve used so far are without features associated with it. So we’re going to have these functions \(h_1\) through \(h_D\). They’ve defined some features we might extract from the data and we are going to encode the constant function that is \(h_0\). So the more generically our model will be looked like: $$\begin{aligned} \text{Model: } \hat{y}_i &= \text{sign(Score(}\textbf{x}_i)) \\ \text{Score(}\textbf{x}_i) &= w_0 h_0\textbf{x}_i + w_1 h_1\textbf{x}_i+ … + w_D h_D\textbf{x}_i \\ &= \textbf{w}^{T}h(\textbf{x}_i) \end{aligned}$$ where $$\begin{aligned} feature \; 1 &= h_0(\textbf{x}) \dots e.g., 1 \\ feature \; 2 &= h_1(\textbf{x}) \dots e.g. , \textbf{x}[1] = \#awesome \\ feature \; 3 &= h_2(\textbf{x}) \dots \textbf{x}[2] = \#awful \; or, \log(\textbf{x}[7]) \textbf{x}[2] = \log(\#bad) * \#awful \\ \vdots \\ feature \; D+1 &= h_D( \textbf{x}) \dots \text{some other function of} \; \textbf{x}[1], \dots, \textbf{x}[d]\end{aligned}$$ So this is our generic linear classifier model with multiple features.

Linear classifiers are a pretty abstract concept. Logistic regression is a specific case of that, where we use what’s called the logistic function to squeeze minus infinity to plus infinity into the interval \((0, 1)\) so we can predict probabilities for every class. Next, we will discuss Logistic Regression.

*
Also published on Medium. *