Study Logistic Regression

Posted on Sat 14 May 2016 in Machine Learning

Logistic Regression是一个用来分类的机器学习算法。

Logistic Regression Hypothesis¶

$$H_\theta(x)=\frac{1}{1+e^{-\theta{x}}}$$

其中$\theta$是需要得到的参数，$x_i$是每一组数据的feature。

Logistic Regression Cost Function¶

对于这个Hypothesis，其中$y_i=1$的probability是$H_\theta(x_i)$的值，$y_i=0$时的probability是$1-H_\theta(x_i)$。为了满足maxinum likelihood, 基于目前的训练数据有： $$Likelihood=\prod_{i=1}^{n} [(1 - y_i)(1 - H_\theta(x_i))+y_iH_\theta(x_i)]$$ $$\log(Likelihood)=\sum_{i=1}^n \log[(1 - y_i)(1 - H_\theta(x_i))+y_iH_\theta(x_i)]$$

Cost function: $$J(\theta)=-\frac{1}{n}\log(Likelihood)=-\frac{1}{n}\sum_{i=1}^n \log[(1 - y_i)(1 - H_\theta(x_i))+y_iH_\theta(x_i)]$$

目标是基于训练数据求使得cost最小的$\theta$

Logistic Regression With Regularization¶

L2 Regularization: $$J(\theta)=-\frac{1}{n}\sum_{i=1}^n \log[(1 - y_i)(1 - H_\theta(x_i))+y_iH_\theta(x_i)]+\frac{\lambda}{2n}\sum_{j=1}^m \theta_j^2$$ 其中n是数据的个数，m是feature的个数，$\theta_0$没有在regularization项里面。

L1 Regularization: $$J(\theta)=-\frac{1}{n}\sum_{i=1}^n \log[(1 - y_i)(1 - H_\theta(x_i))+y_iH_\theta(x_i)]+\frac{\lambda}{2n}\sum_{j=1}^m |\theta_j|$$

Softmax Regression (Or multinomial logistic regression)¶

$$ H_\theta(x) = \begin{bmatrix} P(y=1|x;\theta) \\ P(y=2|x;\theta) \\ \vdots \\ P(y=K|x;\theta) \end{bmatrix} = \frac{1}{\sum_{j=1}^K e^{(\theta^j{x})}} \begin{bmatrix} e^{(\theta^1{x})} \\ e^{(\theta^2{x})} \\ \vdots \\ e^{(\theta^K{x})} \\ \end{bmatrix} $$$$ J(\theta)=-[\sum_{i=1}^n \sum_{k=1}^K 1\{y^i=k\}\log\frac{e^{\theta^k x_i}}{\sum_{j=1}^K e^{\theta^j x_i}}] $$

Logistic Regression In scikit-learn¶

在scikit-learn中，Logistic Regression的实现是在 LogisiticRegression。支持L2和L1的regularization。Multiclass既有one-vs-all的实现，也有真正的multinomial model。

Try Logistic Regression¶

Data is from Machine Learning-Andrew Ng

To use Contour Plot, reference to this link

In [56]:

import warnings
warnings.filterwarnings('ignore')

#Below is inline matplotlib
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
x = pd.read_csv('../data/LogisticRegression/ex5Logx.dat', header=None, names=['x1', 'x2'])
y = pd.read_csv('../data/LogisticRegression/ex5Logy.dat', header=None, names=['y'])
xy = pd.concat([x, y], axis=1)
logistic = linear_model.LogisticRegression(C=100000) # The larger the C, the less regularization.
poly = PolynomialFeatures(6) # Polynomial feature to overfit the data
logistic.fit(poly.fit_transform(x), y)

pos = xy[xy['y']==1]
neg = xy[xy['y']==0]

ax = pos.plot.scatter(x='x1', y='x2', marker='+', label='y=1')
neg.plot.scatter(x='x1', y='x2', color='yellow', marker='o', label='y=0', ax=ax)

x1c = np.linspace(-1.0, 1.2, 200)
x2c = np.linspace(-1.0, 1.2, 200)
z = np.zeros((len(x1c), len(x2c)))
for i in range(len(x1c)):
    for j in range(len(x2c)):
        z[i,j] = logistic.predict(poly.fit_transform([[x1c[i], x2c[j]]]))[0]

plt.contour(x1c, x2c, np.transpose(z), levels=[0], color='green', label='Decision boundary')

Out[56]: