Study Logistic Regression
Posted on Sat 14 May 2016 in Machine Learning
Logistic Regression是一个用来分类的机器学习算法。
Logistic Regression Hypothesis¶
$$H_\theta(x)=\frac{1}{1+e^{-\theta{x}}}$$其中$\theta$是需要得到的参数,$x_i$是每一组数据的feature。
Logistic Regression Cost Function¶
对于这个Hypothesis,其中$y_i=1$的probability是$H_\theta(x_i)$的值,$y_i=0$时的probability是$1-H_\theta(x_i)$。为了满足maxinum likelihood, 基于目前的训练数据有: $$Likelihood=\prod_{i=1}^{n} [(1 - y_i)(1 - H_\theta(x_i))+y_iH_\theta(x_i)]$$ $$\log(Likelihood)=\sum_{i=1}^n \log[(1 - y_i)(1 - H_\theta(x_i))+y_iH_\theta(x_i)]$$
Cost function: $$J(\theta)=-\frac{1}{n}\log(Likelihood)=-\frac{1}{n}\sum_{i=1}^n \log[(1 - y_i)(1 - H_\theta(x_i))+y_iH_\theta(x_i)]$$
目标是基于训练数据求使得cost最小的$\theta$
Logistic Regression With Regularization¶
L2 Regularization: $$J(\theta)=-\frac{1}{n}\sum_{i=1}^n \log[(1 - y_i)(1 - H_\theta(x_i))+y_iH_\theta(x_i)]+\frac{\lambda}{2n}\sum_{j=1}^m \theta_j^2$$ 其中n是数据的个数,m是feature的个数,$\theta_0$没有在regularization项里面。
L1 Regularization: $$J(\theta)=-\frac{1}{n}\sum_{i=1}^n \log[(1 - y_i)(1 - H_\theta(x_i))+y_iH_\theta(x_i)]+\frac{\lambda}{2n}\sum_{j=1}^m |\theta_j|$$
Softmax Regression (Or multinomial logistic regression)¶
$$ H_\theta(x) = \begin{bmatrix} P(y=1|x;\theta) \\ P(y=2|x;\theta) \\ \vdots \\ P(y=K|x;\theta) \end{bmatrix} = \frac{1}{\sum_{j=1}^K e^{(\theta^j{x})}} \begin{bmatrix} e^{(\theta^1{x})} \\ e^{(\theta^2{x})} \\ \vdots \\ e^{(\theta^K{x})} \\ \end{bmatrix} $$$$ J(\theta)=-[\sum_{i=1}^n \sum_{k=1}^K 1\{y^i=k\}\log\frac{e^{\theta^k x_i}}{\sum_{j=1}^K e^{\theta^j x_i}}] $$Logistic Regression In scikit-learn¶
在scikit-learn中,Logistic Regression的实现是在 LogisiticRegression。支持L2和L1的regularization。Multiclass既有one-vs-all的实现,也有真正的multinomial model。
Try Logistic Regression¶
Data is from Machine Learning-Andrew Ng
To use Contour Plot, reference to this link
import warnings
warnings.filterwarnings('ignore')
#Below is inline matplotlib
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
x = pd.read_csv('../data/LogisticRegression/ex5Logx.dat', header=None, names=['x1', 'x2'])
y = pd.read_csv('../data/LogisticRegression/ex5Logy.dat', header=None, names=['y'])
xy = pd.concat([x, y], axis=1)
logistic = linear_model.LogisticRegression(C=100000) # The larger the C, the less regularization.
poly = PolynomialFeatures(6) # Polynomial feature to overfit the data
logistic.fit(poly.fit_transform(x), y)
pos = xy[xy['y']==1]
neg = xy[xy['y']==0]
ax = pos.plot.scatter(x='x1', y='x2', marker='+', label='y=1')
neg.plot.scatter(x='x1', y='x2', color='yellow', marker='o', label='y=0', ax=ax)
x1c = np.linspace(-1.0, 1.2, 200)
x2c = np.linspace(-1.0, 1.2, 200)
z = np.zeros((len(x1c), len(x2c)))
for i in range(len(x1c)):
for j in range(len(x2c)):
z[i,j] = logistic.predict(poly.fit_transform([[x1c[i], x2c[j]]]))[0]
plt.contour(x1c, x2c, np.transpose(z), levels=[0], color='green', label='Decision boundary')