The basic supervised learning framework
y=f(x)
- y : output
- f : prediction function
- x : input
training:
(x1โ,y1),โฆ,(xNโ,YNโ)
Nearest neighbor classifier
$f(x) =labelofthetrainingexamplenearestto x$
K-nearest neighbor classifier:
Linear classifier
f(x)=sgn(wโ
x+b)
NN vs. linear classifiers: Pros and cons
NN pros:
- Simple to implement
- Decision boundaries not necessarily linear
- Works for any number of classes
- Nonparametric method
NN cons:
- Need good distance function
- Slow at test time
Linear pros:
- Low-dimensional parametric representation
- Very fast at test time
Linear cons:
- Works for two classes
- How to train the linear function?
- What if data is not linearly separable?
Empirical loss minimization
define expected loss
L(f)=E(x,y)โผDโ[l(f,x,y)]
- 0โ1 loss
- l(f,x,y)=I[f(x)๎ =y]
- L(f)=Pr[f(x)๎ =y]
- l2โ loss
- l(f,x,y)=[f(x)โy]2
- L(f)=E[[f(x)โy]2]
Find f that minimizes
L^(f)=n1โi=1โnโl(f,xiโ,yiโ)
- for 0โ1 loss
- NP-hard
- use surrogate loss functions instead
- l2โ loss
- L^(fwโ)=n1โโฅXwโYโฅ2โ is a convex function
- 0=โโฅXwโYโฅ2โ=2XT(XwโY)
- w=(XTX)โ1XTY
Interpretation of l2โ loss
Assumption:
$y isnormallydistributedwithmean f_w(x) = w^Tx+b$
Maximum likelihood estimation:
wMLโโ=wargminโโiโโlogPwโ(yiโโฃxiโ)=wargminโiโโโlog(2ฯฯ2โ1โexp(โ2ฯ2[yiโโfwโ(xiโ)]2โ))=wargminโiโโlog2ฯฯ2โ+2ฯ2[yiโโfwโ(xiโ)]2โ=wargminโiโโ[yiโโfwโ(xiโ)]2โ
Problem of linear regression
Linear regression is very sensitive to outliers
Logistic regression
use sigmoid function
ฯ(x)=1+eโx1โ
flowchart LR
x,y-->linear_output
linear_output-->|logistic function|Probability
sigmoid function
Pwโ(y=1โฃx)=ฯ(wTx)=1+exp(โwTx)1โ
Pwโ(y=โ1โฃx)=1โPwโ(y=1โฃx)=ฯ(โwTx)
logP(y=โ1โฃx)P(y=1โฃx)โ=wTx+b
Logistic loss
Maximum likelihood estimate:
wMLโโ=wargminโiโโโlogP(yiโโฃxiโ)=wargminโiโโโlogฯ(yiโwTxi)โ
i.e. the logistic loss
l(w,xiโ,yiโ):=โlogฯ(yiโwTxiโ)
Gradient descent
To minimize l(w,xiโ,yiโ) , use gradient descent
wโwโฮทโL^(w)
Stochastic gradient descent (SGD)
wโwโฮทโl(w,xiโ,yiโ)
mini-batch SGD:
โL^=b1โi=1โbโโl(w,xiโ,yiโ)
โl(w,xiโ,yiโ)โ=โโwโlogฯ(yiโwTxiโ)=โฯ(yiโwTxiโ)โwโฯ(yiโwTxiโ)โ=โฯ(yiโwTxiโ)ฯ(yiโwTxiโ)ฯ(โyiโwTxiโ)yiโxiโโ=โฯ(โyiโwTxiโ)yiโxiโโโ
SGD:
wโw+ฮทฯ(โyiโwTxiโ)yiโxiโ
Logistic regression does not converge for linearly separable data: