Deep Learning for Computer Vision (2)

The basic supervised learning framework

y=f(x)y = f(x)

  • yy : output
  • ff : prediction function
  • xx : input

training:

(x1,๐‘ฆ1),โ€ฆ,(xN,YN){(x_1,๐‘ฆ1), โ€ฆ, (x_N,Y_N)}

Nearest neighbor classifier

$f(x) =labelofthetrainingexamplenearestto= label of the training example nearest to x$

K-nearest neighbor classifier:

Linear classifier

f(x)=sgnโก(wโ‹…x+b)f(x) = \operatorname{sgn} (w\cdot x + b)

NN vs. linear classifiers: Pros and cons

NN pros:

  • Simple to implement
  • Decision boundaries not necessarily linear
  • Works for any number of classes
  • Nonparametric method
    NN cons:
  • Need good distance function
  • Slow at test time
    Linear pros:
  • Low-dimensional parametric representation
  • Very fast at test time
    Linear cons:
  • Works for two classes
  • How to train the linear function?
  • What if data is not linearly separable?

Empirical loss minimization

define expected loss

L(f)=E(x,y)โˆผD[l(f,x,y)]L(f)=\mathbb{E}_{(x, y) \sim D}[l(f, x, y)]

  • 0โˆ’10-1 loss
    • l(f,x,y)=I[f(x)โ‰ y]l(f,x,y) = \mathbb{I}[f(x) \neq y]
    • L(f)=Prโก[f(x)โ‰ y]L(f)=\operatorname{Pr}[f(x) \neq y]
  • l2l_2 loss
    • l(f,x,y)=[f(x)โˆ’y]2l(f, x, y)=[f(x)-y]^2
    • L(f)=E[[f(x)โˆ’y]2]L(f)=\mathbb{E}\left[[f(x)-y]^2\right]

Find ff that minimizes

L^(f)=1nโˆ‘i=1nl(f,xi,yi)\hat{L}(f)=\frac{1}{n} \sum_{i=1}^n l\left(f, x_i, y_i\right)

  • for 0โˆ’10-1 loss
    • NP-hard
    • use surrogate loss functions instead
  • l2l_2 loss
    • L^(fw)=1nโˆฅXwโˆ’Yโˆฅ2\hat{L}(f_w) = \frac{1}{n} \| Xw-Y \|_2 is a convex function
    • 0=โˆ‡โˆฅXwโˆ’Yโˆฅ2=2XT(Xwโˆ’Y)0 = \nabla \| Xw-Y \|_2 = 2X^T (Xw - Y)
    • w=(XTX)โˆ’1XTYw = (X^T X)^{-1} X^T Y

Interpretation of l2l_2 loss

Assumption:

$y isnormallydistributedwithmeanis normally distributed with mean f_w(x) = w^Tx+b$

Maximum likelihood estimation:

wML=argโ€‰minโกwโˆ’โˆ‘ilogโกPw(yiโˆฃxi)=argโ€‰minโกwโˆ‘iโˆ’logโก(12ฯ€ฯƒ2expโก(โˆ’[yiโˆ’fw(xi)]22ฯƒ2))=argโ€‰minโกwโˆ‘ilogโก2ฯ€ฯƒ2+[yiโˆ’fw(xi)]22ฯƒ2=argโ€‰minโกwโˆ‘i[yiโˆ’fw(xi)]2\begin{aligned} w_{ML} &= \argmin_w - \sum_i \log P_w(y_i | x_i) \\ &= \argmin_w \sum_i - \log \left( \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( - \frac{[ y_i-f_w(x_i)]^2}{2\sigma^2} \right) \right) \\ &= \argmin_w \sum_i \log \sqrt{2\pi\sigma^2} + \frac{[y_i-f_w(x_i)]^2}{2\sigma^2} \\ &= \argmin_w \sum_i [y_i-f_w(x_i)]^2 \\ \end{aligned}

Problem of linear regression


Linear regression is very sensitive to outliers

Logistic regression

use sigmoid function

ฯƒ(x)=11+eโˆ’x\sigma(x) = \frac{1}{1 + e^{-x}}

flowchart LR
    x,y-->linear_output
    linear_output-->|logistic function|Probability


sigmoid function

Pw(y=1โˆฃx)=ฯƒ(wTx)=11+expโก(โˆ’wTx)P_w(y=1 | x)=\sigma\left(w^T x\right)=\frac{1}{1+\exp \left(-w^T x\right)}

Pw(y=โˆ’1โˆฃx)=1โˆ’Pw(y=1โˆฃx)=ฯƒ(โˆ’wTx)P_w(y=-1 | x) = 1-P_w(y=1|x)=\sigma(-w^Tx)

logโกP(y=1โˆฃx)P(y=โˆ’1โˆฃx)=wTx+b\log \frac{P(y=1 \mid x)}{P(y=-1 \mid x)}=w^T x+b

Logistic loss

Maximum likelihood estimate:

wML=argโ€‰minโกwโˆ‘iโˆ’logโกP(yiโˆฃxi)=argโ€‰minโกwโˆ‘iโˆ’logโกฯƒ(yiwTxi)\begin{aligned} w_{ML} &= \argmin_w \sum_i -\log P(y_i|x_i) \\ &= \argmin_w \sum_i -\log \sigma(y_iw^Txi) \end{aligned}

i.e. the logistic loss

l(w,xi,yi):=โˆ’logโกฯƒ(yiwTxi)l(w,x_i, y_i) := -\log \sigma(y_i w^T x_i)

Gradient descent

To minimize l(w,xi,yi)l(w,x_i, y_i) , use gradient descent

wโ†wโˆ’ฮทโˆ‡L^(w)w \leftarrow w - \eta \nabla\hat{L} (w)

Stochastic gradient descent (SGD)

wโ†wโˆ’ฮทโˆ‡l(w,xi,yi)w \leftarrow w - \eta \nabla l(w, x_i, y_i)

mini-batch SGD:

โˆ‡L^=1bโˆ‘i=1bโˆ‡l(w,xi,yi)\nabla \hat{L}=\frac{1}{b} \sum_{i=1}^b \nabla l\left(w, x_i, y_i\right)

โˆ‡l(w,xi,yi)=โˆ’โˆ‡wlogโกฯƒ(yiwTxi)=โˆ’โˆ‡wฯƒ(yiwTxi)ฯƒ(yiwTxi)=โˆ’ฯƒ(yiwTxi)ฯƒ(โˆ’yiwTxi)yixiฯƒ(yiwTxi)=โˆ’ฯƒ(โˆ’yiwTxi)yixi\begin{aligned} \nabla l\left(w, x_i, y_i\right)&=-\nabla_w \log \sigma\left(y_i w^T x_i\right) \\ &=-\frac{\nabla_w \sigma\left(y_i w^T x_i\right)}{\sigma\left(y_i w^T x_i\right)} \\ &=-\frac{\sigma\left(y_i w^T x_i\right) \sigma\left(-y_i w^T x_i\right) y_i x_i}{\sigma\left(y_i w^T x_i\right)} \\ &= \boxed{-\sigma(-y_iw^Tx_i)y_ix_i} \end{aligned}

SGD:

wโ†w+ฮทฯƒ(โˆ’yiwTxi)yixiw \leftarrow w+\eta \sigma\left(-y_i w^T x_i\right) y_i x_i

Logistic regression does not converge for linearly separable data:

Scaling  by ever larger constants makes the classifier more confident and keeps increasing the likelihood of the data


Deep Learning for Computer Vision (2)
https://yzzzf.xyz/2024/02/01/deep-learning-for-CV-2/
Author
Zifan Ying
Posted on
February 1, 2024
Licensed under