๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

A PIECE OF DATA/๐Ÿ• ๋”ฅ๋Ÿฌ๋‹

[๋”ฅ๋Ÿฌ๋‹ ๊ธฐ์ดˆ] ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ์ดˆโ‘ฃ(ReLU, Vanishing gradient, dropout, ensemble)

* ๋”ฅ๋Ÿฌ๋‹ session์€ Sung Kim๋‹˜์˜ ๊ฐ•์˜ (์œ ํŠœ๋ธŒ)๋ฅผ ์š”์•ฝ/์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

** Sung Kim๋‹˜์˜ ๊ฐ•์˜์™€ ์ž๋ฃŒ๋Š” ์•„๋ž˜์˜ ์ž๋ฃŒ๋ฅผ ์ฐธ๊ณ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

- Andrew Ng's ML class

- Convolutional Neural Networks for Visual Recognition

- Tensorflow

- Srivastava, Nitish, et al. โ€Dropout: a simple way to prevent neural networks from overfittingโ€


CONTENTS

  • ReLU(Rectified Linear Unit, ReLU)
  • Vanishing gradient(๊ธฐ์šธ๊ธฐ ์†Œ์‹ค)
  • dropout
  • ensemble

10-1. ReLU: Better non-linearity

 

Vanishing gradient(๊ธฐ์šธ๊ธฐ ์†Œ์‹ค) ๋ฌธ์ œ

- neural network์—์„œ layer๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ์Œ“์œผ๋ฉด ํ•™์Šต์ด ์ž˜ ๋˜์ง€ ์•Š๋Š” ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค. Sigmoid ํ•จ์ˆ˜๋ฅผ ๋„ฃ์—ˆ์„ ๋•Œ ์†์‹ค ํ•จ์ˆ˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋Š” ์‚ฌ๋ผ์ง€๊ฒŒ ๋œ๋‹ค. ์ด๋Š” ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆซ๊ฐ’์ด 0~1 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์— chain rule๋กœ ๊ณ„์‚ฐํ•  ๋•Œ 0.1 * 0.001 * 0.01 * โ€ฆ ์ด ๋œ๋‹ค๋ฉด ๊ฐ’์ด 0์— ๊ฐ€๊น๊ฒŒ ๋œ๋‹ค. 

 

Vanishing gradient ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

- ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋กœ Sigmoid(์‹œ๊ทธ๋ชจ์ด๋“œ) ํ•จ์ˆ˜ ๋ง๊ณ  ReLUํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•œ๋‹ค. Neural Network์—์„œ ReLU๋ฅผ ์ ์šฉํ•˜๊ฒŒ ๋˜๋ฉด sigmoid ํ•จ์ˆ˜๋ณด๋‹ค cost function์—์„œ ์ž˜ ํ•™์Šตํ•ด๋‚˜๊ฐ„๋‹ค. ReLU์˜ ๋ฌธ์ œ์ ์„ ๋ณด์™„ํ•œ Leaky ReLU๋„ ์žˆ๊ณ , ๋‹ค์–‘ํ•œ ํ™œ์„ฑ ํ•จ์ˆ˜๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ™œ์„ฑ ํ•จ์ˆ˜๋ฅผ ๊ตฌ์ฒด์ ์œผ๋กœ ์•Œ์•„๋ณด๋Š” ๊ฒƒ๋„ ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.

 

ReLU


10-2. Initialize weights

 

Initialize weights(weight์˜ ์ดˆ๊ธฐ๊ฐ’ ์„ค์ •ํ•˜๊ธฐ)

- vanishing gradient ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ReLU๋ฅผ ์ ์šฉํ•˜๊ฑฐ๋‚˜ ์ดˆ๊ธฐ๊ฐ’์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์„ค์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค.

- ์ดˆ๊ธฐ ๊ฐ’์€ 0์œผ๋กœ ๋‘๋ฉด ์•ˆ๋˜๊ธฐ ๋•Œ๋ฌธ์—(chain rule ์ ์šฉ ์‹œ ๊ฐ’์ด 0์ด ๋‚˜์˜ค๊ธฐ ๋•Œ๋ฌธ์—) ์ดˆ๊ธฐ๊ฐ’์„ wiseํ•˜๊ฒŒ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.

 

์ดˆ๊ธฐ๊ฐ’ ์„ค์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•(Restricted Boatman Machine, RBM)

RBM ๋ฐฉ๋ฒ•?

- forward์™€ backward๋กœ ์ง„ํ–‰๋  ๋•Œ x์™€ x_bar์˜ ๊ฐ’์„ ๋น„๊ตํ•˜์—ฌ ์ฐจ๊ฐ€ ์ตœ์†Œ๊ฐ€ ๋˜๋„๋ก weight์„ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ

 

RBM์œผ๋กœ ์ดˆ๊ธฐ๊ฐ’ ์„ค์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•

(์˜ˆ์‹œ) layer 3๊ฐœ

1. layer1๊ณผ layer2์—์„œ forward, backward ์ง„ํ–‰

2. ์ฒ˜์Œ ๊ฐ’์ด ๋งˆ์ง€๋ง‰์— ๊ณ„์‚ฐํ•œ ๊ฐ’๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ๋‚˜์˜ค๋Š” weight ํ•™์Šต

3. ๋‹ค์Œ layer 2์™€ layer3์„ ๋˜‘๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ๋‹ค์Œ weight ํ•™์Šต


10-3. NN dropout and model ensemble

 

Dropout(๋“œ๋กญ์•„์›ƒ)

dropout ๋ชฉ์ 

๋‹ค์ธต ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์—์„œ overfitting(๊ณผ์ ํ•ฉ)์˜ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค. ์˜ค๋ฒ„ํ”ผํŒ… ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ training data๋ฅผ ๋Š˜๋ฆฌ๊ฑฐ๋‚˜ ๊ทœ์ œ(L1, L2)๊ฐ€ ์žˆ์ง€๋งŒ neural network์—์„œ๋Š” ๋“œ๋กญ์•„์›ƒ์ด ์žˆ๋‹ค.

 

dropout ๋ฐฉ์‹

(a)์—์„œ๋Š” ๋ชจ๋“  ๋…ธ๋“œ๊ฐ€ ์—ฐ๊ฒฐ๋˜์–ด์žˆ๋Š” fully connected์ง€๋งŒ, dropout์€ (b)์ฒ˜๋Ÿผ ๋ชจ๋“  ๋…ธ๋“œ๋ฅผ ์—ฐ๊ฒฐํ•˜์ง€ ์•Š๊ณ  ๋žœ๋คํ•˜๊ฒŒ ๋ช‡ ๊ฐœ์˜ ๋…ธ๋“œ๋งŒ ์—ฐ๊ฒฐํ•ด์„œ ์ง„ํ–‰ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

* dropout์€ ๋ฐ์ดํ„ฐ ํ•™์Šต ์‹œ์—๋งŒ ์‚ฌ์šฉํ•˜๊ณ  ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ์ง„ํ–‰ํ•  ๋•Œ๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค.

 

Ensemble(์•™์ƒ๋ธ”)

ensemble ๋ชฉ์ 

ensemble์€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ชจ๋ธ๋กœ ์ด๋ฃจ์–ด์ง„ ํ•™์Šต ๋ฐฉ๋ฒ•์ด๊ธฐ ๋•Œ๋ฌธ์— ์ผ๋ฐ˜ํ™”๊ฐ€ ์ž˜ ๋˜๊ณ , ์„ฑ๋Šฅ์„ ๋ถ„์‚ฐ์‹œํ‚ค๊ธฐ ๋•Œ๋ฌธ์— overfitting ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

ensemble ๋ฐฉ์‹

1. ๋ฐ์ดํ„ฐ๋ฅผ ๋ณต์› ์ถ”์ถœ ๋ฐฉ์‹์œผ๋กœ N์˜ ๋ฐ์ดํ„ฐ ์…‹์„ ๋งŒ๋“ ๋‹ค.

2. N์˜ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚จ๋‹ค.

3. ํ•™์Šตํ•œ ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์นœ๋‹ค.