[논문 리뷰] ResNet 논문 리뷰

오늘은 최근에 읽은 유명한 논문, ResNet을 리뷰해보려고 합니다.

논문 링크 : https://arxiv.org/abs/1512.03385

Deep Residual Learning for Image Recognition

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with

arxiv.org

Abstract

깊은 neural net은 학습시키기가 더 어려움
residual network 가 더 optimize하기 쉽고 깊은 네트워크에서 accuracy를 얻을 수 있음
ImageNet 데이터셋으로 VGG net 보다 8배 깊은 152 layer를 사용했고, 여전히 낮은 복잡도임
3.57% error rate을 달성했고 ILSVRC classification task에서 1등을 달성함
ILSVRC & COCO 2015 Competition의 ImageNet detection, localization, COCO detection, segmentation에서 1등을 달성함

Introduction

feature의 level은 stacked layer의 깊이로 풍부해짐
최근 네트워크 깊이의 중요성을 다룬 연구들이 있고, ImageNet에서 성능이 좋은 구조도 16층-30층의 깊은 네트워크 였음
깊이의 중요성에 질문을 던짐 : " layer를 깊이 쌓을수록 더 쉽게 학습할까? "

여기서 두 가지 장애물이 있음

1. Convergence Problem

2. Degradation Problem

(1) Convergence Problem

'vanishing/exploding gradients 으로 발생함
이는 학습 초기부터 convergence를 방해함
이 문제는 normalized initialization과 intermediate normalization layer로 다뤄짐
-> Back Propagation과 함께 SGD(Stochastic gradient descent)로 수십장의 layer를 convergence 할 수 있게 됨

(2) Degradation Problem

Convergence 가 가능해지고 나니 degradation 문제가 나타남
network 깊이가 증가할수록 accuracy가 포화되고, 그 이후로 빠르게 하락함
예상과 다르게 degradation은 overfitting에 의해 발생하는 문제가 아니었고, layer를 더 추가하면 training error가 높아짐

얕은 구조와, 이 구조에 layer를 추가해 깊게 만든 구조

깊은 모델을 만드는 방법 : 추가된 layer들은 identity mapping이고, 다른 layer들은 얕은 모델을 복사한 것
깊은 모델은 얕은 모델모다 training error가 높지 않아야 하는데, 그렇지 않았음

이 논문에서 degradation problem을 deep residual leraning framework를 소개하며 다룸

underlying mapping 을 H(x) 라하면, 우리는 nonlinear mapping을 F(x) = H(x) + x 에 맞출 것
original mapping은 F(x)+x 로의 recast
residual mapping이 original unrefernced mapping보다 optimize하기 쉬울 것
F(x) + x 는 shortcut connection을 가진 feedforward neural network라 할 수 있음
shortcut connection은 한 개 이상의 layer를 스킵함
이 논문에서의 경우 shortcut connection은 identity mapping이고, 그 output은 stacked layer의 output에 추가됨
Identity shortcut connection은 parameter나 계산 복잡도를 증가시키지 않음
전체 네트워크는 여전히 SGD with backpropagation으로 end-to-end로 학습될 수 있고 쉽게 구현할 수 있음

degradation problem을 보이고 제안한 방법을 평가하기 위해 ImageNet으로 주로 실험함

다음 두 가지를 보일 것

1) extremely deep residual net은 optimize하기 쉽지만, residual이 없이 단순히 깊게 쌓기만한 plain net은 더 높은 training error를 보임

2) deep residual net은 깊이에 의한 accuracy gain이 있고, 다른 이전 network들 보다 성능이 좋음

Related Work

Residual Representation

vector quantization에서 residual vector를 인코딩하는 것이 original vector를 인코딩하는 것보다 효과적임
good reformulation이나 preconditioning은 optimization을 간소화할 수 있음

Shortcut Connection

"highway network"가 gating function과 함께 shortcut connection을 다룸
highway network의 gate는 data-dependent하고 parameter가 있음
- 이와 다르게 parameter free
gated shortcut이 닫히면(zero에 가까워지면) layer들은 non-residual function이 됨
- 이와 다르게 항상 residual하며, shortcut이 닫히는 일 없음

Deep Residual Learning

(1) Residual Learning

H(x)를 few stacked layer의 underlying mapping이라고 하자

F(x) = H(x) - x를 H(x) = F(x) + x라 생각해보면 학습하기가 더 쉬움

residual learning reformulation으로, identity mapping이 optimal하다면, solver는 weight를 얻기가 더 쉬울 것

-> 실제로 identity mapping이 optimal할 것 같진 않지만, 우리의 reformulation은 problem을 precondition하는 데 도움이 됨

(2) Identity Mapping by Shortcuts

모든 few stacked layer마다 residual mapping을 적용함

building block을 다음과 같이 정의함

Eqn (1) y = F(x, {W_i}) + x

F(x, {W_i}) 는 학습될 residual mapping을 의미함
F = W_2σ(W_1 x) 에서 σ는 ReLu를 의미함
F + x 연산이 shortcut connection을 의미하고 element-wise addition임
parameter도 없고 계산 복잡도도 없음
위 수식에서는 x와 F의 차원이 같을 것

Eqn (2) y = F(x, {W_i}) + W_s x

(1) 수식은 x와 F의 차원이 같을 것
그렇지 않은 경우에 차원을 맞추기 위해, Shortcut connection에 linear projection W_s를 적용할 수 있음

(3) Network Architectures

plain/ residual net 두 가지를 모두 실험했음

Plain network

VGG net의 철학에 영감을 받음

Residual Network

위 plain network에서 shorcut conenction을 삽입함

input과 output 차원이 같을 땐 Eqn(1)을 사용하면 됨 (그림에서 실선으로 표시)
output dimension이 증가한 경우 두 가지 방법을 생각해 볼 수 있음
- (A) : zero padding을 적용해 shortcut은 여전히 identity mapping을 수행
- (B) : 차원을 맞추기 위해 Eqn(2)의 projection shortcut을 사용

(4) Implementation

Image는 더 짧은 쪽의 길이로 resize 됨
224 x 224 크기로 random하게 sample 됨
standard color augmentation이 사용됨
각 convolution 이후와 activiation 전에 Batch Normalization을 적용함
256 사이즈의 mini-batch로 SGD를 사용함
learning rate는 0.1에서 시작해 error가 안정되면 10으로 나눔
iter는 60만 번까지
0.0001의 weight decay와 0.9의 momentum을 사용
dropout은 사용하지 않았음

비교용으로 standard 10-crop testing을 적용함

가장 좋은 결과로 fully-convolutional form을 적용했고 multiple scale에서 score를 평균 냄

Experiments

(1) ImageNet Classification

Residual이 성능이 더 좋고, residual 중에서도 깊이가 깊은 net이 성능이 더 좋음

Plain Networks

34-layer plain net이 전체적으로 가장 높은 training error를 보임

-> 이 optimization difficulty는 vanishing gradient에 의해 발생하는 게 아닐 것이라고 주장

forward propagated signal이 zero가 되지 않도록 보장하는 BN을 사용하기도 했고,

BN으로 healthy norm을 보여주는 backward propagated gradient를 증명하기도 했음

-> 따라서 forward, backward signal vanish의 문제가 아님

우리는 깊은 plain net이 exponentially low convergence rate를 가질 거라 추측함

Residual Networks

option A) 모든 shortcut에 identity mapping을 사용, 증가한 차원에 대해서는 zero-padding을 적용

-> 따라서 plain net와 비교했을 때 extra parameter가 없음

18-layer 보다 34-layer ResNet이 성능이 더 좋음

가장 중요하게, 34-layer ResNet은 training error도 더 낮음

-> 이는 degradation problem이 잘 다뤄졌다는 것을 의미함!

-> extreme deep system에서 Residual Learning의 효과를 알 수 있음

18-layer도 accuracy는 꽤 좋지만, 18-layer ResNet이 더 빠르게 수렴함

network가 너무 깊지 않으면 (여기선 18 layer), SGD가 여전히 plain net에서 잘 동작함

-> 이 경우, ResNet 은 초기 단계에서 convergence를 더 빠르게 해서 optimization을 더 쉽게 함

Identity vs Projection Shortcuts

paramter free, identity shortcut이 training 을 돕는 것은 이미 보았음 (degradation 문제 완화)

다음으로 Projection Shortcut (Eqn (2)) 을 조사해 봄

세 가지 옵션을 비교함

(A) 증가하는 차원에 대해 zero-padding shortcut, 모든 shortcut은 parameter free
(B) 증가하는 차원에 대해 Projection shortcut 적용, 그렇지 않으면 identity shortcut 적용
(C) 모든 경우에 Projection shortcut 적용

B가 A보다 조금 더 좋음.

-> 이는 A의 zero-padded 차원이 사실상 residual learning 이 아니기 때문이라고 생각함

C가 B보다 조금 더 나음

-> Projection shortcut에 의해 extra parameter가 생겼기 때문일 것

-> A/B/C 간의 작은 차이는 결국 projection shortcut은 degradation problem을 다루는 데 그렇게 중요한 요소는 아니라는 것을 알 수 있음

따라서 C는 이후로 다루지 않을 것이고, complexity와 model size 에 중요한 Identity shortcut만을 사용할 것임

왜 중요한지는 밑에 Bottleneck 구조로 설명

Deeper Bottleneck Architectures

training time을 감당하기 위해 building block을 bottleneck구조로 수정함

각 residual function F에서 2개가 아닌, 3개의 layer stack을 사용함

3개의 layer는 각각 1x1, 3x3, 1x1 convolution임

1x1 layer는 차원을 줄이고 늘리는(복구하는) 역할을 함
3x3 layer는 더 작은 input/output 차원의 bottleneck

그림의 두 구조는 시간 복잡도가 유사함

parameter-free identity shortcut은 bottleneck 구조에서 중요함

-> 만약 projection shortcut이면, shortcut이 두 고차원에 연결되어 있기 때문에, 시간 복잡도와 모델 사이즈가 두 배가 됨

-> 실험 결과 152 layer 까지도 34 layer보다 더 정확함

degradation 문제가 존재하지 않아서 깊이에 따른 상당한 accuracy gain이 있을 수 있었음

이후 CIFAR-10, MS COCO 에서 다른 network들 간의 비교를 보여주는 데 이 역시 훌륭한 성능을 보인다고 합니다.

중요한 내용은 앞에서 다뤘으므로 뒤의 내용은 생략하도록 하겠습니다!

'머신러닝 > Network' 카테고리의 다른 글

[논문 리뷰] EfficientNet 논문 리뷰 (Rethinking Model Scaling for Convolutional Neural Networks) (0)	2022.05.24
[논문 리뷰] Transformer 논문 리뷰 (Attention Is All You Need) (0)	2022.05.21

완둑콩의 연구실

[논문 리뷰] ResNet 논문 리뷰

Abstract

Introduction

Related Work