The reason this mistake breaks deep ResNets is well known. Lots of papers showed why but I think ours is the simplest: arxiv.org/abs/2002.10444
In short, deep ResNets are trainable if the activations on branches are much smaller than the activations on the skip connection.
Nov 21, 2024 · 11:19 AM UTC
9
