Details
Residual networks (ResNets) are highly versatile neural network architectures which have demonstrated excellent performance in deep learning applications. They have also been widely studied in the theoretical literature due to their assumed relation with 'neural ODEs' i.e. the assumption that in the deep network limit the behavior of ResNets is well approximated by a system of ODEs. We study this relation in detail by investigating the asymptotic properties of deep Residual networks (ResNets) both empirically, using trained weights from large scale deep ResNets, and mathematically, by studying scaling limits of ResNets as the number of layers increases.
We show that scaling properties of trained weights behave differently from what is implicitly assumed in the neural ODE literature, leading in the generic case to a diffusion limit in which the (hidden) state of the neurons is the solution of a stochastic differential equations driven by the weights.
We show that, in the same scaling regime, backpropagation dynamics leads to a universal forward-backward system of stochastic differential equations form the joint dynamics of the hidden state and the gradient of the training objective. The scaling limit has a universal form which does not depend on the details of the activation function, only on its local behavior at zero.
Joint work with: RenYuan Xu (NYU) Alain Rossier (Oxford)