Has a lower loss than shortcut solutions.Can be represented by a practically sized NN.So, the scaling hypothesis says that at large enough scale, the lazy, shortcut solution is the desired one. Among those solutions, shortcuts fail to generalize to different data (o.o.d. Among the solutions that solve the training data, only some generalize to an i.i.d. Among the set of all possible rules, only some solve the illustrates this well:Ĭredit: Geirhos et al. A figure from the paper “Shortcut Learning in Deep Neural Networks” by Geirhos et al. Neural nets are indeed “lazy”, in that their loss functions are minimized by “shortcuts”, solutions that don’t generalize beyond the data distribution. (…) if there is enough data & compute to push it past the easy convenient sub-models and into the sub-models which express desirable traits like generalizing, factorizing perception into meaningful latent dimensions, meta-learning tasks based on descriptions, learning causal reasoning & logic, and so on If the model & data & compute are not big or varied enough, the optimization, by the end of the cursory training, will have only led to a sub-model which achieves a low loss but missed important pieces of the desired solution.Įventually, after enough examples and enough updates, there may be a phase transition (Viering & Loog 2021), and the simplest ‘arithmetic’ model which accurately predicts the data just is arithmetic. “neural nets are lazy”: sub-models which memorize pieces of the data, or latch onto superficial features, learn quickest and are the easiest to represent internally. Gwern cites a swathe of papers in support, interpreting them in such a way that the following picture emerges: We can simply train ever larger NNs and ever more sophisticated behavior will emerge naturally as the easiest way to optimize for all the tasks & data The scaling hypothesis and the laziness of deep learning To that end, I’ll quote from Gwern’s thorough essay on the scaling hypothesis. But I’ll start by examining the opposing view, that scaling deep learning is sufficient for general intelligence. I’ll show how probabilistic programs can represent causal models of the world, which deep learning can’t do, and why causal models are essential to intelligence. I believe that symbolic representations, specifically programs, and learning as program synthesis, can provide data efficient and flexible generalization, in a way that deep learning can’t, no matter how much we scale it. I will argue for Marcus’ position, but dive a little deeper than he does. If you haven’t read the exchange, here it is: SA, GM, SA, GM. Will scaling deep learning produce human-level generality, or do we need a new approach? You may have read the exchange between Scott Alexander and Gary Marcus, and felt that there are some good arguments on both sides, some bad ones, but few arguments that go beyond analogy and handwaving - arguments that would take what we know about deep learning and intelligence, and look at what that knowledge implies. Probabilistic synthesis and causality as program editing.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |