> the question "why do neural networks work better than other models?" is getting pretty close to a solid answer.
This would be great, as from the "classical" perspective, the results of over-parametization and potentially other parts of NN architecture make no sense (to me, at least). I do accept that double-descent appears to empirically work, but it really, really shouldn't. In fact, as someone who's a big fan of Hastie et al's Elements, the bias variance tradeoff suggests that they shouldn't.
This has been bugging me (sporadically) for years, and any progress towards an answer would be incredibly useful (most probably in a philosophical sense I suppose).
As an aside, I've only read the Introduction, but this appears to be a well-written paper and a research program I can get behind. I really want this stuff to work.
I guess it's similar to bagging and boosting, which were empirically successful well before we had any theoretical understanding of why they work.
Hastie was actually lead author of an excellent paper that discusses the underlying phenomenon in the context of least-squares linear regression: https://arxiv.org/abs/1903.08560
It really isn't so mysterious once you begin to examine how the rule of thumb for the bias-variance tradeoff (remember that it is the relationship with model size that is curious, not the tradeoff itself) came to be. The easiest ways to arrive at this rule are through an information criterion like the AIC or BIC, where the model size appears in the penalty term for the log-likelihood. These criteria have a bunch of assumptions, all of which are crucial, and absolutely none of which apply for neural networks. The biggest one is that the only limiting regime is in the size of the dataset, so there are vastly more data than model parameters. Neural networks have parameter counts within a constant ratio of the number of datapoints. Another is that the model has a non-singular Hessian in a neighbourhood of the optimum. Neural networks do not have this. Once you abandon the rule of thumb and actually do the math in the appropriate limiting regimes, there's no contradiction anymore.
I've found the biggest mystery for people though is the fact that performance actually _improves_ after the interpolation threshold. This seems insane if you come at it from the point of view that the model "could have done anything" if there are more parameters than data. But this isn't true at all. The fact that you have obtained _a solution_ means that you imposed some implicit bias that guided which solution you end up in. For linear regression, that is often the minimum L2 norm solution, which _literally_ minimizes the variance keeping all else fixed. If you add more parameters to play with, obviously it should be able to minimize the variance even further, right? If the bias is zero and the variance is reduced, you get better performance. If you use a different optimizer than gradient descent, you can end up at the minimum L1 norm solution (effectively LASSO), which is well-known to perform really well regardless of the number of parameters.
Of course, linear regression is not neural network regression, and the situation in deep learning is far more complicated. But the same idea applies. Every single part of the training procedure is carefully designed to bias the obtained solution toward something with minimal variance. Stochastic optimizers (even dropout) settle in wide minima which have smaller variances. Some optimizers prioritize stronger correlations in the weights. Bottlenecks in the architecture induce low-rank solutions. Data augmentation induce known invariances that reduce variance along those directions. Convolutional designs induce regularity with respect to the input space. Neural networks are not magic; they are the product of hundreds of intentional design decisions over decades. When you increase the size of the model, all of these features are exacerbated.
Quantifying all of this in the theory is difficult because there are a lot of moving parts. But if you study a simplified model and consider each mechanism individually, the picture becomes pretty clear.
Check out Andrew Gordon Wilson's excellent paper "Deep Learning is Not so Mysterious or Different" for a discussion of the ways in which existing learning theory does and doesn't work neural nets.
> The most extreme example I've worked in was in Dublin, where there was an explicit "you are given 8 hours of work, and 8 hours to do it in. If you need to stay longer than that then you must be incompetent", and the entire office, everyone, emptied into the pub at 5pm. All the socialising and "cooler chat" happened over pints of Guiness in the pub. The folks with kids would have one or two and then go home, or not drink at all and then go home. The less attached folks stayed on for several. But everyone came to the pub at 5, regardless.
I want to call out that while generally, Irish working hours are pretty capped, most people at most companies definitely don't go to the pub at 5pm. I am Irish, and work in Ireland (but mostly for multinationals) so 5pm pub time (unfortunately) doesn't work when you need to talk to California.
Additionally, I normally agitate for the whole 8 and only 8 hours of work, as lots of professional people in Ireland are quite driven (or people pleasing) and tend to work longer hours.
That being said, there are some employers where this definitely is a thing (particularly on Thursday or Friday), but it's 100% not the standard.
I've only worked at this one place in Ireland, so there was definitely a tendency to say "the Irish" when I actually only knew "this one workplace in Ireland". Thanks for the clarification :) From now I'll preface it with "it's not the norm, but I worked at this one place in Ireland where...".
> I've only worked at this one place in Ireland, so there was definitely a tendency to say "the Irish" when I actually only knew "this one workplace in Ireland".
To be fair, this does happen a lot if there are visitors to the office. I can certainly believe everyone going out in that case. Alternatively, if the team is pretty young and single this would definitely happen. When I worked at FB there was a big drinking culture (not just in Ireland), so again I'd believe it of there.
Instagram had around 10mn users at acquisition, so they might not have gotten to where they are without FB. Whatsapp was a successful product that didn't make any money.
When you have Apple level margins then you can definitely consider long term ROI (such as this entire thread, for example). Long term greedy, as they say.
reply