The bitter lesson is about not trying to encode impossible-to-formalize conceptual knowledge, not avoiding data efficiency and the need to scale the model up to ever higher parameter counts.
If we followed this logic, we'd be training LLMs on character-level UTF-32 and just letting it figure everything out by itself, while needing two orders of magnitude bigger contexts and parameter counts.
Converting from RGB to YUV is absolutely subject to the bitter lesson because it is trying to generalize from a representation that we have seen works for some classical methods and hard code that knowledge in to the AI which could easily learn (and will anyways) a more useful representation for itself.
> LLMs on character-level UTF-32 and just letting it figure everything out by itself, while needing two orders of magnitude bigger contexts and parameter counts.
This was tried extensively and honestly it is probably still too early to proclaim the demise of this approach. It's also completely different - you're conflating a representation that literally changes the number of forward passes you have to do (ie. the amount of computation - what the bitter lesson is about) vs. one that (at most) would just require stacking on a few layers or so.
A better example for your point (imo) would be audio recognition, where we pre-transform from wave amplitudes into log mel spectrogram for ingestion by the model. I think this will ultimately fall to the bitter lesson as well though.
Also a key difference is that you are proposing going from methods that already work to try to inject more classical knowledge into them. It is oftentimes the case that you'll have an intermediary fusion between deep + classical, but not if you already have working fully deep methods.
Heck why even go that far? Given how much texts we have in scanned books, just feed it scans of the books and let it dedicate a bunch of layers to learning OCR.
Or given the number of unscanned books, even just give it the controls for a book scanner, the books and probably some robot arms. Then let it figure out the scanning first in some layers. Shouldn't be that hard.
Right... but I don't see how that means that it doesn't fall to the bitter lesson.
The bitter lesson is not saying that the model will always relearn the same representation as the one that has been useful to humans in the past, merely that the model will learn a better representation for the task at hand than the one hand-coded by humans.
If the model could easily learn the representation useful to humans, then it will fall to the bitter lesson because at minimum the model could easily follow our path (it's just an affine transformation to learn) and more probably will learn very different (& better) representations for itself.
LLMs can't reason about spelling, e.g. asking for a sentence which contains no letter "a"; and can also struggle with rhyming, etc. The most obvious explanation is that they never 'see' the underlying letters/spelling, only tokens.
Been hearing that for half my adult life. People were 100% sure multicore in 2005 meant manufacturers were officially signalling it and it was time to invest in auto-parallelizable code.
I don't think it's wrong, but looking at it through a child's eyes, we do keep finding ways to do things we couldn't a couple years ago: an open mind on hardware and more focus on software is continuing deep innovation cycles
Leaving aside that we're still far from hitting the limits to growth outlined in that book, and that we can exceed those limits to growth by expanding outside of Earth, what does a book about physical limitations on agriculture and industry have to do with limitations on computing efficiency? There is of course some fundamental limit to computing efficiency, but for all we know we could be many orders of magnitude away from hitting it.
We've clearly fallen behind the exponential curve on clock speed. But the great thing is we can parallelize transformers, so it's not as big of a deal.