For the uninitiated: Interestingly, it is not advisable to take this to the extreme and set temperature to 0.
That would seem logical, as the results are then completely deterministic, but it turns out that a suboptimal token may result in a better answer in the long run. Also, allowing for a little bit of noise gives the model room to talk itself out of a suboptimal path.
I like to think of this like tempering the output space. With a temperature of zero, there is only one possible output and it may be completely wrong. With even a low temperature, you drastically increase the chances that the output space contains a correct answer, through containing multiple responses rather than only one.
I wonder if determinism will be less harmful to diffusion models because they perform multiple iterations over the response rather than having only a single shot at each position that lacks lookahead. I'm looking forward to finding out and have been playing with a diffusion model locally for a few days.
Yup. I think of it as how off the rails do you want to explore?
For creative things or exploratory reasoning, a temperature of 0.8 lends us to all sorts of excursions down the rabbit hole. However, when coding and needing something precise, a temperature of 0.2 is what I use. If I don’t like the output, I’ll rephrase or add context.
The author introduces the term "Supervision Paradox", but IMHO this is simply one instance of the "Automation Paradox" [1], which has been haunting me since I started working in IT.
Interestingly, most jobs don't incentivize working harder or smarter, because it just leads to more work, and then burn-out.
You seem to be right. The author is pumping out one such article per day. I think I've spent more time in forming my comment than they did in generating the article. Oh well :)
I didn't exactly understood what was meant here, so I went out and read a little. There is an interesting paper called "Neural Networks are Decision Trees" [1]. Thing is, this does not imply a nice mapping of neural networks onto decision trees. The trees that correspond to the neural networks are huge. And I get the idea that the paper is stretching the concept of decision trees a bit.
Also, I still don't know exactly what you mean, so would you care to elaborate a bit? :)
> We originally planned to make and train a neural network with single bit activations, weights, and gradients, but unfortunately the neural network did not train very well. We were left with a peculiar looking CPU that we tried adapting to mine bitcoin and run Brainfuck.
Straight forward quantization, just to one bit instead of 8 or 16 or 32. Training a one bit neural network from scratch is apparently an unsolved problem though.
> The trees that correspond to the neural networks are huge.
Yes, if the task is inherently 'fuzzy'. Many neural networks are effectively large decision trees in disguise and those are the ones which have potential with this kind of approach.
> Training a one bit neural network from scratch is apparently an unsolved problem though.
It was until recently, but there is a new method which trains them directly without any floating point math, using "Boolean variation" instead of Newton/Leibniz differentiation:
Unfortunately the paper seems to have been mostly overlooked. It has only a few citations. I think one practical issue is that that existing training hardware is optimized for floating point operations.
>Many neural networks are effectively large decision trees in disguise and those are the ones which have potential with this kind of approach.
I don't see how that is true. Decision trees look at one parameter at a time and potentially split to multiple branches (aka more than 2 branches are possible). Single input -> discrete multi valued output.
Neural networks do the exact opposite. A neural network neuron takes multiple inputs and calculates a weighted sum, which is then fed into an activation function. That activation function produces a scalar value where low values mean inactive and high values mean active. Multiple inputs -> continuous binary output.
Quantization doesn't change anything about this. If you have a 1 bit parameter, that parameter doesn't perform any splitting, it merely decides whether a given parameter is used in the weighted sum or not. The weighted sum would still be performed with 16 bit or 8 bit activations.
I'm honestly tired of these terrible analogies that don't explain anything.
> I'm honestly tired of these terrible analogies that don't explain anything.
Well, step one should be trying to understand something instead of complaining :)
> Single input -> discrete multi valued output.
A single node in a decision tree is single input. The decision tree as a whole is not. Suppose you have a 28x28 image, each 'pixel' being eight bits wide. Your decision tree can query 28x28x8 possible inputs as a whole.
> A neural network neuron takes multiple inputs and calculates a weighted sum, which is then fed into an activation function.
Do not confuse the 'how' with 'what'.
You can train a neural network that, for example, tells you if the 28x28 image is darker at the top or darker at the bottom or has a dark band in the middle.
Can you think of a way to do this with a decision tree with reasonable accuracy?
> Training a one bit neural network from scratch is apparently an unsolved problem though.
I don't think it's correct to call it unsolved. The established methods are much less efficient than those for "regular" neural nets but they do exist.
Perhaps. It's also possible that the approach simply precludes the use of the best tool for the job. Backprop is quite powerful and it just doesn't work in the face of heavy quantization.
Whereas if you're already using evolution strategies or a genetic algorithm or similar then I don't expect changing the bit width (or pretty much anything else) to make any difference to the overall training efficiency (which is presumably already abysmal outside of a few specific domains such as RL applied to a sufficiently ambiguous continuous control problem).
This is probably an outdated understanding of how LLMs work. Modern LLMs can reason and they are creative, at least if you don't mind stretching the meaning of those words a bit.
The thing they currently lack is the social skills, ambition, and accountability to share a piece of software and get adoption for it.
I suggested that the _understanding_ is outdated, not the principles.
Many people used to say that LLMs were no more than a stochastic parrot, implying that they would be incapable of forming novel ideas. It is quite obvious that that is no longer the case.
This is new. You are citing FPGA prototypes. Those papers do not demonstrate the same class of scaling or hardware integration that Taalas is advocating. For one, the FPGA solutions typically use fixed multipliers (or lookup tables), the ASIC solution has more freedom to optimize routing for 4 bit multiplication.
I understand that what Taalas is claiming. I was trying to actually describe that model on a hardware is some not something new Or unthought of The natural progression of FPGA is ASIC. Taalas process is more expensive And not really worth it because once you burn a model on the silicon, the silicon can only serve that model. speed improvement alone is not enough for the cost you will incur in the long run. GPU's are still general purpose, FPGA's are atleast reusable but wont have the same speed. But this alone cannot be a long term business. Turning a model to hardware in two months is too long. Models already take quite a long time to train. Anyone going down this strategy would leave wide open field to their competitors. Deployment planning of existing models already so complicated.
How was that not the case? As far as I understand it ChatGPT was instrumental to solving a problem. Even if it did not entirely solve it by itself, the combination with other tools such as Lean is still very impressive, no?
My understanding is there's been around 10 erdos problems solved by GPT by now. Most of them have been found to be either in literature or a very similar problem was solved in literature. But one or two solutions are quite novel.
I am not aware of any unsolved Erdos problem that was solved via an LLM. I am aware of LLMs contributing to variations on known proofs of previously solved Erdos problems. But the issue with having an LLM combine existing solutions or modify existing published solutions is that the previous solutions are in the training data of the LLM, and in general there are many options to make variations on known proofs. Most proofs go through many iterations and simplifications over time, most of which are not sufficiently novel to even warrant publication. The proof you read in a textbook is likely a highly revised and simplified proof of what was first published.
If I'm wrong, please let me know which previously unsolved problem was solved, I would be genuinely curious to see an example of that.
"We tentatively believe Aletheia’s solution to Erdős-1051 represents an early example of an AI system autonomously resolving a slightly non-trivial open Erdős problem of somewhat broader (mild) mathematical interest, for which there exists past literature on closely-related problems [KN16], but none fully resolves Erdős-1051. Moreover, it does not appear to us that Aletheia’s solution is directly inspired by any previous human argument (unlike in
many previously discussed cases), but it does appear to involve a classical idea of moving to the series tail and applying Mahler’s criterion. The solution to Erdős-1051 was generalized further, in a collaborative effort by Aletheia together with human mathematicians and Gemini Deep Think, to produce the research paper [BKK+26]."
"The erdosproblems website shows 851 was proved in 1934." I disagree with this characterization of the Erdos problem. The statement proven in 1934 was weaker. As evidence for this, you can see that Erdos posed this problem after 1934.
You recommended I look at the erdosproblems website.
But evidence that it was posed after 1934 is not really evidence it was not solved, because one of the things we learned from LLMs was that many of these problems were already solved in the literature, or are relatively straightforward applications of known, yet obscure, results. Particularly in the world of Erdos problems, the majority of which can be described as "off the beaten path" and are basically musings in papers that Erdos was asking -- many of these are in fact solved in more obscure articles and no one made the connection until LLMs allowed us to do systematic literature searches. This was the primary source of "solutions" of these problems by LLMs in the cited paper.
The Erdos Problem site also does not say it was solved in 1934. If you read the full sentence there, it refers to a different statement proven which is related.
Yeah that was also my take-away when I was following the developments on it. But then again I don't follow it very closely so _maybe_ some novel solutions are discovered. But given how LLMs work, I'm skeptical about that.
I honestly don't see the point of the red data points. By now all the erdos problems have been attempted by AIs--so every unsolved one can be a red data point.
That would seem logical, as the results are then completely deterministic, but it turns out that a suboptimal token may result in a better answer in the long run. Also, allowing for a little bit of noise gives the model room to talk itself out of a suboptimal path.
reply