fcharton's comments

fcharton · 2025-12-09T20:56:12 1765313772

Author, here. The paper is about the Collatz sequence, how experiments with a transformer can point at interesting facts about a complex mathematical phenomenon, and how, in supervised math transformers, model predictions and errors can be explained (this part is a follow-up to a similar paper about GCD). From a ML research perspective, the interesting (but surprising) take away is the particular way the long Collatz function is learned: "one loop at a time".

To me, the base conversion is a side quest. We just wanted to rule out this explanation for the model behavior. It may be worth further investigation, but it won't be by us. Another (less important) reason is paper length, if you want to submit to peer reviewed outlets, you need to keep pages under a certain number.

godelski · 2025-12-09T22:24:04 1765319044

I'm curious about 2 things.

1) Why did you not test the standard Collatz sequence? I would think that including that, as well as testing on Z+, Z+\2Z, and 2Z+, would be a bit more informative (in addition to what you've already done). Even though there's the trivial step it could inform how much memorization the network is doing. You do notice the model learns some shortcuts so I think these could help confirm that and diagnose some of the issues.

2) Is there a specific reason for the cross attention?

Regardless, I think it is an interesting paper (these wouldn't be criteria for rejection were I reviewing your paper btw lol. I'm just curious about your thoughts here and trying to understand better)

FWIW I think the side quest is actually pretty informative here, though I agree it isn't the main point.

observationist · 2025-12-09T21:36:11 1765316171

It might be a side quest, or it could be an elegant way to frame a category of problems that are resistant to the ways in which transformers can learn; in turn, by solving that structural deficiency in order to enable a model to effectively learn that category of problems, you might empower a new leap in capabilities and power.

We're a handful of breakthroughs before models reach superhuman levels across any and all domains of cognition. It's clear that current architectures aren't going to be the end-all solution, but all we need might simply be a handful of well-posed categorical deficiencies that allow a smooth transition past the current jagged frontiers.

jacquesm · 2025-12-10T02:49:16 1765334956

> We're a handful of breakthroughs before models reach superhuman levels across any and all domains of cognition.

That's a pretty bold claim to make.

fcharton · on March 11, 2020

Since the paper was presented, it was reviewed in ICLR and significantly expanded, and many questions raised in September were addressed.

The performance metric is the number of equations/integrals correctly solved over a held out test set, generated randomly. This is done using an external tool (SymPy), which some might consider "cheating" (but you need a way to test, don't you?). Anyway, keep in mind that checking a solution for correctness is much easier than finding one.

Another issue was about using timeout for Wolfram. This is now discussed in the appendix (it makes very little difference).

During review, an interesting point was discussed: out of distribution generalization, or how performance depends on the random problem generator we used. This is now measured and discussed at the end of the appendix.

whatshisface · on March 11, 2020

>During review, an interesting point was discussed: out of distribution generalization, or how performance depends on the random problem generator we used. This is now measured and discussed at the end of the appendix.

Your appendix E poses a really interesting question that I hope future research addresses. The distribution of integrals that appear on math tests are likely to closely match FWD, BWD and IBP because that's how professors write problems for students. However the distribution of integrals that arise in physics and engineering is quite different and includes a lot of ones that aren't solvable in closed form. I wonder what the performance of the system would be on the "engineering distribution." I also wonder what you would have to do to find out what the engineering distribution was...

fcharton · on March 11, 2020

Textbook problems are usually short, with short solutions, and demonstrating one specific rule. They are better handled by classical (rule-based) tools. Deep learning tools would either memorize them or resort to a rule based sub-module.

For integrals, solvability depends on the function set you work with. Since we use elementary functions on the real domain, a lot of integrals have no solution. We could have gone for a larger set (adding erf, the Fresnels, up to Liouvillian functions). This would mean more solvable cases.

As for the engineering distribution, no one knows what it is. The best we can do is to generate a broader training set, knowing that it will generalize better (this is the key takeaway of our appendix). BWD+IBP is a step in this direction, but to progress further, we need a better understanding of the problem space, and issues related to simplification. We are working on this now.

whatshisface · on March 11, 2020

>Since we use elementary functions on the real domain, a lot of integrals have no solution. We could have gone for a larger set (adding erf, the Fresnels, up to Liouvillian functions). This would mean more solvable cases.

Here is something to think about if you want more solvable cases: even in the case of sine and cosine, solving a differential equation really means reducing it to a combination of the solutions to simpler differential equations. The sine function can be defined as the solution to a particular differential equation, as can all of the fancier functions. So in a sense it's kind of like factorization, where you have "prime" equations whose only solution is a transcendental function defined as being their solution, and "composite" equations whose solutions can be written as a combination of solutions to "prime" equations. So really all of the rare functions belong to the same general scheme.

wendyshu · on March 11, 2020

Why not test on a set of problems that come up in practice rather than generated by an artificial distribution?

fcharton · on March 11, 2020

For training, you need a generator because you want millions of solved examples for deep learning to work.

At test time, you usually want a test set from the same distribution as the training data (or at least related to it in some controllable way), or it becomes very difficult to interpret the results.

Suppose my test set come from a different and unknown distribution (real problems sampled in some way).

If I get good results, is it because the training worked, or because the test set was "comparatively easy"? If I get bad results, is it because the model did not learn, or because the test set was too far away from the training examples?