More

avdelazeri · 2026-05-11T23:56:40 1778543800

Given that their publication says the dataset is freely available on Huggingface that's at least something ig

avdelazeri · 2026-05-07T16:26:42 1778171202

And right after https://news.ycombinator.com/item?id=48019219 huh

baq · 2026-05-07T17:34:41 1778175281

Taken completely by surprise, no one could have predicted this /s

avdelazeri · 2026-05-03T14:57:31 1777820251

Now, I don't know anything about neuroscience or brain development, but hopefully I can explain the statistics in a way useful to you.

Imagine there are two groups A and B. One group, A, has slower reactions on average and high average activity The other, Group B, has higher reactions and lower than the Group A's activity. Yet inside both groups the general trend is that if someone is slower than the average reaction of their group then they're also below the average activity for their group.

If we look at the overall means without distinguishing groups, slower reaction is correlated positively with higher activity (kids from group A have higher activity and slower reaction in general, which pushes the correlation upwards. As long as the relationship in Group B isn't too strong the upward trend from Group A can easily dominate overall correlation) but inside each group the trend is actually the opposite.

This applies pretty much every time you're comparing samples. If I understood your quote correctly, they're studying a child's reaction time vs activity level by comparing the same kid in different times. The same logic applies, a person can exhibit the opposite trend to the populational average due to the same mechanism above. This can be even more dramatic, because once you start looking at averages you start losing time dependency information.

More broadly (and more formally), multivariate covariance splits in within-group and between-group terms, so if the signs of the terms are different the magnitude of one can dominate the overall sum and flip the sign.

kqr · 2026-05-03T20:18:45 1777839525

This is a very good explanation of Simpson's paradox, which is the name for this thing.

It can go arbitrarily deep and the trend can flip sign for each added controlled variable.

avdelazeri · 2026-04-29T08:51:47 1777452707

I don't think this is the slam dunk you think this is. LinkedIn's existence is, in fact, a net negative for the human race.

avdelazeri · 2026-04-25T01:34:34 1777080874

We must know, we will know.

CamperBob2 · 2026-04-25T04:59:59 1777093199

"Yeah, about that" - Gödel

avdelazeri · 2026-04-15T21:00:46 1776286846

OGame and Travian are two names that really take me back. Those and Tribal Wars, I played them a lot back when I was a teenager.

parzivalt · 2026-04-15T21:25:12 1776288312

that matches what i'm seeing - people want both. fast feedback loops in the first hour and stuff that still pulls them back three weeks later. hardest part is making the slow stuff feel like it matters before they've invested the time. still figuring it out with a lot of nice feedback from the community.

avdelazeri · 2026-04-01T22:30:45 1775082645

Has Claude stopped claiming to be deepseek when prompted in Chinese yet? It wasn't long that it hit the news and blogs

avdelazeri · 2026-03-31T09:46:44 1774950404

Afaik turning up the temperature slowly wouldn't work on an actual frog. But works on people without fail.

account42 · 2026-04-01T11:12:49 1775041969

That's because you can ignore when the people complain as long as its not too many at once.

avdelazeri · 2026-03-12T00:05:38 1773273938

Lack of will. That was one of the main results from the survey from Whitaker in 2020. Making your code reusable and easy to understand is significant work that had no direct benefits for a researcher's career. Particularly because research code grows wildly as researchers keep trying thungs.

Working on the next paper is seem as the better choice.

Moreover if your code is easy for others to run then you're likely to be hit with people wanting support, or even open yourself to the risk of someone finding errors in your code (the survey's result, not my own beliefs).

There are other issues, of course. Just running the code doesn't mean something is replicable. Science is replicated when studies are repeated independently by many teams.

There are many other failure modes SOTA-hacking, benchmarking, and lack of rigorous analysis of results, for example. And that's ignoring data leakage or other more silly mistakes (that still happen in published work! In work published in very good venues even)

Authors don't do much of anything to disabuse readers that they didn't simply get really look with their pseudorandom number generators during initialization, shuffling, etc. As long as it beats SOTA who cares if it is actually a meaningful improvement? Of course doing multiple runs with a decent bootstrap to get some estimation of the average behavior os often really expensive and really slow, and deadlines are always so tight. There is also the matter that the field converged on a experimentation methodology that isn't actually correct. Once you start reusing test sets your experiments stop being approximations of a random sampling process and you quickly find yourself outside of the grantees provided by statistical theory (this is a similar sort of mistake as the one scientists in other fields do when interpreting p-values). There be dragons out there and statistical demons might come to eat your heart or your network could converge to an implementation of nethack.

Scale also plays into that, of course, and use of private data as the other comment mentioned.

Ultimately Machine Learning research is just too competitive and moves too fast. There are tens of thousands (hundreds maybe?) of people all working on closely related problems, all rushing to publish their results before someone else published something that overlaps too much with their own work. Nobody is going to be as careful as they should, because they can't afford to. It's more profitable to carefully find the minimal publishable amount of work and do that, splitting a result into several small papers you can pump every few months. The first thing that tends to get sacrificed during that process is reliability.

avdelazeri · 2025-12-03T08:20:38 1764750038

While I never measured it, this aligns with my own experiences.

It's better to have very shallow conversations where you keep regenerating outputs aggressively, only picking the best results. Asking for fixes, restructuring or elaborations on generated content has fast diminishing returns. And once it made a mistake (or hallucinated) it will not stop erring even if you provide evidence that it is wrong, LLMs just commit to certain things very strongly.

ewoodrich · 2025-12-03T18:00:57 1764784857

I largely agree with this advice but in practice using Claude Code / Codex 4+ hours a day, it's not always that simple. I have a .NET/React/Vite webapp that despite the typical stack has a lot of very specific business logic for a real world niche. (Plus some poor early architectural decisions that are being gradually refactored with well documented rules).

I frequently see (both) agents make wrong assumptions that inevitably take multiple turns of needing it to fail to recognize the correct solution.

There can be like a magnetic pull where no matter how you craft the initial instructions, they will both independently have a (wrong) epiphany and ignore half of the requirements during implementation. It takes messing up once or twice for them to accept that their deep intuition from training data is wrong and pivot. In those cases I find it takes less time to let that process play out vs recrafting the perfect one shot prompt over and over. Of course once we've moved to a different problem I would definitely dump that context ASAP.

(However, what is cool working with LLMs, to counterbalance the petty frustrations that sometimes make it feel like a slog, is that they have extremely high familiarity with the jargon/conventions of that niche. I was expecting to have to explain a lot of the weird, too clever by half abbreviations in the legacy VBA code from 2004 it has to integrate with, but it pretty much picks up on every little detail without explanation. It's always a fun reminder that they were created to be super translaters, even within the same language but from jargon -> business logic -> code that kinda works).

HPsquared · 2025-12-03T11:12:12 1764760332

A human would cross out that part of the worksheet, but an LLM keeps re-reading the wrong text.