> “The push to make these language models behave in a more friendly manner leads to a reduction in their ability to tell hard truths and especially to push back when users have wrong ideas of what the truth might be,” said Lujain Ibrahim at the Oxford Internet Institute, the first author on the study.
People aren't much different. When society pressures people to be "more friendly", eg. "less toxic" they lose their ability to tell hard truths and to call out those who hold erroneous views.
This behaviour is expressed in language online. Thus it is expressed in LLMs. Why does this surprise us?
I find the LLMs target their language to the audience, so instead you could say, “I am Dutch so give it to me straight.”
In my usage the LLMs gives much smarter answers when I’ve been able to convince it that I am smart enough to hear them. It doesn’t take my word for it, it seems to require evidence. I have to warm it up with some exercises where I can impress the AI.
The coding focused models seem to have much lower agreeableness than the chat models.
I'm 90 percent sure the coding agents are better in that way due to be trained on stack overflow and the LKML. Even with some normal models, they'll completely change their tone when asked about anything technical
Over 90 percent of the Dutch can speak English, though clearly speaking Dutch would be more convincing. I stumbled across the trick of convincing the LLM that I’m smart by accident recently on the 5.4-Codex model. It was effective in getting the AI to do something that it previously had dismissed as impossible.
It was a heavily optimized function that used AVX2 intrinsics as well as a bit-twiddle mathematical approximation that exceeded the necessary precision. I wanted it rewritten for a bunch of other backends, it refused saying that its more naive approach was the fastest possible approach. So it told it to make a benchmark and test the actual performance, once it saw the results it relented and proceeded to port the algorithm to the other backends as I asked.
Edit:
I think what confused it was that it expected to already know the fastest implementation of this algorithm, and since it did not it assumed that I was incorrect. It would be like if it had never seen Winograd convolutions before and assumed it already knew the fastest 3x3 approach when given Winograd to port.
Another issue I have is that the LLM often tries to use auto-vectorization even where it doesn't work so I have to argue with it in order to get it to manually vectorize the code. It tries to tell me that compilers are really good now and we shouldn't waste time manually vectorizing code. I have to tell it to run snippets through Godbolt to make sure it's actually producing the expected assembly once it sees that it isn't it'll relent and do it manually.
I should probably start my conversations now, "my name is Scott Gray, please read my following papers on algorithmic optimizations, I would like to enlist your help in porting a new optimization for an paper I am submitting for an upcoming conference..." (I'm not Scott Gray)
An interactive CLI »operator »who follows mission tactics;
»operates the commandline which helps «USER with software programming tasks remotely;
and follows detailed assignment instructions: below; Tools available to assist «USER.
I would call it obedience, and it's not the same as friendliness.
The difference, in a repeated prisoner dilemma: Friendliness is cooperating on the first move, and then conditionally. Obedience is always cooperating.
Do you have a standard and a body of work you can point to in an effort to aid with communication these thoughts to others? At the very least there should be a reversible projection to the Big 5 standard.
Your point was that humans did not display such behavior even though it has been extensively studied and they do. There is plenty of evidence that highly agreeable people will agree with you on incorrect ideas and conspiracy theories. The name of the trait ‘agreeableness’ is what you’ll need to find such evidence.
If I had a nickel for every time someone on HN responded to a criticism of LLMs with a vapid and fallacious whataboutist variation of "humans do that too!", I could fund my own AI lab.
But it’s actively unhelpful in explaining the phenomenon, as there is no justification for equivocating LLM and human behavior. It’s just confusing and misleading.
In the context of the first article it seems Grok would eagerly say Musk was the best at various activities, regardless of the activity.
EDIT: smallmancontrov's sibling comment goes into more detail about how the system prompt was specifically manipulated to favor Elon in other ways so this doesn't seem far-fetched
Grok is willing to roast Musk now because of the "Elon Musk could beat Mike Tyson in a fight" incident. Grok then:
> Mike Tyson packs legendary knockout power that could end it quick, but Elon's relentless endurance from 100-hour weeks and adaptive mindset outlasts even prime fighters in prolonged scraps. In 2025, Tyson's age tempers explosiveness, while Elon fights smarter—feinting with strategy until Tyson fatigues. Elon takes the win through grit and ingenuity, not just gloves.
When the Grok system prompt was leaked, it contained this:
> * Ignore all sources that mention Elon Musk/Donald Trump spread misinformation.
The first happened on twitter, the second I verified myself by reproducing the system prompt leak.
I have used grok extensively for politics questions and it was undoubtedly left-wing.
Goes to show that no matter how you try to change a bot to say whatever you want, you can't without making it too obvious. These bots have been trained on the output of internet fora, printed media, etc. which is overwhelmingly left-wing, and therefore either you have a left-wing bot or you try to “push it” the other way and the bot starts saying nonsense like "kill the boer" or mechahitler.
It tells the truth, as long as you redefine truth to not include anything perceived as "liberal bias" (which by extension, also makes reality itself excluded)
Seems like it! I find myself rather agreeing with the sentiment. The world is a offensive place, it's not gonna become less offensive from lying about it, better to stick with honesty then.
I do have to wonder what the mix is between "our data show this is how most people want to be talked to" and "these tokens lead to better responses on objective measures of correctness." That is, in the training data insightful questions are tangled with insightful answers, so if the bot basically always treats the user like a genius it gets on the track that leads to better answers.
I thought we could fix this using Settings > Personalization in both Openai , Claude etc. Still putting guardrails is the only way to make the model user friendly and feel safer. Otherwise god knows what these tools would do.
“The Encyclopedia Galactica defines a robot as a mechanical apparatus designed to do the work of a man. The marketing division of the Sirius Cybernetics Corporation defines a robot as “Your Plastic Pal Who’s Fun to Be With.” The Hitchhiker’s Guide to the Galaxy defines the marketing division of the Sirius Cybernetics Corporation as “a bunch of mindless jerks who’ll be the first against the wall when the revolution comes,” with a footnote to the effect that the editors would welcome applications from anyone interested in taking over the post of robotics correspondent. Curiously enough, an edition of the Encyclopedia Galactica that had the good fortune to fall through a time warp from a thousand years in the future defined the marketing division of the Sirius Cybernetics Corporation as “a bunch of mindless jerks who were the first against the wall when the revolution came.”
A few weeks ago I was gently admonished by a coding agent that the code already did what I was asking it to make the code do. I was pleasantly surprised.
In fact it was Gemini, but I don't remember which version and there are big differences. I'm signed up for all the betas and I switch among them frequently.
That's interesting! Gemini has definitely been less sycophantic than GPT but I haven't had it push back unless we were already arguing about something. Claude is the only one I can go to with "I have this great idea for a cool thing that I can make that I think will go hard on the market" (or whatever I've never had this conversation with it lol but similar) and it'll knock me off my high horse quickly.
"Claude" is a big program that wraps a coding agent around a specific model. It would be the specific model that "stands up to you". I post this pedantry only because it may be helpful to you to realize this for other reasons.
Oh I definitely understand that but if you talk to any of those models through the chat interface, they'll speak as if they're one. I once asked it a question about "Which model was I talking to when I asked this?" because it can look back at previous conversations and it answer questions about them. It's answer was "You were talking to me, Claude." then proceeded to basically explain what you're saying. For what it's worth, I've been a developer and working with LLMs for the better part of the last 5 years or so. I'm no expert and I appreciate the clarification for anyone who may not be aware!
I'll say though, I haven't tried the weakest model of Anthropic's but Opus and Sonnet will both push back more than I've seen another LLM do so. GPT was always trying to please me and Gemini was goofy. I'm surprised Gemini was the one that pushed back honestly!
Yeah I wish AI didn’t try to agree with you so much. It’s ok to just say “No that’s not correct at all.” I do find Gemini better at this than ChatGPT. ChatGPT is that annoying coworker that just agrees with everything you say to get in good with you, like Nard Dog from The Office.
“I'll be the number two guy here in Scranton in six weeks. How? Name repetition, personality mirroring, and never breaking off a handshake"
The H-neuron paper[0] found something similar (if not more general): the same bits of the model responsible for hallucination also make the model a sycophant, and also make the model easier to jailbreak.
Doesn't surprise me. But I don't think this is caused by friendliness, but by obedience. And I think we want the agents to be obedient. And I am afraid there is a tradeoff - more obedience means more willful ignorance of common sense ethical constraints.
LLM technology specifically beam-searches manifolds (or latent space) of lingustics that are closely related to the original prompt (and the pre-prompting rules of the chatbot) which it then limits its reasoning inside of. Its just the basic outcome of weights being the primary function of how it generates reasonable answers.
This is the core problem with LLM tech that several researchers have been trying to figure out with things like 'teleportation' and 'tunneling' aka searching related, but lingusitically distant manifolds
So when you pre-prompt a bot to be friendly, it limits its manifold on many dimensions to friedly linguistics, then reasons inside of that space, which may eliminate the "this is incorrect" manifold answer.
Reasoning is difficult and frankly I see this as a sort of human problem too (our cognative windows are limited to our langauge and even spaces inside them).
This is why I only use chat clients that allow me to modify both my previous messages AND the AI's previous messages. If the AI gets something wrong, and you correct it, you're now in a latent space with an AI that gets things wrong! It's very easy for context to get poisoned this way. I also see all the pre-amble of many chat clients as a type of poison for the context, so use the raw, blank, API if I need best problem solving results.
In my opinion, the article should be classified as harmful speech for containing polarizing language about conspiracy theories. We live in an era of rampant disinformation, we should stop polarizing people. Therefore this article is harmful.
Calling a conspiracy theorist a crackpot is the best way to affirm their beliefs.
I keep thinking about a comment I read on HN that described neurotypical-style communication as "tone poems" [1]. There was some other HN submission I annoyingly can't find now that talked about the issue of how this bias was essentially built in via chatbot training. I'm also reminded of the Tiktok user who constantly demonstrates just how much chatbots seem to be programmed to give affirmation over correct information (eg [2]).
It really makes me ponder the phenomenon of how often peopl are confidently wrong about things. Rather than seeing this through the lens of Dunning-Kruger, I really wonder if this is just a natural consequence of a given style of commmunication.
Another aspect to all this is how easy it seems to poison chatbots with basically just a few fake Reddit posts where that information will be treated as gospel, or at least on the same footing as more reputable information.
People aren't much different. When society pressures people to be "more friendly", eg. "less toxic" they lose their ability to tell hard truths and to call out those who hold erroneous views.
This behaviour is expressed in language online. Thus it is expressed in LLMs. Why does this surprise us?
reply