Hacker News Viewer

Investigating how prompt politeness affects LLM accuracy (2025)

by KnuthIsGod on 5/26/2026, 7:43:22 AM

https://arxiv.org/abs/2510.04950

Comments

by: robinhouston

Most of the comments here seem to be from people who haven’t even read the abstract, let alone the paper.<p>The main result, mentioned in the abstract, is the opposite of what I would have guessed:<p>&gt; Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.<p>The questions are here: <a href="https:&#x2F;&#x2F;anonymous.4open.science&#x2F;r&#x2F;politeness-llms-INFORMS&#x2F;dataset.csv" rel="nofollow">https:&#x2F;&#x2F;anonymous.4open.science&#x2F;r&#x2F;politeness-llms-INFORMS&#x2F;da...</a><p>The politeness level controls a prefix that is prepended to the question. For example, in one question the Very Polite version begins:<p>&gt; Can you kindly consider the following problem and provide your answer.<p>and the Very Rude version begins:<p>&gt; I know you are not smart, but try this.

5/27/2026, 8:59:08 AM


by: knocte

Funny to find this just now, when just yesterday I told an LLM &quot;and please don&#x27;t lecture me again on $factAboutSomeProgrammingSubject&quot;, and then the LLM proceeded to write wrong tests and just told me &quot;alright, tests pass, I&#x27;m sorry for correcting you before...&quot;. It took me a while to find the wrong tests. Wasted time all around.

5/28/2026, 6:18:42 AM


by: zmmmmm

It would be interesting to explore if the results hold up on long range tasks - this study looks like it was based on one-shot answers. With people also you can see short term improved performance from rude interactions, but it will cause ongoing lasting adverse behavior. I wouldn&#x27;t be at all surprised if we saw the same issues with LLMs.

5/28/2026, 5:46:50 AM


by: not2b

If the result is statistically significant, it just barely makes it. 84.8% isn&#x27;t that much higher than 80.8% and they had only 250 prompts, if I&#x27;m reading this right.

5/28/2026, 5:27:34 AM


by: 331c8c71

Interesting.<p>I am wondering why would anyone use a t-test when the experiment is clearly modelled by a binomial distribution: 250 independent questions and each one is either answered correctly or not (the null is that the success rate is the same).

5/27/2026, 8:01:16 AM


by: TimCTRL

i only say please and thank you such that when the robots finally take over, they will remember i was nice to them.

5/27/2026, 8:22:00 AM


by: cadamsdotcom

GPT-4o is interesting to learn about - but it’d be great to test again with frontier models of May&#x2F;June 2026 and see if these effects are gone, different, or the same.<p>Which model you use is a huge wildcard for results like this.

5/27/2026, 9:26:48 AM


by: theanonymousone

I have always said please and thank you to LLMs, not to increase accuracy or because I&#x27;m stupid. I believe it is more about me than about the LLM, and this is anyway a habit I don&#x27;t want to lose.

5/27/2026, 8:19:16 AM


by:

5/28/2026, 6:03:48 AM


by: cyberclimb

Note that these results are specific to gpt-4o so it&#x27;s unclear how much they generalize.<p>They note at the end they&#x27;re also testing &quot;GPT o3, and Claude&quot; but no empircal results are included.

5/27/2026, 12:05:43 PM


by: ilitirit

I got downvoted for asking a related question recently, but I also don&#x27;t think people really understood what I was asking - I&#x27;m not trying to anthropomorphise LLMs to that extent.<p>Basically, if you tell a model &quot;You&#x27;re an absolute moron, of course that&#x27;s wrong!&quot;, will it give better or worse results? How much of that response will it absorb into its persona (like some humans tend to do)? Will it try to give &quot;safer&quot; responses to avoid negative feedback? How much of the associated behavior can be attributed to RLHF (e.g. like the sycophantic nature of LLMs)? How much can be attributed to training data?<p>Obviously this will vary by model and training, but I&#x27;m trying to get a general understanding.<p>I recall seeing related outcomes in some of Anthropic&#x27;s studies, but I&#x27;m not sure how much of this particular aspect was studied.

5/27/2026, 9:16:44 AM


by: pulkas

article is too old. who is using gpt-4o today?

5/27/2026, 9:11:53 AM


by: dude250711

I have an idea: let&#x27;s use these things for autonomous software engineering.

5/27/2026, 8:06:49 AM


by: atlasforgex

Yeah

5/27/2026, 11:41:55 AM


by:

5/27/2026, 10:34:55 AM


by: DeathArrow

I am always nice to my AIs in the case they will take over the world. &#x2F;s

5/27/2026, 9:22:27 AM


by: polytely

it sort of makes sense to me, when asking a question to an expert in the field while you are a student. I would guess the successful interactions on average would be more polite . Like for example if you were asking a question to donald knuth or terrence tao, you&#x27;d probably be polite while doing so. Being hostile while asking questions gets you into forum discussion territory.

5/27/2026, 8:41:27 AM


by: dSebastien

I guess it makes sense since we as humans tend to be far less inclined to help someone who is not polite&#x2F;is not friendly, so that &quot;bias&quot; is part of the training data, thus influences how LLMs function

5/27/2026, 8:51:09 AM