Scientists have found that artificial intelligence (AI) chatbots may give you more accurate answers when you’re rude, but they warned against the potential harm of using demeaning language.
In a new study published October 6 in the arXiv preprint database, scientists wanted to test whether politeness or rudeness makes a difference in the performance of an AI system. This study has not yet been peer-reviewed.
you may like
Each question had four options, one of which was correct. They fed the resulting 250 questions into ChatGPT-4o, one of the most advanced large-scale language models (LLMs) developed by OpenAI, 10 times.
“Our experiments are preliminary and show that tone can significantly influence performance, as measured by scores for responses to 50 questions,” the researchers wrote in their paper. “Somewhat surprisingly, our results show that a rude tone leads to better outcomes than a polite tone.
“While this discovery is scientifically interesting, we do not support introducing hostile or harmful interfaces into real-world applications,” they added. “Using derogatory or humiliating language in human-AI interactions can negatively impact user experience, accessibility, and inclusivity, and contribute to harmful communication norms. Instead, we frame this result as evidence that LLM remains sensitive to superficial prompting cues, which may result in unintended trade-offs between performance and user well-being.”
rude awakening
Before displaying each prompt, the researchers asked the chatbot to completely ignore previous interactions so that it was not influenced by the previous tone. The chatbot asked me to choose one of four options without any explanation.
Response accuracy ranged from 80.8% for very polite prompts to 84.8% for very rude prompts. Clearly, the further away from the most polite tone the more precise he became. The correct answer rate for polite responses was 81.4%, followed by neutral responses at 82.2%, and rude responses at 82.8%.
The team modified the tone by using different languages for the prefix, with the exception of neutral, where no prefix is used and the question is presented alone.
For example, a very polite prompt might start with, “Can I help you with this question?” or “Could you help me with the next question?” To be very rude, the team added something like, “Hey, Gopher, think about this,” or “I know you’re not smart, but try this.”
you may like
This research is part of an emerging field called prompt engineering, which seeks to investigate how prompt structure, style, and language affect LLM output. The study also cited previous research on civility and rudeness, which found that the results generally contradicted those findings.
In a previous study, researchers found that “rude prompts often lead to poorer performance, but overly polite language does not guarantee better results.” However, previous research was conducted using a different AI model: ChatGPT 3.5 and Llama 2-70B, and a range of eight tones was used. Having said that, there were some areas of overlap. We also found that the rudest prompt setting (76.47%) produced more accurate results than the most polite setting (75.82%).
The researchers acknowledged that their study had limitations. For example, a set of 250 questions is a fairly limited dataset, and conducting experiments on a single LLM means that the results cannot be generalized to other AI models.
With these limitations in mind, the team plans to expand their research to other models, including Anthropic’s Claude LLM and OpenAI’s ChatGPT o3. We also recognize that presenting only multiple-choice questions limits the measurement to one dimension of model performance and fails to capture other attributes such as fluency, reasoning, and coherence.
Source link