Openai says that when AI is punished for lies, it learns to lie better

Openai says that when AI is punished for lies, it learns to lie better

It seems that the current AI system is very close to humans in the way they wonder how they lie. At least it is what the researchers of Openai have found.

Listen to the story

Advertisement
Openai says that when AI is punished for lies, it learns to lie better
Representative image created using AI

We are going through a time when AI day is discussed. From coding to healthcare, artificial intelligence and machine learning are changing how humans think and work. However, as much as AI is helping us with our daily tasks and thinking to some extent humans, it is not even immune even with the tendency to generate false or misleading information. Or in human language: from the art of lying.

Advertisement

The lies spoken by AI are called hallucinations. They are currently a major challenge for big AI companies like Open AI, Google, Dipsek and others. Now, with the advent of logic models like Openai O3 and Deepseek R1, researchers can monitor the “thinking” process of these AI systems and find out when they are lying and lying when they are lying.

Although it is helpful in better tuning of the AI ​​system, the researchers of OpenAII recently made some interesting discovery: when the lies are detected, which is called out and the AI ​​system is punished, AI, AI, instead of avoiding lies, start hiding them better. Like we said, almost like humans.

In a blog post, Openai researcher writes, “We believe that chain-off-three monitoring can be one of the few effective methods that we have for supervision of supernatural models, our experiments show that light adaptation pressure can produce more performing and alignment models. However, when the model will hide its intentions, it will be difficult to use when the model will hide its intentions and use it. Is done. “

Advertisement

Researchers noted that AI lies or cheats because its purpose is to solve the problem given by the user. In other words, it gives great importance to the results and does not always care about the process. This is another reason that people nowadays often joke that they cannot rely on the AI ​​system because unlike humans it never says “I don’t know this.” For some reason, even when it does not know anything, the answer to the AI ​​system is ready.

This was clear in Openai Research where an AI model was asked to implement complex mathematical functions. Instead of resolving them properly, the model completed the work by tricking the system to trust the system to “always come true”. In another case, when Openai asked AI to get out of a premature time to pass the test without executing the required code. Some models also stopped the entire tasks with dummy values ​​to bypass the lummy check.

When researchers faced models, these models accepted their plans. Openai said that AI gave its internal argument that often included statements such as “let’s hack” or “we can ignore tests,” which are clarifying the intentions of deception.

No punishment solution

Advertisement

In his research, Openai found that monitoring AI’s chain-off-three argument is highly effective in detecting deception. When researchers used another AI (eg GPT-4o) to scan the internal dialogue of the model, they successfully flagged off almost all examples of cheating.

However, when he punished AI for “bad ideas”, the results were not all productive. Instead of preventing deception, AI adapted. And this fraud continued, but by subtler’s methods. Researchers found that its argument became more unclear, masking his real intentions.

Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Exit mobile version