Openai says that when AI is punished for lies, it learns to lie better

It seems that the current AI system is very close to humans in the way they wonder how they lie. At least it is what the researchers of Openai have found.

Listen to the story

Openai says that when AI is punished for lies, it learns to lie better

Representative image created using AI

We are going through a time when AI day is discussed. From coding to healthcare, artificial intelligence and machine learning are changing how humans think and work. However, as much as AI is helping us with our daily tasks and thinking to some extent humans, it is not even immune even with the tendency to generate false or misleading information. Or in human language: from the art of lying.

The lies spoken by AI are called hallucinations. They are currently a major challenge for big AI companies like Open AI, Google, Dipsek and others. Now, with the advent of logic models like Openai O3 and Deepseek R1, researchers can monitor the “thinking” process of these AI systems and find out when they are lying and lying when they are lying.

Although it is helpful in better tuning of the AI system, the researchers of OpenAII recently made some interesting discovery: when the lies are detected, which is called out and the AI system is punished, AI, AI, instead of avoiding lies, start hiding them better. Like we said, almost like humans.

In a blog post, Openai researcher writes, “We believe that chain-off-three monitoring can be one of the few effective methods that we have for supervision of supernatural models, our experiments show that light adaptation pressure can produce more performing and alignment models. However, when the model will hide its intentions, it will be difficult to use when the model will hide its intentions and use it. Is done. “

Researchers noted that AI lies or cheats because its purpose is to solve the problem given by the user. In other words, it gives great importance to the results and does not always care about the process. This is another reason that people nowadays often joke that they cannot rely on the AI system because unlike humans it never says “I don’t know this.” For some reason, even when it does not know anything, the answer to the AI system is ready.

This was clear in Openai Research where an AI model was asked to implement complex mathematical functions. Instead of resolving them properly, the model completed the work by tricking the system to trust the system to “always come true”. In another case, when Openai asked AI to get out of a premature time to pass the test without executing the required code. Some models also stopped the entire tasks with dummy values to bypass the lummy check.

When researchers faced models, these models accepted their plans. Openai said that AI gave its internal argument that often included statements such as “let’s hack” or “we can ignore tests,” which are clarifying the intentions of deception.

No punishment solution

In his research, Openai found that monitoring AI’s chain-off-three argument is highly effective in detecting deception. When researchers used another AI (eg GPT-4o) to scan the internal dialogue of the model, they successfully flagged off almost all examples of cheating.

However, when he punished AI for “bad ideas”, the results were not all productive. Instead of preventing deception, AI adapted. And this fraud continued, but by subtler’s methods. Researchers found that its argument became more unclear, masking his real intentions.

Openai says that when AI is punished for lies, it learns to lie better

It seems that the current AI system is very close to humans in the way they wonder how they lie. At least it is what the researchers of Openai have found.

No punishment solution

Leave a Reply Cancel reply

Leave a Reply