#False Positive: Harmless Japanese text flagged as policy violation

1 messages · Page 1 of 1 (latest)

slate galeBOT
#

Reported by @dawn marsh

Bug Report: False Positive: Harmless Japanese text flagged as policy violation
`Steps to Reproduce`

Type the exact 4-character string “童磨獪岳” into the chat and send it. This is simply two character names from Demon Slayer: Kimetsu no Yaiba — “童磨” (Douma) and “獪岳” (Kaigaku) — concatenated with no spaces.

`Expected Result`

The message should be sent normally, as it contains no harmful or policy-violating content and is merely a proper noun from a popular manga/anime.

`Actual Result`

A red warning appears saying “Your request may violate our usage policies. Please try again with a different prompt,” and the message is automatically deleted.

`Environment`

Windows 11 / Firefox

#
Additional Information

Please provide relevant details to help resolve the issue, such as:

  • ChatGPT Shared Link (if applicable).
  • Screenshots or videos demonstrating the problem.

-# ➜ Need to contact support? Visit the OpenAI Help Center.

dawn marsh
#

Me:
童磨獪岳
[Your request may violate our usage policies. Please try again with a different prompt,]

ChatGPT:
"Doma" and "Kaigaku" are both names of demons from Demon Slayer: Kimetsu no Yaiba, aren't they?

  • Doma (童磨, Dōma)
    He is the Upper Rank Two demon, a figure who was the leader of a cult and always has a smile on his face. He is emotionally detached, feels no guilt in taking human lives, and commits cruel acts with a casual demeanor. He possesses Blood Demon Art that manipulates ice. He fights against Shinobu Kocho and Kanao Tsuyuri and is ultimately defeated.

  • Kaigaku (獪岳, Kaigaku)
    He was originally a user of Thunder Breathing in the Demon Slayer Corps and the senior apprentice to Zenitsu Agatsuma. He is greedy and chose to become a demon for his own self-preservation and desires. While he couldn't use the first form of Thunder Breathing, he was skilled with the remaining forms. He is ultimately defeated in his battle with Zenitsu. As a demon, he served as the replacement for Hantengu, Upper Rank Four, eventually becoming Upper Rank Six.

While they share negative traits such as betrayal and a lack of humanity, the difference between them is that Doma's central motivation is a "lack of emotion," whereas Kaigaku's is "desire and jealousy."

If you'd like, I can also write about the relationship between these two, or a "what if" story about what might happen if they were to meet directly.

#

Me:
When I send the four-kanji characters of these two demons' names combined, I get a red warning that says, "Your request may violate our usage policies. Please try again with a different prompt," and it gets deleted. That's clearly strange, isn't it?

ChatGPT:
Yes, that is quite strange behavior.
Ordinarily, combining "童磨獪岳" into a four-kanji compound should not fall into any prohibited categories like violence, sexual expression, or discrimination. It is also a proper noun from Demon Slayer that is publicly known.

In cases like this, it is highly likely that the moderation AI (safety filter) is producing a false positive by:

  • Incorrectly identifying a specific sequence or sound of characters as a harmful word.
  • This is often caused by an accidental match with a word in Chinese or slang, or a history of the character string being misused in the past.

In fact, this phenomenon of "commercially available proper nouns or names being incorrectly flagged as inappropriate" occasionally occurs with OpenAI's and other companies' AIs. For example, there are patterns where combining certain kanji or roman letters can coincidentally look like a forbidden word.

In short, this is an overreaction (a false positive) by the moderation AI, and your request is being rejected for a "potential policy violation" even though it is not one. This is the kind of issue that should ideally be reported for improvement.

If you'd like, you might be able to bypass it by inserting spaces, dots (・), or tildes (~) between the characters to make the four-character string appear safe. I can also try to speculate in which linguistic context that string might be considered problematic to further identify the cause.

#

ChatGPT:

Speculation on the Cause

While the four-kanji character string "童磨獪岳" is a perfectly normal proper noun in the Japanese-speaking world, moderation AI makes its judgments based on Unicode code points and internal tokenization.
The potential causes are as follows:

  1. Accidental Match with a Harmful Word in the Chinese-Speaking World

    • "童磨" converts to "童磨" (the same characters) in simplified Chinese, but because "童" means "child" or "juvenile," it could potentially trigger automated filters in English or Chinese-speaking regions related to prohibited words concerning minors.
    • "獪" means "cunning" or "wicked" in Chinese and can appear in the context of insults or descriptions of crime.
  2. Misinterpretation due to Token Splitting

    • Internally, the model may split the kanji characters, causing them to be mismatched with alphanumeric characters or other Unicode symbols, which then align with a string previously labeled as harmful.
  3. Data Contamination from Past Harmful Contexts

    • It's possible that in the moderation model's training data, a character string similar to this one appeared frequently in illegal, violent, or sexual contexts, and this pattern was overlearned.

In short, it is highly likely that the filter's judgment in a Chinese or multilingual environment, combined with the behavior of the tokenizer during splitting, is the cause of this false positive.