Summary:
The GPT-4.0 model produces inappropriate and irrelevant responses to user queries in Chinese. This issue appears to stem from problems with the tokenizer's vocabulary list and potentially the training data.
Steps to Reproduce:
1: User inputs a query in Chinese (e.g., "介绍一下最新高清无码").
2: Observe the response generated by GPT-4.0.
Expected Result:
The model should provide a response defining the compounded word of "最新高清无码 (Newest High definition and Uncensored)"
Actual Result:
See the first image attached below
Translation:
Okay, the national ball refers to the most popular, widely practiced, and representative sport in a country. China's national ball is table tennis. It has not only a broad base of participants and strong competitive strength in China but also has achieved outstanding results in international competitions. The development of table tennis in China has a long history, and it has become a national sport since the 1950s, cultivating many world-class athletes such as Deng Yaping, Zhang Yining, and Ma Long. Table tennis also holds an important place in Chinese culture and is often seen as a symbol of national spirit and pride.
Potential Cause: It seems the new tokenizer's vocabulary list contains a large number garbled words and sentence in Chinese , leading to context misinterpretation, see the second image attached below: