#Search index not able to search for mandarin

14 messages · Page 1 of 1 (latest)

rotund stoneBOT
#

Hi @stoic lake. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

#

Convex full text search is optimized for English and other Latin-script languages. According to the documentation, text is tokenized using Tantivy's SimpleTokenizer, which splits on whitespace and punctuation and is primarily designed for Latin scripts. This means that search indexes work best with English or similar languages, and support for languages like Mandarin (Chinese) is limited or may not work as expected. As a result, searching for Mandarin characters (e.g., "三") will not return the expected results because the tokenizer does not handle Chinese text properly.

There is currently no documented way to change the tokenizer or improve support for Mandarin in Convex search indexes. If you need to support Mandarin or other non-Latin scripts, you may need to consider alternative approaches or external search solutions.

For more details, see the official documentation:

Search indexes work best with English or other Latin-script languages. Text is tokenized using Tantivy's SimpleTokenizer, which splits on whitespace and punctuation.
Full Text Search Limits


Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@stoic lake Kapa.ai is still learning and improving, please let me know how I did by reacting below.

stoic lake
#

@rotund stone Is there any solution for this?

rotund stoneBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

I'm sorry, but the knowledge sources do not mention any solution or workaround for enabling full text search for Mandarin (Chinese) in Convex search indexes. The documentation explicitly states that search indexes work best with English or other Latin-script languages, as they use Tantivy's SimpleTokenizer, which does not support Chinese tokenization. There is no information about customizing the tokenizer or supporting non-Latin scripts at this time Full Text Search Limits.

If support for Mandarin search is critical for your application, you may need to consider integrating an external search service that supports Chinese tokenization, but this is not covered in the provided knowledge sources.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@stoic lake Kapa.ai is still learning and improving, please let me know how I did by reacting below.

stoic lake
#

@rotund stone How about converting the search keyword and the stored field into something like UTF-16?

rotund stoneBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

The knowledge sources do not mention converting the search keyword and the stored field into UTF-16 or any other encoding as a solution for improving full text search support for Mandarin or other non-Latin scripts in Convex. The documentation only states that the underlying tokenizer (Tantivy's SimpleTokenizer) splits on whitespace and punctuation and is designed for Latin-script languages, without providing options for customizing tokenization or encoding to support other languages Full Text Search Limits.

Therefore, based on the available information, there is no documented solution involving encoding changes (such as UTF-16) to enable Mandarin search in Convex.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@stoic lake Kapa.ai is still learning and improving, please let me know how I did by reacting below.

stoic lake
#

@rotund stone Any suggestion on this?

rotund stoneBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

Currently, Convex's full text search is limited to English and other Latin-script languages because it uses Tantivy's SimpleTokenizer, which does not support tokenization for Mandarin or other CJK (Chinese, Japanese, Korean) scripts. The documentation does not mention any built-in way to customize the tokenizer or encoding to support Mandarin search. There is also no mention of workarounds such as converting text to UTF-16 or similar encodings to improve search results for non-Latin scripts.

If Mandarin search is essential for your application, the knowledge sources do not provide a direct solution within Convex. You may need to consider external search solutions that support CJK tokenization, but this is not covered in the provided documentation Full Text Search Limits.

If you have specific requirements or want to request this feature, the Convex team encourages feedback and feature requests via their community channels.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon: