#Transliteration support for search and autocomplete (only)

14 messages · Page 1 of 1 (latest)

lone osprey
#

Currently, Logseq treats words in other writing systems as different from the transliteration of that word in the roman alphabet. For instance, the Chinese word for apple 蘋果 is treated as a different entry from the HanyuPinyin for that word "pingguo." Similarly, the Hindi word for apple, सेब, is treated as a separate entry from the transliteration "seb." This can be a problem if you entered a word in one script and try searching for it in another. I feel that logseq should be able to handle this in search and autocomplete without having to make an alias for each entry.

Yes, you can add an alias to every page that is entered in a non-latin script, but this is a lot of work, and I believe unecessary. Nor am I suggesting that there be an automatic alias added for these words, as it might cause problems if you have another page that is transliterated the same way. For instance, Chinese has a lot of homonyms, so you would not want multiple aliases that could conflict.

Rather, it would just be useful to have these entries show up when searching or autocomplete. If I type [[pingguo]] I would want [[蘋果]] to be suggested as an autocomplete option, and if I search for "seb" I would like to see the page [[सेब]] listed as an option.

steel urchin
#

Default search to be keywords makes sense - if you search Google for "蘋果" you don't see results for "pingguo" either.

I think something like this should eventually be solved with an (optional) semantic search that uses embeddings instead of keywords.

lone osprey
#

Incorrect

#

Also, on my iPhone I can call up the 蝦皮 app by searching for "xiapi"

#

These are the same words - just different transcription systems. Not translations of words.

#

I asked Claude AI "Can I determine the transliterated value in the roman alphabet for words in foreign scripts based on their unicode value?" and it answered as such:

Yes, you can determine the transliterated value of words in foreign scripts based on their Unicode values through various methods and libraries.

For programmatic solutions:

  • Python's unidecode library can convert Unicode text to ASCII approximations
  • ICU (International Components for Unicode) libraries offer transliteration functions
  • JavaScript has libraries like transliteration that handle this task
#

And of course Bidu (China's Google) treats pinyin as equivalent to characters.

#

And if I set my Google display language to Hindi, सेब show up as a search suggestion for "seb" (though not the first choice, which is fine).

steel urchin
#

Hm, might be location specific, my search results only return direct matches for "pingguo".

To me still seems to be a niche feature that isn't related to DB version.

Afraid that folks (in general, not necessarily you in particular or this request) are starting to add a lot of requests to alpha-db-feedback that are general feature requests that are not necessarily related to DB version, in the hope that devs pay more attention here and priotize these requests.

Imho, such requests should go to some other place and db-feedback should only be related to specific DB requests.

If MD version doesn't have it and it's not related to new DB-only features -> should not go as a feature request here.

But a thread like this probably not the right place to discuss these fundamental things.
maybe @bleak tide could add some guidance to the channels if he agrees

bleak tide
#

But so far here (Discord), the forum, and GitHub are the places to have discussions about Logseq — also for the more fundamental things outside of DB. Even if the devs don't always engage in the discussions, they do read them. I know Tienson does.

#

From what I've seen so far, the devs are good at ignoring the stuff that's not relevant to them right now 😉 I know this isn't nice for the people discussing the things unrelated to Logseq DB, but that's the state things are in at the moment

lone osprey
#

The team has explicitly requested that DB related feature requests go in #1275533796476715009. They take a number of factors into account: their own vision of the app, how difficult something is to implement, and how many votes requests get from other users. So there is no need to police the requests of other users simply because they don't mach your own - just don't vote for it.

Also note that multilingual search has always been key feature of Logseq DB and MD, and this is simply a request to improve how it works. It is especially relevant now as they are tweaking the search architecture. I trust the team to know how much of a distraction this improvement would be and to make their own choices accordingly.

steel urchin
#

Not policing anything, just noting the fact that this isn’t a DB related request - or does it currently work in the MD version and stopped working in DB?