#How to split Devanagari bi-tri and tetra consonantal conjuncts as a whole from a string
22 messages · Page 1 of 1 (latest)
oh, yes, it's just so that I understand, nothing else
Maybe try a split by whitespace?
Spliting by whitespace would give just word by word. Not conjuncts consonents
Ah, I see
Could it be that there’s a zero-width joiner of some sort between those graphemes?
Could be
Though I haven't checked if Devanagari conjuncts have zero width joiner. I have seen it in other script while started working on it
If you iterate with .chars(), examine each char to see something "strange" that might occur in your characters to identify whether there’s a special char to denominate the ligature
Or if it’s just a font thing, and you’d have to know, for each grapheme, if it merges with another
I have tested with chars() at the first try. But it does not give me conjuncts as a whole like न्दी most times
Input: हिन्दी क्त्र क्ष्ण्य असम के मुख्यमंत्री हिमंत
['ह', 'ि', 'न', '\u{94d}', 'द', 'ी', ' ', 'क', '\u{94d}', 'त', '\u{94d}', 'र', ' ', 'क', '\u{94d}', 'ष', '\u{94d}', 'ण', '\u{94d}', 'य', ' ', 'अ', 'स', 'म', ' ', 'क', '\u{947}', ' ', 'म', '\u{941}', 'ख', '\u{94d}', 'य', 'म', '\u{902}', 'त', '\u{94d}', 'र', 'ी', ' ', 'ह', 'ि', 'म', '\u{902}', 'त']
(for reference)
There seem to have a special class of characters that will be ligatured with the next one
Yes, that's something I got. Also, I was trying to get letter instead of unicode numbers that are shown but could not get it
Like 'ि or 'ी
I basically did ```rs
fn main() {
let hs = "हिन्दी क्त्र क्ष्ण्य असम के मुख्यमंत्री हिमंत";
let hsi: Vec<_> = hs.chars().collect();
println!("{hsi:?}");
}
Exactly. Also, I just checked. They have zero width joiner too, i mean in Devanagari script
Yes, that's exactly was my first try too
Well, that is some tough shit
Yeh, true
I was trying to build it so that I can get the way I am seeking. But no luck yet. : ( : (
Hmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm 