How to split Devanagari bi-tri and tetra consonantal conjuncts as a whole from a string | Rust Programming Language Community | Page 1

round sonnet Jan 21, 2023, 8:16 PM

#

Are you intentionally printing two spaces in print!("{} ", ...?

daring minnow Jan 21, 2023, 8:16 PM

#

oh, yes, it's just so that I understand, nothing else

raw solstice Jan 21, 2023, 8:17 PM

#

Maybe try a split by whitespace?

daring minnow Jan 21, 2023, 8:18 PM

#

Spliting by whitespace would give just word by word. Not conjuncts consonents

raw solstice Jan 21, 2023, 8:18 PM

#

Ah, I see

#

Could it be that there’s a zero-width joiner of some sort between those graphemes?

daring minnow Jan 21, 2023, 8:19 PM

#

Could be

#

Though I haven't checked if Devanagari conjuncts have zero width joiner. I have seen it in other script while started working on it

raw solstice Jan 21, 2023, 8:19 PM

#

If you iterate with .chars(), examine each char to see something "strange" that might occur in your characters to identify whether there’s a special char to denominate the ligature

#

Or if it’s just a font thing, and you’d have to know, for each grapheme, if it merges with another

daring minnow Jan 21, 2023, 8:21 PM

#

I have tested with chars() at the first try. But it does not give me conjuncts as a whole like न्दी most times

raw solstice Jan 21, 2023, 8:22 PM

#

Input: हिन्दी क्त्र क्ष्ण्य असम के मुख्यमंत्री हिमंत

['ह', 'ि', 'न', '\u{94d}', 'द', 'ी', ' ', 'क', '\u{94d}', 'त', '\u{94d}', 'र', ' ', 'क', '\u{94d}', 'ष', '\u{94d}', 'ण', '\u{94d}', 'य', ' ', 'अ', 'स', 'म', ' ', 'क', '\u{947}', ' ', 'म', '\u{941}', 'ख', '\u{94d}', 'य', 'म', '\u{902}', 'त', '\u{94d}', 'र', 'ी', ' ', 'ह', 'ि', 'म', '\u{902}', 'त']
(for reference)

#

There seem to have a special class of characters that will be ligatured with the next one

daring minnow Jan 21, 2023, 8:23 PM

#

Yes, that's something I got. Also, I was trying to get letter instead of unicode numbers that are shown but could not get it

raw solstice Jan 21, 2023, 8:23 PM

#

Like 'ि or 'ी

#

I basically did ```rs
fn main() {
let hs = "हिन्दी क्त्र क्ष्ण्य असम के मुख्यमंत्री हिमंत";
let hsi: Vec<_> = hs.chars().collect();
println!("{hsi:?}");
}

daring minnow Jan 21, 2023, 8:23 PM

#

raw solstice There seem to have a special class of characters that will be ligatured with the...

Exactly. Also, I just checked. They have zero width joiner too, i mean in Devanagari script

daring minnow Jan 21, 2023, 8:24 PM

#

raw solstice I basically did ```rs fn main() { let hs = "हिन्दी क्त्र क्ष्ण्य असम के मुख्...

Yes, that's exactly was my first try too

raw solstice Jan 21, 2023, 8:24 PM

#

Well, that is some tough shit

daring minnow Jan 21, 2023, 8:24 PM

#

Yeh, true

#

I was trying to build it so that I can get the way I am seeking. But no luck yet. : ( : (

raw solstice Jan 21, 2023, 8:35 PM

#

Hmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm ferrisThonk

#How to split Devanagari bi-tri and tetra consonantal conjuncts as a whole from a string