Get char from a u32 (non-codepoint) | Rust Programming Language Community | Page 1

onyx radish Jun 30, 2022, 8:45 PM

#

So I’m writing an iterator that reads from a stream and I’m trying to read it character by character. I’ve created a u32 representation of the utf-8 grahphene (by using bitshifts with the 1-4 bytes, it works), however char::from_u32 seems to only want the code point of the character, and not a utf-8 representation of it. Is there any way in which I can, get a char from my utf-8 bytes (without needing to use a string), without needing to well, manipulate the bytes to get the codepoint value. I may just have to parse the bytes to get the codepoint, or suck it up and use a string, but I’d rather not if at all possible.

#

The jist: parsing utf-8 characters on the fly by using each individual byte (I already validate whether each point is a valid byte), without needing to use a string or further parsing the bytes to derive the code point only to then use char::from_u32

#

I’ll send over the code once I’m at my pc, or the main part of it that is problematic

onyx radish Jun 30, 2022, 9:07 PM

#

nevermind

#

I think I figured it out

#

Nope

obsidian lark Jul 1, 2022, 8:18 AM

#

char::from_u32 is a safe transmute, it doesn't do anything clever

#

It just checks if the bytes of the u32 would make a valid char, and transmutes if so.

#

If you want to produce a char from your encoding (which is fixed-length 4-byte utf8, not the utf32 that char uses), you'll need to convert yourself. The easiest way maybe to convert your "char" to str (using to_be_bytes, if I'm getting the endianness right, then converting with str::from_utf8), then accessing the 1st character with string[0].

#

by using bitshifts with the 14 bytes
Wait what

#

That's not even 4-byte utf8, that's an entirely custom encoding

#

Which means you'd have to custom decode it too

#Get char from a u32 (non-codepoint)