#[SOLVED] What the heck is a []rune and other string-related confusions.

1 messages · Page 1 of 1 (latest)

marble zealot
#

I'm in the early stages of building a text editor in Odin and I'm confused about the different types there are to represent character strings in Odin.

As far as I can tell "string", "[]u8" and "[]rune" are the main ones, with "cstring" and "[^]u8" being for the purpose of communicating with C libraries. "[^]u8" seems to be the equivalent of the C "char *", which I like to use, but I feel like it probably isn't very "idiomatic" to use in Odin. I don't understand why both "cstring" and "[^]u8" exist given that the second one seems to be the more useful version of the first (?).

Anyway for the purposes of regular string manipulation in Odin, what should I generally use?

Here are some of my thoughts, but I may be wrong and I'd like to hear others'. I assume if possible you would generally would want to stick with the string type. You iterate them by rune and that seems sensible. I could imagine situations where []u8 makes sense if you need to iterate by byte instead of by rune. The one that puzzles me most is []rune. I noticed that the length of the string (len(my_str)) seems to refer to the amount of bytes while the length of the []rune seems to refer to the amount of runes. I noticed that if you index into a string, you do it by byte index, but you index into a []rune you do by it by rune index (an emoji is worth one instead of 4 or whatever). How does this even work with as a simple array if they are of variable size? I would think this is possible if all runes are stored as 4 bytes regardless of usefulness of the second to fourth bytes, but I don't know how to check if this is the case given len() returns the amount of runes. string and []u8 also seem nice to go to and from as they don't need to be reallocated, while going from string to []rune does.

I would appreciate some explanations around this matter, whether I got anything wrong and if anyone has rules of thumb or a good internalized understanding for when to use each, I'd love to hear!

#

What the heck is a []rune and other string-related confusions.

void sand
#

[]rune exists because slices exist and runes exist, that aren't particularly useful as far as i can tell

#

[]u8 is useful for modification, since strings are immutable

#

cstring is effectively a semantic data type, it's a u8 multipointer, with the assumption that it's null terminated

#

also I'm pretty sure you're wrong about indexing rune slices, since a rune is just an alias for a u32

#

it's just a unicode codepoint

#

so emojis may still span multiple

marble zealot
#

Okay, that helps. Thanks.

#

Regarding rune arrays here is a snippet and the return values as comments:

`main :: proc() {
s: string = "Hello🤣World"
rune_array := utf8.string_to_runes(s)

fmt.println(len(s)) // 14
fmt.println(len(rune_array)) // 11
fmt.printfln("%c", s[9]) // W
fmt.println(rune_array[6]) // W

}`

void sand
#

seems right to me, 🤣 is probably just one codepoint

marble zealot
#

Okay, I see. The difference in the values required to index to the same character between a string and a rune array is what I meant in the original post. Given in this scenario that to get to the W in World, you need to go up to index 9 for the string, but for the []rune you just need 6.

#

It makes sense for runes to be a alias for a u32.

#

One more question in case you know. What is the difference between transmute(string)st and string(st)? In the table in the Odin Overview it says this to go from []u8 to a string:

From []u8 to X
To Action Code
string alias transmute(string)st
string alias string(st) unless a slice literal

void sand
#

it might just be the semantics of casts, might not be possible for some reason or another

marble zealot
#

Okay, thank you very much!

fervent radish
whole shadow
#

If you index a string that's by bytes, on the other hand indexing []rune is by UTF8 codepoint. So the difference in your original example is because the emoji is encoded as 3 bytes, string(u8[]{0x01, 0xF9, 0x23})

cstring Vs [^]u8, strings are immutable so the multiptr is the mutable equivalent of a cstring (assuming it's null terminated). Similarly how []u8 is the mutable equivalent of a string.

fervent radish
#

UTF8 is a way to encode an Unicode codepoint, not the codepoint itself. rune is a full Unicode codepoint, no encodeing.
a string is an array of Unicode Codepoints encoded as UTF8. where as a []rune is a slice of raw Unicode Codepoints.

marble zealot
#

This is very useful information, guys. Thanks a ton.

marble zealot
#

[SOLVED] What the heck is a []rune and other string-related confusions.

dusky delta
#

just remember that 1 codepoint (rune) is not 1 grapheme (actual character glyph)