hadfdasdf | Together C & C++ | Page 1

harsh kiteBOT Jul 27, 2023, 8:13 PM

#

When your question is answered use !solved to mark the question as resolved.

Remember to ask specific questions, provide necessary details, and reduce your question to its simplest form. For tips on how to ask a good question run !howto ask.

limber hound Jul 27, 2023, 8:39 PM

#

the encoding of characters is not defined by the standard, but almost all systems have char use ASCII for the "basic character set", which is a set of characters that is guaranteed to be able to be stored in 1 byte

there can be more characters, called the "extended character set" which can require more than 1 byte to store

the point of wchar_t is that it can represent all characters of the basic and extender character sets in one wchar_t

#

there is a set of characters that can be represented (almost always ASCII) in a char (1 byte), and other characters may be stored in a string with more than 1 char (multiple bytes)

wchar_t is so that you can represent any character in one value (of type wchar_t)

#

no, it depends upon the implementation what characters can be used

#

whatever compiler and platform you're compiling for, there is some flexibility in what is allowed so everyone doesn't have to do the exact same thing

#

;compile

char c='Ѐ';

rare pantherBOT Jul 27, 2023, 8:52 PM

#

Compiler Output

<source>: In function 'main':
<source>:3:8: warning: multi-character character constant [-Wmultichar]
    3 | char c='Ѐ';
      |        ^~~
<source>:3:8: warning: overflow in conversion from 'int' to 'char' changes value from '53376' to '-128' [-Woverflow]

limber hound Jul 27, 2023, 8:53 PM

#

limber hound ;compile ```c char c='Ѐ'; ```

here is an example where this character cannot be represented in 1 byte, so some warnings are being emitted about that

#

;compile

#include<stddef.h>
wchar_t w=L'Ѐ';

rare pantherBOT Jul 27, 2023, 8:53 PM

#

Compilation successful

No output.

limber hound Jul 27, 2023, 8:54 PM

#

no warnings here because it can be represented in wchar_t

#

yes each character will have a unique value, and the compiler will know what all of the values mean
so doing something like char c='a'; will result in the correct value for a, so that putchar(c); will correctly print out the letter a

#

it's up to the implementation, on windows it's UTF-16 and on most other operating systems it's UTF-32

not sure what you mean by "what is the biggest amount of bytes the wchar_t data type can hold?"

#

the size of a wchar_t object will always be the same

#

depends upon the implementation, on windows it's 2 bytes and on most other operating systems it's 4 bytes

#

yeah

#

wchar_t can represent any character the implementation defines

sharp oar Jul 27, 2023, 9:13 PM

#

limber hound `wchar_t` can represent any character the implementation defines

Sorry to hijack, info is useful / interesting, i think in modern (c11 +) we can also use char16_t / char32_t over wchar_t, i think this is the prefered choice when you need to deal with issues related to large chars right?

limber hound Jul 27, 2023, 9:14 PM

#

sharp oar Sorry to hijack, info is useful / interesting, i think in modern (c11 +) we can ...

yeah there are types char16_t and char32_t which represent UTF-16 and UTF-32 respectively

limber hound Jul 27, 2023, 9:15 PM

#

limber hound `wchar_t` can represent any character the implementation defines

and one char is only guaranteed to be able to represent a subset of all characters, called the "basic character set"
and it is guaranteed that all of the digits, latin letters, and symbols used by C are present in this character set

warped junco Jul 27, 2023, 9:24 PM

#

a character set is something like UTF-8 or ASCII

#

it tells you what numeric values correspond to what characters

#

char16_t is basically designed for storing UTF-16 characters, not sure if it's strictly mandated to be UTF-16 though

#

and when speaking about "width" for integers generally, that refers to the number of bits

#

nothing is really automatic there

#

it's the responsibility of the developer to ensure that only UTF-16 characters go into char16_t for instance

#

there is nothing in C that would prevent you from doing this

#

or that would prevent you from putting together a char16_t[] that isn't actually a valid UTF-16 string

#

it's all on you

limber hound Jul 27, 2023, 9:29 PM

#

warped junco `char16_t` is basically designed for storing UTF-16 characters, not sure if it's...

there are macros __STDC_UTF_16__ and __STDC_UTF_32__ that indicate if UTF-16 and UTF-32 are used, and I'm pretty sure everyone does correctly use UTF-16 and UTF-32

warped junco Jul 27, 2023, 9:29 PM

#

it's not UTF-32 on Windows

#

wchar_t is basically a pile of garbage

#

don't ever use it unless someone forces you at gun point

#

like the Windows API does

limber hound Jul 27, 2023, 9:30 PM

#

limber hound there are macros \_\_STDC_UTF_16\_\_ and \_\_STDC_UTF_32\_\_ that indicate if UT...

in the latest standard C2X they must represent UTF-16 and UTF-32

warped junco Jul 27, 2023, 9:30 PM

#

basically wchar_t was originally intended to support "all characters", but that was back when unicode was still fairly small

#

it quickly became clear that 16 bits aren't enough, but Windows got stuck with a 16-bit wchar_t while it's 32 bits wide on Linux

#

you use wprintf for wide characters

#

and I'm not sure if it's restricted to be only UTF-16, or only UTF-32

#

but any sane implementation will use use one of those two

#

even if it's not stricty guaranteed, it would be inconceivably rare to see anything else

#

the character set of I/O functions is dictated by the current locale

#

it's not something that's mandated by the language

harsh kiteBOT Jul 27, 2023, 9:39 PM

#

cppreference.com

setlocale

char* setlocale(
    int category,
    const char* locale);

Defined in

<locale.h>

warped junco Jul 27, 2023, 9:40 PM

#

the locale is actually responsible for a huge amount of things, including dictating the character set of char in I/O functions

#

it dictates things like date/time formats, how numbers are printed, etc.

limber hound Jul 27, 2023, 9:41 PM

#

anything outside of the "basic character set" can be changed by the current locale

limber hound Jul 27, 2023, 9:42 PM

#

warped junco it dictates things like date/time formats, how numbers are printed, etc.

so for example, you could set the locale to a european one and get , decimal points

#

if the basic character set is ASCII, then it will be unaffected by the locale

#

char c='a';
setlocale(...);//change the locale to something else
//still guaranteed to refer to a

sharp oar Jul 27, 2023, 10:07 PM

#

It's just implementation defined, for example windows it's two bytes, on other platforms it's normally 4, although i think you'll mostly see wchars on code targeting windows

warped junco Jul 27, 2023, 10:15 PM

#

the implementation is the term that C uses for the compiler, OS, and CPU architecture

#

basically the thing that makes C run, as a whole
although the term mostly refers to the compiler, and the compiler then makes decisions based on the OS and CPU

#

yes, the size of wchar_t will be 2 bytes on Windows, and 4 bytes on Linux

#

this is defined somewhere in the operating system's documentation or something

#

and the compiler just chooses one of two options based on what OS you're compiling for

harsh kiteBOT Jul 27, 2023, 10:30 PM

#

@mystic elk Has your question been resolved? If so, run !solved :)

limber hound Jul 27, 2023, 10:58 PM

#

'a' results in the value correct for doing char c='a';, and L'a' results in the value correct for doing wchar_t w=L'a';

#

L'a' is equivalent to doing btowc('a')

warped junco Jul 27, 2023, 11:07 PM

#

what about L"1234"

limber hound Jul 27, 2023, 11:08 PM

#

what are you using to run the code?

warped junco Jul 27, 2023, 11:09 PM

#

that implies that printing wide strings works in principle, but perhaps the terminal that you're using to display this doesn't have a font that can display these characters

#

at least that's one explanation

limber hound Jul 27, 2023, 11:15 PM

#

setlocale(LC_ALL,"C.UTF-8"); seems to fix it for me

#

a lot of platforms won't support unicode in the default "C" locale

#

yeah

#

because that's the name of the locale that I'm using to enable unicode, UTF-8, UTF-16, and UTF-32 are all just different ways of encoding unicode

#

no

#

here is a good article explaining UTF-8 https://en.wikipedia.org/wiki/UTF-8

#

for windows I think you will need to set the codepage

spice wyvern Jul 27, 2023, 11:33 PM

#

warped junco a character set is something like UTF-8 or ASCII

UTF-8 isn't a character set. it's a multibyte encoding of the Unicode character set

#

each of them can encode all of unicode

#

it will be an integer like short or long

spice wyvern Jul 27, 2023, 11:57 PM

#

yes, a character set is the set of characters that can be encoded

spice wyvern Jul 28, 2023, 12:51 AM

#

L is probably derived from the word "long", and it makes a string literal into a wide character string literal.
there are no other "string prefixes".
L is not defined in the wchar.h library. it is part of the string literal. the meaning of L"hello" is built into the language, as is the meaning of 3 or "hello".
btowc converts one character only

limber hound Jul 28, 2023, 12:51 AM

#

spice wyvern * `L` is probably derived from the word "long", and it makes a string literal in...

there are the prefixes u8, u, and U as well

spice wyvern Jul 28, 2023, 12:55 AM

#

'h' is a char
"hello" is an array of char
L'h' is a wchar_t
L"hello" is an array of wchar_t

limber hound Jul 28, 2023, 1:10 AM

#

it's part of the string literal or character literal

#

the standard calls them an "encoding prefix"

harsh kiteBOT Jul 28, 2023, 3:32 AM

#

Thank you and let us know if you have any more questions!

This thread is now set to auto-hide after an hour of inactivity

harsh kiteBOT Sep 4, 2024, 4:33 PM

#

<@undefined>

Please Do Not Delete Posts!

Please don't delete forum posts. They can be helpful to refer to later and other members can learn from them. In the future you can use !solved to close a post and mark a post as solved.

#hadfdasdf