When your question is answered use !solved to mark the question as resolved.
Remember to ask specific questions, provide necessary details, and reduce your question to its simplest form. For tips on how to ask a good question run !howto ask.
81 messages · Page 1 of 1 (latest)
When your question is answered use !solved to mark the question as resolved.
Remember to ask specific questions, provide necessary details, and reduce your question to its simplest form. For tips on how to ask a good question run !howto ask.
the encoding of characters is not defined by the standard, but almost all systems have char use ASCII for the "basic character set", which is a set of characters that is guaranteed to be able to be stored in 1 byte
there can be more characters, called the "extended character set" which can require more than 1 byte to store
the point of wchar_t is that it can represent all characters of the basic and extender character sets in one wchar_t
there is a set of characters that can be represented (almost always ASCII) in a char (1 byte), and other characters may be stored in a string with more than 1 char (multiple bytes)
wchar_t is so that you can represent any character in one value (of type wchar_t)
no, it depends upon the implementation what characters can be used
whatever compiler and platform you're compiling for, there is some flexibility in what is allowed so everyone doesn't have to do the exact same thing
;compile
char c='Ѐ';
<source>: In function 'main':
<source>:3:8: warning: multi-character character constant [-Wmultichar]
3 | char c='Ѐ';
| ^~~
<source>:3:8: warning: overflow in conversion from 'int' to 'char' changes value from '53376' to '-128' [-Woverflow]
here is an example where this character cannot be represented in 1 byte, so some warnings are being emitted about that
;compile
#include<stddef.h>
wchar_t w=L'Ѐ';
No output.
no warnings here because it can be represented in wchar_t
yes each character will have a unique value, and the compiler will know what all of the values mean
so doing something like char c='a'; will result in the correct value for a, so that putchar(c); will correctly print out the letter a
it's up to the implementation, on windows it's UTF-16 and on most other operating systems it's UTF-32
not sure what you mean by "what is the biggest amount of bytes the wchar_t data type can hold?"
the size of a wchar_t object will always be the same
depends upon the implementation, on windows it's 2 bytes and on most other operating systems it's 4 bytes
yeah
wchar_t can represent any character the implementation defines
Sorry to hijack, info is useful / interesting, i think in modern (c11 +) we can also use char16_t / char32_t over wchar_t, i think this is the prefered choice when you need to deal with issues related to large chars right?
yeah there are types char16_t and char32_t which represent UTF-16 and UTF-32 respectively
and one char is only guaranteed to be able to represent a subset of all characters, called the "basic character set"
and it is guaranteed that all of the digits, latin letters, and symbols used by C are present in this character set
a character set is something like UTF-8 or ASCII
it tells you what numeric values correspond to what characters
char16_t is basically designed for storing UTF-16 characters, not sure if it's strictly mandated to be UTF-16 though
and when speaking about "width" for integers generally, that refers to the number of bits
nothing is really automatic there
it's the responsibility of the developer to ensure that only UTF-16 characters go into char16_t for instance
there is nothing in C that would prevent you from doing this
or that would prevent you from putting together a char16_t[] that isn't actually a valid UTF-16 string
it's all on you
there are macros __STDC_UTF_16__ and __STDC_UTF_32__ that indicate if UTF-16 and UTF-32 are used, and I'm pretty sure everyone does correctly use UTF-16 and UTF-32
it's not UTF-32 on Windows
wchar_t is basically a pile of garbage
don't ever use it unless someone forces you at gun point
like the Windows API does
in the latest standard C2X they must represent UTF-16 and UTF-32
basically wchar_t was originally intended to support "all characters", but that was back when unicode was still fairly small
it quickly became clear that 16 bits aren't enough, but Windows got stuck with a 16-bit wchar_t while it's 32 bits wide on Linux
you use wprintf for wide characters
and I'm not sure if it's restricted to be only UTF-16, or only UTF-32
but any sane implementation will use use one of those two
even if it's not stricty guaranteed, it would be inconceivably rare to see anything else
the character set of I/O functions is dictated by the current locale
it's not something that's mandated by the language
char* setlocale(
int category,
const char* locale);
<locale.h>
the locale is actually responsible for a huge amount of things, including dictating the character set of char in I/O functions
it dictates things like date/time formats, how numbers are printed, etc.
anything outside of the "basic character set" can be changed by the current locale
so for example, you could set the locale to a european one and get , decimal points
if the basic character set is ASCII, then it will be unaffected by the locale
char c='a';
setlocale(...);//change the locale to something else
//still guaranteed to refer to a
It's just implementation defined, for example windows it's two bytes, on other platforms it's normally 4, although i think you'll mostly see wchars on code targeting windows
the implementation is the term that C uses for the compiler, OS, and CPU architecture
basically the thing that makes C run, as a whole
although the term mostly refers to the compiler, and the compiler then makes decisions based on the OS and CPU
yes, the size of wchar_t will be 2 bytes on Windows, and 4 bytes on Linux
this is defined somewhere in the operating system's documentation or something
and the compiler just chooses one of two options based on what OS you're compiling for
@mystic elk Has your question been resolved? If so, run !solved :)
'a' results in the value correct for doing char c='a';, and L'a' results in the value correct for doing wchar_t w=L'a';
L'a' is equivalent to doing btowc('a')
what about L"1234"
what are you using to run the code?
that implies that printing wide strings works in principle, but perhaps the terminal that you're using to display this doesn't have a font that can display these characters
at least that's one explanation
setlocale(LC_ALL,"C.UTF-8"); seems to fix it for me
a lot of platforms won't support unicode in the default "C" locale
yeah
because that's the name of the locale that I'm using to enable unicode, UTF-8, UTF-16, and UTF-32 are all just different ways of encoding unicode
no
here is a good article explaining UTF-8 https://en.wikipedia.org/wiki/UTF-8
for windows I think you will need to set the codepage
UTF-8 isn't a character set. it's a multibyte encoding of the Unicode character set
each of them can encode all of unicode
it will be an integer like short or long
yes, a character set is the set of characters that can be encoded
L is probably derived from the word "long", and it makes a string literal into a wide character string literal.L is not defined in the wchar.h library. it is part of the string literal. the meaning of L"hello" is built into the language, as is the meaning of 3 or "hello".btowc converts one character onlythere are the prefixes u8, u, and U as well
'h' is a char
"hello" is an array of char
L'h' is a wchar_t
L"hello" is an array of wchar_t
it's part of the string literal or character literal
the standard calls them an "encoding prefix"
Thank you and let us know if you have any more questions!
This thread is now set to auto-hide after an hour of inactivity
<@undefined>
Please don't delete forum posts. They can be helpful to refer to later and other members can learn from them. In the future you can use !solved to close a post and mark a post as solved.