#Unicode thread
1 messages · Page 1 of 1 (latest)
@ocean hedge
it should remove any unicode characters & any spaces / amount of spaces that are not inbetween actual ascii characters, but remove spaces before the string and after the string.
that's not what I'm asking
in the input you've gave to me
what should be matched and what shouldn't
ah it should look like this :
A A B C D E F G \ TEST TEST TEST TEST TEST!!!!!!! A \\\\ AB
all ASCII should not be matched ( if matched means removed ), spaces CAN be matched, but cannot have more than one space at a time. ALL forms of unicode should be matched including tabs.
so you want to remove every non-character, except spaces between words?
Does that include numbers?
[^\x{0020}-\x{007e}] this should match all unicode char except the letter and a few signs
please see table here https://unicode-table.com/fr/#0061
space being in those characters
ok, will this keep spaces inbetween chars but remove multiple spaces?
probably not
Without unicode stuff, I believe it would be as simple as expressing remove anything not a-zA-Z0-9 or a space
Then trim.
@rovsa'
^
thats what im trying to do
you won't have any special chars tho
ok, well how can i keep special characters but not keep stuff like line feed, all forms of tab, etc?
but indeed the best way to do it is probably to remove the chars you don't want and then trim
matching 2 or more char is doable, leaving 1 sounds annoying
@ocean hedge what do you mean? I am trying to remove multiple spaces inbetween actual characters.
[a-zA-Z0-9\s] // Removes all characters and numbers
[^a-zA-Z0-9\s] // Removes everything else
like that then?
i want to keep the numbers. which are part of ascii. then remove MULTIPLE whitespaces.
Yep, and the a-zA-Z0-9 can be replaced with \w altogether
use the snipet I gave you it should work fine
special chars will be included, numbers too
don't forget to trim after the regex remove
if you need to expand or reduce the matched chars, see the unicode table I've send you earlier
so in regex, i can do something like this?
[\x{0041-007E}]
? i want to remove multiple ranges.
.<
Without the curlies
regex standard disagree
but I gess it depends on the engine
I'm not very familiar with C#
string removeUnicode = @"(\b[\s]+&[^\x{0020}-\x{007e}]+)";
ArgumentException: parsing "(\b[\s]+&[^\x{0020}-\x{007e}]+)" - Insufficient hexadecimal digits.
@ocean ingotdid I got it wrong ?
You're on PHP algorithm
so something like this, correct ?
[^\u0041-\u0071]
string removeUnicode = @"(\b[\s]+&[^\u{0020}-\u{007e}]+)";
ArgumentException: parsing "(\b[\s]+&[^\u{0020}-\u{007e}]+)" - Insufficient hexadecimal digits.
string removeUnicode = @"[^\u0041-\u0071]";
that worked but it removed all spaces.
see the unicode table I gave you
ok
^ invert the condition after it
0041 would be @
0071 would be q
so that regex is matching anything not between those 2
i dont understand why it removed all the spaces inbetween the characters.
what's the unicode for space. (still the table)
\20 or something
0020
I don't remember
yes
Ah it is
I just told you, 0040-0071 is what you're keeping
0020 < 0040, so you're not keeping spaces right now
Space isn't included in the range
is it 7e or 7f?
no clue what the 007f is tbh
Look at the table
I would exclude it
it works (btw)
yes so you probably don't want that
\o/
regex are always a pain, glad regex101 is here to save the day
so no more need for \b \s+ etc?
Split whatever other thing you want to do into an entirely separate thing instead of trying to combine it into one giant regex query. It'd make this whole thing way easier and more robust to change
yea, that's a good idea. i'll do that.
ok now THIS is an interesting cheat sheet.
ah nice
anyway hf with your string filtering, have a good one Kaori
[a-zA-Z0-9\s] // Removes all characters and numbers
[^a-zA-Z0-9\s] // Removes everything else
This works btw - same thing - a bit more simple approach
Side note that you would also want to add a quantifier after the group ]+ so it matches less times, and is faster
Not sure I'm familiar with that
@ocean ingot so it would be :
string removeUnicode = @"[^\u0020-\u007e]+";
?
Yep
ok
Let's take your example, here instead of 7 matches containing one !, you would get one match containing seven !.
interesting
As the + is "match what's before one or more times"
Thanks. Good to know :)
thank you ALL for the help! you all are amazing! regex is a bitch, but you guys slayed it! amazing!
You can consider visiting a CSharp discord for things only related to C#
well i wasn't sure since i wanted to combine regexes in unity.
I find they are a bit more active replying to these types of questions
will do
hm, I don't know if Regex is much different from regular C#
Could be.
Have a nice evening. Bye o/
you as well, @fathom eagle! 🙂
I'll let you archive the thread when needed 👋