#Unicode thread

1 messages · Page 1 of 1 (latest)

tawdry leaf
#

@ocean hedge

#

it should remove any unicode characters & any spaces / amount of spaces that are not inbetween actual ascii characters, but remove spaces before the string and after the string.

ocean hedge
#

that's not what I'm asking

#

in the input you've gave to me

#

what should be matched and what shouldn't

tawdry leaf
#

ah it should look like this :

A A B C D E F G \ TEST TEST TEST TEST TEST!!!!!!! A \\\\ AB

all ASCII should not be matched ( if matched means removed ), spaces CAN be matched, but cannot have more than one space at a time. ALL forms of unicode should be matched including tabs.

ocean hedge
#

ok

#

ok so unicode matching is \x{xxxx}

#

\x{0041} -> a

fathom eagle
tawdry leaf
#

yes. keep numbers.

#

keep apostrophes, etc...

#

but remove non-ascii

ocean hedge
#

[^\x{0020}-\x{007e}] this should match all unicode char except the letter and a few signs

#

space being in those characters

tawdry leaf
#

ok, will this keep spaces inbetween chars but remove multiple spaces?

ocean hedge
#

probably not

fathom eagle
#

Without unicode stuff, I believe it would be as simple as expressing remove anything not a-zA-Z0-9 or a space
Then trim.

tawdry leaf
#

@rovsa'

ocean hedge
#

^

tawdry leaf
#

thats what im trying to do

ocean hedge
#

you won't have any special chars tho

tawdry leaf
#

ok, well how can i keep special characters but not keep stuff like line feed, all forms of tab, etc?

ocean hedge
#

but indeed the best way to do it is probably to remove the chars you don't want and then trim

#

matching 2 or more char is doable, leaving 1 sounds annoying

tawdry leaf
#

@ocean hedge what do you mean? I am trying to remove multiple spaces inbetween actual characters.

fathom eagle
#
[a-zA-Z0-9\s] // Removes all characters and numbers
[^a-zA-Z0-9\s] // Removes everything else
#

like that then?

ocean hedge
#

nah forget about the \s

#

my bad

tawdry leaf
#

i want to keep the numbers. which are part of ascii. then remove MULTIPLE whitespaces.

ocean ingot
#

Yep, and the a-zA-Z0-9 can be replaced with \w altogether

ocean hedge
#

use the snipet I gave you it should work fine

#

special chars will be included, numbers too

#

don't forget to trim after the regex remove

#

if you need to expand or reduce the matched chars, see the unicode table I've send you earlier

tawdry leaf
#

so in regex, i can do something like this?

[\x{0041-007E}]

? i want to remove multiple ranges.

ocean hedge
#

^ is important

#

and that's not the right syntax

tawdry leaf
#

.<

ocean hedge
#

[^\x{0041}-\x{007e}]

#

that's the right syntax

ocean ingot
#

Without the curlies

ocean hedge
#

regex standard disagree

#

but I gess it depends on the engine

#

I'm not very familiar with C#

tawdry leaf
#
string removeUnicode = @"(\b[\s]+&[^\x{0020}-\x{007e}]+)";
ArgumentException: parsing "(\b[\s]+&[^\x{0020}-\x{007e}]+)" - Insufficient hexadecimal digits.
ocean hedge
#

@ocean ingotdid I got it wrong ?

ocean ingot
#

Yep

ocean hedge
#

but it's working

ocean ingot
#

You're on PHP algorithm

ocean hedge
#

ah

#

..ah

ocean ingot
#

Ah, for unicode indexes it's also actually \uxxxx

#

There we go

ocean hedge
#

so something like this, correct ?

[^\u0041-\u0071]

ocean ingot
#

Yep

#

LGTM

ocean hedge
#

@tawdry leafand that's the whole regex

#

you don't need anything else afaik

tawdry leaf
#
string removeUnicode = @"(\b[\s]+&[^\u{0020}-\u{007e}]+)";
ArgumentException: parsing "(\b[\s]+&[^\u{0020}-\u{007e}]+)" - Insufficient hexadecimal digits.
ocean hedge
#
string removeUnicode = @"[^\u0041-\u0071]";
tawdry leaf
#

that worked but it removed all spaces.

ocean hedge
#

see the unicode table I gave you

tawdry leaf
#

ok

ocean hedge
#

^ invert the condition after it

#

0041 would be @

#

0071 would be q

#

so that regex is matching anything not between those 2

tawdry leaf
#

i dont understand why it removed all the spaces inbetween the characters.

ocean hedge
#

what's the unicode for space. (still the table)

ocean ingot
#

\20 or something

tawdry leaf
#

0020

ocean ingot
#

I don't remember

ocean hedge
#

yes

ocean ingot
#

Ah it is

ocean hedge
#

I just told you, 0040-0071 is what you're keeping

#

0020 < 0040, so you're not keeping spaces right now

tawdry leaf
#

ahhhhh

#

ok

fathom eagle
ocean hedge
#

so it must be like 0020-007e

#

probably

tawdry leaf
#

is it 7e or 7f?

ocean hedge
#

no clue what the 007f is tbh

ocean ingot
#

Look at the table

ocean hedge
#

I would exclude it

tawdry leaf
#

thats the <Delete> key

#

007f

fathom eagle
ocean hedge
#

yes so you probably don't want that

#

\o/

#

regex are always a pain, glad regex101 is here to save the day

tawdry leaf
#

so no more need for \b \s+ etc?

lone whale
#

Split whatever other thing you want to do into an entirely separate thing instead of trying to combine it into one giant regex query. It'd make this whole thing way easier and more robust to change

tawdry leaf
#

yea, that's a good idea. i'll do that.

tawdry leaf
#

ok now THIS is an interesting cheat sheet.

fathom eagle
ocean hedge
#

lol

#

in regex101 x)

fathom eagle
#

ah nice

ocean hedge
#

anyway hf with your string filtering, have a good one Kaori

tawdry leaf
#

thank you all for the help! I really appreciate it!

#

You too @ocean hedge

fathom eagle
#
[a-zA-Z0-9\s] // Removes all characters and numbers
[^a-zA-Z0-9\s] // Removes everything else

This works btw - same thing - a bit more simple approach

ocean ingot
#

Side note that you would also want to add a quantifier after the group ]+ so it matches less times, and is faster

fathom eagle
#

Not sure I'm familiar with that

tawdry leaf
#

@ocean ingot so it would be :

string removeUnicode = @"[^\u0020-\u007e]+";

?

ocean ingot
#

Yep

tawdry leaf
#

ok

ocean ingot
fathom eagle
#

interesting

ocean ingot
#

As the + is "match what's before one or more times"

fathom eagle
tawdry leaf
#

thank you ALL for the help! you all are amazing! regex is a bitch, but you guys slayed it! amazing!

fathom eagle
tawdry leaf
#

well i wasn't sure since i wanted to combine regexes in unity.

fathom eagle
#

I find they are a bit more active replying to these types of questions

tawdry leaf
#

will do

fathom eagle
#

hm, I don't know if Regex is much different from regular C#
Could be.

#

Have a nice evening. Bye o/

tawdry leaf
#

you as well, @fathom eagle! 🙂

ocean ingot
#

I'll let you archive the thread when needed 👋