Convert "Fancy" unicode ABC to standard ABC - php

I run Regex checks on certain inputs on my site, but the Regex wrongfully returns false when users use "Fancy" Unicode sets such as:
Ⓜⓐⓣⓒⓗ
🅜🅐🅣🅒🅗
Match
𝐌𝐚𝐭𝐜𝐡
𝕸𝖆𝖙𝖈𝖍
𝑴𝒂𝒕𝒄𝒉
𝓜𝓪𝓽𝓬𝓱
𝕄𝕒𝕥𝕔𝕙
𝙼𝚊𝚝𝚌𝚑
𝖬𝖺𝗍𝖼𝗁
𝗠𝗮𝘁𝗰𝗵
𝙈𝙖𝙩𝙘𝙝
𝘔𝘢𝘵𝘤𝘩
⒨⒜⒯⒞⒣
🇲🇦🇹🇨🇭
🄼🄰🅃🄲🄷
🅼🅰🆃🅲🅷
These are not different fonts, they are different characters! None of these are matched by /Match/ (Proof)
How can I convert the user input to standard ABC characters before running through my Regex checks? (I'm using PHP, if that makes a difference)

The NFKD unicode normalisation should take care of most of those. However, it seems it only works if intl module is enabled, and I don't have it in my environment, so I can't test it. If you also don't have such a PHP, and don't want to install it, this does something a bit similar, at least for some of the characters:
iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text)
Finally, you can make your own mapping, for example using strtr (which you will then know to work, since you'd've written it yourself).

Related

PHP remove terminal codes from string

While processing the input/output of a process created with proc_open, I've been hit with the special terminal ANSI codes (\033[0J,\033[13G), aside from not finding a reference to what these particular codes are doing, they are really messing with my preg_match calls.
Does PHP have a built in method for cleansing these types of strings? Or what would be the correct expression to use with preg_replace? Please note, I am dealing with non ascii characters, so stripping everything except... will not work.
Usually ANSI codes are introduced by an ESC (\033 aka \x1b), an open square bracket, then numbers (possibly repeated: *[32;40m) and terminated by a letter.
You can use something like #\\x1b[[][0-9]+(;[0-9]*)[A-Za-z]# to preg_replace them all to oblivion.
This works (just tested), even if definitely overkill:
$test = preg_replace('#\\x1b[[][^A-Za-z]*[A-Za-z]#', '', $test);
I've also found this on GitHub, and this on SO.

Having en-dash at the end of the string doesn't allow json_encode

I am trying to extract n characters from a string using
substr($originalText,0,250);
The nth character is an en-dash. So I get the last character as †when I view it in notepad. In my editor, Brackets, I can't even open the log file it since it only supports UTF-8 encoding.
I also cannot run json_encode on this string.
However, when I use substr($originalText,0,251), it works just fine. I can open the log file and it shows an en-dash instead of â€. json_encode also works fine.
I can use mb_convert_encoding($mystring, "UTF-8", "Windows-1252") to circumvent the problem, but could anyone tell me why having these characters at the end specifically causes an error?
Moreover, on doing this, my log file shows †in brackets, which is confusing too.
My question is why is having the en-dash at the end of the string, different from having it anywhere else (followed by other characters).
Hopefully my question is clear, if not I can try to explain further.
Thanks.
Pid's answer gives an explanation for why this is happening, this answer just looks at what you can do about it...
Use mb_substr()
The multibyte string module was designed for exactly this situation, and provides a number of string functions that handle multibyte characters correctly. I suggest having a look through there as there are likely other ones that you will need in other places of your application.
You may need to install or enable this module if you get a function not found error. Instructions for this are platform dependent and out-of-scope for this question.
The function you want for the case in your question is called mb_substr() and is called the same as you would use substr(), but has other optional arguments.
UTF-8 uses so-called surrogates which extend the codepage beyond ASCII to accomodate many more characters.
A single UTF-8 character may be coded into one, two, three or four bytes, depending on the character.
You cut the string right in the middle of a multi-byte character:
[<-character->]
[byte-0|byte-1]
^
You cut the string right here in the middle!
[<-----character---->]
[byte-0|byte-1|byte-2]
^ ^
Or anywhere here if it's 3 bytes long.
So the decoder has the first byte(s) but can't read the entire character because the string ends prematurely.
This causes all the effects you are witnessing.
The solution to this problem is here in Dezza's answer.

PHP preg_ /u utf-8 switch - Not understanding what it does in practice

I am converting a php/mariadb web application from latin1 to utf-8. I have it working but I am not using the /u switch on any of my preg_ statements and it seems to be working fine. I have tried samples of russian, chinese traditional and simple, japanese, arabic, hindu. Part of the application is a wiki which uses preg statements extensively and it works fine also.
So what is the preg /u switch suppose to do? ...since it seems to work fine without it?
I have been looking up information on this for 2 weeks and I can't find anything that explains the /u switch in a way that differentiates its use from 'not' using it.
I have determined that I do have the utf-8 pcre features in the prce that my php is using. I'm using PHP v5.6.20, MariaDB 5.5.32. I've got my web pages, mysql driver and mariadb all using utf-8.
The u modifier is used by PCRE when deciding how to handle certain matching cases. For example, with the dot metacharacter, multiple bytes are permitted, assuming they form a valid UTF-8 sequence:
preg_match('/^.$/', '老'); // 0
preg_match('/^.$/u', '老'); // 1
Another example, when considering what is covered by a character class:
preg_match('/^[[:print:]]$/', '老'); // 0
preg_match('/^[[:print:]]$/u', '老'); // 1
When including UTF-8 (or indeed a string encoded in any other encoding) directly in the regex, the u modifier effectively makes no difference, as PCRE is ultimately going match byte-by-byte.

How do I remove Unicode Characters from String?

I need a Regex code to remove Emoji, Symbols ( basically any unicode character ) except Japanese, Korean, Chinese, Vietnamese , and any other languages that use unicode characters. The regex is going to be used for a php and Python server. I noticed that I'm having problem with iPhone users who uses Emoji keyboard to create some weird names.
So far I've tried a few regex but I couldn't find any proper one.
Below is my own text string which I use for testing. Please note that I have no idea what does the other english character means. If its a bad word please change it.
abcdefghij
klmnopqrst
uvwxyz
1234567890
한국 韓國
‎Công Ty Cổ Phần Hùng Đức
南极星
おはようございます
============== Below characters should be detected by regex ========
™£¢£¢§¢∞§¶•§ª§¶
[]{}"';?><_+=-
()*&^%$##!~`,.
😊🐻🏢🐭4️⃣⌘
❤❣☁♫🗽🐯
All symbols match \p{S} regex. You just need to be sure your PHP is in UTF-8 mode (whatever that means, I don't do PHP) – see http://php.net//manual/pl/regexp.reference.unicode.php – and for Python, you need an alternative regex library: https://pypi.python.org/pypi/regex
You may find that regular expressions aren't the hammer for all nails. In this case you simply want to exclude characters, so it probably isn't.
In Python 3 the string translate() method would be useful: if you mapped the characters you want excluded to None they will indeed be excluded from the result.
Unfortunately this method only applies to ASCII strings, and takes a 256-character string as its mapping table. You couldl ,however, program a similar alogirthm yourself in Python, but it's not going to be as efficient.
PS: There are no "bad words" in your text.

UTF-8 & IsAlpha() in PHP

I'm working on a application which supports several languages and has a functionality in place which tries to use the language requested by the browser and also allows manual override of this function. This part works fine and picks the correct templates, labels, etc.
User have to enter sometimes text on their own and that's where I run into issues because the application has to accept even "complicated" languages like Chinese and Russian. So far I've taken care of the things mentioned in other posting, i.e.:
calling mb_internal_encoding( 'UTF-8' )
setting the right encoding when rendering the webpages with meta http-equiv=Content-Type content=text/html;charset=UTF-8 (format adapted due to stackoverflow limitations)
even the content arrives correctly, because mb_detect_encoding() == UTF-8
tried to set setLocale(LC_CTYPE, "UTF-8"), which doesn't seem to work because it requires the selection of one language, which I can't specify because I have to support several. And it still fails if I force it manually for testing purposes, i.e. with; setLocale(LC_CTYPE,"zh__CN.utf8") - ctype_alpha() would still fail for Chinese text
It seems that even explicit language selection doesn't make ctype_alpha() useful.
Hence the question is: how should I check for alphabetic characters in all languages?
The only idea I had at the moment is to check manually with arrays of "valid" characters - but this seems ugly especially for Chinese.
How would you solve this issue?
If you'd like to check only for valid unicode letters regardless of the used language I'd propose to use a regular expression (if your pcre-regex extension is built with unicode support):
// adjust pattern to your needs
// $input needs to be UTF-8 encoded
if (preg_match('/^\p{L}+$/u', $input)) {
// OK
} else {
// not OK
}
\p{L} checks for unicode characters with the L(etter) property which includes the properties Ll (lower case letter), Lm (modifier letter), Lo (other letter), Lt (title case letter) and Lu (upper case letter) - from: Regular Expression Details).
I wouldn't use an array of characters. That would get impossible to manage.
What I'd suggest is working out a 'default' language from the IP address and using that as the locale for a request. You could also get it from the browser-agent string in some cases. You could provide the user a way to override so that if your default isn't correct they aren't stuck with a strange site. (E.g. provide on the form 'language set to english. If this isn't correct, please change: '. This isn't the nicest thing to provide but you won't get any working validation otherwise as you NEED a language/locale set in order to have a sensible alpha validation (An A isn't a letter in chinese).
You can use the languages from
$_SERVER['HTTP_ACCEPT_LANGUAGE']
It contains something like
de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
so you need to parse this string. Then you can use the preferred language in the setLocale function.
This is rather an encoding issue than a language detection issue. Because UTF-8 can encode any Unicode character.
The best approach is to use UTF-8 throughout your project: in your database, in your output and as expected encoding for the input.
Output    Make sure you encode your data with UTF-8 and declare that in the HTTP header in the Content-Type field and not just in the document itself.
Input    If you’re using forms, declare the expected encoding in the accept-charset attribute.

Categories