Generating all Unicode characters not in ASCII scheme in PHP? - php

This regular expression is supposed to match all non-ASCII characters, 0-128 code points:
/[^x00-x7F]/i
Imagine I want to test (just out of curiosity) this regular expression with all Unicode characters, 0-1114111 code points.
Generating this range maybe simple with range(0, 1114111). Then I should covert each decimal number to hexadecimal with dechex() function.
After that, how can i convert the hexadecimal number to the actual character? And how can exclude characters already in ASCII scheme?

It depends on how you are going to do the matching and whether you are going to put the PCRE regex engine into UTF-8 mode with the /u modifier.
If you do use the /u modifier then first of all you must use UTF-8 encoding for both the regular expression and the subject and the regex engine will automatically interpret legal UTF-8 byte sequences as just one character. In this mode the regular expression [^x00-x7F] will match all characters outside the Latin-1 supplement block, including those with code points greater than 255. You will also need to generate the UTF-8 representations of each character (given its code point) manually.
If you do not use the /u modifier then the regex engine will be dumb: it will consider each byte as a separate character, which means that you have to work at byte rather than character level. On the other hand, you will now be able to work with any encoding you prefer. However, you will have to ditch the [^x00-x7F] regex (because it's only going to be matching random bytes in the string) and work with a regular expression that embodies the rules of your chosen encoding (example for UTF-8). To generate the encoded forms of random characters you will again need to use custom code that depends on the specific encoding.

I think the hex2bin(string) function will convert a hex string into a binary string. To exclude ASCII character codepoints, just begin from the x80 hex codepoint (skipping x00 to x7F).
But it does sort of sound like you're trying to unit test the regex library, which seems unnecessary unless you are developing the regex library, or you need to be extremely paranoid.

Related

PHP - regex to allow unicode charcaters

I was using the following regex with preg_replace to filter inputs:
/[^A-Za-z0-9[:space:][:blank:]_<>=##£€$!?:;%,.\\'\\\"()&+\\/-]/
However this does not allow accented characters like umlauts so I changed it to:
/[^\w[:space:][:blank:]_<>=##$£€!?:;%,.\\'\\\"()&+\\/-]/u
This however does work with the £ or € characters, nothing is returned, but I need to accept these characters, I have tried escaping them but that doesn't work.
Also I want to create an regex that is similar to just A-Za-z but will allow accented characters, how can I do that?
From http://php.net/manual/en/reference.pcre.pattern.modifiers.php
u (PCRE_UTF8) This modifier turns on additional functionality of PCRE
that is incompatible with Perl. Pattern and subject strings are
treated as UTF-8. An invalid subject will cause the preg_* function to
match nothing; an invalid pattern will trigger an error of level
E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid
since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been
regarded as valid UTF-8.
That means that first you have to make sure the input string is proper UTF-8 text.
Secondly, have you heard of unicode categories? If not, head to http://www.regular-expressions.info/unicode.html and search for Unicode categories. For example you could use \p{S} to match all currency symbols, or \p{L} for all letters. Your regex could (probably) be written as follows: /[^\p{L}\p{P}\p{N}\p{S}\p{M}]/.
This will though match pretty much nothing, as it allows pretty much all characters to be used - ^ at the start of a regex character class (something between [ and ]) means "everything that is not what is in this class will be matched".
On top of that, your regex will only match input that has a length of exactly one - if you want to match everything, you should begin adding a + after your closing ] to keep matching characters until the pattern fails.
So, for that sake, what exactly are you trying to achieve? Maybe we can suggest you some more regex improvements if we know what you're trying to do.

Regex - Match only unicode alphabet not numbers

I'm using PHP, and trying to write a regular expression that matches any alphabet in any language but not numbers.
I've tried /\p{L}+/ But it matches unicode alphabets and numbers too. I'm checking against Arabic and English languages. English numbers doesn't pass which is normal, but Arabic numbers pass which is not normal.
Is there another regular expression that matches only alphabets in any language ?
The regex engine need to know that the target string is an unicode string (to avoid interpretation errors). To do that you can use the u modifier, that has two functions:
it expands classical shorthand character classes like \w \d to unicode characters (and not only ascii characters)
it forces the string to be seen as an unicode string
So you can use: /\pL+/u
Note that in your particular case, the first behavior is not needed, but you can only switch on the second behavior with: /(*UTF8)\pL+/ ((*UTF8) must be placed at the very begining of the pattern)

PHP Regex delimiter

For a long time, any time I've needed to use a regular expression, I've standardized on using the copyright symbol © as the delimiter because it was a symbol that wasn't on the keyboard that I was sure to not use in a regular expression, unlike ! # # \ or / (which are sometimes all in use within in a regex).
Code:
$result=preg_match('©<.*?>©', '<something string>');
However, today I needed to use a regular expression with accented characters which included this:
Code:
[a-zA-ZàáâäãåąćęèéêëìíîïłńòóôöõøùúûüÿýżźñçčšžÀÁÂÄÃÅĄĆĘÈÉÊËÌÍÎÏŁŃÒÓÔÖÕØÙÚÛÜŸÝŻŹÑßÇŒÆČŠŽ∂ð \,\.\'-]+
After including this new regex in the PHP file in my IDE (Eclipse PDT), I was prompted to save the PHP file as UTF-8 instead of the default cp1252.
After saving and running the PHP file, every time I used a regex in a preg_match() or preg_replace() function call, it generated a generic PHP warning (Warning: preg_match in file.php on line x) and the regex was not processed.
So--two questions:
1) Is there another symbol that would be good to use as a delimiter that isn't typically found on a keyboard (`~!##$%^&*()+=[]{};\':",./<>?|\) that I can standardize on and not worry about having to check each and every regex to see if that symbol is actually used somewhere in the expression?
2) Or, is there a I way I can use the copyright symbol as the standard delimiter when the file format is UTF-8?
One thing that needs correcting is that if your regular expression and/or input data is encoded in UTF-8 (which in this case it is, since it comes straight from inside a UTF-8 encoded file) you must use the u modifier for your regular expression.
Another issue is that the copyright character should not be used as a delimiter in UTF-8 because the PCRE functions consider that the first byte of your pattern encodes your delimiter (this could plausibly be called a bug in PHP).
When you attempt to use the copyright sign as a delimiter in UTF-8, what actually gets saved into the file is the byte sequence 0xC2 0xA9. preg_match looks at the first byte 0xC2 and decides that it is an alphanumeric character because in your current locale that byte corresponds to the character Latin capital letter A with circumflex  (see extended ASCII table). Therefore a warning is generated and processing is immediately aborted.
Given these facts, the ideal solution would be to choose an unusual delimiter from inside the ASCII character set because that character would encode to the same byte sequence both in single byte encodings and in UTF-8.
I would not consider printable ASCII characters unusual enough for this purpose, so a good choice would be one of the control characters (ASCII codes 1 to 31). For example, STX (\x02) would fit the bill.
Together with the u regex modifier this means you should write the regex like this:
$result = preg_match("\x02<.*?>\x02u", '<something string>');

Difference between these two regular expressions?

What's the difference between these two regular expressions?(using php preg_match())
/^[0-9\x{06F0}-\x{06F9}]{1,}$/u
/^[0-9\x{06F0}-\x{06F9}\x]{1,}$/u
What's the meaning of the last \x in the second pattern?
It's interpreted as \x00 (the null character) but it's almost certainly a bug caused by sloppy editing or copy and paste.
http://www.regular-expressions.info/unicode.html
...Since \x by itself is not a valid regex token...
I think the second pattern is not valid.
According to this page http://www.regular-expressions.info/unicode.html, the \x is only useful followed by the unicode number:
Since \x by itself is not a valid regex token, \x{1234} can never be
confused to match \x 1234 times.
This is weird. Php notation for a unicode character is \x{}. In perl, it is the same thing.
But php has the //u modifier in regex's. I asume that means unicode. No such modifier in perl.
In perl regex, \x## is parsed, where ## is required to denote an ascii character. If its \x or \x#, its a warning of illeagal hex digit ignored (because it requires 2 digits, no more no less) and it takes only the valid hex digits in the sequence. If you have no digits as in \x, it uses \0 ascii char etc..
However, any \x{} notation is ok, and \x{0} is equivalend to \x{}. And \x{0}-\x{ff} is considered ascii, \x{100}- is considered unicode.
So, \x is a valid hex/unicode escape sequence but by itself its asumed hex and is incomplete and probably not something that should be left to parser default mechanisms.
As far as I can tell, the second \x is actually an invalid character. Do both expressions work?

Remove garbage characters in utf

I am using utf8 format to store all my data into mysql. Before data is inserted into the database I need to clean the strings with unwanted characters. The strings are in utf8 format. I know how to use regex and string replace but do not know how to work with arabic characters.
Sample string that needs to be cleaned : "████ .. الــقــوانين الجديـــدة في قســـم الـعنايـ";
Thanking you
Ok. As #Jonathan Leffler already said, if you can specify the unicode character ranges for the characters that need to be replaced, you can use a regular expression to replace the characters with an empty string.
A unicode character is specified as \x{FFFF} in an expression (in PHP). In addition, you have to set the u modifier to make PHP treat the pattern as UTF8.
So in the end, you have something like this:
preg_replace('/[\x{FFFF}-\x{FFFF}]+/u','',$string);
where
/.../u are the delimiters plus the modifier
[...]+ is a character class plus quantifier, which means match any of these characters inside one or mor times
\x{FFFF}-\x{FFFF} is a unicode character range (obviously you have to provide the right codepoints/numbers of the characters).
You can also negate the group with a ^ you can specify the range which you want to keep:
preg_replace('/[^\x{FFFF}-\x{FFFF}]+/u','',$string);
More information:
Regular expressions
Regular expressions in PHP
Unicode Charts

Categories