Do multilingual numeric characters count as letters? - php

I'm trying to do a search for just letters and spaces (simple words) in other languages, and if I find numbers or punctuation, throw a detection exception. When testing the regex i've written with UTF-8 numeric characters I found on wikipedia, my results always come back a match, and I'm baffled as to why unless it thinks all numbers are considered letters.
Here's the characters I've tried:
5 or 伍
http://en.wikipedia.org/wiki/Chinese_numerals
5 or Є
http://en.wikipedia.org/wiki/Cyrillic_script
Here's the code:
$were_bad_characters_found = preg_match('/[^\p{L}\p{Zs}]+/us', $data);
The answer to the question it asks is always, no, there were no bad characters found.
It seemed, based on the docs, that this would work, and it in fact does work when I try to just run simple english numbers through it, but as soon as multilingual characters hit, it just rolls over on me. I have a number of variations on this for detecting different common scenarios, and all the utf8 regex code only seems to work well for english characters. Thoughts?

The characters you showed are letters.
U+4F0D 伍, Is not a digit and has non-numeric interpretations.
U+0404 Є Not a digit, but also not even close to having any kind numeric interpretation.
The properties of english digits in unicode make it a Digit and not a letter. In PHP you can use \p{Nd}, to match digits. But your regex is working fine.

Related

How to match any full unicode character, with modifiers etc, in regex?

I want to match any full Unicode character. I'm probably using the wrong terms, but I don't necessarily mean letters; I want any displayed character with any modifiers included. Edit: I'm keeping my original wording, but upon review of this answer, perhaps grapheme is actually what I'm looking for.
Using the trivial regex ., with the Unicode u modifier, /./u does not fully suffice. A few examples:
❤️ will instead match ❤ without the variation selector U+FE0F.
👧🏻 will only match 👧 without the pale skin tone U+1F3Fb.
à (U+0061 (a) followed by U+0300 (grave accent)) will only match the a.
Following this answer, I was able to expand the pattern to this: /.[\x{1f3fb}-\x{1f3ff}\p{M}]?/u. This matches all of my test characters above, as well as the three han unification characters I pulled from this web page.
Edit: I just realized this still doesn't fully match, because (at least in PHP) it fails to fully match 🙍🏽‍♂ (might not display properly on all devices), because it doesn't capture the male character U+2642.
At this point, it seems like a guessing game to me. I have a feeling there are a lot of edge cases my current regex will not cover, but I don't know enough about foreign alphabets nor am I ready to just start guessing and enumerating random emojis and symbols from the character map to fully test this.
Is there a simpler solution to actually match any character including its modifiers/combining marks/etc?
Edit: Per Rob's comment below, I'm using PHP 7.4 for the regex.

PHP - Sanitize inputs for hashtags, allowing arabic, hebrew, japanese, etc., and emoji?

I'm looking at sanitizing inputs for a hashtag search engine.
Effectively I want to allow all alphanumeric characters, cyrillic, arabic, hebrew, etc., as well as emoji characters, but strip any symbols other than underscore.
After spending an hour or so looking online I haven't yet found a conclusive answer. Is there a regex that would enable me to sanitize such an input? Basically remove anything that isn't alphanumeric / letters / emojis.
Thanks!
Mark
I would basically enable Unicode and match
/emoji-regex(*SKIP)(?!)|[^\p{L}\p{Nd}_]+/u
and replace with nothing.
There is a negative class [^ ] (meaning not these) of:
\p{L} All letters
\p{Nd} Number digits
_ Underscore
The emoji-regex was deleted because of its size.
Edit this answer and grab it if needed.
This regex will SKIP the emoji by moving the search position past
them until it finds a block of 1 or more Non-Letters/Digits/Underscore characters.

PHP, uppercase a single character

Essentially, what I want to know is, will this:
echo chr(ord('a')-32);
Work for every a-z letter, on every possible PHP installation, every single time?
Read here for a bit of background information
After searching for a while, I realised that most of the questions for changing string case in PHP only apply to entire words or sentences.
I have a case where I only need to upper 1 single character.
I though using functions like strtoupper, ucfirst and ucwords were overkill for single characters, seeing as they are designed to work with strings.
So after looking around php.net I found the functions chr and ord which convert chars to their ascii representation (and back).
After a little playing, I discovered I can convert a lower to an upper by doing
echo chr(ord('a')-32);
This simply offsets the character by 32 places in the ascii table. Which just happens to be the character's upper version.
The reason I'm posting this on stackoverflow, is because I want to know if there are any edge cases that could break this simple conversion.
Would changing the character set of the php script, or somethig like that affect the outcome?
Is this $upper = chr(ord($lower)-$offset) the standard way to upper a char in PHP? or is there another?
The ASCII code doesn't change between PHP installations, because it is based on the ASCII table.
Quote from www.asciitable.com:
ASCII stands for American Standard Code for Information Interchange. Computers can only understand numbers, so an ASCII code is the numerical representation of a character such as 'a' or '#' or an action of some sort. ASCII was developed a long time ago and now the non-printing characters are rarely used for their original purpose.
Quote from PHP documentation on chr():
Returns a one-character string containing the character specified by ascii.
In any case, I'd say it's more overkill to do it your way than do it with strtoupper().
strtoupper() is also faster.

Help with password complexity regex

I'm using the following regex to validate password complexity:
/^.*(?=.{6,12})(?=.*[0-9]{2})(?=.*[A-Z]{2})(?=.*[a-z]{2}).*$/
In a nutshell: 2 lowercase, 2 uppercase, 2 numbers, min length is 6 and max length is 12.
It works perfectly, except for the maximum length, when I'm using a minimum length as well.
For example:
/^.*(?=.{6,})(?=.*[0-9]{2})(?=.*[A-Z]{2})(?=.*[a-z]{2}).*$/
This correctly requires a minimum length of 6!
And this:
/^.*(?=.{,12})(?=.*[0-9]{2})(?=.*[A-Z]{2})(?=.*[a-z]{2}).*$/
Correctly requires a maximum length of 12.
However, when I pair them together as in the first example, it just doesn't work!!
What gives? Thanks!
You want:
/^(?=.{6,12}$)...
What you're doing is saying: find me any sequence of characters that is followed by:
6-12 characters
another sequence of characters that is followed by 2 digits
another sequence of characters that is followed by 2 uppercase letters
another sequence of characters that is followed by 2 lowercase letters
And all that is followed by yet another sequence of characters. That's why the maximum length isn't working because 30 characters followed by 00AAaa and another 30 characters will pass.
Also what you're doing is forcing two numbers together. To be less stringent than that but requiring at least two numbers anywhere in the string:
/^(?=.{6,12}$)(?=(.*?\d){2})(?=(.*?[A-Z]){2})(?=(.*?[a-z]){2})/
Lastly you'll note that I'm using non-greedy expressions (.*?). That will avoid a lot of backtracking and for this kind of validation is what you should generally use. The difference between:
(.*\d){2}
and
(.*?\d){2}
Is that the first will grab all the characters with .* and then look for a digit. It won't find one because it will be at the end of the string so it will backtrack one characters and then look for a digit. If it's not a digit it will keep backtracking until it finds one. After it does it will match that whole expression a second time, which will trigger even more backtracking.
That's what greedy wildcards means.
The second version will pass on zero characters to .*? and look for a digit. If it's not a digit .*? will grab another characters and then look for a digit and so on. Particularly on long search strings this can be orders of magnitude faster. On a short password it almost certainly won't make a difference but it's a good habit to get into of knowing how the regex matcher works and writing the best regex you can.
That being said, this is probably an example of being too clever for your own good. If a password is rejected as not satisfying those conditions, how do you determine which one failed in order to give feedback to the user about what to fix? A programmatic solution is, in practice, probably preferable.

how locale aware is preg_replace in php?

If I do, preg_replace('/[^a-zA-Z0-9\s-_]/','',$val) in a multi-lingual application, will it handle things like accented characters or russian characters? If not, how can I filter user input to only allow the above characters but with locale awareness?
thanks!
codecowboy.
The only useful information I can find is from this page of the manual, which states :
A "word" character is any letter or
digit or the underscore character,
that is, any character which can be
part of a Perl "word". The definition
of letters and digits is controlled by
PCRE's character tables, and may vary
if locale-specific matching is taking
place. For example, in the "fr"
(French) locale, some character codes
greater than 128 are used for accented
letters, and these are matched by \w.
Still, I wouldn't bet that it's working as you want...
But, to be sure :
maybe using unicode matching would be better
You'll probably have to try to be certain...
About unicode, the manual says this :
Matching characters by Unicode
property is not fast, because PCRE has
to search a structure that contains
data for over fifteen thousand
characters. That is why the
traditional escape sequences such as
\d and \w do not use Unicode
properties in PCRE.
So, it might be a safer solution... curious about it, should I add ^^
No, it will only match the ASCII character A-Z. To match any letter/number in any language, you need to use the unicode properties of the regex engine:
preg_replace('/[^\p{L}\p{N}]/', '', $string);

Categories