How to match any full unicode character, with modifiers etc, in regex? - php

I want to match any full Unicode character. I'm probably using the wrong terms, but I don't necessarily mean letters; I want any displayed character with any modifiers included. Edit: I'm keeping my original wording, but upon review of this answer, perhaps grapheme is actually what I'm looking for.
Using the trivial regex ., with the Unicode u modifier, /./u does not fully suffice. A few examples:
❤️ will instead match ❤ without the variation selector U+FE0F.
👧🏻 will only match 👧 without the pale skin tone U+1F3Fb.
à (U+0061 (a) followed by U+0300 (grave accent)) will only match the a.
Following this answer, I was able to expand the pattern to this: /.[\x{1f3fb}-\x{1f3ff}\p{M}]?/u. This matches all of my test characters above, as well as the three han unification characters I pulled from this web page.
Edit: I just realized this still doesn't fully match, because (at least in PHP) it fails to fully match 🙍🏽‍♂ (might not display properly on all devices), because it doesn't capture the male character U+2642.
At this point, it seems like a guessing game to me. I have a feeling there are a lot of edge cases my current regex will not cover, but I don't know enough about foreign alphabets nor am I ready to just start guessing and enumerating random emojis and symbols from the character map to fully test this.
Is there a simpler solution to actually match any character including its modifiers/combining marks/etc?
Edit: Per Rob's comment below, I'm using PHP 7.4 for the regex.

Related

Convert regex from gskinner to PHP

I know that I'd likely hear "Don't parse HTML with regex", so let me say that this question is just academic at this point because I actually solved my problem using the DOM, but on my road to a solution, I ran across this pattern that works on the gskinner website, but I can't figure out how to make it work in PHP preg_match().
(?<=href\=")[^]+?(?=")
I think that the [^] is causing the problem, but I'm not certain what to do about it.
What it is intended to do is pull the substring from between the quotes of an href. (One would expect it to be a web-address or at least part of one.)
[^] is a difficult construct. Basically it is an empty negated character class. But what should it match? That depends on the implementation. Some languages are interpreting it as negation of nothing, so it will match every character, that is what gskinner (means ActionScript 3) seems to be doing.
I would never use this, because it is ambiguous.
The most readable way is to use ., the meta character that matches every character (without newlines), if newlines are also wanted, just add the modifier s that enables the dotall mode, this would be exactly what you wanted to achieve with [^].
A workaround that is sometimes used is to use a character class something like this [\s\S] or [\w\W]. Those will also match every character (including newlines), because they are matching some predefined character class and their negation.

Regex with negative lookahead to ignore the word "class"

I'm getting insane over this, it's so simple, yet I can't figure out the right regex. I need a regex that will match blacklisted words, ie "ass".
For example, in this string:
<span class="bob">Blacklisted word was here</span>bass
I tried that regex:
((?!class)ass)
That matches the "ass" in the word "bass" bot NOT "class".
This regex flags "ass" in both occurences. I checked multiple negative lookaheads on google and none works.
NOTE: This is for a CMS, for moderators to easily find potentially bad words, I know you cannot rely on a computer to do the filtering.
If you have lookbehind available (which, IIRC, JavaScript does not and that seems likely what you're using this for) (just noticed the PHP tag; you probably have lookbehind available), this is very trivial:
(?<!cl)(ass)
Without lookbehind, you probably need to do something like this:
(?:(?!cl)..|^.?)(ass)
That's ass, with any two characters before as long as they are not cl, or ass that's zero or one characters after the beginning of the line.
Note that this is probably not the best way to implement a blacklist, though. You probably want this:
\bass\b
Which will match the word ass but not any word that includes ass in it (like association or bass or whatever else).
It seems to me that you're actually trying to use two lists here: one for words that should be excluded (even if one is a part of some other word), and another for words that should not be changed at all - even though they have the words from the first list as substrings.
The trick here is to know where to use the lookbehind:
/ass(?<!class)/
In other words, the good word negative lookbehind should follow the bad word pattern, not precede it. Then it would work correctly.
You can even get some of them in a row:
/ass(?<!class)(?<!pass)(?<!bass)/
This, though, will match both passhole and pass. ) To make it even more bullet-proof, we can add checking the word boundaries:
/ass(?<!\bclass\b)(?<!\bpass\b)(?<!\bbass\b)/
UPDATE: of course, it's more efficient to check for parts of the string, with (?<!cl)(?<!b) etc. But my point was that you can still use the whole words from whitelist in the regex.
Then again, perhaps it'd be wise to prepare the whitelists accordingly (so shorter patterns will have to be checked).
Is this one is what you want ? (?<!class)(\w+ass)

How to preg_match_all a set of words in any possible language?

I have a website that people enter lists of words into.
These lists of words could be written in any language in the world.
How can I extract these lists of words from their input data if I do not know what language they are entering?
Is there some kind of match-all international alphabet symbol I am missing, or do I have to manually write up a set of brackets that will match every possible international letter?
Is this what I am looking for and just don't know it yet?
You can use Unicode character properties, for example:
preg_match_all('#[\p{L}\p{Pc}]+#u', $str, $matches);
[\p{L}\p{Pc}]+ gives you letters and connector punctuation. You can shorten that to \pL+.
Either way, you'd want to define "word" better. It is probably more than a sequence of some letters...
My recommendation is to define your own input convention - force them to input one word at a time, or one word per line in a textbox. Else, you will need a segmentation algorithm for each script (granted, it will be something trivial like "split on characters which have the Unicode word separator property" for the vast majority of scripts, but the remaining special cases are basically still open AI research topics).

Regex, encoding, and characters that look a like

First, a brief example, let's say I have this /[0-9]{2}°/ RegEx and this text "24º". The text won't match, obviously ... (?) really, it depends on the font.
Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/, or even better, assure that the text has only the chars I'm expecting °. But I can't just remove the unknown chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new -, now we got three of them, just like latex =D - -- --- ,cool , but the regex didn't work.
Does anyone knows how I might solve this?
There is no way to include characters with a "similar appearance" in a regular expression, so basically you can't.
For a specific character, you may have luck with the Unicode specification, which may list some of the most common mistakes, but you have no guarantee. In case of the degree sign, the Unicode code chart lists four similar characters (\u02da, \u030a, \u2070 and \u2218), but not your problematic character, the masculine ordinal indicator.
Unfortunately not in PHP. ASP.NET has unicode character classes that cover things like this, but as you can see here, :So covers too much. Also as it's not PHP doesn't help anyway. :)
In PHP you are going to be limited to selecting the most common character sets and using them.
This should help:
http://unicode.org/charts/charindex.html
There is only one degree symbol. Using something that looks similar is not correct. There are also symbols for degree Fahrenheit and celsius. There are tons of minus signs unfortunately.
Your regular expression will indeed need to list all the characters that you want to accept. If you can't know the string's encoding in advance, you can specify your regular expression to be UTF-8 using the /u modifier in PHP: "/[0-9]{2}[°º]/u" Then you can include all Unicode characters that you want to accept in your character class. You will need to convert the subject string to UTF-8 also before using the regex on it.
I just stumbled into good references for this question:
http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt
https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
https://www.rfc-editor.org/rfc/rfc3454.html
Ok, if you're looking to pull temp you'll probably need to start with changing a few things first.
temperatures can come in 1 to 3 digits so [0-9]{1,3} (and if someone is actually still alive to put in a four digit temperature then we are all doomed!) may be more accurate for you.
Now the degree signs are the tricky part as you've found out. If you can't control the user (more's the pity), can you just pull whatever comes next?
[0-9]{1,3}.
You might have to beef up the first part though with a little position handling like beginning of the string or end.
You may also exclude all the regular characters you don't want.
[0-9]{1,3}[^a-zA-Z]
That will pick up all the punctuation marks (only one though).

How to check real names and surnames - PHP

here's my problem:
I want to check if a user insert a real name and surname by checking if they have only letters (of any alphabet) and ' or - in PHP.
I've found a solution here (but I don't remember the link) on how to check if a string has only letters:
preg_match('/^[\p{L} ]+$/u',$name)
but I'd like to admit ' and - too. (Charset is UTF8)
Can anyone help me please?
A little off-topic, but what exactly is the point of validating names?
It's not to prevent fraud; if people are trying to give you a fake name, they can easily type a string of random letters.
It's not to prevent mistakes; typing a punctuation character is only one of the many mistakes you could make, and an unlikely one at that.
It's not to prevent code injection; you should be preventing that by properly encoding your outputs, regardless of what characters they contain.
So why do we all do it?
Looks like you just need to modify the regex: [\p{L}' -]+
(International) names can contain many characters: spaces, 's, dashes, normal letters, umlauts, accents, ...
EDIT: The point is: How to be sure all letters (of all languages), dash, ' and space are enough? Are there no names which contain a dot (What about "Dr. No"?), a colon or some char else?
EDIT2: Thanks to the user 'some' probably from Sweden (left a comment) we now know that there is an swedish name 'Andreas J:son Friberg'. Remember the colon!
Depending on the character set you want to permit, you'll just need to make sure that characters you want to support are inside the '[]' portion of the regex. Since the '-' character has special meaning in this context (it creates a range), it needs to be the last item in the list.
The \p{L} means match any character with the property of being a letter. \w has a similar meaning, but also includes the '_' character, which you probably don't want.
preg_match('/^[A-Za-z \'-]+$/i',$name);
Would match most common names, though if you want to support foreign character sets, you'll need more a exotic regex.
This should also do it
/[\w'-]+/gi
if charset is UTF-8, then you have a problem - how are you able to check for Central and Eastern European Latin characters (diacritics) or names in Cyrillic, Chinese or Japanese names? that would be a hell of a regex.
Note that the example you provided does not check to ensure that the user has both a surname and given names, though I would argue that that is how it should be. You shouldn't assume a person has more than one name. I am currently working on a PHP application which deals with people's names in context, and if I have discovered anything it's that you cannot make such assumptions :) Even many non-celebrities have just one name.
Using the Unicode categories as in \p{L} was a good idea, as yes obviously people will have all sorts of characters from other languages in their names. However, as well as \p{L} you will also have to take into account combining marks - ie accents, umlauts etc that people add as extra characters.
So, maybe immediately after \p{L} I'd add \p{Mc}
I'd end up with
preg_match('/^[\pL\p{Mc} \'-]+$/u', $name)

Categories