PHP - Matching city/street names in PHP using unicode regex

PHP - Matching city/street names in PHP using unicode regex - php

I have this expression:
'/^([\p{L}\p{Mn}\p{Pd}\'\x{2019}]+|\d+)(\s+([\p{L}\p{Mn}\p{Pd}\'\x{2019}]+|\d+))*$/u'
It's goal is to match names and numbers like "6 de diciembre" or "Mariana de Jesús" (using numbers and unicode characters.
The issue is that it also matches typos like: "6de diciembre" [1]. Mixing numbers and letters in the same word should not be allowed (no, we have not expression like "6th" in this cases).
Question: What character classes should I use? I need digits and these unicode letters, but not mixed, not concatenated.
Notes: I posted a similar question regarding this topic before, but the issue was slightly different and cannot expect the same kind of answer.
[1] I can't believe I MUST clarify this point: typos should not be matched - unless explicitly said, a regex is to find an expected regular format in a string

The expression works well. I had a totally different issue in which my validation handler was not called.
After a bit of experimentation I noticed that if I reduce the length of the validation handling function then the DRUPAL 7 form can use it as a handler, instead of silently discarding it. Yes, ladies and gentlemen, my handler was named toyotaec_form_webform_client_form_trabaja_con_nosotros_validate and assigned as:
`$form['#validate'][] = 'toyotaec_form_webform_client_form_trabaja_con_nosotros_validate';`.
Slicing the 'con_notros_' part in both sides made it work, and lead me to this conclusion.
.: Drupal has a(n absolutely senseless) limit for those identifiers, while PHP has not.
.: Drupal truncates the input when you assign it.
.: Drupal raises no error upon unexistent function (the truncated name does not exist as a function).
Rantful (but logic) conclusion: For this and many previous issues I conclude that drupal sucks.

Related

How to match any full unicode character, with modifiers etc, in regex?

I want to match any full Unicode character. I'm probably using the wrong terms, but I don't necessarily mean letters; I want any displayed character with any modifiers included. Edit: I'm keeping my original wording, but upon review of this answer, perhaps grapheme is actually what I'm looking for.
Using the trivial regex ., with the Unicode u modifier, /./u does not fully suffice. A few examples:
❤️ will instead match ❤ without the variation selector U+FE0F.
👧🏻 will only match 👧 without the pale skin tone U+1F3Fb.
à (U+0061 (a) followed by U+0300 (grave accent)) will only match the a.
Following this answer, I was able to expand the pattern to this: /.[\x{1f3fb}-\x{1f3ff}\p{M}]?/u. This matches all of my test characters above, as well as the three han unification characters I pulled from this web page.
Edit: I just realized this still doesn't fully match, because (at least in PHP) it fails to fully match 🙍🏽‍♂ (might not display properly on all devices), because it doesn't capture the male character U+2642.
At this point, it seems like a guessing game to me. I have a feeling there are a lot of edge cases my current regex will not cover, but I don't know enough about foreign alphabets nor am I ready to just start guessing and enumerating random emojis and symbols from the character map to fully test this.
Is there a simpler solution to actually match any character including its modifiers/combining marks/etc?
Edit: Per Rob's comment below, I'm using PHP 7.4 for the regex.

Need to switch from ereg() to preg_match() [duplicate]

This question already has answers here:
How can I convert ereg expressions to preg in PHP?
(4 answers)
Closed 9 years ago.
I need to know what this line of code does, tried to figure it out because i have to build it with preg_match() but I didn't understand it completely:
ereg("([0-9]{1,2}).([0-9]{1,2}).([0-9]{4})", $date)
I know it checks a date, but i don't know in which way.
thanks for some help

Let's break this down:
([0-9]{1,2})
This looks for numbers zero through nine (- indicates a range when used in brackets []) and there can be 1 or two of them.
.
This looks for any single character
([0-9]{1,2})
This looks for numbers zero through nine and there can be 1 or two of them (again)
.
This looks for any single character (again)
([0-9]{4})
This looks for numbers zero through nine and there must be four of them in a row
So it is looking for a date in any of the following formats:
04 18 1973
04-18-1973
04/18/1973
04.18.1973
More will fit that pattern so it isn't a very good regex for what it is supposed to validate against. There are lots of sample regex patterns for matting dates in this format so if you google it you'll have a PCRE in no time.

It's a relatively simple regular expression (regex). If you're going to be working with regex, then I suggest taking a bit of time to learn the syntax. A good starting place to learn is http://regular-expressions.info.
"Regular expressions" or "regex" is a pattern matching language used for searching through strings. There are a number of dialects, which are mostly fairly similar but have some differences. PHP started out with the ereg() family of functions using one particular dialect and then switched to the preg_xx() functions to use a slightly different regex dialect.
There are some differences in syntax between the two, which it is helpful to learn, but they're fairly minor. And in fact the good news for you is that the pattern here is pretty much identical between the two.
Beyond the patterns themselves, the only other major difference you need to know about is that patterns in preg_match() must have a pair of delimiting characters at either end of the pattern string. The most commonly used characters for this are slashes (/).
So in this case, all you need to do is swap ereg for preg_match, and add the slashes to either end of the pattern:
$result = preg_match("/([0-9]{1,2}).([0-9]{1,2}).([0-9]{4})/", $date);
^ ^
slash here and here
It would still help to get an understanding of what the pattern is doing, but for a quick win, that's probably all you need to do in this case. Other cases may be more complex, but most will be as simple as that.
Go read the regular-expressions.info site I linked earlier though; it will help you.
One thing I would add, however, is that the pattern given here is actually quite poorly written. It is intending to match a date string, but will match a lot of things that it probably didn't intend to.
You could fix it up by finding a better regex expression for matching dates, but it is quite possible that the code could be written without needing regex at all -- PHP has some perfectly good date handling functionality built into it. You'd need to consider the code around it and understand what it's doing, but it's perfectly possible that the whole thing could be replaced with something like this:
$dateObject = DateTime::CreateFromFormat($date, 'd.M.Y');

It looks like it would be pretty much agnostic in its matching.
You could interpret it either as mm.dd.yyyy or dd.mm.yyyy. I would consider modifying it if you were in fact trying to match/verify a date as 00.00.0000 would be a match but is an invalid data, outside of possible historic context.
Edit: I forget '.' in this case would match any character without escaping.

this do the same, i have only replace [0-9] by \d, and the dot (that match all) by \D (a non digit, but can replace it by \. or [.- ])
preg_match("~\d{2}\D\d{2}\D\d{4}~", $date)

Regex, encoding, and characters that look a like

First, a brief example, let's say I have this /[0-9]{2}°/ RegEx and this text "24º". The text won't match, obviously ... (?) really, it depends on the font.
Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/, or even better, assure that the text has only the chars I'm expecting °. But I can't just remove the unknown chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new -, now we got three of them, just like latex =D - -- --- ,cool , but the regex didn't work.
Does anyone knows how I might solve this?

There is no way to include characters with a "similar appearance" in a regular expression, so basically you can't.
For a specific character, you may have luck with the Unicode specification, which may list some of the most common mistakes, but you have no guarantee. In case of the degree sign, the Unicode code chart lists four similar characters (\u02da, \u030a, \u2070 and \u2218), but not your problematic character, the masculine ordinal indicator.

Unfortunately not in PHP. ASP.NET has unicode character classes that cover things like this, but as you can see here, :So covers too much. Also as it's not PHP doesn't help anyway. :)
In PHP you are going to be limited to selecting the most common character sets and using them.
This should help:
http://unicode.org/charts/charindex.html
There is only one degree symbol. Using something that looks similar is not correct. There are also symbols for degree Fahrenheit and celsius. There are tons of minus signs unfortunately.

Your regular expression will indeed need to list all the characters that you want to accept. If you can't know the string's encoding in advance, you can specify your regular expression to be UTF-8 using the /u modifier in PHP: "/[0-9]{2}[°º]/u" Then you can include all Unicode characters that you want to accept in your character class. You will need to convert the subject string to UTF-8 also before using the regex on it.

I just stumbled into good references for this question:
http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt
https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
https://www.rfc-editor.org/rfc/rfc3454.html

Ok, if you're looking to pull temp you'll probably need to start with changing a few things first.
temperatures can come in 1 to 3 digits so [0-9]{1,3} (and if someone is actually still alive to put in a four digit temperature then we are all doomed!) may be more accurate for you.
Now the degree signs are the tricky part as you've found out. If you can't control the user (more's the pity), can you just pull whatever comes next?
[0-9]{1,3}.
You might have to beef up the first part though with a little position handling like beginning of the string or end.
You may also exclude all the regular characters you don't want.
[0-9]{1,3}[^a-zA-Z]
That will pick up all the punctuation marks (only one though).

Will [a-z] ever match accented characters in PREG/PCRE?

I'm already aware that \w in PCRE (particularly PHP's implementation) can sometimes match some non-ASCII characters depending on the locale of the system, but what about [a-z]?
I wouldn't think so, but I noticed these lines in one of Drupal's core files (includes/theme.inc, simplified):
// To avoid illegal characters in the class,
// we're removing everything disallowed. We are not using 'a-z' as that might leave
// in certain international characters (e.g. German umlauts).
$body_classes[] = preg_replace('![^abcdefghijklmnopqrstuvwxyz0-9-_]+!s', '', $class);
Is this true, or did someone simply get [a-z] confused with \w?

Long story short: Maybe, depends on the system the app is deployed to, depends how PHP was compiled, welcome to the CF of localization and internationalization.
The underlying PCRE engine takes locale into account when determining what "a-z" means. In a Spanish based locale, ñ would be caught by a-z). The semantic meaning of a-z is "all the letters between a and z, and ñ is a separate letter in Spanish.
However, the way PHP blindly handles strings as collections of bytes rather than a collection of UTF code points means you have a situation where a-z MIGHT match an accented character. Given the variety of different systems Drupal gets deployed to, it makes sense that they would choose to be explicit about the allowed characters rather than just trust a-z to do the right thing.
I'd also conjecture that the existence of this regular expression is the result of a bug report being filed about German umlauts not being filtered.
Update in 2014: Per JimmiTh's answer below, it looks like (despite some "confusing-to-non-pcre-core-developers" documentation) that [a-z] will only match the characters abcdefghijklmnopqrstuvwxyz a proverbial 99% of the time. That said — framework developers tend to get twitchy about vagueness in their code, especially when the code relies on systems (locale specific strings) that PHP doesn't handle as gracefully as you'd like, and servers the developers have no control over. While the anonymous Drupal developer's comments are incorrect — it wasn't a matter of "getting [a-z] confused with \w", but instead a Drupal developer being unclear/unsure of how PCRE handled [a-z], and choosing the more specific form of abcdefghijklmnopqrstuvwxyz to ensure the specific behavior they wanted.

The comment in Drupal's code is WRONG.
It's NOT true that "international characters (e.g. German umlauts)" might match [a-z].
If, e.g., you have the German locale available, you can check it like this:
setlocale(LC_ALL, 'de_DE'); // German locale (not needed, but you never know...)
echo preg_match('/^[a-z]+$/', 'abc') ? "yes\n" : "no\n";
echo preg_match('/^[a-z]+$/', "\xE4bc") ? "yes\n" : "no\n"; // äbc in ISO-8859-1
echo preg_match('/^[a-z]+$/', "\xC3\xA4bc") ? "yes\n" : "no\n"; // äbc in UTF-8
echo preg_match('/^[a-z]+$/u', "\xC3\xA4bc") ? "yes\n" : "no\n"; // w/ PCRE_UTF8
Output (will not change if you replace de_DE with de_DE.UTF-8):
yes
no
no
no
The character class [abcdefghijklmnopqrstuvwxyz] is identical to [a-z] in both encodings the PCRE understands: ASCII-derived monobyte and UTF-8 (which is ASCII-derived too). In both of these encodings [a-z] is the same as [\x61-\x7A].
Things may have been different when the question was asked in 2009, but in 2014 there is no "weird configuration" that can make PHP's PCRE regex engine interpret [a-z] as a class of more than 26 characters (as long as [a-z] itself is written as 5 bytes in an ASCII-derived encoding, of course).

Just an addition to both the already excellent, if contradicting, answers.
The documentation for the PCRE library has always stated that "Ranges operate in the collating sequence of character values". Which is somewhat vague, and yet very precise.
It refers to collating by the index of characters in PCRE's internal character tables, which can be set up to match the current locale using pcre_maketables. That function builds the tables in order of char value (tolower(i)/toupper(i))
In other words, it doesn't collate by actual cultural sort order (the locale collation info). As an example, while German treats ö the same as o in dictionary collation, ö has a value that makes it appear outside the a-z range in all the common character encodings used for German (ISO-8859-x, unicode encodings etc.) In this case, PCRE would base its determination of whether ö is in the range [a-z] on that code value, rather than any actual locale-defined sort order.
PHP has mostly copied PCRE's documentation verbatim in their docs. However, they've actually gone to pains changing the above statement to "Ranges operate in ASCII collating sequence". That statement has been in the docs at least since 2004.
In spite of the above, I'm not quite sure it's true, however.
Well, not in all cases, at least.
The one call PHP makes to pcre_maketables... From the PHP source:
#if HAVE_SETLOCALE
if (strcmp(locale, "C"))
tables = pcre_maketables();
#endif
In other words, if the environment for which PHP is compiled has setlocale and the (LC_CTYPE) locale isn't the POSIX/C locale, the runtime environment's POSIX/C locale's character order is used. Otherwise, the default PCRE tables are used - which are generated (by pcre_maketables) when PCRE is compiled - based on the compiler's locale:
This function builds a set of character tables for character values less than 256. These can be passed to pcre_compile() to override PCRE's internal, built-in tables (which were made by pcre_maketables() when PCRE was compiled). You might want to do this if you are using a non-standard locale. The function yields a pointer to the tables.
While German wouldn't be different for [a-z] in any common character encoding, if we were dealing with EBCDIC, for example, [a-z] would include ± and ~. Granted, EBCDIC is the one character encoding I can think of that doesn't place a-z and A-Z in uninterrupted sequence.
Unless PCRE does some magic when using EBCDIC (and it might), while it's highly unlikely you'd be including umlauts in anything but the most obscure PHP build or runtime environment (using your very own, very special, custom-made locale definition), you might, in the case of EBCDIC, include other unintended characters. And for other ranges, "collated in ASCII sequence" doesn't seem entirely accurate.
ETA: I could have saved some research by looking for Philip Hazel's own reply to a similar concern:
Another issue is with character classes ranges. You would think that [a-k] and [x-z] are well defined for latin scripts but that's not the case.
They are certainly well defined, being equivalent to [\x61-\x6b] and [\x78-\x7a], that is, related to code order, not cultural sorting order.

How do I create a regular expression that disallows symbols?

I got a question regarding regexp in general. I'm currently building a register form where you can enter the full name (given name and family name) however I cant use [a-zA-Z] as a validation check because that would exclude everyone with a "foreign" character.
What is the best way to make sure that they don't enter a symbol, in both php and javascript?
Thanks in advance!

The correct solution to this problem (in general) is POSIX character classes. In particular, you should be able to use [:alpha:] (or [:alphanum:]) to do this.
Though why do you want to prevent users from entering their name exactly as they type it? Are you sure you're in a position to tell them exactly what characters are allowed to be in their names?

You first need to conceptually distinguish between a "foreign" character and a "symbol." You may need to clarify here.
Accounting for other languages means accounting for other code pages and that is really beyond the scope of a simple regexp. It can be done, but on a higher level, the codepages have to work.

If you strictly wanted your regexp to fail on punctuation and symbols, you could use [^[:punct:]], but I'm not sure how the [:punct:] POSIX class reacts to some of the weird unicode symbols. This would of course stop some one from putting in "John Smythe-Jones" as their name though (as '-' is a punctuation character), so I would probably advise against using it.

I don’t think that’s a good idea. See How to check real names and surnames - PHP

I don't know how you would account for what is valid or not, and depending on your global reach, you will probably not be able to remove anything without locking out somebody. But a Google search turned this up which may be helpful.
http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_symbol_characters_web_page

You could loop through the input string and use the String.charCodeAt() function to get the integer character code for each character. Set yourself up with a range of acceptable characters and do your comparison.

As noted POSIX character classes are likely the best bet. But the details of their support (and alternatives) vary very much with the details of the specific regex variant.
PHP apparently does support them, but JavaScript does not.
This means for JavaScript you will need to use character ranges: /[\u0400-\u04FF]/ matches any one Cyrillic character. Clearly this will take some writing, but not the XML 1.0 Recommendation (from W3C) includes a listing of a lot of ranges, albeit a few years old now.
One approach might be to have a limited check on the client in JavaScript, and the full check only server side.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.