Accept international name characters in RegEx - php

I've always struggled with RegEx so forgive me if this may seem like an awful approach at tackling my problem.
When users are entering first and last names I started off just using the basic, check for upper and lower case, white space, apostrophes and hyphens
if (!preg_match("/^[a-zA-Z\s'-]+$/", $name)) { // Error }
Now I realise this isn't the best since people could have things such as: Dr. Martin Luther King, Jr. (with comma's and fullstops). So I assume by changing it to this would make it slightly more effective.
if (!preg_match("/^[a-zA-Z\s,.'-]+$/", $name)) { // Error }
I then saw a girls name I know on my Facebook who writes her name as Siân, which got me thinking of names which contain umlauts as well as say Japanese/Chinese/Korean/Russian characters too. So I started searching and found ways by writing each of these characters in there like so.
if (!preg_match("/^[a-zA-Z\sàáâäãåèéêëìíîïòóôöõøùúûüÿýñçčšžÀÁÂÄÃÅÈÉÊËÌÍÎÏÒÓÔÖÕØÙÚÛÜŸÝÑßÇŒÆČŠŽ∂ð ,.'-]+$/u", $first_name)) { // Error }
As you can imagine, it's extremely long winded and I'm pretty certain there is a much simpler RegEx which can achieve this. Like I've said, I've searched around but this is the best I can do.
So, what is a good way to check for upper and lower case characters, commas, full stops, apostrophes, hypens, umlauts, Latin, Japanese/Russian etc

You can use an Unicode character class. \pL covers pretty much all letter symbols.
http://php.net/manual/en/regexp.reference.unicode.php
if (!preg_match("/^[a-zA-Z\s,.'-\pL]+$/u", $name))
See also http://www.regular-expressions.info/unicode.html, but beware that PHP/PCRE only understands the abbreviated class names.

\pL already includes a-z and A-Z, therefore the mentioned pattern "/^[a-zA-Z\s,.'-\pL]+$/u" could be simplified to
"/^[\s,.'-\pL]+$/"
also the modifier u is not required.

There could probably be some loosening of the qualifications by allowing other types of punctuation.
One thing that should be a restriction is requiring at least one letter.
if (!preg_match("/^[\s,.'-]*\p{L}[\p{L}\s,.'-]*$/u", $name))

Related

Why is ctype_alnum unhelpful in matching culture-agnostic alphanumerics?

Let's suppose that I have a text in a variable called $text and I want to validate it, so that it can contain spaces, underscores, dots and any letters from any languages and any digits. Since I am a total noob with regular expressions, I thought I can work-around learning it, like this:
if (!ctype_alnum(str_replace(".", "", str_replace(" ", "", str_replace("_", "", $text))))) {
//invalid
}
This correctly considers the following inputs as valid:
foobarloremipsum
foobarloremipsu1m
foobarloremi psu1m
foobar._remi psu1m
So far, so good. But if I enter my name, Lajos Árpád, which contains non-English letters, then it is considered to be invalid.
Returns TRUE if every character in text is either a letter or a digit,
FALSE otherwise.
Source.
I suppose that a setting needs to be changed to allow non-English letters, but how can I use ctype_alnum to return true if and only if $text contains only letters or digits in a culture-agnostic fashion?
Alternatively, I am aware that some spooky regular expression can be used to resolve the issue, including things like \p{L} which is nice, but I am interested to know whether it is possible using ctype_alnum.
You need to use setlocale with category set to LC_CTYPE and the appropriate locale for the ctype_* family of functions to work on non-English characters.
Note that the locale that you're using with setlocale needs to actually be installed on the system, otherwise it won't work. The best way to remedy this situatioin is to use a portable solution, given in this answer to a similar question.

Regular expression to match numbers, but not HTML entities

Is there a regex to find all the digit sequences (\d+) in text, but not the ones forming HTML entities? Look like I should use both "look ahead" and "look behind" together, but I can’t figure out how.
For example, for the string ✑ #555 foo 777; I want to match only 555 and 777, but not 10001.
I’ve tried
~(?<!(&#)|\d])\d+(?![\d|;])~
But it seems to be too strict, as it returns no matches for cases like 777;
You can probably use this regex with lookarounds:
(?<!&#)\b\d+\b|(?:^|\b)\d+\b(?!;|$)
Demo: http://www.rubular.com/r/IUGqDf7Nfg
I’ve found the solution the next morning.
(?<![(&#)\d])\d+|\d+(?!\d|;)
It's quite big and poorly readable, but it works.
P.S. I think it’s a lot easier just do decode/hide the entities before processing and then put them back.

problem with regex

i have this code, but is not working as i expect.
if i write #$% or textwithouterrors the message showed is "correct". So, the regex is not working.
I need to avoid, special characters, spaces and numbers
function validateCity($form) {
if(preg_match("/[\~!\##\$\%\^\&*()_+\=-[]{}\\|\'\"\;\:\/\?.>\,\<`]/", $form['city'])) {
echo ("error");
return false;
}
else {
echo("correct");
return true;
}
}
validateCity($form);
thanks
There are lots of problems - The hypen - should be moved to first or end or escaped otherwise it will be seen as indicating a range. The [] have to be escaped.
Try out what you want in some place like http://gskinner.com/RegExr/
Also you are including lot of stuff in it. Just use something like \w+ as the match for a valid one rather than looking for an invalid one.
Try reversing your logic. Look for the characters you want, not the ones you don't want.
There are a couple of issues going on here. The most serious one is that you have syntax errors in your regex: So far, I've noticed [, ], and - all unescaped in your character class. I'm a little surprised the regex engine isn't erroring out from those, since they technically lead to undefined behavior, but PHP tends to be pretty tolerant of such things. Either way, it isn't doing what you think it is.
Before worrying about that, address the second issue: You're blacklisting characters, but you should just use a whitelist instead. That will simplify your pattern considerably, and you won't have to worry about crazy characters like ▲ slipping past your regex.
If you're trying to match cities, I'd go with something like this:
if(preg_match("/[^\p{L}\s-]/", $form['city'])) {
echo ("error");
return false;
}
//etc...
That will allow letters, dashes (think Winston-Salem, NC), and whitespace (think New Haven, CT), while blocking everything else. This might be too restrictive, I don't know; anyone who knows of a town with a number in the name is welcome to comment. However, the \p{L} should match unicode letters, so Āhualoa, HI should work.
It seems like you want to check if your city name contains any non-letter characters. In that case you can simplify it to:
if (preg_match("/[^A-Z]/i", $form['city'])) {
The character set has an unescaped "]", which looks like the end of the set.
You need to escape [ and ] inside of your character class.

filter non-alphanumeric "repeating" characters

What's the best way to filter non-alphanumeric "repeating" characters
I would rather no build a list of characters to check for. Is there good regex for this I can use in PHP.
Examples:
...........
*****************
!!!!!!!!
###########
------------------
~~~~~~~~~~~~~
Special case patterns:
=*=*=*=*=*=
->->->->
Based on #sln answer:
$str = preg_replace('~([^0-9a-zA-Z])\1+|(?:=[*])+|(?:->)+~', '', $str);
The pattern could be something like this : s/([\W_]|=\*|->)\1+//g
or, if you want to replace by just a single instance: s/([\W_]|=\*|->)\1+/$1/g
edit ... probably any special sequence should be first in the alternation, incase you need to make something like == special, it won't be grabbed by [\W_].
So something like s/(==>|=\*|->|[\W_])\1+/$1/g where special cases are first.
preg_replace('~\W+~', '', $str);
sin's solution is pretty good but the use of \W "non-word" class includes whitespace. I don't think you wan't to be removing sequences of tabs or spaces! Using a negative class (something like: '[^A-Za-z0-9\s]') would work better.
This will filter out all symbols
[code]
$q = ereg_replace("[^A-Za-z0-9 ]", "", $q);
[/code]
replace(/([^A-Za-z0-9\s]+)\1+/, "")
will remove repeated patterns of non-alphanumeric non-whitespace strings.
However, this is a bad practice because you'll also be removing all non-ASCII European and other international language characters in the Unicode base.
The only place where you really won't ever care about internationalization is in processing source code, but then you are not handling text quoted in strings and you may also accidentally de-comment a block.
You may want to be more restrictive in what you try to remove by giving a list of characters to replace instead of the catch-all.
Edit: I have done similar things before when trying to process early-version ShoutCAST radio names. At that time, stations tried to call attention to themselves by having obnoxious names like: <<!!!!--- GREAT MUSIC STATION ---!!!!>>. I used used similar coding to get rid of repeated symbols, but then learnt (the hard way) to be careful in what I eventually remove.
This works for me:
preg_replace('/(.)\1{3,}/i', '', $sourceStr);
It removes all the symbols that repats 3+ times in row.

Replacing [[wiki:Title]] with a link to my wiki

I'm looking for a simple replacement of [[wiki:Title]] into Title.
So far, I have:
$text = preg_replace("/\[\[wiki:(\w+)\]\]/","\\1", $text);
The above works for single words, but I'm trying to include spaces and on occasion special characters.
I get the \w+, but \w\s+ and/or \.+ aren't doing anything.
Could someone improve my understanding of basic regex? And I don't mean for anyone to simply point me to a webpage.
\w\s+ means "a word-character, followed by 1 or more spaces". You probably meant (\w|\s)+ ("1 or more of a word character or a space character").
\.+ means "one or more dots". You probably meant .+ (1 or more of any character - except newlines, unless in single-line mode).
The more robust way is to use
\[wiki:(.+?)\]
This means "1 or more of any character, but stop at first position where the rest matches", i.e. stop at first right bracket in this case. Without ? it would look for the longest available match - i.e. past the first bracket.
You need to use \[\[wiki:([\w\s]+)\]\]. Notice square brackets around \w\s.
If you are learning regular expressions, you will find this site useful for testing: http://rexv.org/
You're definitely getting there, but you've got a couple syntax errors.
When you're using multiple character classes like \w and \s, in order to match within that group, you have to put them in [square brackets] like so... ([\w\s]+) this basically means one or more of words or white space.
Putting a backslash in front of the period escapes it, meaning the regex is searching for a period.
As for matching special characters, that's more of a pain. I tried to come up with something quickly, but hopefully someone else can help you with that.
(Great cheat sheet here, I keep a copy on my desk at all times: http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/ )

Categories