Accented character validation in PHP - php

What is a good one-liner php regex for checking first/last name fields with accented characters (in case someone's name was Pièrre), that could match something like:
<?php
$strErrorMessage = null;
if(!preg_match('/\p{L}0-9\s-+/u', trim($_POST["firstname"])))
$strErrorMessage = "Your first name can only contain valid characters, ".
"spaces, minus signs, or numbers.";
?>
This tries to use unicode verification, from this post, but doesn't work correctly. The solution seems pretty hard to google.

Aside from the difficulty to validate a name, you need to put your characters into a character class. /\p{L}0-9\s-+/u matches only on a sequence like "Ä0-9 ------". What you wanted to do is
/^[\p{L}0-9\s-]+$/u
Additionally I added anchors, they ensure that the regex tries to match the complete string.
As ex3v mentioned you should probably add \p{M} to that class to match also combination characters. See Unicode properties.
/^[\p{L}\p{M}0-9\s-]+$/u

Related

Why is ctype_alnum unhelpful in matching culture-agnostic alphanumerics?

Let's suppose that I have a text in a variable called $text and I want to validate it, so that it can contain spaces, underscores, dots and any letters from any languages and any digits. Since I am a total noob with regular expressions, I thought I can work-around learning it, like this:
if (!ctype_alnum(str_replace(".", "", str_replace(" ", "", str_replace("_", "", $text))))) {
//invalid
}
This correctly considers the following inputs as valid:
foobarloremipsum
foobarloremipsu1m
foobarloremi psu1m
foobar._remi psu1m
So far, so good. But if I enter my name, Lajos Árpád, which contains non-English letters, then it is considered to be invalid.
Returns TRUE if every character in text is either a letter or a digit,
FALSE otherwise.
Source.
I suppose that a setting needs to be changed to allow non-English letters, but how can I use ctype_alnum to return true if and only if $text contains only letters or digits in a culture-agnostic fashion?
Alternatively, I am aware that some spooky regular expression can be used to resolve the issue, including things like \p{L} which is nice, but I am interested to know whether it is possible using ctype_alnum.
You need to use setlocale with category set to LC_CTYPE and the appropriate locale for the ctype_* family of functions to work on non-English characters.
Note that the locale that you're using with setlocale needs to actually be installed on the system, otherwise it won't work. The best way to remedy this situatioin is to use a portable solution, given in this answer to a similar question.

Match Polish characters in PHP with preg_match

I am trying to do some server side validation in PHP. I tried hard but I found still no solution. I am trying to allow only Polish characters in the input.
For this I have used:
preg_match('/^[\x{0104}-\x{017c}]*$/u',$titles)
This doesn't work however.
Anyone has any idea how to write it properly?
To match Polish letters only, you just need a character class:
[a-pr-uwy-zA-PR-UWY-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ]
Use as
preg_match('/^[A-PR-UWY-ZĄĆĘŁŃÓŚŹŻ]*$/iu',$titles)
Note that there is no Q, V and X in Polish, but since they can be met in some words (taxi), you may want to allow these letters as well. Then, use '/^[A-ZĄĆĘŁŃÓŚŹŻ]*$/iu' regex.
IDEONE demo
if (preg_match('/^[A-PR-UWY-ZĄĆĘŁŃÓŚŹŻ]*$/iu', "spółka")) {
echo "The whole string contains only Polish letters";
}

Regex to replace punctuation

I've been trying for a few hours to get this to work to the effect I need but nothing works quite like it should. I'm building a discussion board type thing and have made a way to tag other users by putting #username in the post text.
Currently I have this code to strip anything that wouldn't be part of the username once the tags have already been pulled out of the entire text:
$name= preg_replace("/[^A-Za-z0-9_]/",'',$name);
This works well because it correct captures names that are for example (#username), #username:, #username, some text etc. (so to remove the ,, :, and )).
HOWEVER, this does not work when the user has non-ascii characters in their username. For example if it's #üsername, the result of that line above gives sername which is not useful.
IS there a way using preg_replace to still strip these additional punctuation, but retain any non-ascii letters?
Any help is much appreciated :)
You enter the area of Unicode Regexps.
$name= preg_replace('/[^\p{Letter}\p{Number}_]/u', '', $name);
or the other way round. The link I provided contains more examples.
To detect punctuation characters, you can use unicode property \p{P} instead:
$name = preg_replace('/[\p{P} ]+/', '', $name);
RegEx Demo

Space validation in full name field

I want to place check in Full name field that full name field should accept space between first and last name using i am using strrpos() function for it but not working
You could use a regex...
if (preg_match("/(.+)( )(.+)/", $full_name))
{
// returns true if name is formed of two words with a space between
}
For even better validation, you can use \w although keep in mind that it will only match English word characters. See here for more info: Why does \w match only English words in javascript regex?
preg_match("/(\w+)( )(\w+)/", $full_name)

Regular expression for e-mail domain (not basic e-mail verification)

I'm currently using
if(preg_match('~#(semo\.edu|uni\.uu\.se|)$~', $email))
as a domain check.
However I need to only check if the e-mail ends with the domains above. So for instance, all these need to be accepted:
hello#semo.edu
hello#student.semo.edu
hello#cool.teachers.semo.edu
So I'm guessing I need something after the # but before the ( which is something like "any random string or empty string". Any regexp-ninjas out there who can help me?
([^#]*\.)? works if you already know you're dealing with a valid email address. Explanation: it's either empty, or anything that ends with a period but does not contain an ampersand. So student.cs.semo.edu matches, as does plain semo.edu, but not me#notreallysemo.edu. So:
~#([^#]*\.)?(semo\.edu|uni\.uu\.se)$~
Note that I've removed the last | from your original regex.
You can use [a-zA-Z0-9\.]* to match none or more characters (letters, numbers or dot):
~#[a-zA-Z0-9\.]*(semo\.edu|uni\.uu\.se|)$~
Well .* will match anything. But you don't actually want that. There are a number of characters that are invalid in a domain name (ex. a space). Instead you want something more like this:
[\w.]*
I might not have all of the allowed characters, but that will get you [A-Za-z0-9_.]. The idea is that you make a list of all the allowed characters in the square brakets and then use * to say none or more of them.

Categories