Regex diacritics problem - php

I am trying to validate some user inputs, but my regex fails when it encounters diacritics. I am talking about characters like ăĂ and so on.
What should I add to the regex code so it should also validate diacritics from within inputs?
Thank you!
P.S.: If it matters, I am using PHP with CakePHP framework.
This is the piece of code I am currently using for validating user input: return preg_match('|^[0-9a-zA-Z_-\s]*$|', $value);

Assuming you want to match letters, then allowing Unicode letters should help:
Use /\p{L}+/u for example if you want to match a sequence of letters. Don't forget the /u (Unicode) modifier.
In your case:
return preg_match('|^[0-9\p{L}_\s-]*$|u', $value);
should work.
As an aside, it's probably not a good idea to use | as a regex delimiter. For the current regex / would do just fine; other alternatives are ~ or # because they seldom occur in text and don't have any special meaning in regexes.

Related

Matching only Arabic and English Alphanumeric with one space allowed only

I have got a forum site and I am currently working on the final piece, the registration form and I want to validate the username. It should only contain Arabic and English alphanumerics and a maximum of one space between words.
I've got the english alphanumeric part working but not the Arabic nor the double spaces.
I am using the preg_match() function to match the username input with the RegEX.
What I currently have:
!preg_match('/\p{Arabic}/', $username) && !preg_match('/^[A-Za-z0-9]$/')
//this is currently inside and if statement, so if they both don't match then it is false.
You should put the unicode properties inside your regular regex because this can all be done with 1 regex. You also need to quantify that character class otherwise you only allow 1 character. This regex should do it.
^[\p{Arabic}a-zA-Z\p{N}]+\h?[\p{N}\p{Arabic}a-zA-Z]*$
Use the u modifier in PHP so unicode works as expected.
PHP Usage:
preg_match('/^[\p{Arabic}a-zA-Z\p{N}]+\h?[\p{N}\p{Arabic}a-zA-Z]*$/u', $string);
Demo: https://regex101.com/r/fsRchS/2/

Regex blocking special characters

I'm using PHP Version 5.3.27
I'm trying to get my regex to match whitespace, and special characters such as ♦◘•♠♥☻, the other known special characters which are %$#&*# are already matched, but somehow the ones I mentioned before are not matched..
Current regex
preg_match('/^[a-zA-Z0-9[:space:]]+$/', $login)
My apology for asking two questions on the same subject. I hope this one is clear enough for you.
use this
[\W]+
will match any non-word character.
Your regex doesn't contain any reference to the special characters mentioned. You would need to include them in the character class for them to be matched.
To match those kinds of special characters you can use the unicode values.
Example:
\u0000-\uFFFF
\x00-\xFF
The top is UTF-16, the bottom is UTF-8.
Refer to a UTF-8/16 character table online to match up your symbols with their unicode values, then create a range to keep your expression short.
You can use the \p{S} character class (or \p{So}) that matches symbol characters (that includes this kind of characters: ╭₠☪♛♣♞♉♆☯♫):
preg_match('/^[a-zA-Z0-9\h\p{S}]+$/u', $login)
To find more possibilities you can check the pcre documentation at: http://www.pcre.org/pcre.txt
If you need to be more precise, the best way is to use character ranges in the character class. You can find code of characters here.

JavaScript/PHP Regular Expression

I'm trying to match first names and Lastname with something like this.
$pattern = '/[a-zA-Z\-]{3,30} +[a-zA-Z]+/';
This works great, except when I have a first name like this Mélissa Smith
My match becomes Lissa Smith
How do I match for all special characters like é
in javascript, you can use a unicode char range instead of A-Za-z:
"Mélissa Smith".match( /[\u80-\uffff]{3,30} +[\u80-\uffff]+/ )
equals: ["Mélissa Smith"]
Put the regex into Unicode mode with the /u modifier and use an appropriate Unicode character class instead of hardcoding just latin letters:
$pattern = '/^(\pL|-){3,30}\s+\pL+$/u';
I also anchored the pattern between ^ and $ because otherwise it could end up matching things you didn't intend it to.
You have to keep in mind that when you do this, the input (as well as the pattern itself) must be encoded in UTF-8.
However, it has to be said that naively parsing names like this is not going to give you very good results. People's full names are way too involved for something this simple to work across the board.
Try using the POSIX expression [:alpha:] instead of [a-zA-Z-] to catch the characters. [:alpha:] will catch equivalent characters such as accents.
http://www.regular-expressions.info/posixbrackets.html

What is the regular expression for space and alpha-numeric

I'm using ajax check function to check inserted category name which should be only alpha-numeric and also allowed space
I've used this function eregi_replace with the following regular expression [a-zA-Z0-9_]+
$check = eregi_replace('([a-zA-Z0-9_]+)', "", $catname);
But when i insert category name for example hello world it failed cause it does not accept space but if i write it as helloworld works so i understood that the error must be in the regular expression i'm using.
so what is the correct regular expression that filter any special characters and allow only for alpha-numeric and space.
Thanks a lot
A character class matching letters, numbers, the underscore and space would be
[\w ]
You should not be using any of the POSIX regular expression functions as they are now deprecated. Instead, use their superior counterparts from the PCRE suite.
Change your regular expression to:
([A-Za-z0-9_]+(?: +[A-Za-z0-9_]+)*)
I realize that it is not as straightforward as you might have hoped. Things to note:
The identifier must start with a non-space
If there are spaces, they should be between words and not matched at the end
?: is used to prevent an extra grouping in your expression, but is not required
The + after the space character allows multiple spaces between words. You can enforce a single space by removing it, but in some solutions, it is a better practice to normalize the space internally with a preg_split that matches on " +" (a space with a plus sign) and then use implode(" ", $array). But eh... if you are just validating, this should be fine.
you've got it nearly right, just add \s into your square brackets and "hello world" will pass.
([A-Za-z0-9_\s]+)
I've got some help by old friend and i've tested and works perfect - thank you all for answers and comments it was very helpful to me.
this works perfect
$check = eregi_replace('(^[a-zA-Z0-9 ]*$)', "", $catname);
Alphanumeric and white space regular expression
#Phil
yours works perfect but still will pass underscore ~ thanks
#Michael Hays
I do not know it didn't worked for whitespace , but your comments is very helpful ~ thanks
#kjetilh
I will read more about $preg ~ thanks
#Alastair
Works fine if i've replaced \s with just whitespace ! ~ thanks
eregi functions are deprecated as of php 5.3. Use preg instead.

Regex for description form input

a textarea is a part of my form. The user has to write a little text and I want to validate this text. For now I am using the following regex:
/^[0-9a-zA-ZäöüÄÖÜ_\-']+$/
Although I have mentioned the äöüÄÖÜ in the regex it handles all words with äöü.. as invalid. Furthermore it does not accept empty spaces.
Any ideas how to improve the regex?
Use a Unicode-aware regex:
/[\pL\pN_\-]+/
the PCRE u modifier allows for utf-8. You are also missing a space from the regex, and you can condense it a bit:
/^[0-9a-zäöü\- ]+$/ui
Though I'm not sure if 'i' will work with the capitals of the foreign characters.
You may also want to include punctuation.
First, you might have an encoding issue, that's why äöüÄÖÜ are registered as invalid. I'm not a PHP user, so I can't answer your question directly, but taking a look at this page might help you. Also, using appropriate character classes could work better than explicitly writing all appropriate letters. Alas, this is also probably encoding configuration dependent.
Second, you need a space in your regex, so
/^[0-9a-z A-ZäöüÄÖÜ_\-']+$/ // note space after a-z
should work. Note what I wrote in last paragraph about using character classes. \w might be sufficient instead of a-zA-ZäöüÄÖÜ
You may just use \w to indicate all "word" characters (letters, digits, etc.) So the regex will be
/^[\w_\-' ]+$/
What text from the user are you considering to be "valid"?

Categories