Regular expressions, allow specific format only. "John-doe" - php

I've researched a little, but I found nothing that relates exactly to what I need and whenever tried to create the expression it is always a little off from what I require.
I attempted something along the lines of [AZaz09]{3,8}\-[AZaz09]{3,8}.
I want the valid result to only allow text-text, where either or the text can be alphabetical or numeric however the only symbol allowed is - and that is in between the two texts.
Each text must be at least three characters long ({3,8}?), then separated by the -.
Therefore for it to be valid some examples could be:
Text-Text
Abc-123
123-Abc
A2C-def4gk
Invalid tests could be:
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%

You need to use anchors and use the - so the characters in the character class are read as a range, not the individual characters.
Try:
^[A-Za-z0-9]{3,8}-[A-Za-z0-9]{3,8}$
Demo: https://regex101.com/r/xH3oM8/1
You also could simplify it a but with the i modifier and the \d meta character.
(?i)^[a-z\d]{3,8}-[a-z\d]{3,8}$

If accented letters should be allowed, or any other letter that exists in the Unicode range (like Greek or Cyrillic letters), then use the u modifier (for UTF-8 support) and \pL to match Unicode letters (and \d for digits):
$string ="
Mañana-déjà
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='/^[\pL\d]{3,}-[\pL\d]{3,}$/mu';
preg_match_all($regex, $string, $matches);
var_export($matches);
Output:
array (
0 =>
array (
0 => 'Mañana-déjà',
1 => 'Text-Text',
2 => 'Abc-123',
3 => '123-Abc',
4 => 'A2C-def4gk',
),
)
NB: the difference with \w is that [\pL\d] will not match an underscore.

You could come up with the following:
<?php
$string ="
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='~
^\w{3,} # at last three word characters at the beginning of the line
- # a dash
\w{3,}$ # three word characters at the end of the line
~xm'; # multiline and freespacing mode (for this explanation)
# ~xmu for accented characters
preg_match_all($regex, $string, $matches);
print_r($matches);
?>
As #chris85 pointed out, \w will match an underscore as well. Trincot had a good comment (matching accented characters, that is). To achieve this, simply use the u modifier.
See a demo on regex101.com and a complete code on ideone.com.

You can use this regex
^\w{3,}-\w{3,}$
^ // start of the string
\w{3,} // match "a" to "z", "A" to "Z" and 0 to 9 and requires at least 3 characters
- // requires "-"
\w{3,} // same as above
$ // end of the string
Regex Demo

And a short one.
^([^\W_]{3,8})-(?1)$
[^\W_] can be used as short for alnum. It subtracts the underscore from \w
(?1) is a subroutine call to the pattern in first group
Demo at regex101
My vote for #chris85 which is most obvious and performant.

This one
^([\w]{3,8}-[\w]{3,8})$
https://regex101.com/r/uS8nB5/1

Related

regex capture certain characters only

currently dealing with a bit of a problem. this is my string "all-days"
im in need of some assistance to creating a regex to capture the first character, the dash and also the first character after the dash. Im a bit of a newbie to Regex so forgive me.
Here is what ive got so far. (^.)
capture the first character, the dash and also the first
character after the dash
With preg_match function:
$s = "all-days";
preg_match('/^(.)[^-]*(-)(.)/', $s, $m);
unset($m[0]);
print_r($m);
The output:
Array
(
[1] => a
[2] => -
[3] => d
)
Its not regex but If you want just a solution as you want by other way it can be achieve by explode, array_walk and implode
$string = 'all-days-with-my-style';
$arr = explode("-",$string);
$new = array_walk($arr,function(&$a){
$a = $a[0];
});
echo implode("-",$arr);
Live demo : https://eval.in/882846
Output is : a-d-w-m-s
I assume your string only contains word characters and hyphens, and doesn't have consecutive hyphens:
To remove all that isn't the first character the hyphens and the first character after them, remove all that isn't after a word boundary:
$result = preg_replace('~\B\w+~', '', 'all-days');
If you only want to match these characters, just catch each character after a word boundary:
if ( preg_match_all('~\b.~', 'all-days', $matches) )
print_r($matches[0]);
Code
See code in use here
\b(\w|-\b)
For more precision, the following can be used (note that it uses Unicode groups, so it doesn't work in every language, but it does in PHP). This will only match letters, not numbers and underscores. It uses a negative lookbehind and positive lookahead, but you can understand it if you keep reading this article and break it apart one piece at a time.
(\b\p{L}|(?<=\p{L})-(?=\p{L}))
Explanation
\b Assert position at a word boundary
(\w|-\b) Capture the following into capture group 1
\w Match any word character
| Or
- Match the - character literally
\b Assert position at a word boundary
\b:
Asserts the position in the string matches 1 of the following:
^\w Assert position at the start of the string and match a word character
\w$ Match a word character and assert its position as the last position in the string
\W\w Match any non-word character, followed by a word character
\w\W Match any word character, followed by a non-word character
\w:
Means a word character (usually defined by any character in the set a-zA-Z0-9_, however, some languages also accept Unicode characters that represent any letter, number, or underscore \p{L}\p{N}_).
For more precision (depending on the use-case), you can specify [a-zA-Z] (for ASCII letters), \p{L} for Unicode letters, or [a-z] with the i flag for ASCII characters with the case-insensitive flag enabled in regex.

Regex Preg_match for licence key 25 alphanumeric and 4 hyphens

I'm still trying to get to grips with regex patterns and just after a little double-checking if someone wouldn't mind obliging!
I have a string which should either contain:
A 10 digit (numbers and letters) licence key, for example: 1234567890 OR
A 25 digit (numbers and letters) licence key, for example: ABCD1EFGH2IJKL3MNOP4QRST5 OR
A 29 digit licence number (25 numbers and letters, separated into 5 group by hyphens), for example: ABCD1-EFGH2-IJKL3-MNOP4-QRST51
I can match the first two fine, using ctype_alnum and strlen functions. However, for the last one I think I'll need to use regex and preg_match.
I had a go over at regex101.com and came up with the following:
preg_match('^([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})+-([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})', $str);
Which seems to match what I'm looking for.
I want the string to only contain an exact match for a string beginning with the licence number, and contain nothing other than mixed upper/lower case letters and numbers in any order and hyphens between each group of 5 characters (so a total of 29 characters - I don't want any further matches). No white space, no other characters and nothing else before or after the 29 digit key.
Will the above work, without allowing any other combinations? Will it stop checking at 29 characters? I'm not sure if there is a simpler way to express this in regex?
Thanks for your time!
The main point is that you need to use both ^ (start of string) and $ (end of string) anchors. Also, when you use + after (...), you allow 1 or more repetitions of the whole subpattern inside the (...). So, you need to remove the +s and add the $ anchor. Also, you need regex delimiters for your regex to work in PHP preg_match. I prefer ~ so as not to escape /. Maybe it is not the case here, but this is a habit.
So, the regex can look like
'~^[A-Za-z0-9]{5}(?:-[A-Za-z0-9]{5}){4}$~'
See the regex demo
The (?:-[A-Za-z0-9]{5}){4} matches 4 occurrences of -[A-Za-z0-9]{5} subpattern. The (?:...) is a non-capturing group whose matched text does not get stored in any buffer (unlike the capturing group).
See the IDEONE demo:
$re = '~^[A-Za-z0-9]{5}(?:-[A-Za-z0-9]{5}){4}$~';
$str = "ABCD1-EFGH2-IJKL3-MNOP4-QRST5";
if (preg_match($re, $str, $matches)) {
echo "Matched!";
}
How about:
preg_match('/^([a-z0-9]{5})(?:-(?1)){4}$/i', $str);
Explanation:
/ : regex delimiter
^ : begining of string
( : begin group 1
[a-z0-9]{5} : exactly 5 alphanum.
) : end of group 1
(?: : begin NON capture group
- : a dash
(?1) : same as definition in group 1 (ie. [a-z0-9]{5})
){4} : this group must be repeated 4 times
$ : end of string
/i : regex delimiter with case insensitive modifier

PHP regex for matching ALL special characters, included accented characters

I am looking for a way to match all the possible special characters in a string.
I have a list of cities in the world and many of the names of those cities contain special characters and accented characters. So I am looking for a regular expression that will return TRUE for any kind of special characters.
All the ones I found only match some, but I need one for every possible special character out there, spaces at the begin of the string included.
Is this possible?
This is the one I found, but does not match all the different and possible characters I may encounter in the name of a city:
preg_match('/[#$%^&*()+=\-\[\]\';,.\/{}|":<>?~\\\\]/', $string);
You're going to need the UTF8 mode "#pattern#u": http://nl3.php.net/manual/en/reference.pcre.pattern.modifiers.php
Then you can use the Unicode escape sequences: http://nl3.php.net/manual/en/regexp.reference.unicode.php
So that preg_match("#\p{L}*#u", "København", $match) will match.
Use unicode properties:
\pL stands for any letter
To match a city names, i'd do (I suppose - and space are valid characters) :
preg_match('/\s*[\pL-\s]/u', $string);
You can just reverse your pattern... to match everything what is not "a-Z09-_" you would use
preg_match('/[^-_a-z0-9.]/iu', $string);
The ^ in the character class reverses it.
I had the same problem where I wanted to split nameparts which also contained special characters:
For example if you want to split a bunch of names containing:
<lastname>,<forename(s)> <initial(s)> <suffix(es)>
fornames and suffix are separated with (white)space(s)
initials are separated with a . and with maximum of 6 initials
you could use
$nameparts=preg_split("/(\w*),((?:\w+[\s\-]*)*)((?:\w\.){1,6})(?:\s*)(.*)/u",$displayname,null,PREG_SPLIT_DELIM_CAPTURE);
//first and last part are always empty
array_splice($naamdelen, 5, 1);
array_splice($naamdelen, 0, 1);
print_r($nameparts);
Input:
Powers,Björn B.A. van der
Output:
Array ( [0] => Powers[1] => Björn [2] => B.A. [3] => van der)
Tip: the regular expression looks like from outer space but regex101.com to the rescue!

PHP Regex Not Matching Desired Substrings

I've written the next regular expression
$pattern = "~\d+[.][\s]*[A-Z]{1}[A-Za-z0-9\s-']+~";
in order to match substrings as 2.bon jovi - it's my life
the problem is the only part that is recognized is - bon jovi
none " - " or " ' " are recognized by this regular expression.
I'd prefer to know what is wrong with the regular expression that I've wrote rather than getting a new one.
Your regular expressions states that after the period character (can be changed to \.), you will have zero or more white space characters which should then be followed by 1 upper case letter. In your string, you do not have any upper case letters.
Secondly, the - should be placed last when you want to match it. So, changing your regex to this: ~\d+[.][\s]*[A-Z]{1}[A-Za-z0-9\s'-]+~ will match something like so: 2.Bon jovi - it's my life.
On the other hand, you can change it to this: ~\d+[.][\s]*[A-Za-z0-9\s'-]+~ to match something like so: 2.bon jovi - it's my life.
EDIT: Ammended as per the comments of Marko D and aleation.
A better regular expression to handle that would be...
$pattern = "~\d+\.\s*[\pL\pP\s]+~";
CodePad.
This will match a number, followed by a ., followed by optional whitespace, followed by one or more Unicode letters, whitespace or punctuation marks.
$pattern = "~\d+\..*~";
$string = "2.bon jovi - it's my life";
preg_match($pattern, $string, $match);
print_r($match);
output: Array ( [0] => 2.bon jovi - it's my life )
So the way I understand this regular expression is:
\d+ // Match any digit, 1 or more times
[.] // Match a dot
[\s]* // Match 0 or more whitespace characters
[A-Z]{1} // Match characters between an UPPERCASE A-Z Range 1 time
[A-Za-z0-9\s-']+ // Match characters between A-Z, a-z, 0-9, whitespace, dashe and apostrophe
So straight away, your 'bon jovi' might not get matched as it's lower case and you're only looking for uppercase characters. 'bon jovi' also contains a space so perhaps changing that part of the regular expression to allow for lowercase characters and whitespace might help so you'd end up with:
$pattern = "~\d+[.][\s]*[A-Za-z\s]{1}[A-Za-z0-9\s-']+~";
Note: I quickly tested this on RegExr ( http://gskinner.com/RegExr/ ) and it appeared to match the string fine.
Your regrex is as follows.
~ // delimiter
\d+ // 1 or more numbers
[.] // a period
[\s]* // 0 or more whitespace characters
[A-Z]{1} // 1 upper case letter
[A-Za-z0-9\s-\']+ // 1 or more characters, from the character class
~ //delimiter
Comparing that to the string "2.bon jovi" You have:
~ //
\d+ // "2"
[.] // "."
[\s]* // ""
[A-Z]{1} // <- NO MATCH
[A-Za-z0-9\s-\']+ //
~ //
"bon" does not start with a captial letter, it therefore does not match [A-Z]{1}
Cleaner regex
There are a few simple things you can do to clean up your regex
don't use character-classes for one character
don't specify {1} it's the same as not being present
Applying the above to your existing regex you get:
$pattern = "~\d+\.\s*[A-Z][A-Za-z0-9\s-']+~";
Which is slightly easier to read.
Your [A-Z]{1} sub-pattern requires one capital letter, so "2.bon jovi - it's my life" will not match.
And you need to escape the - in the [A-Za-z0-9\s-'] character class, or put it at the start or end, otherwise it is specifying a range.
"~\d+\.[A-Za-z0-9\s'-]+~"
As pointed out in the comments, it is actually not necessary to escape the - in the character class in your regex. That is only because you happened to precede it with a metacharacter \s that cannot be part of a range. Normally, if you want to match a literal - and you have it in a character class, you must escape it or position it as described above.

preg_match_all alpha+accented, but not numeric

I would like to use PHP's preg_match_all to capture substrings which comprise:
A-Z, a-z, all accented chars;
space;
hyphen.
It must not capture strings with anything else in them, including numeric chars.
This example is close but also catches strings containing numeric chars:
preg_match_all("/([\w -]+)/u", $abigstring, $matches);
That's a job for Unicode properties:
preg_match_all("/([\p{L} -]+)/u", $abigstring, $matches);
\p{L} matches any character with the Unicode property "Letter".
This is also an option :
preg_replace("/[^A-Za-zÀ-ÿ -]+/u", "", "juana 123456 sfdf 423 999 _ -a- dsa & ç%& à à$¨à+", -1);
Here is a working example -> https://xrg.es/#17hyhfm
For those want to, here is a correction of the code above which doesn't work !
preg_match("/^([\p{L} -]+)$/u", $string)
Anchors (^ and $) were missing
EDIT : Much better. If hyphens/spaces are only allowed in the middle:
/^([\p{L}](?:[\p{L} -]+[\p{L}])?)$/u

Categories