Matching string regular expression - php

I would like to match data from strings like the following:
24.Legacy.S01E08.720p.HDTV.x264-AVS[rarbg]
Colony.S02E09.720p.HDTV.x264-FLEET[rarbg]
24.Legacy (everything before S01E08)
S => 01
E => 08
720p.HDTV.x264 (everything between S01E08 and -)
AVS (everything between - en [)
rarbg (everything between [])
The following test almost works but needs some tweaks:
preg_match_all(
'/(.*?).S([0-9]+)E([0-9]+).(.*?)(.*?)[(.*?)]/s',
$download,
$posts,
PREG_SET_ORDER
);

You're so close, you just need to add the tests for the second half of the requirements:
(.*?).S([0-9]+)E([0-9]+).(.*?)-(.*?)\[(.*?)\]
https://regex101.com/r/PfgMfq/1

You should not need the /s modifier, it extends . to match meta chars and line breaks.
I would recommend to use the /e modifier to also allow lower case 's01e14'
Don't forget to escape the regex chars like . and [ with \. and \[
// NAME SEASON EPISOE MEDIUM OPTIONS
$regex = '/(.+)\.S([0-9]+)E([0-9]+)\.(.+)\[(.+)\]/i';
preg_match_all(
$regex,
$download,
$posts,
PREG_SET_ORDER
);
Test with '24.Legacy.S01E08.720p.HDTV.x264-AVS[rarbg]'
Array
(
[0] => 24.Legacy.S01E08.720p.HDTV.x264-AVS[rarbg]
[1] => 24.Legacy
[2] => 01
[3] => 08
[4] => 720p.HDTV.x264-AVS
[5] => rarbg
)

Just write it down then :)
^
(?P<title>.+?) # title
S(?P<season>\d+) # season
E(?P<episode>\d+)\. # episode
(?P<quality>[^-]+)- # quality
(?P<type>[^[]+) # type
\[
(?P<torrent>[^]]+) # rest
\]
$
Demo on regex101.com.

If a part is optional just add some ( ) around it and a ? behind it, like this
// NAME SEASON EPISOE MEDIUM OPTIONS
$regex = '/(.+)\.S([0-9]+)E([0-9]+)\.(.+)(\[(.+)\])?/i';
but watch out for changing $match indexes
Array
(
[0] => 24.Legacy.S01E08.720p.HDTV.x264-AVS[rarbg]
[1] => 24.Legacy
[2] => 01
[3] => 08
[4] => 720p.HDTV.x264-AVS
[5] => [rarbg]
[6] => rarbg
)
if you don't need the rarbg value you can skip the inner ()
// NAME SEASON EPISOE MEDIUM OPTIONS
$regex = '/(.+)\.S([0-9]+)E([0-9]+)\.(.+)(\[.+\])?/i';

Related

Decomposing a string into words separared by spaces, ignoring spaces within quoted strings, and considering ( and ) as words

How can I explode the following string:
+test +word any -sample (+toto +titi "generic test") -column:"test this" (+data id:1234)
into
Array('+test', '+word', 'any', '-sample', '(', '+toto', '+titi', '"generic test"', ')', '-column:"test this"', '(', '+data', 'id:1234', ')')
I would like to extend the boolean fulltext search SQL query, adding the feature to specify specific columns using the notation column:value or column:"valueA value B".
How can I do this using preg_match_all($regexp, $query, $result), i.e., what is the correct regular expression to use?
Or more generally, what would be the most appropriate regular expression to decompose a string into words not containing spaces, where spaces within text between quotes is not considered spaces, for the sake of defining a word, and ( and ) are considered words, independent of being surrounded by spaces. For example xxx"yyy zzz" should be considered a single world. And (aaa) should be three words (, aaa and ).
I have tried something like /"(?:\\\\.|[^\\\\"])*"|\S+/, but with limited/no success.
Can anybody help?
I think PCRE verbs can be used to achieve your goal:
preg_split('/".*?"(*SKIP)(*FAIL)|(\(|\))| /', '+test +word any -sampe (+toto +titi "generic test") -column:"test this" (+data id:1234)',-1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY)
https://3v4l.org/QnpB9
https://regex101.com/r/pw1mEd/1
https://3v4l.org/dNMkf (with test data)
If you want to match the various parts using alternations:
(?:[^\s()":]*:)?"[^"]+"|[^\s()]+|[()]
Explanation
(?: Non capture group to match as a whole part
[^\s()":]*: Match optional non whitespace chars other than ( ) " : and then match :
)? Close the non capture group and make it optional
"[^"]+" Match from an opening double quote till closing double quote
| Or
[^\s()]+ Match 1+ non whitespace chars other than ( or )
| Or
[()] Match either ( or )
Regex demo | PHP demo
Example code
$re = '/(?:[^\s()":]*:)?"[^"]+"|[^\s()]+|[()]/';
$str = '+test +word any -sampe (+toto +titi "generic test") -column:"test this" (+data id:1234)';
preg_match_all($re, $str, $matches);
print_r($matches[0]);
Output
Array
(
[0] => +test
[1] => +word
[2] => any
[3] => -sampe
[4] => (
[5] => +toto
[6] => +titi
[7] => "generic test"
[8] => )
[9] => -column:"test this"
[10] => (
[11] => +data
[12] => id:1234
[13] => )
)

preg_split : splitting a string according to a very specific pattern

Regex/PHP n00b here. I'm trying to use the PHP "preg_split" function...
I have strings that follow a very specific pattern according to which I want to split them.
Example of a string:
CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION
Desired result:
[0]CADAVRES
[1]FILM
[2]Canada : Québec
[3]Érik Canuel
[4]2009
[5]long métrage
[6]FICTION
Delimiters (in order of occurrence):
" ["
"] ("
", "
", "
", "
") "
How do I go about writing the regex correctly?
Here's what I've tried:
<?php
$pattern = "/\s\[/\]\s\(/,\s/,\s/,\s/\)\s/";
$string = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
$keywords = preg_split($pattern, $string);
print_r($keywords);
It's not working, and I don't understand what I'm doing wrong. Then again, I've just begun trying to deal with regex and PHP, so yeah... There are so many escape characters, I can't see right...
Thank you very much!
I managed to work out a solution using preg_match_all:
$input = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
preg_match_all("|[^-\\[\\](),/\\s]+(?:(?: :)? [^-\\[\\](),/]+)?|", $input, $matches);
print_r($matches[0]);
Array
(
[0] => CADAVRES
[1] => FILM
[2] => Canada : Québec
[3] => Érik Canuel
[4] => 2009
[5] => long métrage
[6] => FICTION
)
The above regex considers a term as any character which is not something like bracket, comma, parenthesis, etc. It also allows for two word terms, possibly with a colon separator in the middle.
You can use this regex to split on:
([^\w:]\s[^\w:]?|\s[^\w:])
It looks for a non-(word or :) character, followed by a space, followed by an optional non-(word or :) character; or a space followed by a non-(word or :) character. This will match all your desired split patterns. In PHP (note you need the u modifier to deal with unicode characters):
$input = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
$keywords = preg_split('/([^\w:]\s[^\w:]?|\s[^\w:])/u', $input);
print_r($keywords);
Output:
Array
(
[0] => CADAVRES
[1] => FILM
[2] => Canada : Québec
[3] => Érik Canuel
[4] => 2009
[5] => long métrage
[6] => FICTION
)
Demo on 3v4l.org
Here's an attempt with preg_match:
$pattern = "/^([^\[]+)\[([^\]]+)\]\s+\(([^,]+),\s+([^,]+),\s+([^,]+),\s+([^,]+)\)\s+(.+)$/i";
$string = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
preg_match($pattern, $string, $keywords);
array_shift($keywords);
print_r($keywords);
Output:
Array
(
[0] => CADAVRES
[1] => FILM
[2] => Canada : Québec
[3] => Érik Canuel
[4] => 2009
[5] => long métrage
[6] => FICTION
)
Try it!
Regex breakdown:
^ anchor to start of string
( begin capture group 1
[^\[]+ one or more non-left bracket characters
) end capture group 1
\[ literal left bracket
( begin capture group 2
[^\]]+ one or more non-right bracket characters
) end capture group 2
\] literal bracket
\s+ one or more spaces
\( literal open parenthesis
( open capture group 3
[^,]+ one or more non-comma characters
) end capture group 3
,\s+ literal comma followed by one or more spaces
([^,]+),\s+([^,]+),\s+([^,]+) repeats of the above
\) literal closing parenthesis
\s+ one or more spaces
( begin capture group 7
.+ everything else
) end capture group 7
$ EOL
This assumes your structure to be static and is not particularly pretty, but on the other hand, should be robust to delimiters creeping into fields where they're not supposed to be. For example, the title having a : or , in it seems plausible and would break a "split on these delimiters anywhere"-type solution. For example,
"Matrix:, Trilogy() [FILM, reviewed: good] (Canada() : Québec , \t Érik Canuel , ): 2009 , long ():():[][]métrage) FICTIO , [(:N";
correctly parses as:
Array
(
[0] => Matrix:, Trilogy()
[1] => FILM, reviewed: good
[2] => Canada() : Québec
[3] => Érik Canuel
[4] => ): 2009
[5] => long ():():[][]métrage
[6] => FICTIO , [(:N
)
Try it!
Additionally, if your parenthesized comma region is variable length, you might want to extract that first and parse it, then handle the rest of the string.

split string by spaces and colon but not if inside quotes

having a string like this:
$str = "dateto:'2015-10-07 15:05' xxxx datefrom:'2015-10-09 15:05' yyyy asdf"
the desired result is:
[0] => Array (
[0] => dateto:'2015-10-07 15:05'
[1] => xxxx
[2] => datefrom:'2015-10-09 15:05'
[3] => yyyy
[4] => asdf
)
what I get with:
preg_match_all("/\'(?:[^()]|(?R))+\'|'[^']*'|[^(),\s]+/", $str, $m);
is:
[0] => Array (
[0] => dateto:'2015-10-07
[1] => 15:05'
[2] => xxxx
[3] => datefrom:'2015-10-09
[4] => 15:05'
[5] => yyyy
[6] => asdf
)
Also tried with preg_split("/[\s]+/", $str) but no clue how to escape if value is between quotes. Can anyone show me how and also please explain the regex. Thank you!
I would use PCRE verb (*SKIP)(*F),
preg_split("~'[^']*'(*SKIP)(*F)|\s+~", $str);
DEMO
Often, when you are looking to split a string, using preg_split isn't the best approach (that seems a little counter intuitive, but that's true most of the time). A more efficient way consists to find all items (with preg_match_all) using a pattern that describes all that is not the delimiter (white-spaces here):
$pattern = <<<'EOD'
~(?=\S)[^'"\s]*(?:'[^']*'[^'"\s]*|"[^"]*"[^'"\s]*)*~
EOD;
if (preg_match_all($pattern, $str, $m))
$result = $m[0];
pattern details:
~ # pattern delimiter
(?=\S) # the lookahead assertion only succeeds if there is a non-
# white-space character at the current position.
# (This lookahead is useful for two reasons:
# - it allows the regex engine to quickly find the start of
# the next item without to have to test each branch of the
# following alternation at each position in the strings
# until one succeeds.
# - it ensures that there's at least one non-white-space.
# Without it, the pattern may match an empty string.
# )
[^'"\s]* #"'# all that is not a quote or a white-space
(?: # eventual quoted parts
'[^']*' [^'"\s]* #"# single quotes
|
"[^"]*" [^'"\s]* # double quotes
)*
~
demo
Note that with this a little long pattern, the five items of your example string are found in only 60 steps. You can use this shorter/more simple pattern too:
~(?:[^'"\s]+|'[^']*'|"[^"]*")+~
but it's a little less efficient.
For your example, you can use preg_split with negative lookbehind (?<!\d), i.e.:
<?php
$str = "dateto:'2015-10-07 15:05' xxxx datefrom:'2015-10-09 15:05' yyyy asdf";
$matches = preg_split('/(?<!\d)(\s)/', $str);
print_r($matches);
Output:
Array
(
[0] => dateto:'2015-10-07 15:05'
[1] => xxxx
[2] => datefrom:'2015-10-09 15:05'
[3] => yyyy
[4] => asdf
)
Demo:
http://ideone.com/EP06Nt
Regex Explanation:
(?<!\d)(\s)
Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\d)»
Match a single character that is a “digit” «\d»
Match the regex below and capture its match into backreference number 1 «(\s)»
Match a single character that is a “whitespace character” «\s»

Regexp tip request

I have a string like
"first,second[,b],third[a,b[1,2,3]],fourth[a[1,2]],sixth"
I want to explode it to array
Array (
0 => "first",
1 => "second[,b]",
2 => "third[a,b[1,2,3]]",
3 => "fourth[a[1,2]]",
4 => "sixth"
}
I tried to remove brackets:
preg_replace("/[ ( (?>[^[]]+) | (?R) )* ]/xis",
"",
"first,second[,b],third[a,b[1,2,3]],fourth[a[1,2]],sixth"
);
But got stuck one the next step
PHP's regex flavor supports recursive patterns, so something like this would work:
$text = "first,second[,b],third[a,b[1,2,3]],fourth[a[1,2]],sixth";
preg_match_all('/[^,\[\]]+(\[([^\[\]]|(?1))*])?/', $text, $matches);
print_r($matches[0]);
which will print:
Array
(
[0] => first
[1] => second[,b]
[2] => third[a,b[1,2,3]]
[3] => fourth[a[1,2]]
[4] => sixth
)
The key here is not to split, but match.
Whether you want to add such a cryptic regex to your code base, is up to you :)
EDIT
I just realized that my suggestion above will not match entries starting with [. To do that, do it like this:
$text = "first,second[,b],third[a,b[1,2,3]],fourth[a[1,2]],sixth,[s,[,e,[,v,],e,],n]";
preg_match_all("/
( # start match group 1
[^,\[\]] # any char other than a comma or square bracket
| # OR
\[ # an opening square bracket
( # start match group 2
[^\[\]] # any char other than a square bracket
| # OR
(?R) # recursively match the entire pattern
)* # end match group 2, and repeat it zero or more times
] # an closing square bracket
)+ # end match group 1, and repeat it once or more times
/x",
$text,
$matches
);
print_r($matches[0]);
which prints:
Array
(
[0] => first
[1] => second[,b]
[2] => third[a,b[1,2,3]]
[3] => fourth[a[1,2]]
[4] => sixth
[5] => [s,[,e,[,v,],e,],n]
)

Validating US phone number with php/regex

EDIT: I've mixed and modified two of the answers given below to form the full function which now does what I had wanted and then some... So I figured I'd post it here in case anyone else comes looking for this same thing.
/*
* Function to analyze string against many popular formatting styles of phone numbers
* Also breaks phone number into it's respective components
* 3-digit area code, 3-digit exchange code, 4-digit subscriber number
* After which it validates the 10 digit US number against NANPA guidelines
*/
function validPhone($phone) {
$format_pattern = '/^(?:(?:\((?=\d{3}\)))?(\d{3})(?:(?<=\(\d{3})\))?[\s.\/-]?)?(\d{3})[\s\.\/-]?(\d{4})\s?(?:(?:(?:(?:e|x|ex|ext)\.?\:?|extension\:?)\s?)(?=\d+)(\d+))?$/';
$nanpa_pattern = '/^(?:1)?(?(?!(37|96))[2-9][0-8][0-9](?<!(11)))?[2-9][0-9]{2}(?<!(11))[0-9]{4}(?<!(555(01([0-9][0-9])|1212)))$/';
//Set array of variables to false initially
$valid = array(
'format' => false,
'nanpa' => false,
'ext' => false,
'all' => false
);
//Check data against the format analyzer
if(preg_match($format_pattern, $phone, $matchset)) {
$valid['format'] = true;
}
//If formatted properly, continue
if($valid['format']) {
//Set array of new components
$components = array(
'ac' => $matchset[1], //area code
'xc' => $matchset[2], //exchange code
'sn' => $matchset[3], //subscriber number
'xn' => $matchset[4], //extension number
);
//Set array of number variants
$numbers = array(
'original' => $matchset[0],
'stripped' => substr(preg_replace('[\D]', '', $matchset[0]), 0, 10)
);
//Now let's check the first ten digits against NANPA standards
if(preg_match($nanpa_pattern, $numbers['stripped'])) {
$valid['nanpa'] = true;
}
//If the NANPA guidelines have been met, continue
if($valid['nanpa']) {
if(!empty($components['xn'])) {
if(preg_match('/^[\d]{1,6}$/', $components['xn'])) {
$valid['ext'] = true;
}
}
else {
$valid['ext'] = true;
}
}
//If the extension number is valid or non-existent, continue
if($valid['ext']) {
$valid['all'] = true;
}
}
return $valid['all'];
}
You can resolve this using a lookahead assertion. Basically what we're saying is I want a series of specific letters, (e, ex, ext, x, extension) followed by one or more number. But we also want to cover the case where there's no extension at all.
Side Note, you don't need brackets
around single characters like [\s] or
that [x] that follows. Also, you can group
characters that are meant to be in the same
spot, so instead of \s?\.?/?, you can
use [\s\./]? which means "one of any of those
characters"
Here's an update with regex that resolves your comment here as well. I've added the explanation in the actual code.
<?php
$sPattern = "/^
(?: # Area Code
(?:
\( # Open Parentheses
(?=\d{3}\)) # Lookahead. Only if we have 3 digits and a closing parentheses
)?
(\d{3}) # 3 Digit area code
(?:
(?<=\(\d{3}) # Closing Parentheses. Lookbehind.
\) # Only if we have an open parentheses and 3 digits
)?
[\s.\/-]? # Optional Space Delimeter
)?
(\d{3}) # 3 Digits
[\s\.\/-]? # Optional Space Delimeter
(\d{4})\s? # 4 Digits and an Optional following Space
(?: # Extension
(?: # Lets look for some variation of 'extension'
(?:
(?:e|x|ex|ext)\.? # First, abbreviations, with an optional following period
|
extension # Now just the whole word
)
\s? # Optionsal Following Space
)
(?=\d+) # This is the Lookahead. Only accept that previous section IF it's followed by some digits.
(\d+) # Now grab the actual digits (the lookahead doesn't grab them)
)? # The Extension is Optional
$/x"; // /x modifier allows the expanded and commented regex
$aNumbers = array(
'123-456-7890x123',
'123.456.7890x123',
'123 456 7890 x123',
'(123) 456-7890 x123',
'123.456.7890x.123',
'123.456.7890 ext. 123',
'123.456.7890 extension 123456',
'123 456 7890',
'123-456-7890ex123',
'123.456.7890 ex123',
'123 456 7890 ext123',
'456-7890',
'456 7890',
'456 7890 x123',
'1234567890',
'() 456 7890'
);
foreach($aNumbers as $sNumber) {
if (preg_match($sPattern, $sNumber, $aMatches)) {
echo 'Matched ' . $sNumber . "\n";
print_r($aMatches);
} else {
echo 'Failed ' . $sNumber . "\n";
}
}
?>
And The Output:
Matched 123-456-7890x123
Array
(
[0] => 123-456-7890x123
[1] => 123
[2] => 456
[3] => 7890
[4] => 123
)
Matched 123.456.7890x123
Array
(
[0] => 123.456.7890x123
[1] => 123
[2] => 456
[3] => 7890
[4] => 123
)
Matched 123 456 7890 x123
Array
(
[0] => 123 456 7890 x123
[1] => 123
[2] => 456
[3] => 7890
[4] => 123
)
Matched (123) 456-7890 x123
Array
(
[0] => (123) 456-7890 x123
[1] => 123
[2] => 456
[3] => 7890
[4] => 123
)
Matched 123.456.7890x.123
Array
(
[0] => 123.456.7890x.123
[1] => 123
[2] => 456
[3] => 7890
[4] => 123
)
Matched 123.456.7890 ext. 123
Array
(
[0] => 123.456.7890 ext. 123
[1] => 123
[2] => 456
[3] => 7890
[4] => 123
)
Matched 123.456.7890 extension 123456
Array
(
[0] => 123.456.7890 extension 123456
[1] => 123
[2] => 456
[3] => 7890
[4] => 123456
)
Matched 123 456 7890
Array
(
[0] => 123 456 7890
[1] => 123
[2] => 456
[3] => 7890
)
Matched 123-456-7890ex123
Array
(
[0] => 123-456-7890ex123
[1] => 123
[2] => 456
[3] => 7890
[4] => 123
)
Matched 123.456.7890 ex123
Array
(
[0] => 123.456.7890 ex123
[1] => 123
[2] => 456
[3] => 7890
[4] => 123
)
Matched 123 456 7890 ext123
Array
(
[0] => 123 456 7890 ext123
[1] => 123
[2] => 456
[3] => 7890
[4] => 123
)
Matched 456-7890
Array
(
[0] => 456-7890
[1] =>
[2] => 456
[3] => 7890
)
Matched 456 7890
Array
(
[0] => 456 7890
[1] =>
[2] => 456
[3] => 7890
)
Matched 456 7890 x123
Array
(
[0] => 456 7890 x123
[1] =>
[2] => 456
[3] => 7890
[4] => 123
)
Matched 1234567890
Array
(
[0] => 1234567890
[1] => 123
[2] => 456
[3] => 7890
)
Failed () 456 7890
The current REGEX
/^[\(]?(\d{0,3})[\)]?[\.]?[\/]?[\s]?[\-]?(\d{3})[\s]?[\.]?[\/]?[\-]?(\d{4})[\s]?[x]?(\d*)$/
has a lot of issues, resulting in it matching all of the following, among others:
(0./ -000 ./-0000 x00000000000000000000000)
()./1234567890123456789012345678901234567890
\)\-555/1212 x
I think this REGEX is closer to what you're looking for:
/^(?:(?:(?:1[.\/\s-]?)(?!\())?(?:\((?=\d{3}\)))?((?(?!(37|96))[2-9][0-8][0-9](?<!(11)))?[2-9])(?:\((?<=\(\d{3}))?)?[.\/\s-]?([0-9]{2}(?<!(11)))[.\/\s-]?([0-9]{4}(?<!(555(01([0-9][0-9])|1212))))(?:[\s]*(?:(?:x|ext|extn|ex)[.:]*|extension[:]?)?[\s]*(\d+))?$/
or, exploded:
<?
$pattern =
'/^ # Matches from beginning of string
(?: # Country / Area Code Wrapper [not captured]
(?: # Country Code Wrapper [not captured]
(?: # Country Code Inner Wrapper [not captured]
1 # 1 - CC for United States and Canada
[.\/\s-]? # Character Class ('.', '/', '-' or whitespace) for allowed (optional, single) delimiter between Country Code and Area Code
) # End of Country Code
(?!\() # Lookahead, only allowed if not followed by an open parenthesis
)? # Country Code Optional
(?: # Opening Parenthesis Wrapper [not captured]
\( # Opening parenthesis
(?=\d{3}\)) # Lookahead, only allowed if followed by 3 digits and closing parenthesis [lookahead never captured]
)? # Parentheses Optional
((?(?!(37|96))[2-9][0-8][0-9](?<!(11)))?[2-9]) # 3-digit NANPA-valid Area Code [captured]
(?: # Closing Parenthesis Wrapper [not captured]
\( # Closing parenthesis
(?<=\(\d{3}) # Lookbehind, only allowed if preceded by 3 digits and opening parenthesis [lookbehind never captured]
)? # Parentheses Optional
)? # Country / Area Code Optional
[.\/\s-]? # Character Class ('.', '/', '-' or whitespace) for allowed (optional, single) delimiter between Area Code and Central-office Code
([0-9]{2}(?<!(11))) # 3-digit NANPA-valid Central-office Code [captured]
[.\/\s-]? # Character Class ('.', '/', '-' or whitespace) for allowed (optional, single) delimiter between Central-office Code and Subscriber number
([0-9]{4}(?<!(555(01([0-9][0-9])|1212)))) # 4-digit NANPA-valid Subscriber Number [captured]
(?: # Extension Wrapper [not captured]
[\s]* # Character Class for allowed delimiters (optional, multiple) between phone number and extension
(?: # Wrapper for extension description text [not captured]
(?:x|ext|extn|ex)[.:]* # Abbreviated extensions with character class for terminator (optional, multiple) [not captured]
| # OR
extension[:]? # The entire word extension with character class for optional terminator
)? # Marker for Extension optional
[\s]* # Character Class for allowed delimiters (optional, multiple) between extension description text and actual extension
(\d+) # Extension [captured if present], required for extension wrapper to match
)? # Entire extension optional
$ # Matches to end of string
/x'; // /x modifier allows the expanded and commented regex
?>
This modification provides several improvements.
It creates a configurable group of items that can match as the extension. You can add additional delimiters for the extension. This was the original request. The extension also allows for a colon after the extension delimter.
It converts the sequence of 4 optional delimiters (dot, whitespace, slash or hyphen) into a character class that matches only a single one.
It groups items appropriately. In the given example, you can have the opening parentheses without an area code between them, and you can have the extension mark (space-x) without an extension. This alternate regular expression requires either a complete area code or none and either a complete extension or none.
The 4 components of the number (area code, central office code, phone number and extension) are the back-referenced elements that feed into $matches in preg_match().
Uses lookahead/lookbehind to require matched parentheses in the area code.
Allows for a 1- to be used before the number. (This assumes that all numbers are US or Canada numbers, which seems reasonable since the match is ultimately made against NANPA restrictions. Also disallows mixture of country code prefix and area code wrapped in parentheses.
It merges in the NANPA rules to eliminate non-assignable telephone numbers.
It eliminates area codes in the form 0xx, 1xx 37x, 96x, x9x and x11 which are invalid NANPA area codes.
It eliminates central office codes in the form 0xx and 1xx (invalid NANPA central office codes).
It eliminates numbers with the form 555-01xx (non-assignable from NANPA).
It has a few minor limitations. They're probably unimportant, but are being noted here.
There is nothing in place to require that the same delimiter is used repeatedly, allowing for numbers like 800-555.1212, 800/555 1212, 800 555.1212 etc.
There is nothing in place to restrict the delimiter after an area code with parentheses, allowing for numbers like (800)-555-1212 or (800)/5551212.
The NANPA rules are adapted from the following REGEX, found here: http://blogchuck.com/2010/01/php-regex-for-validating-phone-numbers/
/^(?:1)?(?(?!(37|96))[2-9][0-8][0-9](?<!(11)))?[2-9][0-9]{2}(?<!(11))[0-9]{4}(?<!(555(01([0-9][0-9])|1212)))$/
Why not convert any series of letters to be "x". Then that way you would have all possibilities converted to be "x".
OR
Check for 3digits, 3digits, 4digits, 1orMoreDigits and disregard any other characters inbetween
Regex:
([0-9]{3}).*?([0-9]{3}).*?([0-9]{4}).+?([0-9]{1,})
Alternatively, you could use some pretty simple and straightforward JavaScript to force the user to enter in a much more specified format. The Masked Input Plugin ( http://digitalbush.com/projects/masked-input-plugin/ ) for jQuery allows you to mask an HTML input as a telephone number, only allowing the person to enter a number in the format xxx-xxx-xxxx. It doesn't solve your extension issues, but it does provide for a much cleaner user experience.
Well, you could modify the regex, but it won't be very nice -- should you allow "extn"? How about "extentn"? How about "and then you have to dial"?
I think the "right" way to do this is to add a separate, numerical, extension form box.
But if you really want the regex, I think I've fixed it up. Hint: you don't need [x] for a single character, x will do.
/^\(?(\d{0,3})\)?(\.|\/)|\s|\-)?(\d{3})(\.|\/)|\s|\-)?(\d{4})\s?(x|ext)?(\d*)$/
You allowed a dot, a slash, a dash, and a whitespace character. You should allow only one of these options. You'll need to update the references to $matches; the useful groups are now 0, 2, and 4.
P.S. This is untested, since I don't have a reference implentation of PHP running. Apologies for mistakes, please let me know if you find any and I'll try to fix them.
Edit
This is summed up much better than I can here.

Categories