PHP Regex Not Matching Desired Substrings - php

I've written the next regular expression
$pattern = "~\d+[.][\s]*[A-Z]{1}[A-Za-z0-9\s-']+~";
in order to match substrings as 2.bon jovi - it's my life
the problem is the only part that is recognized is - bon jovi
none " - " or " ' " are recognized by this regular expression.
I'd prefer to know what is wrong with the regular expression that I've wrote rather than getting a new one.

Your regular expressions states that after the period character (can be changed to \.), you will have zero or more white space characters which should then be followed by 1 upper case letter. In your string, you do not have any upper case letters.
Secondly, the - should be placed last when you want to match it. So, changing your regex to this: ~\d+[.][\s]*[A-Z]{1}[A-Za-z0-9\s'-]+~ will match something like so: 2.Bon jovi - it's my life.
On the other hand, you can change it to this: ~\d+[.][\s]*[A-Za-z0-9\s'-]+~ to match something like so: 2.bon jovi - it's my life.
EDIT: Ammended as per the comments of Marko D and aleation.

A better regular expression to handle that would be...
$pattern = "~\d+\.\s*[\pL\pP\s]+~";
CodePad.
This will match a number, followed by a ., followed by optional whitespace, followed by one or more Unicode letters, whitespace or punctuation marks.

$pattern = "~\d+\..*~";
$string = "2.bon jovi - it's my life";
preg_match($pattern, $string, $match);
print_r($match);
output: Array ( [0] => 2.bon jovi - it's my life )

So the way I understand this regular expression is:
\d+ // Match any digit, 1 or more times
[.] // Match a dot
[\s]* // Match 0 or more whitespace characters
[A-Z]{1} // Match characters between an UPPERCASE A-Z Range 1 time
[A-Za-z0-9\s-']+ // Match characters between A-Z, a-z, 0-9, whitespace, dashe and apostrophe
So straight away, your 'bon jovi' might not get matched as it's lower case and you're only looking for uppercase characters. 'bon jovi' also contains a space so perhaps changing that part of the regular expression to allow for lowercase characters and whitespace might help so you'd end up with:
$pattern = "~\d+[.][\s]*[A-Za-z\s]{1}[A-Za-z0-9\s-']+~";
Note: I quickly tested this on RegExr ( http://gskinner.com/RegExr/ ) and it appeared to match the string fine.

Your regrex is as follows.
~ // delimiter
\d+ // 1 or more numbers
[.] // a period
[\s]* // 0 or more whitespace characters
[A-Z]{1} // 1 upper case letter
[A-Za-z0-9\s-\']+ // 1 or more characters, from the character class
~ //delimiter
Comparing that to the string "2.bon jovi" You have:
~ //
\d+ // "2"
[.] // "."
[\s]* // ""
[A-Z]{1} // <- NO MATCH
[A-Za-z0-9\s-\']+ //
~ //
"bon" does not start with a captial letter, it therefore does not match [A-Z]{1}
Cleaner regex
There are a few simple things you can do to clean up your regex
don't use character-classes for one character
don't specify {1} it's the same as not being present
Applying the above to your existing regex you get:
$pattern = "~\d+\.\s*[A-Z][A-Za-z0-9\s-']+~";
Which is slightly easier to read.

Your [A-Z]{1} sub-pattern requires one capital letter, so "2.bon jovi - it's my life" will not match.
And you need to escape the - in the [A-Za-z0-9\s-'] character class, or put it at the start or end, otherwise it is specifying a range.
"~\d+\.[A-Za-z0-9\s'-]+~"
As pointed out in the comments, it is actually not necessary to escape the - in the character class in your regex. That is only because you happened to precede it with a metacharacter \s that cannot be part of a range. Normally, if you want to match a literal - and you have it in a character class, you must escape it or position it as described above.

Related

Validate string to contain only qualifying characters and a specific optional substring in the middle

I'm trying to make a regular expression in PHP. I can get it working in other languages but not working with PHP.
I want to validate item names in an array
They can contain upper and lower case letters, numbers, underscores, and hyphens.
They can contain => as an exact string, not separate characters.
They cannot start with =>.
They cannot finish with =>.
My current code:
$regex = '/^[a-zA-Z0-9-_]+$/'; // contains A-Z a-z 0-9 - _
//$regex = '([^=>]$)'; // doesn't end with =>
//$regex = '~.=>~'; // doesn't start with =>
if (preg_match($regex, 'Field_name_true2')) {
echo 'true';
} else {
echo 'false';
};
// Field=>Value-True
// =>False_name
//Bad_name_2=>
Use negative lookarounds. Negative lookahead (?!=>) at the beginning to prohibit beginning with =>, and negative lookbehind (?<!=>) at the end to prohibit ending with =>.
^(?!=>)(?:[a-zA-Z0-9-_]+(=>)?)+(?<!=>)$
DEMO
There is absolutely no requirement for lookarounds here.
Anchors and an optional group will suffice.
Demo
/^[\w-]+(?:=>[\w-]+)?$/
^^^^^^^^^^^^^-- this whole non-capturing group is optional
This allows full strings consisting exclusively of [0-9a-zA-Z-] or split ONCE by =>.
The non-capturing group may occur zero or one time.
In other words, => may occur after one or more [\w-] characters, but if it does occur, it MUST be immediately followed by one or more [\w-] characters until the end of the string.
To cover some of the ambiguity in the question requirements:
If foo=>bar=>bam is valid, then use /^[\w-]+(?:=>[\w-]+)*$/ which replaces ? (zero or one) with * (zero or more).
If foo=>=>bar is valid then use /^[\w-]+(?:(?:=>)+[\w-]+)*$/ which replaces => (must occur once) with (?:=>)+ (substring must occur one or more times).
Well, your character ranges equal to \w, so you could use
^(?!=>)(?:(?!=>$)(?:[-\w]|=>))+$
This construct uses a "tempered greedy token", see a demo on regex101.com.
More shiny, complicated and surely over the top, you could use subroutines as in:
(?(DEFINE)
(?<chars>[-\w]) # equals to A-Z, a-z, 0-9, _, -
(?<af>=>) # "arrow function"
(?<item>
(?!(?&af)) # no af at the beginning
(?:(?&af)?(?&chars)++)+
(?!(?&af)) # no af at the end
)
)
^(?&item)$
See another demo on regex101.com
For the example data, you can use
^[a-zA-Z0-9_-]+=>[a-zA-Z0-9_-]+$
The pattern matches:
^ Start of string
[a-zA-Z0-9_-]+ Match 1+ times any of the listed ranges or characters (can not start with =>)
=> Match literally
[a-zA-Z0-9_-]+ Match again 1+ times any of the listed ranges or characters
$ End of string
Regex demo
If you want to allow for optional spaces:
^\h*[a-zA-Z0-9_-]+\h*=>\h*[a-zA-Z0-9_-]+\h*$
Regex demo
Note that [a-zA-Z0-9_-] can be written as [\w-]

Why regex with lookaheads doesn't match?

I need (in PHP) to split a sententse by the word that cannot be the first or the last one in the sentence. Say the word is "pression" and here is my regex
/^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$/i
Live here: https://regex101.com/r/CHAhKj/1/
First, it doesn't match.
Next, I think - it is at all possible to split that way? I tryed simplified example
print_r(preg_split('/^.+pizza.+$/', 'my pizza is cool'));
live here http://sandbox.onlinephpfunctions.com/code/10b674900fc1ef44ec79bfaf80e83fe1f4248d02
and it prints an array of 2 empty strings, when I expect
['my ', ' is cool']
I need (in PHP) to split a sentence by the word that cannot be the first or the last one in the sentence
You may use this regex:
(?<=[^\s.?]\h)pression(?=\h[^\s.?])
RegEx Demo
RegEx Details:
(?<=[^\s.?]\h): Lookbehind to assert that ahead of current position we have a space and a character that not a whitespace, not a dot and not a ?.
pression: Match word pression
(?=\h[^\s.?]): Lookahead to assert that before current position we have a space and a character that not a whitespace, not a dot and not a ?
First, ^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$ can't match any string at all because the (?=[\s\.\,\:\;])p part requires p to be also either a whitespace char, or a ., ,, : or ;, which invalidates the whole match at once.
Second, ^.+pizza.+$ pattern does not ensure the pizza matched is not the first or last word in a sentence as . matches whitespace, too. It does not return anything meaningful, because preg_split uses the match to break string into chunks, and the two empty values are 1) start of string and 2) empty string positions.
That said, all you need is:
preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)
See the regex demo. Details:
^ - start of string
(.*?\w\W+) - Capturing group 1: any zero or more chars, as few as possible, then a word char and then one or more non-word chars
pression - a word
(\W+\w.*) - Capturing group 2: one or more non-word chars, a word char, and then any zero or more chars as many as possible
$ - end of string.
s makes the . match across lines and i flag makes the pattern match in a case insensitive way.
See the PHP demo:
$text = "You can use any regular expression pression inside the lookahead ";
if (preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)) {
echo $m[1] . " << | >> " . $m[2];
}
// => You can use any regular expression << | >> inside the lookahead

How do I check if a string is composed only of letters and numbers? (PHP) [duplicate]

This question already has answers here:
How to check, if a php string contains only english letters and digits?
(10 answers)
Closed 12 months ago.
Title says it all: I am checking to see if a user's username contains anything that isn't a number or letter, such as €{¥]^}+<€, punctuation, spaces or even things like âæłęč. Is this possible in php?
You can use the ctype_alnum() function in PHP.
From the manual..
Check for alphanumeric character(s)
Returns TRUE if every character in text is either a letter or a digit, FALSE otherwise.
var_dump(ctype_alnum("æøsads")); // false
var_dump(ctype_alnum("123asd")); // true
Live demo at https://3v4l.org/5etr7
PHP does REGEX
What you want to do is fairly trivial, PHP has a number of regex functions
Testing a String For a Character
If all you want is to know IF a string contains non-alphanumeric characters, then just use preg_match():
preg_match( '/[^A-Za-z0-9]*/', $userName );
This will return 1 if the username contains anything other than alphanumeric (A-Z or a-z or 0to9), it returns 0 if it doesn't contain a non-alphanumeric.
Regex Pattern Elements
Regex PCRE patterns open and close with a delimiter such as a slash/, and that needs to be treated like a string (quoted):'/myPattern/' Some other key features are:
[ brackets contain match sets ]
[a-z] // means match any lowercase letter
This pattern means check the current character in the $String relative to the pattern in these brackets, in this case match any lowercase letter a to z.
^ Caret (Meta-Character)
[^a-z] // means no lowercase letters If the caret ^ (aka hat) is the first character inside brackets, it NEGATES the pattern inside brackets so [^A7] means match anything EXCEPT uppercase A and the numeral 7. (Note: when outside brackets, the caret ^ means the start of the string.)
\w\W\d\D\s\S. Meta-Characters (WildCards)
\w // match all alphanumeric An escaped (i.e. preceded by a backslash \ ) lowercase w means match any "word" character, i.e. alphanumeric and the underscore _, this is shorthand for [A-Za-z0-9_]. The uppercase \W is the NOT word character, equivalent to [^A-Za-z0-9_] or [^\w]
. // (dot) match ANY single character except return/newline
\w // match any word character [A-Za-z0-9_]
\W // NOT any word character [^A-Za-z0-9_]
\d // match any digit [0-9]
\D // NOT any digit [^0-9].
\s // match any whitespace (tab, space, newline)
\S // NOT any whitespace
.*+?| Meta-Characters (Quantifiers))
These modify the behavior outside of a set []
* // match previous character or [set] zero or more times,
// so .* means match everything (including nothing) until reaching a return/newline.
+ // match previous at least one or more times.
? // match previous only zero or one time (i.e. optional).
| // means logical OR eg.: com|net means match either literal "com" or "net"
Not shown: capture groups, backreferences, substitution (the real power of regex). See https://www.phpliveregex.com/#tab-preg-match for more including a live pattern-match playground that is based on the PHP functions, and delivers results as arrays.
Back To Your StringCleaning
So for your pattern, to match all non-letters and numbers (including underscores) you need either: '/[^A-Za-z0-9]*/' or '/[\W_]*/'
Strip Search
If instead you want to STRIP all the non-alpha characters from a string then use preg_replace( $Regex, $Replacement, $StringToClean )
<?php
$username = 'Svéñ déGööfinøff';
echo preg_replace('/[\W_]*/', '', $username);
?>
The output is: SvdGfinff If you'd prefer to replace certain accented letters with standard latin ones to keep the names reasonably readable, then I believe you'd need a lookup table (array). There is one ready to use at the PHP site

regex capture certain characters only

currently dealing with a bit of a problem. this is my string "all-days"
im in need of some assistance to creating a regex to capture the first character, the dash and also the first character after the dash. Im a bit of a newbie to Regex so forgive me.
Here is what ive got so far. (^.)
capture the first character, the dash and also the first
character after the dash
With preg_match function:
$s = "all-days";
preg_match('/^(.)[^-]*(-)(.)/', $s, $m);
unset($m[0]);
print_r($m);
The output:
Array
(
[1] => a
[2] => -
[3] => d
)
Its not regex but If you want just a solution as you want by other way it can be achieve by explode, array_walk and implode
$string = 'all-days-with-my-style';
$arr = explode("-",$string);
$new = array_walk($arr,function(&$a){
$a = $a[0];
});
echo implode("-",$arr);
Live demo : https://eval.in/882846
Output is : a-d-w-m-s
I assume your string only contains word characters and hyphens, and doesn't have consecutive hyphens:
To remove all that isn't the first character the hyphens and the first character after them, remove all that isn't after a word boundary:
$result = preg_replace('~\B\w+~', '', 'all-days');
If you only want to match these characters, just catch each character after a word boundary:
if ( preg_match_all('~\b.~', 'all-days', $matches) )
print_r($matches[0]);
Code
See code in use here
\b(\w|-\b)
For more precision, the following can be used (note that it uses Unicode groups, so it doesn't work in every language, but it does in PHP). This will only match letters, not numbers and underscores. It uses a negative lookbehind and positive lookahead, but you can understand it if you keep reading this article and break it apart one piece at a time.
(\b\p{L}|(?<=\p{L})-(?=\p{L}))
Explanation
\b Assert position at a word boundary
(\w|-\b) Capture the following into capture group 1
\w Match any word character
| Or
- Match the - character literally
\b Assert position at a word boundary
\b:
Asserts the position in the string matches 1 of the following:
^\w Assert position at the start of the string and match a word character
\w$ Match a word character and assert its position as the last position in the string
\W\w Match any non-word character, followed by a word character
\w\W Match any word character, followed by a non-word character
\w:
Means a word character (usually defined by any character in the set a-zA-Z0-9_, however, some languages also accept Unicode characters that represent any letter, number, or underscore \p{L}\p{N}_).
For more precision (depending on the use-case), you can specify [a-zA-Z] (for ASCII letters), \p{L} for Unicode letters, or [a-z] with the i flag for ASCII characters with the case-insensitive flag enabled in regex.

Can you explain/simplify this regular expression (PCRE) in PHP?

preg_match('/.*MyString[ (\/]*([a-z0-9\.\-]*)/i', $contents, $matches);
I need to debug this one. I have a good idea of what it's doing but since I was never an expert at regular expressions I need your help.
Can you tell me what it does block by block (so I can learn)?
Does the syntax can be simplified (I think there is no need to escape the dot with a slash)?
The regexp...
'/.*MyString[ (\/]*([a-z0-9\.\-]*)/i'
.* matches any character zero or more times
MyString matches that string. But you are using case insensitive matching so the matched string will spell "mystring" by but with any capitalization
EDIT: (Thanks to Alan Moore) [ (\/]*. This matches any of the chars space ( or / repeated zero of more times. As Alan points out the final escape of / is to stop the / being treated as a regexp delimeter.
EDIT: The ( does not need escaping and neither does the . (thanks AlexV) because:
All non-alphanumeric characters other than \, -, ^ (at the start) and
the terminating ] are non-special in character classes, but it does no
harm if they are escaped.
-- http://www.php.net/manual/en/regexp.reference.character-classes.php
The hyphen, generally does need to be escaped, otherwise it will try to define a range. For example:
[A-Z] // matches all upper case letters of the aphabet
[A\-Z] // matches 'A', '-', and 'Z'
However, where the hyphen is at the end of the list you can get away with not escaping it (but always best to be in the habit of escaping it... I got caught out by this].
([a-z0-9\.\-]*) matches any string containing the characters a through z (note again this is effected by the case insensitive match), 0 through 9, a dot, a hyphen, repeated zero of more times. The surrounding () capture this string. This means that $matches[1] will contain the string matches by [a-z0-9\.\-]*. The brackets () tell preg_match to "capture" this string.
e.g.
<?php
$input = "aslghklfjMyString(james321-james.org)blahblahblah";
preg_match('/.*MyString[ (\/]*([a-z0-9.\-]*)/i', $input, $matches);
print_r($matches);
?>
outputs
Array
(
[0] => aslghklfjMyString(james321-james.org
[1] => james321-james.org
)
Note that because you use a case insensitive match...
$input = "aslghklfjmYsTrInG(james321898-james.org)blahblahblah";
Will also match and give the same answer in $matches[1]
Hope this helps....
Let's break this down step-by step, removing the explained parts from the expression.
"/.*MyString[ (\/]*([a-z0-9\.\-]*)/i"
Let's first strip the regex delimiters (/i at the end means it's case-insensitive):
".*MyString[ (\/]*([a-z0-9\.\-]*)"
Then we've got a wildcard lookahead (search for any symbol any number of times until we match the next statement.
"MyString[ (\/]*([a-z0-9\.\-]*)"
Then match 'MyString' literally, followed by any number (note the '*') of any of the following: ' ', '(', '/'. This is probably the error zone, you need to escape that '('. Try [ (/].
"([a-z0-9\.\-]*)"
Then we get a capture group for any number of any of the following: a-z literals, 0-9 digits, '.', or '-'.
That's pretty much all of it.

Categories