Can you explain/simplify this regular expression (PCRE) in PHP? - php

preg_match('/.*MyString[ (\/]*([a-z0-9\.\-]*)/i', $contents, $matches);
I need to debug this one. I have a good idea of what it's doing but since I was never an expert at regular expressions I need your help.
Can you tell me what it does block by block (so I can learn)?
Does the syntax can be simplified (I think there is no need to escape the dot with a slash)?

The regexp...
'/.*MyString[ (\/]*([a-z0-9\.\-]*)/i'
.* matches any character zero or more times
MyString matches that string. But you are using case insensitive matching so the matched string will spell "mystring" by but with any capitalization
EDIT: (Thanks to Alan Moore) [ (\/]*. This matches any of the chars space ( or / repeated zero of more times. As Alan points out the final escape of / is to stop the / being treated as a regexp delimeter.
EDIT: The ( does not need escaping and neither does the . (thanks AlexV) because:
All non-alphanumeric characters other than \, -, ^ (at the start) and
the terminating ] are non-special in character classes, but it does no
harm if they are escaped.
-- http://www.php.net/manual/en/regexp.reference.character-classes.php
The hyphen, generally does need to be escaped, otherwise it will try to define a range. For example:
[A-Z] // matches all upper case letters of the aphabet
[A\-Z] // matches 'A', '-', and 'Z'
However, where the hyphen is at the end of the list you can get away with not escaping it (but always best to be in the habit of escaping it... I got caught out by this].
([a-z0-9\.\-]*) matches any string containing the characters a through z (note again this is effected by the case insensitive match), 0 through 9, a dot, a hyphen, repeated zero of more times. The surrounding () capture this string. This means that $matches[1] will contain the string matches by [a-z0-9\.\-]*. The brackets () tell preg_match to "capture" this string.
e.g.
<?php
$input = "aslghklfjMyString(james321-james.org)blahblahblah";
preg_match('/.*MyString[ (\/]*([a-z0-9.\-]*)/i', $input, $matches);
print_r($matches);
?>
outputs
Array
(
[0] => aslghklfjMyString(james321-james.org
[1] => james321-james.org
)
Note that because you use a case insensitive match...
$input = "aslghklfjmYsTrInG(james321898-james.org)blahblahblah";
Will also match and give the same answer in $matches[1]
Hope this helps....

Let's break this down step-by step, removing the explained parts from the expression.
"/.*MyString[ (\/]*([a-z0-9\.\-]*)/i"
Let's first strip the regex delimiters (/i at the end means it's case-insensitive):
".*MyString[ (\/]*([a-z0-9\.\-]*)"
Then we've got a wildcard lookahead (search for any symbol any number of times until we match the next statement.
"MyString[ (\/]*([a-z0-9\.\-]*)"
Then match 'MyString' literally, followed by any number (note the '*') of any of the following: ' ', '(', '/'. This is probably the error zone, you need to escape that '('. Try [ (/].
"([a-z0-9\.\-]*)"
Then we get a capture group for any number of any of the following: a-z literals, 0-9 digits, '.', or '-'.
That's pretty much all of it.

Related

match the following pattern

I have a lot of (p. #) like (p. 13) (p. 234) in a string and I want to remove them. I used the following pattern to match but it doesn't work
preg_replace('/\(p\.*\)/','',$string);
( to escape (
p is p
\. to escape .
I need some help here. Thank you.
This is the regular expression you're looking for:
\(p\.\s+\d+\)
Or, in your code:
preg_replace('/\(p\.\s+\d+\)/', '', $string);
Here's a fiddle.
You have nothing in your regexp to match the page number. You're just matching something like (p.......).
preg_replace('/\(p\.\s*\d+\s*\)/', '', $string);
The * means to match 0 or more of the preceding item, so it would be working on the . character in your example. Perhaps you might need something like:
preg_replace('/\(p\. ?[0-9]+\)/','',$string);
This matches:
(p.
Then a space, which the ? makes optional
Then one or more digits 0-9, due to the +
Then )
Hope this helps
First at all preg_replace is a function, not a procedure, if you want to see any changes, you need to write:
$str = preg_replace($pattern, $replacement, $str);
In your pattern you wrote \.* that means a literal . zero or more times, I assume that is not what you want. I assume you wanted to write \..* a literal . and zero or more characters. But this doesn't work too for two reasons: this doesn't check if characters are digits, and since the * quantifier is greedy, .* will match the all the characters until the end of the line and will backtrack until the last parenthesis.
The good way is probably Barmar pattern that checks there is at least a digit (using the + quantifier) or the same more constraignant (without a variable number of spaces):
/\(p\. \d+\)/

Find a pattern in string

I have a string like this:
{param1}{param2}{param3}....{myparam paramvalue}{paramn}
How can i get the paramvalue of myparam
Simple regex:
/\({[^ ]+?) ([^}]+?)\}/
{[^ ]+?) - it will look for anything at least 1 time occured but space and put it in subpattern
([^}]+?) - it will look for anything at least 1 time occured but { and put it in subpattern.
use it with preg_match() function
OR
The other simple regex:
preg_match('/([a-z0-9]+?) ([a-z0-9]+?)\}/', $str, $matches);
([a-z0-9]+?) - a-z 0-9 at least one time not greedy
([^}]+?) - a-z 0-9 at least one time not greedy
Output:
Array ( [0] => myparam paramvalue} [1] => myparam [2] => paramvalue )
Demo
To specifically get that parameter value, you first have to match the left part:
/\{myparam/
Followed by at least one space:
/\{myparam\s+/
Capture characters until a closing curly brace is found:
/\{myparam\s+([^}]+)\}/
The expression [^}]+ is a negative character set, indicated by the ^ just after the opening bracket; it means "match all characters except these".
Try with this regex:
/\{\w+\s+(\w+)\}/
if(preg_match('/\{'.preg_quote('myparam').' ([^\}]+)\}/', $input, $matches) {
echo "myparam=".$matches[1];
} else {
echo "myparam not found";
}
in preg_match, '{' and '}' are special chars, so they need to be escaped
the preg_quote may not be neccessary, as long as "myparam" will never have any special regex chars
the (cryptic) part ([^}]+)} matches one or more chars not being a '}', followed by '}'
the parantheses make that match available in the third arg to preg_match, $matches in this case
You can try this one as well:
.+?\s+([^}]+)
EDIT
Explanation:
.+? means match everything one or more time but its lazy, will prefer to match as less as it can.
\s+ means it will match white-spaces one or more time.
([^}]+) means match everything except `}`(close bracket) one or more time and capture group.

PHP Regex Not Matching Desired Substrings

I've written the next regular expression
$pattern = "~\d+[.][\s]*[A-Z]{1}[A-Za-z0-9\s-']+~";
in order to match substrings as 2.bon jovi - it's my life
the problem is the only part that is recognized is - bon jovi
none " - " or " ' " are recognized by this regular expression.
I'd prefer to know what is wrong with the regular expression that I've wrote rather than getting a new one.
Your regular expressions states that after the period character (can be changed to \.), you will have zero or more white space characters which should then be followed by 1 upper case letter. In your string, you do not have any upper case letters.
Secondly, the - should be placed last when you want to match it. So, changing your regex to this: ~\d+[.][\s]*[A-Z]{1}[A-Za-z0-9\s'-]+~ will match something like so: 2.Bon jovi - it's my life.
On the other hand, you can change it to this: ~\d+[.][\s]*[A-Za-z0-9\s'-]+~ to match something like so: 2.bon jovi - it's my life.
EDIT: Ammended as per the comments of Marko D and aleation.
A better regular expression to handle that would be...
$pattern = "~\d+\.\s*[\pL\pP\s]+~";
CodePad.
This will match a number, followed by a ., followed by optional whitespace, followed by one or more Unicode letters, whitespace or punctuation marks.
$pattern = "~\d+\..*~";
$string = "2.bon jovi - it's my life";
preg_match($pattern, $string, $match);
print_r($match);
output: Array ( [0] => 2.bon jovi - it's my life )
So the way I understand this regular expression is:
\d+ // Match any digit, 1 or more times
[.] // Match a dot
[\s]* // Match 0 or more whitespace characters
[A-Z]{1} // Match characters between an UPPERCASE A-Z Range 1 time
[A-Za-z0-9\s-']+ // Match characters between A-Z, a-z, 0-9, whitespace, dashe and apostrophe
So straight away, your 'bon jovi' might not get matched as it's lower case and you're only looking for uppercase characters. 'bon jovi' also contains a space so perhaps changing that part of the regular expression to allow for lowercase characters and whitespace might help so you'd end up with:
$pattern = "~\d+[.][\s]*[A-Za-z\s]{1}[A-Za-z0-9\s-']+~";
Note: I quickly tested this on RegExr ( http://gskinner.com/RegExr/ ) and it appeared to match the string fine.
Your regrex is as follows.
~ // delimiter
\d+ // 1 or more numbers
[.] // a period
[\s]* // 0 or more whitespace characters
[A-Z]{1} // 1 upper case letter
[A-Za-z0-9\s-\']+ // 1 or more characters, from the character class
~ //delimiter
Comparing that to the string "2.bon jovi" You have:
~ //
\d+ // "2"
[.] // "."
[\s]* // ""
[A-Z]{1} // <- NO MATCH
[A-Za-z0-9\s-\']+ //
~ //
"bon" does not start with a captial letter, it therefore does not match [A-Z]{1}
Cleaner regex
There are a few simple things you can do to clean up your regex
don't use character-classes for one character
don't specify {1} it's the same as not being present
Applying the above to your existing regex you get:
$pattern = "~\d+\.\s*[A-Z][A-Za-z0-9\s-']+~";
Which is slightly easier to read.
Your [A-Z]{1} sub-pattern requires one capital letter, so "2.bon jovi - it's my life" will not match.
And you need to escape the - in the [A-Za-z0-9\s-'] character class, or put it at the start or end, otherwise it is specifying a range.
"~\d+\.[A-Za-z0-9\s'-]+~"
As pointed out in the comments, it is actually not necessary to escape the - in the character class in your regex. That is only because you happened to precede it with a metacharacter \s that cannot be part of a range. Normally, if you want to match a literal - and you have it in a character class, you must escape it or position it as described above.

Regex to remove ALL single characters from a string

I need a Regular Expression to remove ALL single characters from a string, not just single letters or numbers
The string is:
"A Future Ft Casino Karate Chop ( Prod By Metro )"
it should come out as:
"Future Ft Casino Karate Chop Prod By Metro"
The expression I am using at the moment (in PHP), correctly removes the single 'A' but leaves the single '(' and ')'
This is the code I am using:
$string = preg_replace('/\b\w\b\s?/', '', $string);
Try this:
(^| ).( |$)
Breakdown:
1. (^| ) -> Beginning of line or space
2. . -> Any character
3. ( |$) -> Space or End of line
Actual code:
$string = preg_replace('/(^| ).( |$)/', '$1', $string);
Note: I'm not familiar with the workings of PHP regex, so the code might need a slight tweak depending on how the actual regex needs declared.
As m.buettner pointed out, there will be a trailing white space here with this code. A trim would be needed to clear it out.
Edit: Arnis Juraga pointed out that this would not clear out multiple single characters a b c would filter out to b. If this is an issues use this regex:
(^| ).(( ).)*( |$)
The (( ).)* added to the middle will look for any space following by any character 0 or more times. The downside is this will end up with double spaces where a series of single characters were located.
Meaning this:
The a b c dog
Will become this:
The dog
After performing the replacement to get single individual characters, you would need to use the following regex to locate the double spaces, then replace with a single space
( ){2}
A slightly more efficient version that does not require capturing would be using lookarounds. It's a bit less intuitive due to the multiple negative logic:
$string = preg_replace('/(?<!\S).(?!\S)\s*/', '', $input);
This will remove any character that is neither preceded nor followed by a non-whitespace character (so only those that are between whitespace or at the string boundaries). It will also include all trailing whitespace in the match, so as to leave only the preceding whitespace if there is any. The caveat is, that just like Nick's answer the ) at the end of the string will leave a trailing whitespace (because it is in front of the character). This can easily be solved by trimming the string.

how to extract a certain digit from a String using regular expression in php?

I have a String (filename): s_113_2.3gp
How can I extract the number that appears after the second underscore? In this case it's '2' but in some cases that can be a few digits number.
Also the number of digits that appears after the first underscore can vary so the length of this String is not constant.
You can use a capturing group:
preg_match('/_(\d+)\.\w+$/', $str, $matches);
$number = $matches[1];
\d+ represents 1 or more digits. The parentheses around that capture it, so you can later retrieve it with $matches[1]. The . needs to be escaped, because otherwise it would match any character but line breaks. \w+ matches 1 or more word characters (digits, letters, underscores). And finally the $ represents the end of the string and "anchors" the regular expression (otherwise you would get problems with strings containing multiple .).
This also allows for arbitrary file extensions.
As Ωmega pointed out below there is another possibility, that does not use a capturing group. With the concept of lookarounds, you can avoid matching _ at the start and the \.\w+$ at the end:
preg_match('/(?<=_)\d+(?=\.\w+$)/', $str, $matches);
$number = $matches[0];
However, I would recommend profiling, before applying this rather small optimization. But it is something to keep in mind (or rather, to read up on!).
Using regex lookaround it is very short code:
$n = preg_match('/(?<=_)\d+(?=\.)/', $str, $m) ? $m[0] : "";
...which reads: find one or more digits \d+ that are between underscore (?<=_) and period (?=\.)

Categories