regex back-reference not working in PHP PCRE - php

I want to match matching tags like <tag>...</tag>. I tried the regex
~<([^>]+)>.*?</\1>~
but this fails. The expression worked when I used the exact text inside the angle brackets, i.e,
~<(tag)>.*?</tag>~
works, but even
~<(tag)>.*?</\1>~
fails.
I'm assuming that the back reference is not working here.
Can someone help me out please. Thanks
PS: I'm not using this to parse HTML. I know I shouldn't.

You didn't show your PHP code, but I surmise you have your regex in double quotes. If so then the backreference \1 actually is converted into an ASCII character ☺ before it reaches PCRE. (All \123 sequences are interpreted as C-string octal escapes there.)

It worked for me...
$str = '<a></a>';
var_dump(preg_match('~<([^>]+)>.*?</\1>~', $str)); // int(1)
CodePad.
Also, have you considered an XML parser? Otherwise it won't like a piece of HTML like this...
<a title="Is 4 > 6?"></a>
CodePad.

Apart from the fact that it's not always a good idea to try and match markup languages using regex, your regex looks OK. Maybe you're using it wrong?
if (preg_match('~<([^>]+)>.*?</\1>~', $subject, $regs)) {
$result = $regs[0];
} else {
$result = "";
}
should work.

Use single quotes in the pattern
preg_match_all('/(sens|respons)e and \1ibility/', "sense and sensibility", $matches);
print_r($matches);

Related

PHP regex with quotes

I want to match all href values in my page content. I wrote regex for that and tested it on regex101
href[ ]*=[ ]*("|')(.+?)\1
This finds all my href values properly. If I use
href[ ]*=[ ]*(?:"|')(.+?)(?:"|')
its even better since I do not have to use certain group later.
With " and ' in regex string I cannot run the regex properly with
$matches = array();
$pattern = "/href[ ]*=[ ]*("|')(.+?)\1/"; // syntax error
$numOfMatches = preg_match_all($pattern, $pattern, $matches);
print_r($matches);
If I "escape" double quote and thus repair the syntax error I get no matches.
So - what is the correct way to apply the given regex in PHP?
Thanks for any help
Notes:
addslashes or preg_quote won't help since I need to pass legit string first
escaping all the special chars \ + * ? [ ^ ] $ ( ) { } = ! < > | : - didn't help either
EDIT: Ok, I see I really shouldn't be doing this with regex. Could you please provide some helpful DOM parsers or any other tool I 'should' use with PHP for instance ?
For your case, the following should work:
/<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
Given the nature of the WWW there are always going to be cases where the regular expression breaks down. Small changes to the patterns can fix these.
spaces around the = after href:
/<a\s[^>]*href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
matching only links starting with http:
/<a\s[^>]*href=(\"??)(http[^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
single quotes around the link address:
/<a\s[^>]*href=([\"\']??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
Source
I had to use this regex to make it work. Next time I will definitely try with DOM parser :)
$regexForHREF = "/href[ ]*=[ ]*(?:\"|')(.+?)(?:\"|')/";

Need to match multiple backslahes in a PHP regex

I'm trying to match 4 backslahes using preg_match.
preg_match('/\\\\\\\\/',$subject) works fine, but preg_match('/\\{4}/',$subject) doesn't.
Perhaps I'm using {} incorrectly. Could anyone advise?
Ok I got it: Two backslashes mean you want one backslash in your string: So for the regex it looks like this: /\{4}/ Which means you want to escape the {
What you need here is:
preg_match('/\\\\{4}/', $subject);
That looks for the regex like this: /\\{4}/ and works properly.
Regex is the wrong tool when you're looking for a literal string.
strpos($subject, str_repeat("\\",4)) !== false
Use this:
preg_match('/(?:\\\\){2}/',$subject, $m);
It matches 4 backslashes.

preg_match dose not work correctly in php

all,
I am using preg_match to filter some data, and it is strange that, it dose not work correctly. I am new to regex, and I used a php live regex website to check my regex, which works correctly. So I have no idea what is wrong here.
I would like to have preg_match to find something like "a\_b" in the $string:
$string="aaa\_bbb:ccc"
if(preg_match("/[a-zA-Z]\\_[a-zA-Z]/", $string)){
$snew = str_replace('\_', "_", $string);
}
But it is strange that even I have a $string like in this example above, the result of preg_match is 0. But when I change it to
preg_match("/\\_[a-zA-Z]/", $string)
It works fine and return 1. But of course that is not what I want. Any idea?
Thanks very much~
You don't really need the preg_match at all, from what I can see.
However the problem you're having with it is to do with escaping.
You have this: "/[a-zA-Z]\\_[a-zA-Z]/"
You've correctly identified that the backslash needs to be escaped, however, you've missed a subtle issue:
Regular expressions in PHP are strings. This means that you need to escape it as a string as well as a regular expression. In effect, this means that to correctly escape a backslash so it is matched as an actual backslash character in your pattern, you actually need to have four backslashes.
"/[a-zA-Z]\\\\_[a-zA-Z]/"
It's not pretty, but that's how it is.
Hope that helps.
use:
if(preg_match("/[a-zA-Z]\\\\_[a-zA-Z]/", $string))
instead
You don't need the preg_match altogether, instead just do a replace using this regex:
/([a-zA-Z])\\\\_([a-zA-Z])/
and then replace with $1_$2, like this:
$result = preg_replace("/([a-zA-Z])\\\\_([a-zA-Z])/", "$1_$2$, $string);

preg_replace returns unexpected results to $1

<?php
$data='123
[test=abc]cba[/test]
321';
$test = preg_replace("(\[test=(.+?)\](.+?)\[\/test\])is","$1",$data);
echo $test;
?>
I expect the above code to return
abc
but instead of returning abc it returns
123 abc 321
Please tell me what I am doing wrong.
You're only replacing the matched part (the BBcode section). You're leaving the rest of the string untouched.
If you also want to remove the leading/trailing text, include those in the expression:
$test = preg_replace("(.*\[test=(.+?)\](.+?)\[\/test\].*)is","$1",$data);
I don't know if you're aware of this, but the outermost set of parentheses in your regex does not form a group (capturing or otherwise). PHP is interpreting them as regex delimiters. If you are aware of that, and you're using them as delimiters on purpose, please don't. It's usually best to use a non-bracketing character that never has any special meaning in regexes (~, %, #, etc.).
I agree with Casimir that preg_match() is the tool you should be using, not preg_replace(). But his solution is trickier than it needs to be. Your original regex works fine; all you have to do is grab the contents of the first capturing group, like so:
if (preg_match('%\[test=(.+?)\](.+?)\[/test\]%si', $data, $match)) {
$test = $match[1];
}
You don't need to use a replace here, all that you need is to take something in the string. To do that preg_match is more useful:
$data='123
[test=abc]cba[/test]
321';
$test = preg_match('~\[test=\K[^\]]++~', $data, $match);
echo $match[0];

issues with preg_match PHP

I have a string:
$str="(94896)content is here(/94896)(94897)content is here(/94897)(94898)content is here(/94898)(94899)content is here(/94899)";
the (number) and (/number) act as tags to take certain content out of the string.
and I have a preg_match to take the content out:
if(preg_match('/(94896)\"(.*)\"(\/94896)/',$str,$c)) {echo "I found the content, its:".$co[1];}
Now for some reason, it doesn't find a match in the string ($str), though its clearly there....
Any ideas on what im doing wrong here?
You need to take the double-quotes out of your regex string, since they don't appear in $str, but are expected by the regex.
'/(94896)\"(.*)\"(\/94896)/'
// ^^ ^^
// These aren't in the string.
EDIT: I think you'll also need to escape your brackets, since they will be getting read as grouping operators, not actual brackets.
Your expression should be:
'/\(94896\)(.*)\(\/94896\)/'
Parentheses are used in a regex to denote subpatterns. If you want to search these characters in a string, you must escape them:
preg_match('/\(94896\)(.*)\(\/94896\)/',$str,$c)
If the pattern is found:
echo "I found the content, its:".$c[0];
Oh, and as Karl Nicoll says, why are the quotations in your pattern?
To match all content:
$str="(94896)content is here(/94896)(94897)content is here(/94897)(94898)content is here(/94898)(94899)content is here(/94899)";
$re = '/\((\d+)\)(.*)\(\/\1\)/';
preg_match_all($re, $str, $matches,PREG_SET_ORDER);
var_dump($matches);
Number will be in $matches[*][1], content in $matches[*][2].

Categories