Regular expressions - Reference the first match in a search - php

I don't quite know how to describe my problem in a short title so I am sorry if the title for this question is a bit mis-leading.
But I really don't know what the thing I am looking for is called or if it is even possible.
I am trying to use a regular expression to find everything between a set of matching tags in HTML.
This was easy for me when I was testing with static tags because I could just search for everything in between two pieces of text such as \{myTag\}(someExpression)\{\/myTag\}
My problem comes with the fact that 'myTag' could be anything.
I just don't know how (or if it is even possible) to match the starting tag with the ending tag when that text is variable.
I thought I had seen some kind of referencing system in regular expressions before where you can use the dollar sign and a number, but I don't know if you can use this within the search itself.
I originally thought that perhaps I could write something like: \{(.*?)\}(someExpression)\{\/${1}\} but I have no idea if this would actually work or if it is possible (let alone if it is correct).
I hope this question makes sense as I'm not really sure how to ask it.
Mainly because like I said I don't know if this has a name, if it is possible and I am also a total beginner at regular experessions.
And if it makes any difference the language I am doing this is in PHP with the preg_replace_callback function.
Any help would be greatly appreciated.

Try this:
\{([^}]*)\}(someExpression)\{\/\1\}
but be aware that you need to make sure someExpression doesn't match ending tags as well (like for example .* would). And of course, if tags are nested, then all bets are off, and you'll need a different regex (or a parser).

It kind of depends on your case. If you know it's just an HTML snippet and there is a specific pattern you can search the HTML for then you can use a regex to find and replace the pattern but it seems to me you are trying to parse the HTML. So the issue would be if you had a nested tag. You should check out http://php.net/manual/en/function.preg-replace.php because that seems like a much easier function to use than the one with the callback.
As a note about regular expression look backs you can use $i or \i depending on the language you are using. I don't know if php regex supports capturing group look backs.

Related

Convert JavaScript regular expression to PHP [duplicate]

I know this question has been asked about a dozen times, but this one is not technically a dupe (check the others if you like) ;)
Basically, I have a Javascript regex that checks email addresses which I use for front-end validation, and I use CodeIgniter to double check on the back end, in case the validation on the front end fails to run properly (browser issues, for instance.) It's QUITE a long regular expression, and I have no idea where to begin converting it by hand.
I'm pretty much looking for a tool that converts JS regexes to PHP regexes - I haven't found one in any of the answers to similar questions (of course, it's possible that such a tool doesn't exist.) Okay, I lied - one of them suggested a tool that costs $39.95, but I really don't want to spend that much to convert a single expression (and no, there isn't a free trial as suggested by the answer to the aforementioned question.)
Here's the Javascript expression, graciously provided by aSeptik:
/^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))#((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/i
And the one used by CodeIgniter, which I don't want to use because it doesn't follow the same rules (disallows some valid addresses):
/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*#([a-z0-9\-]+\.)+[a-z]{2,6}$/ix
I want to use the same rules set by the Javascript regex in PHP.
Having this sort of inconsistency where my front-end code is saying that the email address is okay, and then Codeigniter says it isn't, is of course the behavior I'm trying to fix in my application.
Thanks for any and all tips! :D
There are some differences between regex engines in Javascript and PHP. Please check Comparison of regular-expression engines
article for theoretical and Difference between PHP regex and JavaScript regex answer for practical information.
Most of the time, you can use Javascript regex patterns in PHP with small modifications. As a fundamental difference, PHP regex is defined as a string (or in a string) like this:
preg_match('/^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/',$telephone);
Javascript regex is not, it's defined in its own way:
var ptr = new RegExp(/^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/);
// or
var ptr = /^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/;
You can give it a try by running the regex on PHP. As a recommendation, do not replace it in Codeigniter files, you can simply extend or replace native library. You can check Creating Libraries out for more information.
I was able to solve this in a better-than-expected manner. I was unable to convert the Javascript regex that I wanted to use (even after purchasing RegexBuddy - it'll come in handy, but it was not able to produce a proper conversion), so I decided to go looking on the Regex Validate Email Address site to see if they had any recommendations anywhere for good regexes. That's when I found this:
"The expression with the best score is currently the one used by PHP's filter_var()":
/^(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){255,})(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){65,}#)(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22))(?:\.(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22)))*#(?:(?:(?!.*[^.]{64,})(?:(?:(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+)*\.){1,126}){1,}(?:(?:[a-z][a-z0-9]*)|(?:(?:xn--)[a-z0-9]+))(?:-[a-z0-9]+)*)|(?:\[(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){7})|(?:(?!(?:.*[a-f0-9][:\]]){7,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?)))|(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){5}:)|(?:(?!(?:.*[a-f0-9]:){5,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3}:)?)))?(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))(?:\.(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))){3}))\]))$/iD
It matches with only 4/86 errors, while the Javascript one I was using matches with 8/86 errors, so the PHP one is a little more accurate. So, I extended the CodeIgniter Form_validation library to instead use return filter_var($str, FILTER_VALIDATE_EMAIL);.
...But does it work in Javascript?
var pattern = new RegExp(/^(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){255,})(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){65,}#)(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22))(?:\.(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22)))*#(?:(?:(?!.*[^.]{64,})(?:(?:(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+)*\.){1,126}){1,}(?:(?:[a-z][a-z0-9]*)|(?:(?:xn--)[a-z0-9]+))(?:-[a-z0-9]+)*)|(?:\[(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){7})|(?:(?!(?:.*[a-f0-9][:\]]){7,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?)))|(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){5}:)|(?:(?!(?:.*[a-f0-9]:){5,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3}:)?)))?(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))(?:\.(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))){3}))\]))$/i);
Zing! Works like a charm! Not only did I get the consistency I was looking for between front and back end validation, but I also got a more accurate regex in the process. Double win!
Thank you to all those who provided suggestions!
Today there is exists the site https://regex101.com/ where you can transform one JS regex to PHP or some another languages.

preg/regex in PHP

I'm Having trouble with regex. Never fully understood it the real question is: does anybody have a good site that explains the difference between the expression instead of just posting stuff like
$regexp = "/^[^0-9][A-z0-9_]+([.][A-z0-9_]+)*[#][A-z0-9_]+([.][A-z0-9_]+)*[.][A-z]{2,4}$/";
then prattling off what that line as a whole will do. Rather then what each expression will do. I've tried googling many different versions of preg_replace and regex tutorial but they all seem to assume that we already know what stuff like \b[^>]* will do.
Secondary. The reason i am trying to do this:
i want to turn
<span style="color: #000000">*ANY NUMBER*</span>
into
<span style="color: #0000ff">*ANY NUMBER*</span>
a few variations that i have already tried some just didnt work some make the script crap out.
$data = preg_replace("/<span style=\"color: #000000\">([0-9])</span>/", "<span style=\"color: #FFCC00\">$1</span>", $data);//just tried to match atleast 0-9
$data = preg_replace("/<span style=\"color: #000000\"\b[^>]*>(.*?)</span>/", "<span style=\"color: #FFCC00\">$1</span>", $data);
$data = preg_replace("/<span style=\"color: #000000\"\b[^>]*>([0-9])</span>/", "<span style=\"color: #FFCC00\">$1</span>", $data);
The answer to this specific problem is not nearly as important to me as a site so check goes to that. Tried alot of different sites and i am pretty sure its not above my comprehension i just cannot find a good for all the bad tutorial/example farm. Normal fallbacks of w3 and phpdotnet dont have what i need this time.
EDIT1 For those of you who end up in here looking for a similar answer:
$data = preg_replace("/<span style=\"color: #000000\">([0-9]{1,})<\/span>/", "<span style=\"color: #FFCC00\">$1</span>", $data);
Did what it needed to. Sadly it was one of the first things i tried but because i didnt put </span> instead of it was not working and i do not know if "[0-9]{1,}" is the MOST appropriate way of matching any number (telling it to match any integer 0-9 with [0-9] atleast once and as many times as it can with {1,} it still fit the purpose)
ROY Finley Posted:
http://www.macronimous.com/resources/writing_regular_expression_with_php.asp
Its a good site with a list of expression definitions and a good example workup below.
Also:
regular-expressions.info/tutorial.html was posted a few times. Its a slower more indepth walk through but if you are stuck like i am its good.
Will pop in about regex101 and the parsers after i have a chance to play with them.
EDIT2
DWright posted a book link below "Mastering Regular Expressions". If you look at regex and cannot make heads or tails of the convolution of characters it is DEFINITELY worth picking it up. Took about an hour and a half to read about half but that is no time compared to the hours spend on google and the mess up work arounds used to avoid it.
Also the html parse linked below would be right for this particular problem.
To have a regex explained, you can have a look at Regex101. To actually learn regular expressions (which I recommend), this is a pretty good, in-depth tutorial. After you have read that, the PCRE documentation on PHP.net shouldn't seem to arcane any more, and reading it will help you get your head around some specific differences for PHP.
However, for the problem at hand, you shouldn't actually be using regex at all. A DOM parser is the way to go. Here is a very convenient to use 3rd party one, and this is what PHP brings along itself. As mentioned by hakre,here is a more extensive list of libraries available for this purpose.
Another general recommendation for regexes in PHP: use single quotes '/pattern/', because double quotes cause a lot of trouble with escape sequences (you need to double some backslashes otherwise).
Finally, the reason you get errors is that your regex delimiter (you use /) shows up in your pattern (in the closing span tag) without it being escaped. That means the engine thinks that the pattern ends at the first / and that span>/ are 6 different modifiers (most of which don't actually exist). You could either escape the delimiter like <\/span> or even better, change the delimiter (you can use pretty much anything) like '~yourPattern/Here~'.
Edit: Since I posted this answer, two new websites have been released which try to explain regular expressions by visualising them. Right now they only support the (quite limited) JavaScript flavor, but it's a good point to start:
Regexper
Debuggex
http://www.macronimous.com/resources/writing_regular_expression_with_php.asp
look at this one. it seems to cover the process pretty good.
Try this website, perhaps. Personally, I'd say if you are really interested in regexes, it'd be worth getting a book like this one.

Converting Javascript Regex to PHP

I know this question has been asked about a dozen times, but this one is not technically a dupe (check the others if you like) ;)
Basically, I have a Javascript regex that checks email addresses which I use for front-end validation, and I use CodeIgniter to double check on the back end, in case the validation on the front end fails to run properly (browser issues, for instance.) It's QUITE a long regular expression, and I have no idea where to begin converting it by hand.
I'm pretty much looking for a tool that converts JS regexes to PHP regexes - I haven't found one in any of the answers to similar questions (of course, it's possible that such a tool doesn't exist.) Okay, I lied - one of them suggested a tool that costs $39.95, but I really don't want to spend that much to convert a single expression (and no, there isn't a free trial as suggested by the answer to the aforementioned question.)
Here's the Javascript expression, graciously provided by aSeptik:
/^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))#((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/i
And the one used by CodeIgniter, which I don't want to use because it doesn't follow the same rules (disallows some valid addresses):
/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*#([a-z0-9\-]+\.)+[a-z]{2,6}$/ix
I want to use the same rules set by the Javascript regex in PHP.
Having this sort of inconsistency where my front-end code is saying that the email address is okay, and then Codeigniter says it isn't, is of course the behavior I'm trying to fix in my application.
Thanks for any and all tips! :D
There are some differences between regex engines in Javascript and PHP. Please check Comparison of regular-expression engines
article for theoretical and Difference between PHP regex and JavaScript regex answer for practical information.
Most of the time, you can use Javascript regex patterns in PHP with small modifications. As a fundamental difference, PHP regex is defined as a string (or in a string) like this:
preg_match('/^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/',$telephone);
Javascript regex is not, it's defined in its own way:
var ptr = new RegExp(/^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/);
// or
var ptr = /^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/;
You can give it a try by running the regex on PHP. As a recommendation, do not replace it in Codeigniter files, you can simply extend or replace native library. You can check Creating Libraries out for more information.
I was able to solve this in a better-than-expected manner. I was unable to convert the Javascript regex that I wanted to use (even after purchasing RegexBuddy - it'll come in handy, but it was not able to produce a proper conversion), so I decided to go looking on the Regex Validate Email Address site to see if they had any recommendations anywhere for good regexes. That's when I found this:
"The expression with the best score is currently the one used by PHP's filter_var()":
/^(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){255,})(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){65,}#)(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22))(?:\.(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22)))*#(?:(?:(?!.*[^.]{64,})(?:(?:(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+)*\.){1,126}){1,}(?:(?:[a-z][a-z0-9]*)|(?:(?:xn--)[a-z0-9]+))(?:-[a-z0-9]+)*)|(?:\[(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){7})|(?:(?!(?:.*[a-f0-9][:\]]){7,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?)))|(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){5}:)|(?:(?!(?:.*[a-f0-9]:){5,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3}:)?)))?(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))(?:\.(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))){3}))\]))$/iD
It matches with only 4/86 errors, while the Javascript one I was using matches with 8/86 errors, so the PHP one is a little more accurate. So, I extended the CodeIgniter Form_validation library to instead use return filter_var($str, FILTER_VALIDATE_EMAIL);.
...But does it work in Javascript?
var pattern = new RegExp(/^(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){255,})(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){65,}#)(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22))(?:\.(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22)))*#(?:(?:(?!.*[^.]{64,})(?:(?:(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+)*\.){1,126}){1,}(?:(?:[a-z][a-z0-9]*)|(?:(?:xn--)[a-z0-9]+))(?:-[a-z0-9]+)*)|(?:\[(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){7})|(?:(?!(?:.*[a-f0-9][:\]]){7,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?)))|(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){5}:)|(?:(?!(?:.*[a-f0-9]:){5,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3}:)?)))?(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))(?:\.(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))){3}))\]))$/i);
Zing! Works like a charm! Not only did I get the consistency I was looking for between front and back end validation, but I also got a more accurate regex in the process. Double win!
Thank you to all those who provided suggestions!
Today there is exists the site https://regex101.com/ where you can transform one JS regex to PHP or some another languages.

Regular Expressions - Where Angels Fear to Tread

I've just started studying regular expressions in PHP, but I'm having a terrible time following some of the tutorials on the WWW and cannot seem to find anything addressing my current needs. Perhaps I'm trying to learn too much too fast. This aspect of PHP is entirely new to me.
What I'm trying to create is a regular expression to replace all HTML code in between the nth occurrence of <TAG> and </TAG> with any code I choose.
My ultimate goal is to make an Internet filter in PHP through which I can view a web page stripped of certain content (or replaced with sanitized content) between any specified set of tags <TAG>...</TAG> within the page, where <TAG>...</TAG> represents any valid paired HTML tags, such as <B>...</B> or <SPAN>...</SPAN> or <DIV>...</DIV>, etc, etc.
For example, if the page has a porn ad contained in the 5th <DIV>...</DIV> block within the page, what regular expression could be invoked to target and replace that code with something else, like xxxxxxx, but only the 5th <DIV> block within the page and nothing else?
The entire web page is contained within a single text string and the filtered result should also be a single string of text.
I'm not sure, but I think the code to do this could have a format similar to:
$FilteredPage = preg_replace("REG EXPRESSION", "xxxxxxxx", $OriginalPage);
The "REG EXPRESSION" to invoke is what I need to know and the "xxxxxxxx" represents the text to replace the code between the tags targeted by "REG EXPRESSION".
Regular expressions are obviously the work of Satan!
Any general suggestions or perhaps a couple of working examples which I could study and experiment with would be greatly appreciated.
Thanks, Jay
Firstly, are you using the right tool for the job? Regex is a text matching engine, not a fully blown parser - perhaps a dedicated HTML parser will give better results.
Secondly, when approaching any programming problem, try to simplify your problem and build it brick by brick rather than just jumping straight to a final solution. For example, you could:
Start with a simple block of normal english text, and try to match and replace (for example) every occurrence of the word "and".
When that works, wrap it in a loop of PHP that can count up to 5 and only replace the 5th occurrence. Why use regex to count when PHP is so much better at that task?
Then modify your regex to match your 5th HTML tag (which is a bit harder because <> are special characters and need escaping)
By approaching the problem in steps, you will be able to get each part working in turn and build a solid solution that you understand.
This has been done to death, but please, don't use a regex to parse HTML. Just stop, give up... It is not worth the kittens god will kill for you doing it. use a real HTML or XML parser
On a more constructive note, look at xpath as a technology better suited to describing html nodes you might want to replace... or phpQuery and QueryPath
The reason god kills kittens when you parse HTML with a regex:
Html is not a regular language, thus a regex can only ever parse very limited html. HTML is a context free language, and as such can only be properly parsed with a context free parser.
Edit: thank you #Andrew Grimm, this is said much better than i could, as evidenced by the first answer with well over four thousand upvotes!
RegEx match open tags except XHTML self-contained tags
ok, few ground rules.
Dont post a question like that, pre-ing all the question, will only keep people away
Regular expressions are awsome!
If you want to consider options, look on how to read html as an xml document and parse it using xpath
#tobyodavies is pretty much correct, I'll include the answer in case you want to do it anyways
Now, to your problem. With this one:
$regex = "#<div>(.+?)</div>#si";
You should be ok using that expression and counting the occurences, much like this:
preg_match_all($regex, $htmlcontent, $matches, PREG_SET_ORDER );
Suppose you only need the 5th one. Matches[$i][0] is the whole string of the $i-eth match
if (count($matches) > 5 )
{
$myMatch = $matches[5][0];
$matchedText = $matches[5][1];
}
Good luck in your efforts...

negative lookbehind and greedy quantifiers in php

I'm using a regex to find any URLs and link them accordingly. However, I do not want to linkify any URLs that are already linked so I'm using lookbehind to see if the URL has an href before it.
This fails though because variable length quantifiers aren't allowed in lookahead and lookbehind for PHP.
Here's the regex for the match:
/\b(?<!href\s*=\s*[\'\"])((?:http:\/\/|www\.)\S*?)(?=\s|$)/i
What's the best way around this problem?
EDIT:
I have yet to test it, but I think the trick to doing it in a single regex is using conditional expressions within the regex, which is supported by PCRE. It would look something like this:
/(href\s*=\s*[\'\"])?(?(1)^|)((?:http:\/\/|www\.)\w[\w\d\.\/]*)(?=\s|$)/i
The key point is that if the href is captured, the match is immediately thrown out due to the conditional (?(1)^|), which is guaranteed to not match.
There's probably something wrong with it. I'll test it out tomorrow.
I tried doing the same thing the other way round: ensure that the URL doesn't end in ">:
/((?:http:\/\/|www\.)(?:[^"\s]|"[^>]|(*FAIL))*?)(?=\s|$)/i
But for me that looks pretty hacky, I'm sure you can do better.
My second approach is more similar to yours (and thus is more precise):
/href\s*=\s*"[^"]*"(*SKIP)(*FAIL)|((?:http:\/\/|www\.)\S*?)(?=\s|$)/i
If I find an href= I (*SKIP)(*FAIL). This means that I jump to the position the regex engine is at, when it encounters the (*SKIP).
But that's no less hacky and I'm sure there is a better alternative.
Finding "every URL that isn't part of a link" is quite difficult negative logic. It may be easier to find every URL, then every URL that's a link, and remove every of the latter from the former list.
As far as finding which URLs are a part of a link, try:
/<a([\s]+[\w="]+)*[\s]+href[\s]*=[\s]*"([\w\s:/.?+&=]+)"([\s]+[\w="]+)*>/i
I tested it with http://regexpal.com/ to be sure. It looks for the <a first, then it allows for any number of parameters, followed by href, followed by any other number of parameters. If it doesn't have the href, it's not a link. If it isn't an <a> tag, it's not a link. Since this is just the list of what we want to remove from the other list (of URLs), I simplified the definition of a URL to [\w\s:/.?+&=]+. As far as generating a list of URLs, you'll want something smarter.
I don't have a better regex. but if you do not find better regex then I would suggest using two queries for the task. First, find and remove all links and then search for urls. This would be easier and faster possibly.
(For, find and replace in one go, you can use something like - http://www.satya-weblog.com/2010/08/php-regex-find-and-replace-any-word-string-or-text-at-one-go.html).

Categories