Preg_replace regex, newlines, connection resets - php

I have mixed html, custom code, and regular text I need to examine and change frequently on several, long wiki pages. I'm working with a proprietary wiki-like application and have no control over how the application functions or validates user input. The layout of pages that users add must follow a very specific standard layout and always include very specific text in only certain places - a standard which frequently changes. If users add pages that are so far out of the standard, they will be deleted.
I do not have the resources to manually proof-read and correct all these pages, so automation is the only solution. The fact that all this is obviously a complete waste of time when alternative platforms to do exactly what's needed here exist is already understood.
I've built a PHP based API to automate this post-validation and frequent restandardization process for me. I've been able set up regex patterns to handle all this mixed text, and they all work fine for handling single lines. The problem I have is this: Poorly formed regex against long text with line breaks can lead to unexpected results, such as connection resets. I have no access to server-side logs to troubleshoot. How do I overcome this?
This is just one example of what I currently have: {column} and {section} tags I'm searching for below can have any number of attributes, and wrap any text. {section} may or may not exist and may or may not be one or more lines under {column}, but it has to be wrapped inside {column}. {column} itself may or may not exist, and if it doesn't, I don't care as I then have some default text inserted later on down the script. I want to grab the inner section contents and wrap it in an html div tag instead. I can't recall the exact pattern I'm using offhand at the moment, but it's close enough...
$pattern = "/\{column:id=summary([|]?([a-zA-Z0-9-_ ]+[:][a-zA-Z0-9-_ ]+[ ]?))\}(.*)({section([|]([a-zA-Z0-9-_ ]+[:][a-zA-Z0-9-_ ]+[ ]?))\}(.*)\{section\}(.*))?{column\}/s";
$replacement = "{html}<div id='summary'>$7</div>{html}";
$text = preg_replace($pattern, $replacement, $subject);
Handling the {column} and {section} attributes and passing only valid HTML parameters to the new html div or a subtext of it is itself a challenge, but my main focus above right now is getting that (.*) value within {section} above without causing a connection reset. Any pointers?

This probably isn't what you're looking for, but: don't use a regex! You're trying to parse some very structured, very complex text, and to do so, you should really use a parser. I don't know what's available for PHP (you can Google just as well as I can, and I'm in no position to make any particular recommendation) but I'm sure something exists.
As for what's causing a connection reset, my only guess is that, since you mention problems with "long text", you're having a memory allocation issue. I don't think your regex will have unexpectedly huge performance, though it might in the non-matching case. But your best option, if you can, is probably to scrap the regex technique and switch to a real parser.

I found the likely source of the crashing issue: catastrophic backtracking (http://www.regular-expressions.info/catastrophic.html). So if refining patterns to handle that doesn't work (and if anyone has any patterns to suggest, please do share), switching to some other text parser solution would be best.

The only real problem I can see is all those (.*)s. In /s mode, each (.*) initially slurps up the whole page, only to have to backtrack most of the way. Change them all to (.*?) (i.e., switch to reluctant quantifiers) and it should work much faster.

Related

preg_replace vs DOMDocument replaceChild

I was wondering which method mentioned in the title is more efficient to replace content in a html page.
I have this custom tag in my page: <includes module='footer'/> which will be replaced with some content.
Now there are some downsides with using DOMDocument->getElementsByTagName('includes')->item(0)->parentNode->replaceChild for instance when i forgot to add the slash in the tag, like so <includes module='footer'> the whole site crashes.
Regex allows exceptions like these, as long it matches the rule. It would even allow me to replace any string, like {includes:footer}.
Now back to my actual question. Are there any downsides using regex for this purpose, like performance issues...?
More here: Append child/element in head using XML Manipulation
cheers
I wouldn't be too worried about performance here, I would consider them "comparable". Benchmarks would need to be ran to truly determine this, as it would depend on the size of the document and how the regular expression is written.
Instead, I would be concerned about accuracy. In general DOMDocument will be much better at parsing XML since it was built to read and understand the language. However, it does fail on <includes module='footer'> because it is an un-closed tag (expecting: </includes>).
Most common HTML/XML formatting issues can be fixed with PHP's Tidy class. I would check this out, since you should receive much more "expected results" compared to if you used regex to parse. If you used a regular expression, there could technically be attributes before/after the module, elements within the includes element, unexpected characters like <includes module='foo>bar'>, etc.
In the end, if your XML is in a "controlled" environment (i.e. you know what can and can't happen, you know what possible characters module will contain, you know that it will always be a self closing element containing now children, etc.) than by all means use a regular expression. Just know it is looking for a very specific set of rules. However, if you expect for this to work with "anything you throw at it"..please use a DOM parser (after Tidy'ing to avoid the exceptions), regardless of performance (although I bet it will be very comparable in many instances).
Also, final note, if you plan to find/replace/manipulate many nodes in a document, you will see a large performance increase by going with a DOM parser. A DOM parser will take a document and parse it, once. Then you just traverse the data it already has loaded into its class. This is compared to using regular expressions, where each individual one will be ran across the whole document looking for a set of matches.
If you want me to get more specific in any area (i.e. give a Tidy example, or work on a benchmark), let me know.
So i did some naive performance testing using microtime(true). And it turns out using preg_replace is the faster option. While DOM replaceChild needed between 2.0 and 3.5 ms, preg_replace needed between 0.5 and 1.2 ms! But i guess thats only in my case.
This is how my html looks like:
<!DOCTYPE html>
<html>
<head>
{includes:title}
{includes:style}
</head>
<body>
{includes:body}
{includes:footer}
...
allot more here
...
</body>
</html>
this is the regex is used: /{([ ]*)includes:([ ]*)$key([^}]*)}/i
As i said, i'm not fully proficient in using regex, but this did the job. Guess if you optimize it, it would run even faster.
For the replaceChild method i used a custom tag like this: <includes module='body'/>
Again, this is testet on my local server, therefore i still need to make some tests of how it will behave on my online server...

preg/regex in PHP

I'm Having trouble with regex. Never fully understood it the real question is: does anybody have a good site that explains the difference between the expression instead of just posting stuff like
$regexp = "/^[^0-9][A-z0-9_]+([.][A-z0-9_]+)*[#][A-z0-9_]+([.][A-z0-9_]+)*[.][A-z]{2,4}$/";
then prattling off what that line as a whole will do. Rather then what each expression will do. I've tried googling many different versions of preg_replace and regex tutorial but they all seem to assume that we already know what stuff like \b[^>]* will do.
Secondary. The reason i am trying to do this:
i want to turn
<span style="color: #000000">*ANY NUMBER*</span>
into
<span style="color: #0000ff">*ANY NUMBER*</span>
a few variations that i have already tried some just didnt work some make the script crap out.
$data = preg_replace("/<span style=\"color: #000000\">([0-9])</span>/", "<span style=\"color: #FFCC00\">$1</span>", $data);//just tried to match atleast 0-9
$data = preg_replace("/<span style=\"color: #000000\"\b[^>]*>(.*?)</span>/", "<span style=\"color: #FFCC00\">$1</span>", $data);
$data = preg_replace("/<span style=\"color: #000000\"\b[^>]*>([0-9])</span>/", "<span style=\"color: #FFCC00\">$1</span>", $data);
The answer to this specific problem is not nearly as important to me as a site so check goes to that. Tried alot of different sites and i am pretty sure its not above my comprehension i just cannot find a good for all the bad tutorial/example farm. Normal fallbacks of w3 and phpdotnet dont have what i need this time.
EDIT1 For those of you who end up in here looking for a similar answer:
$data = preg_replace("/<span style=\"color: #000000\">([0-9]{1,})<\/span>/", "<span style=\"color: #FFCC00\">$1</span>", $data);
Did what it needed to. Sadly it was one of the first things i tried but because i didnt put </span> instead of it was not working and i do not know if "[0-9]{1,}" is the MOST appropriate way of matching any number (telling it to match any integer 0-9 with [0-9] atleast once and as many times as it can with {1,} it still fit the purpose)
ROY Finley Posted:
http://www.macronimous.com/resources/writing_regular_expression_with_php.asp
Its a good site with a list of expression definitions and a good example workup below.
Also:
regular-expressions.info/tutorial.html was posted a few times. Its a slower more indepth walk through but if you are stuck like i am its good.
Will pop in about regex101 and the parsers after i have a chance to play with them.
EDIT2
DWright posted a book link below "Mastering Regular Expressions". If you look at regex and cannot make heads or tails of the convolution of characters it is DEFINITELY worth picking it up. Took about an hour and a half to read about half but that is no time compared to the hours spend on google and the mess up work arounds used to avoid it.
Also the html parse linked below would be right for this particular problem.
To have a regex explained, you can have a look at Regex101. To actually learn regular expressions (which I recommend), this is a pretty good, in-depth tutorial. After you have read that, the PCRE documentation on PHP.net shouldn't seem to arcane any more, and reading it will help you get your head around some specific differences for PHP.
However, for the problem at hand, you shouldn't actually be using regex at all. A DOM parser is the way to go. Here is a very convenient to use 3rd party one, and this is what PHP brings along itself. As mentioned by hakre,here is a more extensive list of libraries available for this purpose.
Another general recommendation for regexes in PHP: use single quotes '/pattern/', because double quotes cause a lot of trouble with escape sequences (you need to double some backslashes otherwise).
Finally, the reason you get errors is that your regex delimiter (you use /) shows up in your pattern (in the closing span tag) without it being escaped. That means the engine thinks that the pattern ends at the first / and that span>/ are 6 different modifiers (most of which don't actually exist). You could either escape the delimiter like <\/span> or even better, change the delimiter (you can use pretty much anything) like '~yourPattern/Here~'.
Edit: Since I posted this answer, two new websites have been released which try to explain regular expressions by visualising them. Right now they only support the (quite limited) JavaScript flavor, but it's a good point to start:
Regexper
Debuggex
http://www.macronimous.com/resources/writing_regular_expression_with_php.asp
look at this one. it seems to cover the process pretty good.
Try this website, perhaps. Personally, I'd say if you are really interested in regexes, it'd be worth getting a book like this one.

Regular Expressions - Where Angels Fear to Tread

I've just started studying regular expressions in PHP, but I'm having a terrible time following some of the tutorials on the WWW and cannot seem to find anything addressing my current needs. Perhaps I'm trying to learn too much too fast. This aspect of PHP is entirely new to me.
What I'm trying to create is a regular expression to replace all HTML code in between the nth occurrence of <TAG> and </TAG> with any code I choose.
My ultimate goal is to make an Internet filter in PHP through which I can view a web page stripped of certain content (or replaced with sanitized content) between any specified set of tags <TAG>...</TAG> within the page, where <TAG>...</TAG> represents any valid paired HTML tags, such as <B>...</B> or <SPAN>...</SPAN> or <DIV>...</DIV>, etc, etc.
For example, if the page has a porn ad contained in the 5th <DIV>...</DIV> block within the page, what regular expression could be invoked to target and replace that code with something else, like xxxxxxx, but only the 5th <DIV> block within the page and nothing else?
The entire web page is contained within a single text string and the filtered result should also be a single string of text.
I'm not sure, but I think the code to do this could have a format similar to:
$FilteredPage = preg_replace("REG EXPRESSION", "xxxxxxxx", $OriginalPage);
The "REG EXPRESSION" to invoke is what I need to know and the "xxxxxxxx" represents the text to replace the code between the tags targeted by "REG EXPRESSION".
Regular expressions are obviously the work of Satan!
Any general suggestions or perhaps a couple of working examples which I could study and experiment with would be greatly appreciated.
Thanks, Jay
Firstly, are you using the right tool for the job? Regex is a text matching engine, not a fully blown parser - perhaps a dedicated HTML parser will give better results.
Secondly, when approaching any programming problem, try to simplify your problem and build it brick by brick rather than just jumping straight to a final solution. For example, you could:
Start with a simple block of normal english text, and try to match and replace (for example) every occurrence of the word "and".
When that works, wrap it in a loop of PHP that can count up to 5 and only replace the 5th occurrence. Why use regex to count when PHP is so much better at that task?
Then modify your regex to match your 5th HTML tag (which is a bit harder because <> are special characters and need escaping)
By approaching the problem in steps, you will be able to get each part working in turn and build a solid solution that you understand.
This has been done to death, but please, don't use a regex to parse HTML. Just stop, give up... It is not worth the kittens god will kill for you doing it. use a real HTML or XML parser
On a more constructive note, look at xpath as a technology better suited to describing html nodes you might want to replace... or phpQuery and QueryPath
The reason god kills kittens when you parse HTML with a regex:
Html is not a regular language, thus a regex can only ever parse very limited html. HTML is a context free language, and as such can only be properly parsed with a context free parser.
Edit: thank you #Andrew Grimm, this is said much better than i could, as evidenced by the first answer with well over four thousand upvotes!
RegEx match open tags except XHTML self-contained tags
ok, few ground rules.
Dont post a question like that, pre-ing all the question, will only keep people away
Regular expressions are awsome!
If you want to consider options, look on how to read html as an xml document and parse it using xpath
#tobyodavies is pretty much correct, I'll include the answer in case you want to do it anyways
Now, to your problem. With this one:
$regex = "#<div>(.+?)</div>#si";
You should be ok using that expression and counting the occurences, much like this:
preg_match_all($regex, $htmlcontent, $matches, PREG_SET_ORDER );
Suppose you only need the 5th one. Matches[$i][0] is the whole string of the $i-eth match
if (count($matches) > 5 )
{
$myMatch = $matches[5][0];
$matchedText = $matches[5][1];
}
Good luck in your efforts...

Why does preg_match_all poop out after so many characters?

I'm having a problem with my preg_match_all statement. It's been working perfectly as I've been typing out an article but all of a sudden after it passed a certain length is stopped working all together. Is this a known issue with the function that it just doesn't do anything after so many characters?
$number = preg_match_all("/(<!-- ([\w]+):start -->)\n?(.*?)\n?(<!-- \\2:stop -->)/s", $data, $matches, PREG_SET_ORDER);
It's been working fine all this time and works fine for other pages, but once that article passed a certain length, poof, it stopped working for that article. Is there another solution I can use to make it work for longer blocks of text? The article that is being processed is about 33,000 characters in length (including spaces).
I asked a question like this before but got only one answer which I never actually tested. The previous time I had just found another way to get around it for that particular scenario, but this time there is no way to get around it because it's all one article. I tried changing the pcre.backtrack_limit and pcre.recursion_limit up to even 500,000 with absolutely no effect. Are there any other ideas on why this is occurring and what I can do to get it to continue working even for these massive blocks of text? A 30,000 character limit seems to be a bit low, that's only 5,000-6,000 words (this one is about 5,700). Breaking it apart isn't really an option here because it won't find the start and stop if they are in two separate blocks of text.
I bumped into this one once, and the only way I could solve it back then, was by splitting the string. You could explode() or preg_split().
Quoting literally from my source code:
// regexps have failed miserably on very large tables...
$parts = explode("<table",$html);
But this was two years ago.
It looks like you're working with HTML. You might want to consider working with one of the various parsers. For example, DOM has a specific class for comments, so we know it can work with them. Unfortunately the DOM is a bit awkward to work with.
Another option might be to use XMLReader, which reads XML as a stream and processes it as tokens along the way. It seems to understand what comments are. I've never used it myself, so I can't tell you how well it works. (You can use DOM's loadHTML and saveXML methods to convert your HTML into XML, assuming it's not too horribly formed.)
Finally, you might consider writing a tokenizer or parser for your custom comments. It shouldn't be too difficult, and may well be faster for you to hack together than learning either of the XML solutions I've pointed out.

Scraping Google Search Results in PHP

I would like to get the links from the search results. Can someone please help with with the regular expression to do this? I've got this, and it doesn't work:
preg_match_all("/<h3(.*)><a href=\"(.*)\"(.*)<\/h3>/", $result, $matches);
Your patterns are likely having the biggest issues because of the greedy vs lazy nature of it. Changing it to the following should solve that issue...
preg_match_all('#<h3.*?><a href="(.*?)".*?</h3>#', $result, $matches);
print_r($matches[1]);
There are possibly a few rare URLs that could mess the pattern up, but chances are you won't run into one. I will point out that stillstanding has a good point though using the API would be a better option.
As for people that blanket answer with "You can't parse HTML with Regex, use a DOM"... Whilst you cannot create a generic HTML parser (and should be using DOM for that task), you can match patterns in a set of text you know follows a certain structure, the fact that structure is HTML is irrelevant. Yes, if Google change their layout it will probably break, but this is also probably true of a DOM Parser. (P.S. I'm well aware this will probably get down-voted by the sheeple).

Categories