Why does preg_match_all poop out after so many characters?

Why does preg_match_all poop out after so many characters? - php

I'm having a problem with my preg_match_all statement. It's been working perfectly as I've been typing out an article but all of a sudden after it passed a certain length is stopped working all together. Is this a known issue with the function that it just doesn't do anything after so many characters?
$number = preg_match_all("/()\n?(.*?)\n?()/s", $data, $matches, PREG_SET_ORDER);
It's been working fine all this time and works fine for other pages, but once that article passed a certain length, poof, it stopped working for that article. Is there another solution I can use to make it work for longer blocks of text? The article that is being processed is about 33,000 characters in length (including spaces).
I asked a question like this before but got only one answer which I never actually tested. The previous time I had just found another way to get around it for that particular scenario, but this time there is no way to get around it because it's all one article. I tried changing the pcre.backtrack_limit and pcre.recursion_limit up to even 500,000 with absolutely no effect. Are there any other ideas on why this is occurring and what I can do to get it to continue working even for these massive blocks of text? A 30,000 character limit seems to be a bit low, that's only 5,000-6,000 words (this one is about 5,700). Breaking it apart isn't really an option here because it won't find the start and stop if they are in two separate blocks of text.

I bumped into this one once, and the only way I could solve it back then, was by splitting the string. You could explode() or preg_split().
Quoting literally from my source code:
// regexps have failed miserably on very large tables...
$parts = explode("<table",$html);
But this was two years ago.

It looks like you're working with HTML. You might want to consider working with one of the various parsers. For example, DOM has a specific class for comments, so we know it can work with them. Unfortunately the DOM is a bit awkward to work with.
Another option might be to use XMLReader, which reads XML as a stream and processes it as tokens along the way. It seems to understand what comments are. I've never used it myself, so I can't tell you how well it works. (You can use DOM's loadHTML and saveXML methods to convert your HTML into XML, assuming it's not too horribly formed.)
Finally, you might consider writing a tokenizer or parser for your custom comments. It shouldn't be too difficult, and may well be faster for you to hack together than learning either of the XML solutions I've pointed out.

Related

preg_replace vs DOMDocument replaceChild

I was wondering which method mentioned in the title is more efficient to replace content in a html page.
I have this custom tag in my page: <includes module='footer'/> which will be replaced with some content.
Now there are some downsides with using DOMDocument->getElementsByTagName('includes')->item(0)->parentNode->replaceChild for instance when i forgot to add the slash in the tag, like so <includes module='footer'> the whole site crashes.
Regex allows exceptions like these, as long it matches the rule. It would even allow me to replace any string, like {includes:footer}.
Now back to my actual question. Are there any downsides using regex for this purpose, like performance issues...?
More here: Append child/element in head using XML Manipulation
cheers

I wouldn't be too worried about performance here, I would consider them "comparable". Benchmarks would need to be ran to truly determine this, as it would depend on the size of the document and how the regular expression is written.
Instead, I would be concerned about accuracy. In general DOMDocument will be much better at parsing XML since it was built to read and understand the language. However, it does fail on <includes module='footer'> because it is an un-closed tag (expecting: </includes>).
Most common HTML/XML formatting issues can be fixed with PHP's Tidy class. I would check this out, since you should receive much more "expected results" compared to if you used regex to parse. If you used a regular expression, there could technically be attributes before/after the module, elements within the includes element, unexpected characters like <includes module='foo>bar'>, etc.
In the end, if your XML is in a "controlled" environment (i.e. you know what can and can't happen, you know what possible characters module will contain, you know that it will always be a self closing element containing now children, etc.) than by all means use a regular expression. Just know it is looking for a very specific set of rules. However, if you expect for this to work with "anything you throw at it"..please use a DOM parser (after Tidy'ing to avoid the exceptions), regardless of performance (although I bet it will be very comparable in many instances).
Also, final note, if you plan to find/replace/manipulate many nodes in a document, you will see a large performance increase by going with a DOM parser. A DOM parser will take a document and parse it, once. Then you just traverse the data it already has loaded into its class. This is compared to using regular expressions, where each individual one will be ran across the whole document looking for a set of matches.
If you want me to get more specific in any area (i.e. give a Tidy example, or work on a benchmark), let me know.

So i did some naive performance testing using microtime(true). And it turns out using preg_replace is the faster option. While DOM replaceChild needed between 2.0 and 3.5 ms, preg_replace needed between 0.5 and 1.2 ms! But i guess thats only in my case.
This is how my html looks like:
<!DOCTYPE html>
<html>
<head>
{includes:title}
{includes:style}
</head>
<body>
{includes:body}
{includes:footer}
...
allot more here
...
</body>
</html>
this is the regex is used: /{([ ]*)includes:([ ]*)$key([^}]*)}/i
As i said, i'm not fully proficient in using regex, but this did the job. Guess if you optimize it, it would run even faster.
For the replaceChild method i used a custom tag like this: <includes module='body'/>
Again, this is testet on my local server, therefore i still need to make some tests of how it will behave on my online server...

Fetching images faster from a web page

I'm looking for a plugin or a simple code that fetches images from a link FASTER. I have been using http://simplehtmldom.sourceforge.net/ to extract first 3 images from a given link.
simplehtmldom is quite slow and many users on my site are reporting it as an issue.
Correct me if I'm wrong , I believe this plugin is taking lot of time to fetch complete html code from the url I pass and then it searches for img tags.
Someone please suggest me a technique to improvise the speed of fetching html code or an alternate plugin that i can try ?
What I'm thinking is something like fetching html code until it finds first three img tags and then kill the code fetching process ? So that things get faster.
I'm not sure if it's possible with php although, I'm trying hard to design that using jquery.
Thanks for your help !

Cross-site scripting rules will prevent you from doing something like this in jQuery/JS (unless you control all the domains that you'll be grabbing content from). What you're doing is not going to be super fast in any case, but try writing your own using file_get_content() paired with DOMDocument... the DOMDocument getElementsByTagName method may be faster than simplehtmldom's find() method.
You could also try a regex approach. It won't be as fool-proof as a true DOM parser, but it will probably be faster... Something like:
$html = file_get_contents($url);
preg_match_all("/<img[^']*?src=\"([^']*?)\"[^']*?>/", $html, $arr, PREG_PATTERN_ORDER);
If you want to avoid reading whole large files, you can also skip the file_get_contents() call and sub in a fopen(); while(feof()) loop and just check for images after each line is read from the remote server. If you take this approach, however, make sure you're regexing the WHOLE buffered string, not just the most recent line, as you could easily have the code for an image broke across several lines.
Keep in mind that real-life variability in HTML will make regex an imperfect solution at best, but if speed is a major concern it might be your best option.

How to properly format text retrieved from a website?

I'm building an application for a company that, unfortunately, has a very poorly designed website. Most of the HTML tags are wrongly and sometimes randomly placed, there is excessive use of no-break-spaces, p tags are randomly assigned, they don't follow any rule and so on...
I'm retrieving data from their website by using a crawler and then feeding the resulted strings to my application through my own web-service. The problem is that once displaying it into the android textview, the text is formatted all wrong, spread and uneven, very dissorderly.
Also, worth mentioning that I can not suggest to the company for various reasons to modify their website...
I've tried
String text = Html.fromHtml(myString).toString();
and other variations, I've even tried formatting it manually but it's been a pain.
My question is:
Is there an easy, elegant way to re-format all this text, either with PHP on my web-service or with Java, directly in my Android application?
Thanks to anyone who will take the time to answer...

You can use Tidy with PHP to clean up the code if you're keeping it in place. Otherwise stripping the HTML would probably make working with it a lot easier.

I would so: no, there is no easy, elegant way. HTML combines data and visual representation, they are inherently linked. To understand the data you must look at the tags. Tags like <h1> and <a> carry meaning.
If the HTML is structured enough to break it down into meaningful blocks: header, body and unrelated/unimportant stuff. Then you could apply restyling principles to those. A simple solution is to just strip all the tags, get only the textNodes and stitch them together. If the HTML is exceptionally poorly formatted you might get sentences that are out of order, but if the HTML isn't too contrived I expect this approach should work.
To give you an indication of the complexity involved: You could have <span>s that have styling applied to them, for instance display: block. This changes the way the span is displayed, from inline to block, so it behaves more like a <div> would. This means that each <span> will likely be on it's own line, it will seem to force a line break. Detecting these situations isn't impossible but it is quite complex. Who knows what happens when you've got list elements, tables or even floating elements; they might be completely out of order.

Probably not the most elegant solution, but I managed to get the best results by stripping some tags according to what I needed with php (that was really easy to do) and then displaying the retrieved strings into formatted WebViews.
As I said, probably not the most elegant solution but in this case it worked best for me.

Is there a tool to obtain all get all derivatives of a word in PHP?

I need to input "face" and get "facial, faces, faced, facing, facer, faceable" etc.
I've come across some ineffective programs which do the opposite, such as SNOWBALL and a couple of Porter Stemming PHP scripts which don't seem to work.
I'm beginning to think I may have to write this script - But, I thought I'd check to see if somebody has already been there/done that.

It will be very hard to simply find an algorithm to find the different way a word can be written like that.
You can use a dictionary webservice instead that have all the words available already

Preg_replace regex, newlines, connection resets

I have mixed html, custom code, and regular text I need to examine and change frequently on several, long wiki pages. I'm working with a proprietary wiki-like application and have no control over how the application functions or validates user input. The layout of pages that users add must follow a very specific standard layout and always include very specific text in only certain places - a standard which frequently changes. If users add pages that are so far out of the standard, they will be deleted.
I do not have the resources to manually proof-read and correct all these pages, so automation is the only solution. The fact that all this is obviously a complete waste of time when alternative platforms to do exactly what's needed here exist is already understood.
I've built a PHP based API to automate this post-validation and frequent restandardization process for me. I've been able set up regex patterns to handle all this mixed text, and they all work fine for handling single lines. The problem I have is this: Poorly formed regex against long text with line breaks can lead to unexpected results, such as connection resets. I have no access to server-side logs to troubleshoot. How do I overcome this?
This is just one example of what I currently have: {column} and {section} tags I'm searching for below can have any number of attributes, and wrap any text. {section} may or may not exist and may or may not be one or more lines under {column}, but it has to be wrapped inside {column}. {column} itself may or may not exist, and if it doesn't, I don't care as I then have some default text inserted later on down the script. I want to grab the inner section contents and wrap it in an html div tag instead. I can't recall the exact pattern I'm using offhand at the moment, but it's close enough...
$pattern = "/\{column:id=summary([|]?([a-zA-Z0-9-_ ]+[:][a-zA-Z0-9-_ ]+[ ]?))\}(.*)({section([|]([a-zA-Z0-9-_ ]+[:][a-zA-Z0-9-_ ]+[ ]?))\}(.*)\{section\}(.*))?{column\}/s";
$replacement = "{html}<div id='summary'>$7</div>{html}";
$text = preg_replace($pattern, $replacement, $subject);
Handling the {column} and {section} attributes and passing only valid HTML parameters to the new html div or a subtext of it is itself a challenge, but my main focus above right now is getting that (.*) value within {section} above without causing a connection reset. Any pointers?

This probably isn't what you're looking for, but: don't use a regex! You're trying to parse some very structured, very complex text, and to do so, you should really use a parser. I don't know what's available for PHP (you can Google just as well as I can, and I'm in no position to make any particular recommendation) but I'm sure something exists.
As for what's causing a connection reset, my only guess is that, since you mention problems with "long text", you're having a memory allocation issue. I don't think your regex will have unexpectedly huge performance, though it might in the non-matching case. But your best option, if you can, is probably to scrap the regex technique and switch to a real parser.

I found the likely source of the crashing issue: catastrophic backtracking (http://www.regular-expressions.info/catastrophic.html). So if refining patterns to handle that doesn't work (and if anyone has any patterns to suggest, please do share), switching to some other text parser solution would be best.

The only real problem I can see is all those (.*)s. In /s mode, each (.*) initially slurps up the whole page, only to have to backtrack most of the way. Change them all to (.*?) (i.e., switch to reluctant quantifiers) and it should work much faster.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.