i wrote a php-script which is loading the xml/html of a given url, parses it and writes it to an database. Since some hours ago I'm getting the strange mentioned error, not all the times but definitely too much.
Do you have any suggestions as to what is going wrong?
Here are the lines of code which are supposed to cause the logged error:
libxml_use_internal_errors( true );
$data = file_get_contents($item->get_link());
$dom = new DOMDocument();
$dom->loadHTML($data);
Well, there's not much to work off, so here's some possible issues:
1) $item->get_link() is not returning a valid URL
2) You assume file_get_contents will always get the data. What happens when there's a network issue? The server is down? You need to make sure $data is valid before doing something with it.
$data = file_get_contents($item->get_link());
3) $data is not parse-able by the dom parser, possibly for one of the previous reasons.
$dom = new DOMDocument();
$dom->loadHTML($data);
Related
So, I've created a 34MB XML file.
When I try to get the output from DOMDocument->saveXML(), it takes 94 seconds to return.
I assume the code that generates this XML is irrelevant here, as the problem is on the saveXML() line:
$this->exportDOM = new DOMDocument('1.0');
$this->exportDOM->formatOutput = TRUE;
$this->exportDOM->preserveWhiteSpace = FALSE;
$this->exportDOM->loadXML('<export><produtos></produtos><fornecedores></fornecedores><transportadoras></transportadoras><clientes></clientes></export>');
[...]
$this->benchmark->mark('a');
$this->exportDOM->saveXML();
$this->benchmark->mark('b');
echo $this->benchmark->elapsed_time('a','b');
die;
This gives me 94.4581.
What am I doing wrong? Do you guys know any performance-related issues with DOMDocument when generating the file?
If you need additional info, let me know. Thanks.
I tried removing formatOutput. It improves the perfomance by 33%.
Still taking too long. Any other tips?
One thing that helped - although it isn't the perfect solution - was setting $this->exportDOM->formatOutput = FALSE;.
It improved the performance by ~33%.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
PHP HTML DomDocument getElementById problems
I'm trying to extract info from Google searches in PHP and find that I can read the search urls without problem, but getting anything out of them is a whole different issue. After reading numerous posts, and applicable PHP docs, I came up with the following
// get large panoramas of montana
$url = 'http://www.google.com/search?q=montana+panorama&tbm=isch&biw=1408&bih=409';
$html = file_get_contents($url);
// was getting tons of "entity parse" errors, so added
$html = htmlentities($html, ENT_COMPAT, 'UTF-8', true); // tried false as well
$doc = new DOMDocument();
//$doc->strictErrorChecking = false; // tried both true and false here, same result
$result = $doc->loadHTML($html);
//echo $doc->saveHTML(); this shows that the tags I'm looking for are in fact in $doc
if ($result === true)
{
var_dump($result); // prints 'true'
$tags = $doc->getElementById('center_col');
$tags = $doc->getElementsByTagName('td');
var_dump($tags); // previous 2 lines both print NULL
}
I've verified that the ids and tags I'm looking for are in the html by error_log($html) and in the parsed doc with $doc->SaveHTNL(). Anyone see what I'm doing wrong?
Edit:
Thanks all for the help, but I've hit a wall with DOMDocument. Nothing in any of the docs, or other threads, works with Google image queries. Here's what I tried:
I looked at the #Jon link tried all the suggestions there, looked at the getElementByID docs and read all the comments there as well. Still getting empty result sets. Better than NULL, but not much.
I tried the xpath trick:
$xpath = new DOMXPath($doc);
$ccol = $xpath->query("//*[#id='center_col']");
Same result, an empty set.
I did a error_log($html) directly after the file read and the document has a doctype "" so it's not that.
I also see there that user "carl2088" says "From my experience, getElementById seem to work fine without any setups if you have loaded a HTML document". Not in the case of Google image queries, it would appear.
In desperation, I tried
echo count(explode('center_col', $html))
to see if for some strange reason it disappears after the initial error_log($html). It's definitely there, the string is split into 4 chunks.
I checked my version of PHP (5.3.15) complied Aug. 25 2012, so it's not a version too old to support getElementByID.
Before yesterday, I had been using an extremely ugly series of "explodes" to get the info, and while it's horrid code, it took 45 minutes to write and it works.
I'd really like to ditch my "explode" hack, but 5 hours to achieve nothing vs 45 minutes to get something that works, makes it really difficult to do things the right way.
If anyone else with experience using DOMDocument has some additional tricks I could try, it would be much appreciated.
are you using the the javascript getElementById and getElementsByTagName if yes than this is the problem
$tags = $doc->getElementById('center_col');
$tags = $doc->getElementsByTagName('td');
You will need to validate your document with DOMDocument->validate() or DOMDocument->validateOnParse before using function $doc->getElementById('center_col');
$doc->validateOnParse = true;
$doc->loadHTML($html);
stackoverflow: getelementbyid-problem
http://php.net/manual/de/domdocument.getelementbyid.php
it's in the question #Jon post in his comment!
I have a very simple code like this:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($data);
libxml_clear_errors();
$dom->preserveWhiteSpace = false;
foreach($dom->getElementsByTagName('img') as $img) {
// do something here
}
The variable $data contains HTML from a external URL. Ok, if i test this code using my local webserver using PHP 5.3.6 it works and returns ALL img tags in that page, but same code running on another server with PHP 5.3.3 DOESN'T WORK! It doesn't return all img tags from the SAME $data value... it only returns the first 13 images.
I suspect that this has something to do with encoding, may be that some characters in $data have a bad encoding or something, but i don't know how to solve it. There is a known bug on PHP 5.3.3 related to this?
I'd suggest you check out the comments on the php docs page,
looks like there's some useful advice on the DOMDocument usage:
http://de.php.net/manual/en/domdocument.getelementsbytagname.php
And before you ask for (eventually) known bugs on stackoverflow,
you might want to look for it on https://bugs.php.net/
Edit:
I think I've found the bug related to that behavior:
https://bugs.php.net/bug.php?id=60762
Even though it's flagged 5.4.0 RC 5 I couldn't replicate
the behavior mentioned.
Probably an issue with the HTML data (as you mentioned).
I am reading in html from a few different sources which I have to manipulate. As part of this I have a number of preg_replace() calls where I have to replace some of the information within the html received.
On 90% of the sites I have to do this on, everything works fine, the remaining 10% are returning NULL on each of the preg_replace() calls.
I've tried increasing the pcre.backtrack_limit and pcre.recursion_limit based on other articles I've found which appear to have the same problem, but this has been to no avail.
I have output preg_last_error() which is returning '4' for which the PHP documentation isn't proving very helpful at all, so if anyone can shed any light on this it might start to point me in the right direction, but I'm stumped.
One of the offending examples is:
$html = preg_replace('#<script[^>]*?.*?</script>#siu', '', $html);
but as I said, this works 90% of the time.
Don't parse HTML with regex. Use a real DOM parser:
$dom = new DOMDocument;
$dom->loadHTML($html);
$scripts = $dom->getElementsByTagName('script');
while ($el = $scripts->item(0)) {
$el->parentNode->removeChild($el);
}
$html = $dom->saveHTML();
You have bad utf-8.
/**
* Returned by preg_last_error if the last error was
* caused by malformed UTF-8 data (only when running a regex in UTF-8 mode). Available
* since PHP 5.2.0.
* #link http://php.net/manual/en/pcre.constants.php
*/
define ('PREG_BAD_UTF8_ERROR', 4);
However, you should really not use regex to parse html. Use DOMDocument
EDIT: Also I don't think this answer would be complete without including You can't parse [X]HTML with regex.
Your #4 error is a "PREG_BAD_UTF8_ERROR", you should check charset used on sites wich caused this error.
It is possible that you exceeded backtrack and/or internal recursion limits. See http://php.net/manual/en/pcre.configuration.php
Try this before preg_replace:
ini_set('pcre.backtrack_limit', '10000000');
ini_set('pcre.recursion_limit', '10000000');
I tried several methods to find out what part of a html string is invalid
$dom->loadHTML($badHtml);
$tidy->cleanRepair();
simplexml_load_string($badHtml);
None is clear regarding what part of the html is invalid. Maybe and extra config option for one of the can fix that. Any ideas ?
I need this to manually fix html input from users. I don't want to relay on automated processes.
I'd try loading the offending HTML into a DOM Document (as you are already doing) and then using simplexml to fix things. You should be able to run a quick diff to see where the errors are.
error_reporting(0);
$badHTML = '<p>Some <em><strong>badly</em> nested</stong> tags</p>';
$doc = new DOMDocument();
$doc->encoding = 'UTF-8';
$doc->loadHTML($badHTML);
$goodHTML = simplexml_import_dom($doc)->asXML();
You can compare cleaned and bad version with PHP Inline-Diff found in answer to that stackoverflow question.