So, I've created a 34MB XML file.
When I try to get the output from DOMDocument->saveXML(), it takes 94 seconds to return.
I assume the code that generates this XML is irrelevant here, as the problem is on the saveXML() line:
$this->exportDOM = new DOMDocument('1.0');
$this->exportDOM->formatOutput = TRUE;
$this->exportDOM->preserveWhiteSpace = FALSE;
$this->exportDOM->loadXML('<export><produtos></produtos><fornecedores></fornecedores><transportadoras></transportadoras><clientes></clientes></export>');
[...]
$this->benchmark->mark('a');
$this->exportDOM->saveXML();
$this->benchmark->mark('b');
echo $this->benchmark->elapsed_time('a','b');
die;
This gives me 94.4581.
What am I doing wrong? Do you guys know any performance-related issues with DOMDocument when generating the file?
If you need additional info, let me know. Thanks.
I tried removing formatOutput. It improves the perfomance by 33%.
Still taking too long. Any other tips?
One thing that helped - although it isn't the perfect solution - was setting $this->exportDOM->formatOutput = FALSE;.
It improved the performance by ~33%.
Related
I apologize in advance if this is deemed not constructive...
Which is better to do?
echo out html code
echo("<div>marry me natalie imbruglia... please</div>");
OR
$dom = new DOMDocument("1.0","utf-8");
element = $dom->createElement('test', 'This is the root element!');
$dom->appendChild($element);
echo $dom->saveXML();
I must also mention that the platform i'm using at work does not support this... please do not ask why, it is the way it is... i have to work within those bounds...
Personally, I would choose neither. Since I am used to Java Server Pages, I would rather use straight HTML/PHP interleaving like so ...
<?php $x
$x = "This is the root element!";
?>
<test><?= $x ?></test>
If I had to choose from one of the two options you provided, I would definitely go with the second one because it's much easier to modify the final XML structure in the future.
I'll just emphasize Darryl's comment in an earlier post - since I don't have the enough reputation right now to add a further comment underneath his. Maintenance which is often correlated to code readability is important. If you are asking this question because you are trying to squeeze out "performance" then you're looking in the wrong place.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
PHP HTML DomDocument getElementById problems
I'm trying to extract info from Google searches in PHP and find that I can read the search urls without problem, but getting anything out of them is a whole different issue. After reading numerous posts, and applicable PHP docs, I came up with the following
// get large panoramas of montana
$url = 'http://www.google.com/search?q=montana+panorama&tbm=isch&biw=1408&bih=409';
$html = file_get_contents($url);
// was getting tons of "entity parse" errors, so added
$html = htmlentities($html, ENT_COMPAT, 'UTF-8', true); // tried false as well
$doc = new DOMDocument();
//$doc->strictErrorChecking = false; // tried both true and false here, same result
$result = $doc->loadHTML($html);
//echo $doc->saveHTML(); this shows that the tags I'm looking for are in fact in $doc
if ($result === true)
{
var_dump($result); // prints 'true'
$tags = $doc->getElementById('center_col');
$tags = $doc->getElementsByTagName('td');
var_dump($tags); // previous 2 lines both print NULL
}
I've verified that the ids and tags I'm looking for are in the html by error_log($html) and in the parsed doc with $doc->SaveHTNL(). Anyone see what I'm doing wrong?
Edit:
Thanks all for the help, but I've hit a wall with DOMDocument. Nothing in any of the docs, or other threads, works with Google image queries. Here's what I tried:
I looked at the #Jon link tried all the suggestions there, looked at the getElementByID docs and read all the comments there as well. Still getting empty result sets. Better than NULL, but not much.
I tried the xpath trick:
$xpath = new DOMXPath($doc);
$ccol = $xpath->query("//*[#id='center_col']");
Same result, an empty set.
I did a error_log($html) directly after the file read and the document has a doctype "" so it's not that.
I also see there that user "carl2088" says "From my experience, getElementById seem to work fine without any setups if you have loaded a HTML document". Not in the case of Google image queries, it would appear.
In desperation, I tried
echo count(explode('center_col', $html))
to see if for some strange reason it disappears after the initial error_log($html). It's definitely there, the string is split into 4 chunks.
I checked my version of PHP (5.3.15) complied Aug. 25 2012, so it's not a version too old to support getElementByID.
Before yesterday, I had been using an extremely ugly series of "explodes" to get the info, and while it's horrid code, it took 45 minutes to write and it works.
I'd really like to ditch my "explode" hack, but 5 hours to achieve nothing vs 45 minutes to get something that works, makes it really difficult to do things the right way.
If anyone else with experience using DOMDocument has some additional tricks I could try, it would be much appreciated.
are you using the the javascript getElementById and getElementsByTagName if yes than this is the problem
$tags = $doc->getElementById('center_col');
$tags = $doc->getElementsByTagName('td');
You will need to validate your document with DOMDocument->validate() or DOMDocument->validateOnParse before using function $doc->getElementById('center_col');
$doc->validateOnParse = true;
$doc->loadHTML($html);
stackoverflow: getelementbyid-problem
http://php.net/manual/de/domdocument.getelementbyid.php
it's in the question #Jon post in his comment!
I have a deadline for this project (Monday). It worked great on the localhost, but when I uploaded it to our web server, I discovered that we do not have all of the DOM package enabled and I cannot use the function dom_import_simplexml(). My server admin is ignoring my requests, probably because of the short notice, and I cannot possibly rewrite the XML system into a database system that quickly.
This appears to be the only error I'm encountering. Please, if you have any alternative ideas, I'd love to hear them. I'm at a loss, because I can't find any other solution. All I have to do is create some XML elements and populate them with CDATA values, and I can't believe that this is not supported by SimpleXML!
Please, are there any alternatives you can think of? I'm open, because I don't know what I can do.
// Add a <pages /> element
$xml->addChild('pages');
// Populate it with data
foreach($pages as $page){
$new = $xml->pages->addChild('page');
$new->addAttribute('pid', $page['pid']);
$new->addAttribute('title', $page['title']);
if(isset($page['ancestor']))
$new->addAttribute('ancestor', $page['ancestor']);
$node = dom_import_simplexml($new);
$no = $node->ownerDocument;
$node->appendChild($no->createCDATASection($page['content']));
}
Those last three lines obviously won't work, and I don't know what else I can do with them!
Thank you so much for any help you can provide.
them not supporting this is baffling. You must hack the stream.
$xml->node = '<![CDATA[character data]]>';
echo preg_replace('/\]\]></', ']]><', preg_replace('/<!\[CDATA/','<![CDATA', $xml->asXML()));
This is problematic, perhaps those strings might show up in your cdata somewhere. A risk you'll have to take.
Sorry I couldn't help you with your deadline, but at least anyone annoyed with using DOM or unable to use it will have a solution of sorts.
I tried several methods to find out what part of a html string is invalid
$dom->loadHTML($badHtml);
$tidy->cleanRepair();
simplexml_load_string($badHtml);
None is clear regarding what part of the html is invalid. Maybe and extra config option for one of the can fix that. Any ideas ?
I need this to manually fix html input from users. I don't want to relay on automated processes.
I'd try loading the offending HTML into a DOM Document (as you are already doing) and then using simplexml to fix things. You should be able to run a quick diff to see where the errors are.
error_reporting(0);
$badHTML = '<p>Some <em><strong>badly</em> nested</stong> tags</p>';
$doc = new DOMDocument();
$doc->encoding = 'UTF-8';
$doc->loadHTML($badHTML);
$goodHTML = simplexml_import_dom($doc)->asXML();
You can compare cleaned and bad version with PHP Inline-Diff found in answer to that stackoverflow question.
I'm running into memory issues with PHP Simple HTML DOM Parser. I'm parsing a fair sized doc and need to run down the DOM tree...
1)I'm starting with the whole file:
$html = file_get_html($file);
2)then parsing out my table:
$table = $html->find('table.big');
3)then parsing out my rows:
$rows = $table[0]->find('tr');
What I'm ending up with are three GIANT objects... anyone know how to dump an object after I've parsed it for the data I need? Like $html is useless by step 3, yet, it's the largest of all the objects.
Any ideas?
Is there a way to drill down to my table rows out of the original $html object?
Thanks in advance.
EDIT:
I've managed to skip step two with:
$rows = $this->html->find('table.big tr');
But am still running into memory issues...
I may be little late...to answer as i joined late...so the answers given above are not correct. unset only unsets the $html not its properties. So to clean up memory and kick off the memory issue is :
use $html->clear();.
I think u didint read the class code before using it. clear() function destroy/release the memory eaten up by the $html object.This function is internal function of simple_html_dom.This function immediately take effect. So u dont have to wait whole day or program termination to take effect.
You can increase the memory limit.
ini_set('memory_limit', '64M');
or clear the memory with this code
$html->__destruct();
unset($html);
$html = null;
If memory is really a big concern, you may want to look into SAX instead of using DOM. You may want to try unset() on the $html after obtaining $table, but that is simply just marking it to be garbage collected and memory won't be freed up immediately.
At the end of the day, it is really up to how memory-efficient Simple HTML DOM is written or which implementation you have chosen.
...how to dump an object after I've
parsed it for the data I need? Like
$html...
unset($html) ?
or $html = null; might work better - more of an immediate affect?