I'm running into memory issues with PHP Simple HTML DOM Parser. I'm parsing a fair sized doc and need to run down the DOM tree...
1)I'm starting with the whole file:
$html = file_get_html($file);
2)then parsing out my table:
$table = $html->find('table.big');
3)then parsing out my rows:
$rows = $table[0]->find('tr');
What I'm ending up with are three GIANT objects... anyone know how to dump an object after I've parsed it for the data I need? Like $html is useless by step 3, yet, it's the largest of all the objects.
Any ideas?
Is there a way to drill down to my table rows out of the original $html object?
Thanks in advance.
EDIT:
I've managed to skip step two with:
$rows = $this->html->find('table.big tr');
But am still running into memory issues...
I may be little late...to answer as i joined late...so the answers given above are not correct. unset only unsets the $html not its properties. So to clean up memory and kick off the memory issue is :
use $html->clear();.
I think u didint read the class code before using it. clear() function destroy/release the memory eaten up by the $html object.This function is internal function of simple_html_dom.This function immediately take effect. So u dont have to wait whole day or program termination to take effect.
You can increase the memory limit.
ini_set('memory_limit', '64M');
or clear the memory with this code
$html->__destruct();
unset($html);
$html = null;
If memory is really a big concern, you may want to look into SAX instead of using DOM. You may want to try unset() on the $html after obtaining $table, but that is simply just marking it to be garbage collected and memory won't be freed up immediately.
At the end of the day, it is really up to how memory-efficient Simple HTML DOM is written or which implementation you have chosen.
...how to dump an object after I've
parsed it for the data I need? Like
$html...
unset($html) ?
or $html = null; might work better - more of an immediate affect?
Related
I'm working on some system for a few hours now and this little thing is too much for me to think logically about at the moment.
Normally I would wait a few hours but this is a last minute job and I need to finish this.
Here's my problem:
I have an XML file that gets posted to my PHP file, the PHP file inserts certain data into a DB, but some XML nodes have the same name:
<accessoires>
<accessoire>value1</accessoire>
<accessoire>value2</accessoire>
<accessoire>value3</accessoire>
</accessoires>
Now I want to get a var $acclist which contains all values seperated by a comma:
value1,value2,value3,
I bet the solution to this is very easy but I'm at the known point where even the easiest piece of code becomes a hassle. And googling only comes up with nodes that in some way have their own identifiers.
Could someone help me out please?
You can try simplexml_load_string to parse the html then call implode on the node after casting to an array.
NOTE This code was tested in php 5.4.6 and behaves as expected.
<?php
$xml = '<accessoires>
<accessoire>value1</accessoire>
<accessoire>value2</accessoire>
<accessoire>value3</accessoire>
</accessoires>';
$dat = simplexml_load_string($xml);
echo implode(",",(array)$dat->accessoire);
For 5.3.x I had to change to
$xml = '<accessoires>
<accessoire>value1</accessoire>
<accessoire>value2</accessoire>
<accessoire>value3</accessoire>
</accessoires>';
$dat = simplexml_load_string($xml);
$dat = (array)$dat;
echo implode(",",$dat["accessoire"]);
You do this by taking a library that is able to parse and process XML, for example with SimpleXML:
implode(',', iterator_to_array($accessoires->accessoire, FALSE));
The key part here is to use iterator_to_array() as SimpleXML offers the same-named child-elements here as an iterator. Otherwise $accessoires->accessoire gives you auto-magically only the first element (if any).
So, I've created a 34MB XML file.
When I try to get the output from DOMDocument->saveXML(), it takes 94 seconds to return.
I assume the code that generates this XML is irrelevant here, as the problem is on the saveXML() line:
$this->exportDOM = new DOMDocument('1.0');
$this->exportDOM->formatOutput = TRUE;
$this->exportDOM->preserveWhiteSpace = FALSE;
$this->exportDOM->loadXML('<export><produtos></produtos><fornecedores></fornecedores><transportadoras></transportadoras><clientes></clientes></export>');
[...]
$this->benchmark->mark('a');
$this->exportDOM->saveXML();
$this->benchmark->mark('b');
echo $this->benchmark->elapsed_time('a','b');
die;
This gives me 94.4581.
What am I doing wrong? Do you guys know any performance-related issues with DOMDocument when generating the file?
If you need additional info, let me know. Thanks.
I tried removing formatOutput. It improves the perfomance by 33%.
Still taking too long. Any other tips?
One thing that helped - although it isn't the perfect solution - was setting $this->exportDOM->formatOutput = FALSE;.
It improved the performance by ~33%.
I want to extrat the content of a specific div in an external webpage, the div looks like this:
<dt>Win rate</dt><dd><div>50%</div></dd>
My target is the "50%". I'm actually using this php code to extract the content:
function getvalue($parameter,$content){
preg_match($parameter, $content, $match);
return $match[1];
};
$parameter = '#<dt>Score</dt><dd><div>(.*)</div></dd>#';
$content = file_get_contents('https://somewebpage.com');
Everything works fine, the problem is that this method is taking too much time, especially if I've to use it several times with diferents $content.
I would like to know if there's a better (faster, simplier, etc.) way to acomplish the same function? Thx!
You may use DOMDocument::loadHTML and navigate your way to the given node.
$content = file_get_contents('https://somewebpage.com');
$doc = new DOMDocument();
$doc->loadHTML($content);
Now to get to the desired node, you may use method DOMDocument::getElementsByTagName, e.g.
$dds = $doc->getElementsByTagName('dd');
foreach($dds as $dd) {
// process each <dd> element here, extract inner div and its inner html...
}
Edit: I see a point #pebbl has made about DomDocument being slower. Indeed it is, however, parsing HTML with preg_match is a call for trouble; In that case, I'd also recommend looking at event-driven SAX XML parser. It is much more lightweight, faster and less memory intensive as it does not build a tree. You may take a look at XML_HTMLSax for such a parser.
There are basically three main things you can do to improve the speed of your code:
Off load the external page load to another time (i.e. use cron)
On a linux based server I would know what to suggest but seeing as you use Windows I'm not sure what the equivalent would be, but Cron for linux allows you to fire off scripts at certain schedule time offsets - in the background - so not using a browser. Basically I would recommend that you create a script who's sole purpose is to go and fetch the website pages at a particular time offset (depending on how frequently you need to update your data) and then write those webpages to files on your local system.
$listOfSites = array(
'http://www.something.com/page.htm',
'http://www.something-else.co.uk/index.php',
);
$dirToContainSites = getcwd() . '/sites';
foreach ( $listOfSites as $site ) {
$content = file_get_contents( $site );
/// i've just simply converted the URL into a filename here, there are
/// better ways of handling this, but this at least keeps things simple.
/// the following just converts any non letter or non number into an
/// underscore... so, http___www_something_com_page_htm
$file_name = preg_replace('/[^a-z0-9]/i','_', $site);
file_put_contents( $dirToContainSites . '/' . $file_name, $content );
}
Once you've created this script, you then need to set the server up to execute it as regularly as you need. Then you can modify your front-end script that displays the stats to read from local files, this would give a significant speed increase.
You can find out how to read files from a directory here:
http://uk.php.net/manual/en/function.dir.php
Or the simpler method (but prone to possible problems) is just to re-step your array of sites, convert the URLs to file names using the preg_replace above, and then check for the file's existence in the folder.
Cache the result of calculating your statistics
It's quite likely this being a stats page that you'll want to visit it quite frequently (not as frequent as a public page, but still). If the same page is visited more often than the cron-based script is executed then there is no reason to do all the calculation again. So basically all you have to do to cache your output is do something similar to the following:
$cachedVersion = getcwd() . '/cached/stats.html';
/// check to see if there is a cached version of this page
if ( file_exists($cachedVersion) ) {
/// if so, load it and echo it to the browser
echo file_get_contents($cachedVersion);
}
else {
/// start output buffering so we can catch what we send to the browser
ob_start();
/// DO YOUR STATS CALCULATION HERE AND ECHO IT TO THE BROWSER LIKE NORMAL
/// end output buffering and grab the contents so we now have a string
/// of the page we've just generated
$content = ob_get_contents(); ob_end_clean();
/// write the content to the cached file for next time
file_put_contents($cachedVersion, $content);
echo $content;
}
Once you start caching things you need to be aware of when you should delete or clear your cache - otherwise if you don't your stats output will never change. With regards to this situation, the best time to clear your cache is at the point you go and fetch the external web pages again. So you should add this line to the bottom of your "cron" script.
$cachedVersion = getcwd() . '/cached/stats.html';
unlink( $cachedVersion ); /// will delete the file
There are other speed improvements you could make to the caching system (you could even record the modified times of the external webpages and load only when they have been updated) but I've tried to keep things easy to explain.
Don't use a HTML Parser for this situation
Scanning a HTML file for one particular unique value does not require the use of a fully-blown or even lightweight HTML Parser. Using RegExp incorrectly seems to be one of those things that lots of start-up programmers fall into, and is a question that is always asked. This has led to lots of automatic knee-jerk reactions from more experience coders to automatically adhere to the following logic:
if ( $askedAboutUsingRegExpForHTML ) {
$automatically->orderTheSillyPersonToUse( $HTMLParser );
} else {
$soundAdvice = $think->about( $theSituation );
print $soundAdvice;
}
HTMLParsers should be used when the target within the markup is not so unique, or your pattern to match relies on such flimsy rules that it'll break the second an extra tag or character occurs. They should be used to make your code more reliable, not if you want to speed things up. Even parsers that do not build a tree of all the elements will still be using some form of string searching or regular expression notation, so unless the library-code you are using has been compiled in an extremely optimised manner, it will not beat well coded strpos/preg_match logic.
Considering I have not seen the HTML you are hoping to parse, I could be way off, but from what I've seen of your snippet it should be quite easy to find the value using a combination of strpos and preg_match. Obviously if your HTML is more complex and might have random multiple occurances of <dt>Win rate</dt><dd><div>50%</div></dd> it will cause problems - but even so - a HTMLParser would still have the same problem.
$offset = 0;
/// loop through the occurances of 'Win rate'
while ( ($p = stripos ($html, 'win rate', $offset)) !== FALSE ) {
/// grab out a snippet of the surrounding HTML to speed up the RegExp
$snippet = substr($html, $p, $p + 50 );
/// I've extended your RegExp to try and account for 'white space' that could
/// occur around the elements. The following wont take in to account any random
/// attributes that may appear, so if you find some pages aren't working - echo
/// out the $snippet var using something like "echo '<xmp>'.$snippet.'</xmp>';"
/// and that should show you what is appearing that is breaking the RegExp.
if ( preg_match('#^win\s+rate\s*</dt>\s*<dd>\s*<div>\s*([0-9]+%)\s*<#i', $snippet, $regs) ) {
/// once you are here your % value will be in $regs[1];
break; /// exit the while loop as we have found our 'Win rate'
}
/// reset our offset for the next loop
$offset = $p;
}
Gotchas to be aware of
If you are new to PHP, as you state in a comment above, then the above may seem rather complicated - which it is. What you are trying to do is quite complex, especially if you want to do it optimally and fast. However, if you follow throught the code I've given and research any bits that you aren't sure of / haven't heard of (php.net is your friend), it should give you a better understanding of a good way to achieve what you are doing.
Guessing ahead however, here are some of the problems you might face with the above:
File Permission errors - in order to be able to read and write files to and from the local operating system you will need to have the correct permissions to do so. If you find you can not write files to a particular directory it might be that the host you are using wont allow you to do so. If this is the case you can either contact them to ask about how to get write permission to a folder, or if that isn't possible you can easily change the code above to use a database instead.
I can't see my content - when using output buffering all the echo and print commands do not get sent to the browser, they instead get saved up in memory. PHP should automatically output all the stored content when the script exits, but if you use a command like ob_end_clean() this actually wipes the 'buffer' so all the content is erased. This can lead to confusing situations when you know you are echoing something.. but it just isn't appearing.
(Mini Disclaimer :) I've typed all the above manually so you may find there are PHP errors, if so, and they are baffling, just write them back here and StackOverflow can help you out)
Instead of trying to not use preg_match why not just trim your document contents down in size? for example, you could dump everything before <body and everything after </body>. then preg_match will be searching less content already.
Also, you could try to do each one of these processes as a pseudo separate thread, so that way they aren't happening one at a time.
I have this problem in terms of processing time of a large xml files. By large, i mean 600MB on the average.
Currently, It takes about 50 - 60 minutes to parse and insert the data into a database.
I would like to ask for suggestions on how can I improve the processing time? Like goind down to 20 minutes.
Because with the current time it will take me 2.5 months to populate the database with the content from the xml. By the way I have 3000+ xml files with the average of 600mb. And my php script in command line thru cron job.
I have also read other questions like the one below, but I have not found any idea yet.
What is the fastest XML parser in PHP?
I see that some have parsed files up to 2GB. I wonder how long are the processing time.
I hope you guys could lend your help.
It would be much appreciated.
Thanks.
I have this code:
$handler = $this;
$parser = xml_parser_create('UTF-8');
xml_set_object($parser, $handler);
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, false);
xml_set_element_handler($parser, "startElement", "endElement");
xml_set_character_data_handler($parser, "cdata");
$fp = fopen($xmlfile, 'r');
while (!feof($fp)) {
while (($data = fread($fp, 71680))){
}
}
I put first the parse data in a temporary array.
My mysql insert commands are inside the endElement function.
There is a specific closing tag to trigger my insert command to the database.
Thanks for the response....
Without seeing any code, the very first thing I have to suggest is NOT to use either DOM or SimpleXMLElement as these load the whole thing into memory.
You need to use a stream parser like XMLReader.
EDIT:
Since you are already using a stream parser, you aren't going to get huge gains from changing parsers (I honestly don't know the difference in speed between XML Parser and XMLReader, since the latter uses libxml, it may be better but probably not worth it).
Next thing to look at is whether you're doing anything silly in your code; for that we'd need to see a more substantial overview of how you've implemented this.
You say you are putting data in a temporary array and calling MySQL insert once you reach a closing tag. Are you using prepared statements? Are you using transactions to do multiple inserts in bulk?
The right way to get at your bottleneck though is to run a profiler over your code. My favourite tool for the job is xhProf with XHGui. This will tell you what functions are running, how many times, for how long and how much memory they consume (and can display it all in a nice call-graph, very useful).
Use the instructions on that GitHub's README. Here's a tutorial and another useful tutorial (bear in mind this last one is for the profiler without the XHGui extensions that I linked to).
You only seem to need to parse and read the data and not edit the XML. With this mind, I would say using a SAX parser is the easier and faster way to do this.
SAX is an approach to parse XML documents, but not to validate them. The good thing is that you can use it with both PHP 4 and PHP 5 with no changes. In PHP 4, the SAX parsing is already available on all platforms, so no separate installation is necessary.
You basically define a function to be run when a start element is found and another to be run when an end element is found (you can also use one for attributes). And then you do whatever you want with the parsed data.
Parsing XML with SAX
<?
function start_element($parser, $element_name, $element_attrs) {
switch ($element_name) {
case 'KEYWORDS':
echo '<h1>Keywords</h1><ul>';
break;
case 'KEYWORD':
echo '<li>';
break;
}
}
function end_element($parser, $element_name) {
switch ($element_name) {
case 'KEYWORDS':
echo '</ul>';
break;
case 'KEYWORD':
echo '</li>';
break;
}
}
function character_data($parser, $data) {
echo htmlentities($data);
}
$parser = xml_parser_create();
xml_set_element_handler($parser, 'start_element', 'end_element');
xml_set_character_data_handler($parser, 'character_data');
$fp = fopen('keyword-data.xml', 'r')
or die ("Cannot open keyword-data.xml!");
while ($data = fread($fp, 4096)) {
xml_parse($parser, $data, feof($fp)) or
die(sprintf('XML ERROR: %s at line %d',
xml_error_string(xml_get_error_code($parser)),
xml_get_current_line_number($parser)));
}
xml_parser_free($parser);
?>
Source: I worked on parsing and processing large amounts of XML data.
EDIT: Better example
EDIT: Well, apparently you are already using a Sax Parser. As long as you are actually processing the file in an event driven way (Not having any additional overhead) you should be at top performance in this department. I would say there is nothing you can do to increase the parsing performance. If you are having performance issues I would suggest looking at what you are doing in your code to find performance bottlenekcs (Try using a php profiler like this one ). If you post your code here, we could give it a look! Cheers!
I have spent the last day or so tackling the same problem. I noticed that limiting the number of insert queries reduced the processing time quite significantly. You might have already done this but try collecting a batch of parsed data into a suitable data structure (I am using a simple array, but maybe a more suitable data structure could further reduce the cost?). Upon a collection of X sets insert the data in one go (INSERT INTO table_name (field_name) VALUES (set_1, set_2, set_n) )
Hope this helps anyone who might stumble upon this page. I am still working out other bottlenecks, if I find something new I will post it here.
I want to remove all children from a XML Node using PHP DOM, is there any difference between:
A)
while ($parentNode->hasChildNodes()){
$parentNode->removeChild($parentNode->childNodes->item(0));
}
AND
B)
$node->nodeValue = "";
I prefer the second one, seems like I am getting the same result but I'm not sure.
Thanks,
Carlos
Slightly tighter:
while ($parentNode->hasChildNodes()) {
$parentNode->removeChild($parentNode->firstChild);
}
removeChild() is the more "proper" way of doing things. While you can set the contents of that node to "" and this will acquire the desired effect, calling removeChild() is much more apparent as to what is going on. However, it would be my assumption that, in a minuscule level, nodeValue() is slightly more efficient.