I'm trying to create some xml , basically by reading rss feeds and adding to them some custom tags. I've made a function that contains my code, and now i want to call the function several times with different rss urls. Each call will produce a different .xml file.
I use DOMDocument to load and parse the rss, and simple_html_dom to load and parse the link of each rss item to get some content from the html.
Here is a simplified example of my code:
<?php
include('simple_html_dom.php');
load_custom_rss('http://www.somesite.com/rssfeed/articles', 'articles.xml');
load_custom_rss('http://www.somesite.com/rssfeed/jobs', 'jobs.xml');
load_custom_rss('http://www.somesite.com/rssfeed/press', 'press.xml');
//up to 20 similar function calls here...
function load_custom_rss($link, $filename){
$doc = new DOMDocument();
$doc->load($link);
$newDoc = new DOMDocument('1.0', 'UTF-8');
$rss = $newDoc->createElement('rss');
$channel = $newDoc->createElement('channel');
$newDoc->appendChild($rss);
$rss->appendChild($channel);
foreach ($doc->getElementsByTagName('item') as $node) {
//here is some code to read items from rss xml / write them to new xml document.
//Code missing for simplicity
//Next lines used to get some elements from the html of the item's link
$html = new simple_html_dom();
html->load_file($node->getElementsByTagName('link')->item(0)->nodeValue);
$ret = $html->find('#imgId');
}
$newDoc->formatOutput = true;
$fh = fopen($filename, 'w') or die("can't open file");
fwrite($fh, $newDoc->saveXML());
fclose($fh);
unset($doc);
//unset ALL variables and objects created in this function...
//........
}//function end
?>
My problem is that each call of the function consumes quite an amount of memory, so after the 3rd or 4th call of the function apache throws Fatal Error, as the script consumes memory amount bigger than the memory_limit, even though i unset ALL variables and objects created in the function. If i reduce the function calls to 1 or 2 everything works fine.
Is there any way it could work? I was thinking about each function call waits for the previous to finish before starting, but how could this be done?
Hope somebody could help.
thanks in advance.
The thing you want is normal behaviour in php. It's worked through from top to bottom. Each function has to wait, till the previous function has finished. I think you problem is rather the memory-limit within the php.ini. Open the file and search for the directive: memory_limit http://www.php.net/manual/en/ini.core.php#ini.memory-limit Increase it to fit your needs.
You're unsetting $doc but not $newDoc, try adding
unset($newDoc);
At the end of that function.
As others have said, the problem is you're leaking memory or exceeding your memory limit; this is nothing to do with waiting until previous code has finished.
Alternatively you could put each call to load_custom_rss() into separate requests, so the script calls one and then reloads itself, i.e.
$i = $_GET['i'];
if ($i==0)
load_custom_rss('http://www.somesite.com/rssfeed/articles', 'articles.xml');
elseif ($i==1)
load_custom_rss('http://www.somesite.com/rssfeed/jobs', 'jobs.xml');
... etc ...
else
die("I'm done");
header("Location: myself.php?i=".($i+1));
Your approach to reloading the script would likely be different of course, depending on whether the page needs to render any HTML first.
Related
Really stumped on this one and feel like an idiot! I have a small PHP cron job that does it's thing every few minutes. The client has requested that the app emails them with a daily overview of issues raised....
To do this, I decided to dump an array to a file for storage purposes. I decided against a SQL DB to keep this standalone and lightweight.
What I want to do is open said file, add to a set of numbers and save again.
I have tried this with SimpleXML and serialize/file_put_contents.
The issue I have is what is written to file does not correspond with the array being echo'd the line before. Say I'm adding 2 to the total, the physical file has added 4.
The following is ugly and just a snippet:
echo "count = ".count($result);"<br/>";
$arr = loadLog();
dumpArray($arr, "Pre Load");
$arr0['count'] = $arr['count']+(count($result));
echo "test ".$arr0['count'];
dumpArray($arr0, "Pre Save");
saveLog($arr0);
sleep(3);
$arr1 = loadLog();
dumpArray($arr1, "Post Save");
function saveLog($arr){
$content = serialize($arr);
var_dump($content);
file_put_contents(STATUS_SOURCE, $content);
}
function loadLog(){
$content = unserialize(file_get_contents(STATUS_SOURCE));
return $content;
}
function dumpArray($array, $title = false){
echo "<p><h1>".$title."</h1><pre>";
var_dump($array);
echo "</pre></p>";
}
Output View here
Output File: a:1:{s:5:"count";i:96;}
I really appreciate any heads up - Have had someone else look who also scratched his head.
Check .htaccess isn't sending 404 errors to the same script. Chrome was looking for favicon.ico which did not exist. This caused the script to execute a second time.
Actually I am modifying an old script of my client made by some other developer, what he did is, he included same file many times like a config file, that is causing some variables overwrites, I just want to count how many times a particular file is included in complete page execution, like how many times config file is loaded and even better if I can get the line numbers and file names of where those files are included.
If there is any way to get this done, that will help.
Thanks.
If you can, this would be best done in the included file itself. Add a line, such as track_inclusion(__FILE__); at the start of it. Define the function like so:
function track_inclusion($filename=null) {
static $inclusions = array();
if( !$filename) return $inclusions;
if( !isset($inclusions[$filename])) $inclusions[$filename] = array();
$trace = debug_backtrace();
foreach($trace as $t) {
if( !preg_match("/^(?:include|require)(?:_once)?$/i",$t['function'])) continue;
$inclusions[$filename][] = $t;
break;
}
}
Then, once you're all done, you can call track_inclusion() to retrieve the inclusion data and var_dump it out to have a look - once you see the structure it gives you, you could present it in a more meaningful way.
I have a daily cron job which will get a XML from web service. Sometimes it is large, contains more than 10K products information and the XML size will be 14M example.
What I need to do is parsing XML to object then processing them. The processing is quite complicated. Not like directly put them into the database, I need to do a lot operation on them, and finally put them into many database tables.
It is just in one PHP script. I don't have any experience on dealing with large data.
So the problem is it take a lot of memory. And very long time to do it. I turn my localhost PHP memory_limit to 4G and running 3.5hrs then got successful. But my production host is not allowed such amount memory.
I do a research but I am very confused which is a right way to dealing with this situation.
Here is a sample of my code:
function my_items_import($xml){
$results = new SimpleXMLElement($xml);
$results->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//it will loop over 10K
foreach($results->xpath('//i:Item') as $data) {
$data->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//my processing code here, it will call a other functions to do a lot things
processing($data);
}
unset($results);
}
As a start don't use SimpleXMLElement on the whole document. SimpleXMLElement loads everything in the memory and is not efficient for large data. Here is a snippet from a real code. You'll need to accommodate it to your case but hope you'll get the general idea.
$reader = new XMLReader();
$reader->xml($xml);
// Get cursor to first article
while($reader->read() && $reader->name !== 'article');
// Iterate articles
while($reader->name === 'article')
{
$doc = new DOMDocument('1.0', 'UTF-8');
$article = simplexml_import_dom($doc->importNode($reader->expand(), true));
processing($article);
$reader->next('article');
}
$reader->close();
$article is SimpleXMLElement which can be processed further.
This way you save a lot of memory by making only single article nodes go into memory.
Additionally if each processing() function take long time you can turn it into a background process which runs in separately from the main script and several processing() functions can be started in parallel.
Key hints:
dispose data during process.
Dispose data - mean over write it with blank data. BTW, unset is slower than overwrite with null
Use functions or static method, avoid as much oop instance as possible.
One extra question, how long it takes to loop your xml without do [lots things]:
function my_items_import($xml){
$results = new SimpleXMLElement($xml);
$results->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//it will loop over 10K
foreach($results->xpath('//i:Item') as $data) {
$data->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//my processing code here, it will call a other functions to do a lot things
//processing($data);
}
//unset($result);// no need
}
I'm mining data from site, but there it paginator, but I need to get all pages.
Link to the next page is written in link tag with rel=next. If there are no more pages, the link tag is missing. I created function called getAll which should call self again and again until there is the link tag.
function getAll($url, &$links) {
$dom = file_get_html ($url); // create dom object from $url
$tmp = $dom->find('link[rel=next]', 0); // find link rel=next
if(is_object($tmp)){ // is there the link tag?
$link = $tmp->getAttribute('href'); // get url of next page - href attribute
$links[] = $link; // insert url into array
getAll($link, $links); // call self
}else{
return $links; // there are no more urls, return the array
}
}
// usage
$links = array();
getAll('http://www.zbozi.cz/vyrobek/apple-iphone-5/', $links);
print_r($links); // dump the links
But I have a problem, when I run the script the message "No data received" appear in Chrome. I don't have any idea about error or something. The function should works, because when I don't use it again it-self it returns one link - to the second page.
I think the problem is in bad syntax or bad pointer usage.
Could you please help me?
I don't know what file_get_html or find should do, but this should work:
<?php
function getAll($url, &$links) {
$dom = new DOMDocument();
$dom->loadHTML(file_get_contents($url));
$linkElements = $dom->getElementsByTagName('link');
foreach ($linkElements as $link => $content) {
if ($content->hasAttribute('rel') && $content->getAttribute('rel') === 'next') {
$nextURL = $content->getAttribute('href');
$links[] = $nextURL;
getAll($nextURL, $links);
}
}
}
$links = array();
getAll('http://www.zbozi.cz/vyrobek/apple-iphone-5/', $links);
print_r($links);
Firstly, this could be easier. Without an error message this could be anything from a DNS error to a corrupted space character inside your file. So if you haven't, try adding this to the top of your script:
error_reporting(E_ALL);
ini_set("display_errors", "1");
It should reveal any error that might have taken place. But if that doesn't work I have two ideas:
You can't have a syntax error because then the script wouldn't even run. You said that removing the recursion yielded a result so the script must work.
One possibility is that it's timing out. This depends on the server configuration. Try adding
echo $url, "<br>";
flush();
to the top of getAll. If you receive any of the links this is your problem.
This can be fixed by calling a function like set_time_limit(0).
Another possibility is a connection error. This could be caused by coincidence or a server configuration limit. I can't be certain but I know some hosting providers limit file_get_contents and curl requests. There is a possibility your scripts are limited to one external request per execution.
Besides that there is nothing I could think of that can really go wrong with your script. You could remove the recursion and run the function in a while loop. But unless you expect a lot pages there is no need for such a modification.
And finally, the library you are using for DOM parsing will either return a DOM element object or null. So you can change if(is_object($tmp)){ to if($tmp){. And since you are passing the result by reference, returning a value is pointless. You can safely remove the else statement.
I wish you good luck.
using php library simple_html_dom i'm looping through a list of urls as dom and for each of these i try to find a string, if i find it i save the url in an array otherwise i go to the next cycle, returning the urls array at the end.
The script takes something of the order of some sec for each url.
after some loop the script get stuck on the $dom->load($url) line inside file get html throwing a segmentation fault, the number of loops varies on different urls lists.
I tried to isolate the call at load($url) in a test script working only on the url in which the looping script get stuck but the test script end with no errors (but i can't check the print_r of the dom because my firefox crashes if i try to view page source).
I'm working on a LAMP server. Here is the code:
error_reporting(E_ALL);
ini_set("max_execution_time", "300");
ini_set("memory_limit", "512M");
ini_set('output_buffering', 0);
ini_set('implicit_flush', 1);
ob_end_flush();
ob_start();
set_time_limit(100);
$urlArray = array();
foreach($urlArray as $url){
$found = false;
$dom = file_get_html($url);
foreach(( $dom->find('target')) as $caught){
array_push($link, $caught);
$found = true
}
if($trovato){
return $link;
}else{
echo "not found";
}
}
thx for any help
Well its common problem, here is a bug http://sourceforge.net/p/simplehtmldom/bugs/103/.
Add this lines before your if statement:
$dom->clear();
unset($dom);
Mostly you will not see any segfaults after that. But if you parse several thousands urls (like me :)) than you might meet it again. So my solution is - open simple_html_dom.php file, and comment all lines between 146 and 149.
function clear()
{
/*
$this->dom = null;
$this->nodes = null;
$this->parent = null;
$this->children = null;
*/
}
UPDATE: also if you comment this lines - your memory consumption will increase each parsing iteration