using php library simple_html_dom i'm looping through a list of urls as dom and for each of these i try to find a string, if i find it i save the url in an array otherwise i go to the next cycle, returning the urls array at the end.
The script takes something of the order of some sec for each url.
after some loop the script get stuck on the $dom->load($url) line inside file get html throwing a segmentation fault, the number of loops varies on different urls lists.
I tried to isolate the call at load($url) in a test script working only on the url in which the looping script get stuck but the test script end with no errors (but i can't check the print_r of the dom because my firefox crashes if i try to view page source).
I'm working on a LAMP server. Here is the code:
error_reporting(E_ALL);
ini_set("max_execution_time", "300");
ini_set("memory_limit", "512M");
ini_set('output_buffering', 0);
ini_set('implicit_flush', 1);
ob_end_flush();
ob_start();
set_time_limit(100);
$urlArray = array();
foreach($urlArray as $url){
$found = false;
$dom = file_get_html($url);
foreach(( $dom->find('target')) as $caught){
array_push($link, $caught);
$found = true
}
if($trovato){
return $link;
}else{
echo "not found";
}
}
thx for any help
Well its common problem, here is a bug http://sourceforge.net/p/simplehtmldom/bugs/103/.
Add this lines before your if statement:
$dom->clear();
unset($dom);
Mostly you will not see any segfaults after that. But if you parse several thousands urls (like me :)) than you might meet it again. So my solution is - open simple_html_dom.php file, and comment all lines between 146 and 149.
function clear()
{
/*
$this->dom = null;
$this->nodes = null;
$this->parent = null;
$this->children = null;
*/
}
UPDATE: also if you comment this lines - your memory consumption will increase each parsing iteration
Related
I using simple_html_dom for scrape pages website, the problem is if i want to scrape many page like 500 url pages that takes a long time (5-30 minutes) to complete, and thats make my server error 500.
Some of these things I've done is:
try using set_time_limit
set ini_set('max_execution_time')
add delay() timing
I many read from stackoverflow to use cronjob to split Long Running PHP Scripts, my question is How to split Long Running PHP Scripts ? can u give best way to split it ? can u give me step by step script because iam a beginner.
About my program, i have two file :
file 1, i have array more than 500 link url
file 2, this file have function to process scrape
example this is file 1:
set_time_limit(0);
ini_set('max_execution_time', 3000); //3000 seconds = 30 minutes
$start = microtime(true); // start check render time page
error_reporting(E_ALL);
ini_set('display_errors', 1);
include ("simple_html_dom.php");
include ("scrape.php");
$link=array('url1','url2','url3'...);
array_chunk($link, 25); // this i try to split for 25 but not working
$hasilScrape = array();
for ( $i=1; $i<=count($link); $i++){
//this is the process i want to call function get_data to scrape
$hasilScrape[$i-1] = json_decode(get_data($link[$i-1]), true);
}
$filename='File_Hasil_Scrape';
$fp = fopen($filename . ".csv", 'w');
foreach ($hasilScrape as $fields) {
fputcsv($fp, $fields);
}
fclose($fp);
i have thinking can i split array link for 25 array and thank i pause or make it stop for temporary (NOT DELAY because i have been try it no useless) the proses and run again, can u tell me please, thank you so much.
Really stumped on this one and feel like an idiot! I have a small PHP cron job that does it's thing every few minutes. The client has requested that the app emails them with a daily overview of issues raised....
To do this, I decided to dump an array to a file for storage purposes. I decided against a SQL DB to keep this standalone and lightweight.
What I want to do is open said file, add to a set of numbers and save again.
I have tried this with SimpleXML and serialize/file_put_contents.
The issue I have is what is written to file does not correspond with the array being echo'd the line before. Say I'm adding 2 to the total, the physical file has added 4.
The following is ugly and just a snippet:
echo "count = ".count($result);"<br/>";
$arr = loadLog();
dumpArray($arr, "Pre Load");
$arr0['count'] = $arr['count']+(count($result));
echo "test ".$arr0['count'];
dumpArray($arr0, "Pre Save");
saveLog($arr0);
sleep(3);
$arr1 = loadLog();
dumpArray($arr1, "Post Save");
function saveLog($arr){
$content = serialize($arr);
var_dump($content);
file_put_contents(STATUS_SOURCE, $content);
}
function loadLog(){
$content = unserialize(file_get_contents(STATUS_SOURCE));
return $content;
}
function dumpArray($array, $title = false){
echo "<p><h1>".$title."</h1><pre>";
var_dump($array);
echo "</pre></p>";
}
Output View here
Output File: a:1:{s:5:"count";i:96;}
I really appreciate any heads up - Have had someone else look who also scratched his head.
Check .htaccess isn't sending 404 errors to the same script. Chrome was looking for favicon.ico which did not exist. This caused the script to execute a second time.
I'm mining data from site, but there it paginator, but I need to get all pages.
Link to the next page is written in link tag with rel=next. If there are no more pages, the link tag is missing. I created function called getAll which should call self again and again until there is the link tag.
function getAll($url, &$links) {
$dom = file_get_html ($url); // create dom object from $url
$tmp = $dom->find('link[rel=next]', 0); // find link rel=next
if(is_object($tmp)){ // is there the link tag?
$link = $tmp->getAttribute('href'); // get url of next page - href attribute
$links[] = $link; // insert url into array
getAll($link, $links); // call self
}else{
return $links; // there are no more urls, return the array
}
}
// usage
$links = array();
getAll('http://www.zbozi.cz/vyrobek/apple-iphone-5/', $links);
print_r($links); // dump the links
But I have a problem, when I run the script the message "No data received" appear in Chrome. I don't have any idea about error or something. The function should works, because when I don't use it again it-self it returns one link - to the second page.
I think the problem is in bad syntax or bad pointer usage.
Could you please help me?
I don't know what file_get_html or find should do, but this should work:
<?php
function getAll($url, &$links) {
$dom = new DOMDocument();
$dom->loadHTML(file_get_contents($url));
$linkElements = $dom->getElementsByTagName('link');
foreach ($linkElements as $link => $content) {
if ($content->hasAttribute('rel') && $content->getAttribute('rel') === 'next') {
$nextURL = $content->getAttribute('href');
$links[] = $nextURL;
getAll($nextURL, $links);
}
}
}
$links = array();
getAll('http://www.zbozi.cz/vyrobek/apple-iphone-5/', $links);
print_r($links);
Firstly, this could be easier. Without an error message this could be anything from a DNS error to a corrupted space character inside your file. So if you haven't, try adding this to the top of your script:
error_reporting(E_ALL);
ini_set("display_errors", "1");
It should reveal any error that might have taken place. But if that doesn't work I have two ideas:
You can't have a syntax error because then the script wouldn't even run. You said that removing the recursion yielded a result so the script must work.
One possibility is that it's timing out. This depends on the server configuration. Try adding
echo $url, "<br>";
flush();
to the top of getAll. If you receive any of the links this is your problem.
This can be fixed by calling a function like set_time_limit(0).
Another possibility is a connection error. This could be caused by coincidence or a server configuration limit. I can't be certain but I know some hosting providers limit file_get_contents and curl requests. There is a possibility your scripts are limited to one external request per execution.
Besides that there is nothing I could think of that can really go wrong with your script. You could remove the recursion and run the function in a while loop. But unless you expect a lot pages there is no need for such a modification.
And finally, the library you are using for DOM parsing will either return a DOM element object or null. So you can change if(is_object($tmp)){ to if($tmp){. And since you are passing the result by reference, returning a value is pointless. You can safely remove the else statement.
I wish you good luck.
The following function receives a string parameter representing an url and then loads the url in a simple_html_dom object. If the loading fails, it attemps to load the url again.
public function getSimpleHtmlDomLoaded($url)
{
$ret = false;
$count = 1;
$max_attemps = 10;
while ($ret === false) {
$html = new simple_html_dom();
$ret = $html->load_file($url);
if ($ret === false) {
echo "Error loading url: $url\n";
sleep(5);
$count++;
$html->clear();
unset($html);
if ($count > $max_attemps)
return false;
}
}
return $html;
}
However, if the url loading fails one time, it keeps failing for the current url, and after the max attemps are over, it also keeps failing in the next calls to the function with the rest of the urls it has to process.
It would make sense to keep failing if the urls were temporarily offline, but they are not (I've checked while the script was running).
Any ideas why this is not working properly?
I would also like to point out, that when starts failing to load the urls, it only gives a warning (instead of multiple ones), with the following message:
PHP Warning: file_get_contents(http://www.foo.com/resource): failed
to open stream: HTTP request failed! in simple_html_dom.php on line
1081
Which is prompted by this line of code:
$ret = $html->load_file($url);
I have tested your code and it works perfectly for me, every time I call that function it returns valid result from the first time.
So even if you load the pages from the same domain there can be some protection on the page or server.
For example page can look for some cookies, or the server can look for your user agent and if it see you as an bot it would not serve correct content.
I had similar problems while parsing some websites.
Answer for me was to see what is some page/server expecting and make my code simulate that. Everything, from faking user agent to generating cookies and such.
By the way have you tried to create a simple php script just to test that 'simple html dom' parser can be run on your server with no errors? That is the first thing I would check.
On the end I must add that in one case, while I failed in numerous tries for parsing one page, and I could not win the masking game. On the end I made an script that loads that page in linux command line text browser lynx and saved the whole page locally and then I parsed that local file which worked perfect.
may be it is a problem of load_file() function itself.
Problem was, that the function error_get_last() returns all privious erros too, don't know, may be depending on PHP version?
I solved the problem by changing it to (check if error changed, not if it is null)
(or use the non object function: file_get_html()):
function load_file()
{
$preerror=error_get_last();
$args = func_get_args();
$this->load(call_user_func_array('file_get_contents', $args), true);
// Throw an error if we can't properly load the dom.
if (($error=error_get_last())!==$preerror) {
$this->clear();
return false;
}
}
I'm trying to create some xml , basically by reading rss feeds and adding to them some custom tags. I've made a function that contains my code, and now i want to call the function several times with different rss urls. Each call will produce a different .xml file.
I use DOMDocument to load and parse the rss, and simple_html_dom to load and parse the link of each rss item to get some content from the html.
Here is a simplified example of my code:
<?php
include('simple_html_dom.php');
load_custom_rss('http://www.somesite.com/rssfeed/articles', 'articles.xml');
load_custom_rss('http://www.somesite.com/rssfeed/jobs', 'jobs.xml');
load_custom_rss('http://www.somesite.com/rssfeed/press', 'press.xml');
//up to 20 similar function calls here...
function load_custom_rss($link, $filename){
$doc = new DOMDocument();
$doc->load($link);
$newDoc = new DOMDocument('1.0', 'UTF-8');
$rss = $newDoc->createElement('rss');
$channel = $newDoc->createElement('channel');
$newDoc->appendChild($rss);
$rss->appendChild($channel);
foreach ($doc->getElementsByTagName('item') as $node) {
//here is some code to read items from rss xml / write them to new xml document.
//Code missing for simplicity
//Next lines used to get some elements from the html of the item's link
$html = new simple_html_dom();
html->load_file($node->getElementsByTagName('link')->item(0)->nodeValue);
$ret = $html->find('#imgId');
}
$newDoc->formatOutput = true;
$fh = fopen($filename, 'w') or die("can't open file");
fwrite($fh, $newDoc->saveXML());
fclose($fh);
unset($doc);
//unset ALL variables and objects created in this function...
//........
}//function end
?>
My problem is that each call of the function consumes quite an amount of memory, so after the 3rd or 4th call of the function apache throws Fatal Error, as the script consumes memory amount bigger than the memory_limit, even though i unset ALL variables and objects created in the function. If i reduce the function calls to 1 or 2 everything works fine.
Is there any way it could work? I was thinking about each function call waits for the previous to finish before starting, but how could this be done?
Hope somebody could help.
thanks in advance.
The thing you want is normal behaviour in php. It's worked through from top to bottom. Each function has to wait, till the previous function has finished. I think you problem is rather the memory-limit within the php.ini. Open the file and search for the directive: memory_limit http://www.php.net/manual/en/ini.core.php#ini.memory-limit Increase it to fit your needs.
You're unsetting $doc but not $newDoc, try adding
unset($newDoc);
At the end of that function.
As others have said, the problem is you're leaking memory or exceeding your memory limit; this is nothing to do with waiting until previous code has finished.
Alternatively you could put each call to load_custom_rss() into separate requests, so the script calls one and then reloads itself, i.e.
$i = $_GET['i'];
if ($i==0)
load_custom_rss('http://www.somesite.com/rssfeed/articles', 'articles.xml');
elseif ($i==1)
load_custom_rss('http://www.somesite.com/rssfeed/jobs', 'jobs.xml');
... etc ...
else
die("I'm done");
header("Location: myself.php?i=".($i+1));
Your approach to reloading the script would likely be different of course, depending on whether the page needs to render any HTML first.