I'm mining data from site, but there it paginator, but I need to get all pages.
Link to the next page is written in link tag with rel=next. If there are no more pages, the link tag is missing. I created function called getAll which should call self again and again until there is the link tag.
function getAll($url, &$links) {
$dom = file_get_html ($url); // create dom object from $url
$tmp = $dom->find('link[rel=next]', 0); // find link rel=next
if(is_object($tmp)){ // is there the link tag?
$link = $tmp->getAttribute('href'); // get url of next page - href attribute
$links[] = $link; // insert url into array
getAll($link, $links); // call self
}else{
return $links; // there are no more urls, return the array
}
}
// usage
$links = array();
getAll('http://www.zbozi.cz/vyrobek/apple-iphone-5/', $links);
print_r($links); // dump the links
But I have a problem, when I run the script the message "No data received" appear in Chrome. I don't have any idea about error or something. The function should works, because when I don't use it again it-self it returns one link - to the second page.
I think the problem is in bad syntax or bad pointer usage.
Could you please help me?
I don't know what file_get_html or find should do, but this should work:
<?php
function getAll($url, &$links) {
$dom = new DOMDocument();
$dom->loadHTML(file_get_contents($url));
$linkElements = $dom->getElementsByTagName('link');
foreach ($linkElements as $link => $content) {
if ($content->hasAttribute('rel') && $content->getAttribute('rel') === 'next') {
$nextURL = $content->getAttribute('href');
$links[] = $nextURL;
getAll($nextURL, $links);
}
}
}
$links = array();
getAll('http://www.zbozi.cz/vyrobek/apple-iphone-5/', $links);
print_r($links);
Firstly, this could be easier. Without an error message this could be anything from a DNS error to a corrupted space character inside your file. So if you haven't, try adding this to the top of your script:
error_reporting(E_ALL);
ini_set("display_errors", "1");
It should reveal any error that might have taken place. But if that doesn't work I have two ideas:
You can't have a syntax error because then the script wouldn't even run. You said that removing the recursion yielded a result so the script must work.
One possibility is that it's timing out. This depends on the server configuration. Try adding
echo $url, "<br>";
flush();
to the top of getAll. If you receive any of the links this is your problem.
This can be fixed by calling a function like set_time_limit(0).
Another possibility is a connection error. This could be caused by coincidence or a server configuration limit. I can't be certain but I know some hosting providers limit file_get_contents and curl requests. There is a possibility your scripts are limited to one external request per execution.
Besides that there is nothing I could think of that can really go wrong with your script. You could remove the recursion and run the function in a while loop. But unless you expect a lot pages there is no need for such a modification.
And finally, the library you are using for DOM parsing will either return a DOM element object or null. So you can change if(is_object($tmp)){ to if($tmp){. And since you are passing the result by reference, returning a value is pointless. You can safely remove the else statement.
I wish you good luck.
Related
Unfortunately I can't check it right now, because the XML (which will be on another server) is offline. The url to the xml file will look like this: http://url.com:123/category?foo=bar. It comes with no .xml file extension as you can see. I forgot to insert a file check to avoid error messages printing out the url of the xml file.
simple_load_file works fine with that URL, but I'm not sure about file_exists!
Would this work?:
if(file_exists('http://url.com:123/category?foo=bar')) {
$xml = simplexml_load_file('http://url.com:123/category?foo=bar');
//stuff happens here
} else{echo 'Error message';}
I'm not sure since file_exists doesn't work with URLs.
Thank you!
As you suspect, file_exists() doesn't work with URLs, but fopen() and fclose() do:
if (fclose(fopen("http://url.com:123/category?foo=bar", "r"))) {
$xml = simplexml_load_file('http://url.com:123/category?foo=bar');
//stuff happens here
} else {
echo 'Error message';
}
It is not really useful, if you just try to fetch the data to parse it. Especially if the URL you call is a program/script itself. This will just mean that the script is executed twice.
I suggest you fetch the data with file_get_contents(), handle/catch the errors and parse the fetched data.
Just blocking the errors:
if ($xml = #file_get_contents($url)) {
$element = new SimpleXMLElement($xml);
...
}
using php library simple_html_dom i'm looping through a list of urls as dom and for each of these i try to find a string, if i find it i save the url in an array otherwise i go to the next cycle, returning the urls array at the end.
The script takes something of the order of some sec for each url.
after some loop the script get stuck on the $dom->load($url) line inside file get html throwing a segmentation fault, the number of loops varies on different urls lists.
I tried to isolate the call at load($url) in a test script working only on the url in which the looping script get stuck but the test script end with no errors (but i can't check the print_r of the dom because my firefox crashes if i try to view page source).
I'm working on a LAMP server. Here is the code:
error_reporting(E_ALL);
ini_set("max_execution_time", "300");
ini_set("memory_limit", "512M");
ini_set('output_buffering', 0);
ini_set('implicit_flush', 1);
ob_end_flush();
ob_start();
set_time_limit(100);
$urlArray = array();
foreach($urlArray as $url){
$found = false;
$dom = file_get_html($url);
foreach(( $dom->find('target')) as $caught){
array_push($link, $caught);
$found = true
}
if($trovato){
return $link;
}else{
echo "not found";
}
}
thx for any help
Well its common problem, here is a bug http://sourceforge.net/p/simplehtmldom/bugs/103/.
Add this lines before your if statement:
$dom->clear();
unset($dom);
Mostly you will not see any segfaults after that. But if you parse several thousands urls (like me :)) than you might meet it again. So my solution is - open simple_html_dom.php file, and comment all lines between 146 and 149.
function clear()
{
/*
$this->dom = null;
$this->nodes = null;
$this->parent = null;
$this->children = null;
*/
}
UPDATE: also if you comment this lines - your memory consumption will increase each parsing iteration
The following function receives a string parameter representing an url and then loads the url in a simple_html_dom object. If the loading fails, it attemps to load the url again.
public function getSimpleHtmlDomLoaded($url)
{
$ret = false;
$count = 1;
$max_attemps = 10;
while ($ret === false) {
$html = new simple_html_dom();
$ret = $html->load_file($url);
if ($ret === false) {
echo "Error loading url: $url\n";
sleep(5);
$count++;
$html->clear();
unset($html);
if ($count > $max_attemps)
return false;
}
}
return $html;
}
However, if the url loading fails one time, it keeps failing for the current url, and after the max attemps are over, it also keeps failing in the next calls to the function with the rest of the urls it has to process.
It would make sense to keep failing if the urls were temporarily offline, but they are not (I've checked while the script was running).
Any ideas why this is not working properly?
I would also like to point out, that when starts failing to load the urls, it only gives a warning (instead of multiple ones), with the following message:
PHP Warning: file_get_contents(http://www.foo.com/resource): failed
to open stream: HTTP request failed! in simple_html_dom.php on line
1081
Which is prompted by this line of code:
$ret = $html->load_file($url);
I have tested your code and it works perfectly for me, every time I call that function it returns valid result from the first time.
So even if you load the pages from the same domain there can be some protection on the page or server.
For example page can look for some cookies, or the server can look for your user agent and if it see you as an bot it would not serve correct content.
I had similar problems while parsing some websites.
Answer for me was to see what is some page/server expecting and make my code simulate that. Everything, from faking user agent to generating cookies and such.
By the way have you tried to create a simple php script just to test that 'simple html dom' parser can be run on your server with no errors? That is the first thing I would check.
On the end I must add that in one case, while I failed in numerous tries for parsing one page, and I could not win the masking game. On the end I made an script that loads that page in linux command line text browser lynx and saved the whole page locally and then I parsed that local file which worked perfect.
may be it is a problem of load_file() function itself.
Problem was, that the function error_get_last() returns all privious erros too, don't know, may be depending on PHP version?
I solved the problem by changing it to (check if error changed, not if it is null)
(or use the non object function: file_get_html()):
function load_file()
{
$preerror=error_get_last();
$args = func_get_args();
$this->load(call_user_func_array('file_get_contents', $args), true);
// Throw an error if we can't properly load the dom.
if (($error=error_get_last())!==$preerror) {
$this->clear();
return false;
}
}
I am trying to get the next page of the topic but it gives an error. Is there any way to avoid that error to be able to scrape the next page within that age topic? (next page goes by 20 and after is 40 and so forth) The error is given below and I'm sure someones going to request me to put the code up but not sure how much or what code I should post up.
http://blah.com/quotes/topic/age
20 1
1http://blah.com/quotes/topic/age/20
Fatal error: Call to a member function find() on a non-object in /Users/blah/Sites/simple_html_dom.php on line 879
UPDATE***
this is the line between 870-885
function save($filepath='') {
$ret = $this->root->innertext();
if ($filepath!=='') file_put_contents($filepath, $ret, LOCK_EX);
return $ret;
}
// find dom node by css selector
// Paperg - allow us to specify that we want case insensitive testing of the value of the selector.
function find($selector, $idx=null, $lowercase=false) {
return $this->root->find($selector, $idx, $lowercase);
}
// clean up memory due to php5 circular references memory leak...
function clear() {
foreach ($this->nodes as $n) {$n->clear(); $n = null;}
The first thing that you should check is your file where the $html->find() is called.
Check if you included simple_html_dom.php(with an include) at the beginning of the file
-make sure it is there
-make sure the path is correct
Check if you have this line: $html = file_get_html('http://www.google.com/');
-of course your line will have the web address you are trying to get
I think the problem is that you might have not included simple_html_dom or that you are missing the file_get_html.
Check those. The problem is not in simplehtmldom.php so just look at the file you created.
Good luck!
UPDATE
While your at it. Please provide the source in your file, or at least the line where you call find().
I'm trying to create some xml , basically by reading rss feeds and adding to them some custom tags. I've made a function that contains my code, and now i want to call the function several times with different rss urls. Each call will produce a different .xml file.
I use DOMDocument to load and parse the rss, and simple_html_dom to load and parse the link of each rss item to get some content from the html.
Here is a simplified example of my code:
<?php
include('simple_html_dom.php');
load_custom_rss('http://www.somesite.com/rssfeed/articles', 'articles.xml');
load_custom_rss('http://www.somesite.com/rssfeed/jobs', 'jobs.xml');
load_custom_rss('http://www.somesite.com/rssfeed/press', 'press.xml');
//up to 20 similar function calls here...
function load_custom_rss($link, $filename){
$doc = new DOMDocument();
$doc->load($link);
$newDoc = new DOMDocument('1.0', 'UTF-8');
$rss = $newDoc->createElement('rss');
$channel = $newDoc->createElement('channel');
$newDoc->appendChild($rss);
$rss->appendChild($channel);
foreach ($doc->getElementsByTagName('item') as $node) {
//here is some code to read items from rss xml / write them to new xml document.
//Code missing for simplicity
//Next lines used to get some elements from the html of the item's link
$html = new simple_html_dom();
html->load_file($node->getElementsByTagName('link')->item(0)->nodeValue);
$ret = $html->find('#imgId');
}
$newDoc->formatOutput = true;
$fh = fopen($filename, 'w') or die("can't open file");
fwrite($fh, $newDoc->saveXML());
fclose($fh);
unset($doc);
//unset ALL variables and objects created in this function...
//........
}//function end
?>
My problem is that each call of the function consumes quite an amount of memory, so after the 3rd or 4th call of the function apache throws Fatal Error, as the script consumes memory amount bigger than the memory_limit, even though i unset ALL variables and objects created in the function. If i reduce the function calls to 1 or 2 everything works fine.
Is there any way it could work? I was thinking about each function call waits for the previous to finish before starting, but how could this be done?
Hope somebody could help.
thanks in advance.
The thing you want is normal behaviour in php. It's worked through from top to bottom. Each function has to wait, till the previous function has finished. I think you problem is rather the memory-limit within the php.ini. Open the file and search for the directive: memory_limit http://www.php.net/manual/en/ini.core.php#ini.memory-limit Increase it to fit your needs.
You're unsetting $doc but not $newDoc, try adding
unset($newDoc);
At the end of that function.
As others have said, the problem is you're leaking memory or exceeding your memory limit; this is nothing to do with waiting until previous code has finished.
Alternatively you could put each call to load_custom_rss() into separate requests, so the script calls one and then reloads itself, i.e.
$i = $_GET['i'];
if ($i==0)
load_custom_rss('http://www.somesite.com/rssfeed/articles', 'articles.xml');
elseif ($i==1)
load_custom_rss('http://www.somesite.com/rssfeed/jobs', 'jobs.xml');
... etc ...
else
die("I'm done");
header("Location: myself.php?i=".($i+1));
Your approach to reloading the script would likely be different of course, depending on whether the page needs to render any HTML first.