Attempting to load again a URL when it fails - php

The following function receives a string parameter representing an url and then loads the url in a simple_html_dom object. If the loading fails, it attemps to load the url again.
public function getSimpleHtmlDomLoaded($url)
{
$ret = false;
$count = 1;
$max_attemps = 10;
while ($ret === false) {
$html = new simple_html_dom();
$ret = $html->load_file($url);
if ($ret === false) {
echo "Error loading url: $url\n";
sleep(5);
$count++;
$html->clear();
unset($html);
if ($count > $max_attemps)
return false;
}
}
return $html;
}
However, if the url loading fails one time, it keeps failing for the current url, and after the max attemps are over, it also keeps failing in the next calls to the function with the rest of the urls it has to process.
It would make sense to keep failing if the urls were temporarily offline, but they are not (I've checked while the script was running).
Any ideas why this is not working properly?
I would also like to point out, that when starts failing to load the urls, it only gives a warning (instead of multiple ones), with the following message:
PHP Warning: file_get_contents(http://www.foo.com/resource): failed
to open stream: HTTP request failed! in simple_html_dom.php on line
1081
Which is prompted by this line of code:
$ret = $html->load_file($url);

I have tested your code and it works perfectly for me, every time I call that function it returns valid result from the first time.
So even if you load the pages from the same domain there can be some protection on the page or server.
For example page can look for some cookies, or the server can look for your user agent and if it see you as an bot it would not serve correct content.
I had similar problems while parsing some websites.
Answer for me was to see what is some page/server expecting and make my code simulate that. Everything, from faking user agent to generating cookies and such.
By the way have you tried to create a simple php script just to test that 'simple html dom' parser can be run on your server with no errors? That is the first thing I would check.
On the end I must add that in one case, while I failed in numerous tries for parsing one page, and I could not win the masking game. On the end I made an script that loads that page in linux command line text browser lynx and saved the whole page locally and then I parsed that local file which worked perfect.

may be it is a problem of load_file() function itself.
Problem was, that the function error_get_last() returns all privious erros too, don't know, may be depending on PHP version?
I solved the problem by changing it to (check if error changed, not if it is null)
(or use the non object function: file_get_html()):
function load_file()
{
$preerror=error_get_last();
$args = func_get_args();
$this->load(call_user_func_array('file_get_contents', $args), true);
// Throw an error if we can't properly load the dom.
if (($error=error_get_last())!==$preerror) {
$this->clear();
return false;
}
}

Related

How do I create an error handler in php that redirects the user if a specific error occurs?

I am running some generic php/mysql code that is consistently working fine, then I run the html dom parser (http://simplehtmldom.sourceforge.net/), then I WANT TO redirect to an error page IF AND ONLY IF a specific error occurs with the dom parser, but if not, continue with some additional php/mysql script that is also currently working fine. Here is what my code looks like:
//First do some php/mysql operations here - these are all working fine
$html = file_get_html($website);
foreach($html->find('a[href!=#]') as $element) {
Do several things here
}
//Then finish up with some additional php/mysql operations here - these are all working fine
Most of the time it works great, but about 10% of the time, depending on the website that is assigned to the $website variable, I get a warning and then a fatal error. For example, when I put “https://www.liftedlandscape.com/” into the $website variable:
Warning: file_get_contents(https://www.liftedlandscape.com/): failed
to open stream: HTTP request failed! HTTP/1.0 400 Bad Request in
/home/test/test.test.com/simple_html_dom.php on line 75
Fatal error: Call to a member function find() on boolean in
/home/test/test.com/login/create_application.php on line 148
I am OK with the error happening every once and while; I just want to create an error handler that responds to the error cases appropriately. I want to make sure that the php/mysql code before the dom parser stuff always runs. Then run the dom parser, then run the rest of the script run if the dom parser functions are working right, but if there are the errors above with the dom parser, redirect users to an error page.
I have tried numerous iterations with no success. Here is my latest attempt:
function errorHandler($errno, $errstr) {
echo "Error: [$errno] $errstr";
header('Location:
https://test.com/login/display/error_message.php');
}
//First do some other php/mysql operations here
$html = file_get_html($website);
foreach($html->find('a[href!=#]') as $element) {
Do several things here
}
//Then finish up with some additional php/mysql operations here
I swear this actually worked once, but then after that failed to work. It is only returning the same errors listed above without redirecting the user. Can someone please help?
Don't send any output via "echo" or similar because you can't redirect AFTER you've already started sending out the page.
file_get_contents will return false when it can't complete the request so make sure you check for that before attempting to use the returned variable and assuming you actually have some html to work with.
Additionally you have to exit after your redirect to prevent the rest of the code being processed.
$html = file_get_html($website);
if($html === false) {
header('Location: https://test.com/login/display/error_message.php');
exit;
}
// You can now work with $html string from here onwards.

Cakephp never ending request

I'm quite new to cakephp and trying to debug a code from someone else.
The problem is I get a never ending request, despite the fact that both view and crontroller seem to run properly. I even tryed to add an exit; in both of them or even introduce a syntax error in the controller, the request never ends and the browser keeps trying to load the page endlessly.
Here is the code of the controller :
public function categories()
{
file_put_contents("/tmp/logfile.log",time()." categories bla\n", FILE_APPEND);
$catData = $this->SpecificKeywordCategorie->find('all');
$modelnameLessValues = array();
foreach($catData as $singleCat)
{
$modelnameLessValues[] = $singleCat['SpecificKeywordCategorie'];
}
$this->set('categories',$modelnameLessValues);
file_put_contents("/tmp/logfile.log",time()." categories end blu\n", FILE_APPEND);
}
and the view code "categories.ctp :
<?php
file_put_contents("/tmp/logfile.log","view json ".json_encode($categories),FILE_APPEND);
print(json_encode($categories));
file_put_contents("/tmp/logfile.log","view json before exit",FILE_APPEND);
exit;
?>
all the file_put_contents entries are written in the logfile. but the exit seems to be ignored and if I do a request in a browser it never ends...
Same thing happens if I add a syntax error on controller or view. (of course, in this case, log entries are not written)
I know nothing about cakephp internals, but php scripts running outside it are running well on same apache instance.
Any idea where to look to find where this infinite request comes from ?
We are running cakephp 2.2.3

PHP pointers - no data received

I'm mining data from site, but there it paginator, but I need to get all pages.
Link to the next page is written in link tag with rel=next. If there are no more pages, the link tag is missing. I created function called getAll which should call self again and again until there is the link tag.
function getAll($url, &$links) {
$dom = file_get_html ($url); // create dom object from $url
$tmp = $dom->find('link[rel=next]', 0); // find link rel=next
if(is_object($tmp)){ // is there the link tag?
$link = $tmp->getAttribute('href'); // get url of next page - href attribute
$links[] = $link; // insert url into array
getAll($link, $links); // call self
}else{
return $links; // there are no more urls, return the array
}
}
// usage
$links = array();
getAll('http://www.zbozi.cz/vyrobek/apple-iphone-5/', $links);
print_r($links); // dump the links
But I have a problem, when I run the script the message "No data received" appear in Chrome. I don't have any idea about error or something. The function should works, because when I don't use it again it-self it returns one link - to the second page.
I think the problem is in bad syntax or bad pointer usage.
Could you please help me?
I don't know what file_get_html or find should do, but this should work:
<?php
function getAll($url, &$links) {
$dom = new DOMDocument();
$dom->loadHTML(file_get_contents($url));
$linkElements = $dom->getElementsByTagName('link');
foreach ($linkElements as $link => $content) {
if ($content->hasAttribute('rel') && $content->getAttribute('rel') === 'next') {
$nextURL = $content->getAttribute('href');
$links[] = $nextURL;
getAll($nextURL, $links);
}
}
}
$links = array();
getAll('http://www.zbozi.cz/vyrobek/apple-iphone-5/', $links);
print_r($links);
Firstly, this could be easier. Without an error message this could be anything from a DNS error to a corrupted space character inside your file. So if you haven't, try adding this to the top of your script:
error_reporting(E_ALL);
ini_set("display_errors", "1");
It should reveal any error that might have taken place. But if that doesn't work I have two ideas:
You can't have a syntax error because then the script wouldn't even run. You said that removing the recursion yielded a result so the script must work.
One possibility is that it's timing out. This depends on the server configuration. Try adding
echo $url, "<br>";
flush();
to the top of getAll. If you receive any of the links this is your problem.
This can be fixed by calling a function like set_time_limit(0).
Another possibility is a connection error. This could be caused by coincidence or a server configuration limit. I can't be certain but I know some hosting providers limit file_get_contents and curl requests. There is a possibility your scripts are limited to one external request per execution.
Besides that there is nothing I could think of that can really go wrong with your script. You could remove the recursion and run the function in a while loop. But unless you expect a lot pages there is no need for such a modification.
And finally, the library you are using for DOM parsing will either return a DOM element object or null. So you can change if(is_object($tmp)){ to if($tmp){. And since you are passing the result by reference, returning a value is pointless. You can safely remove the else statement.
I wish you good luck.

php getting error from simplehtmldom when trying to get next page of url web scrape

I am trying to get the next page of the topic but it gives an error. Is there any way to avoid that error to be able to scrape the next page within that age topic? (next page goes by 20 and after is 40 and so forth) The error is given below and I'm sure someones going to request me to put the code up but not sure how much or what code I should post up.
http://blah.com/quotes/topic/age
20 1
1http://blah.com/quotes/topic/age/20
Fatal error: Call to a member function find() on a non-object in /Users/blah/Sites/simple_html_dom.php on line 879
UPDATE***
this is the line between 870-885
function save($filepath='') {
$ret = $this->root->innertext();
if ($filepath!=='') file_put_contents($filepath, $ret, LOCK_EX);
return $ret;
}
// find dom node by css selector
// Paperg - allow us to specify that we want case insensitive testing of the value of the selector.
function find($selector, $idx=null, $lowercase=false) {
return $this->root->find($selector, $idx, $lowercase);
}
// clean up memory due to php5 circular references memory leak...
function clear() {
foreach ($this->nodes as $n) {$n->clear(); $n = null;}
The first thing that you should check is your file where the $html->find() is called.
Check if you included simple_html_dom.php(with an include) at the beginning of the file
-make sure it is there
-make sure the path is correct
Check if you have this line: $html = file_get_html('http://www.google.com/');
-of course your line will have the web address you are trying to get
I think the problem is that you might have not included simple_html_dom or that you are missing the file_get_html.
Check those. The problem is not in simplehtmldom.php so just look at the file you created.
Good luck!
UPDATE
While your at it. Please provide the source in your file, or at least the line where you call find().

How to check if a webpage exists. jQuery and/or PHP

I want to be able to validate a form to check if a website/webpage exists. If it returns a 404 error then that definitely shouldn't validate. If there is a redirect...I'm open to suggestions, sometimes redirects go to an error page or homepage, sometimes they go to the page you were looking for, so I don't know. Perhaps for a redirect there could be a special notice that suggests the destination address to the user.
The best thing I found so far was like this:
$.ajax({url: webpage ,type:'HEAD',error:function(){
alert('No go.');
}});
That has no problem with 404's and 200's but if you do something like 'http://xyz' for the url it just hangs. Also 302 and the like trigger the error handler too.
This is a generic enough question I would like a complete working code example if somebody can make one. This could be handy for lots of people to use.
It sounds like you don't care about the web page's contents, you just want to see if it exists. Here's how I'd do it in PHP - I can stop PHP from taking up memory with the page's contents.
/*
* Returns false if the page could not be retrieved (ie., no 2xx or 3xx HTTP
* status code). On success, if $includeContents = false (default), then we
* return true - if it's true, then we return file_get_contents()'s result (a
* string of page content).
*/
function getURL($url, $includeContents = false)
{
if($includeContents)
return #file_get_contents($url);
return (#file_get_contents($url, null, null, 0, 0) !== false);
}
For less verbosity, replace the above function's contents with this.
return ($includeContents) ?
#file_get_contents($url) :
(#file_get_contents($url, null, null, 0, 0) !== false)
;
See http://www.php.net/file_get_contents for details on how to specify HTTP headers using a stream context.
Cheers.
First you need to check that the page exists via DNS. That's why you say it "just hangs" - it's waiting for the DNS query to time out. It's not actually hung.
After checking DNS, check that you can connect to the server. This is another long timeout if you're not careful.
Finally, perform the HTTP HEAD and check the status code. There are many, many, many special cases you have to consider here: what does a "temporary internal server error" mean for the page existing? What about "permanently moved"? Look into HTTP status codes.
I've just written a simpler version using PHP:
function url_check($url) {
$x = #fopen($url,"r");
if ($x) {
$reply = 1;
fclose($x);
} else {
$reply = 0;
}
return $reply;
}
Obviously $url is the test URL, returns true (1) or false (0) depending on URL existence.
Maybe you could combine domain checker, and jQuery, domain checker (PHP) can respond 1 or 0 for non-existent domains.
eg. http://webarto.com/snajper.php?domena=stackoverflow.com , will return 1, you can use input blur function to check for it instantly.

Categories