Failsafe for PHP Simple HTML DOM Parser

Failsafe for PHP Simple HTML DOM Parser - php

Using PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net), I recently had a situation where the external webpage I routinely fetch was not responding (their servers were down). Because of this the my own website would not load (instead it showed errors after a lengthy wait period).
What would be the best way to add a failsafe to this parser upon an unsucessful fetch attempt?
I have tried to use the following below without success.
include('./inc/simple_html_dom.php');
$html = file_get_html('http://client0.example.com/dcnum.php?count=1');
$str = $html->find('body',0);
$num = $str->innertext;
if(!$html)
{
error('No response.')
}
$html->clear();
unset($html);
EDIT: I haven't had time to try this yet, but perhaps I could place my 'if' statement directly after the first line (before the $html->find('body',0) part).

If I understand you want to prevent going offline when they are offline...
If you are using PHP's curl bindings, you can check the error code using curl_getinfo as such:
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);
/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 404) {
/* Handle 404 here. */
}
curl_close($handle);
/* Handle $response here. */
Also you can check for other error codes, as 500, 503, etc.

Took me literally hours to figure out this, surprisingly very few clues on how to handle errors with simple_html_dom.
Basically all you have to do is get rid of file_get_html, ->load_file or what ever simple_html_dom-specific method you used to load the content, and instead do it with curl, and pass it to str_get_html.
I used the code of the other answer, here is how you can use it :
function get_with_curl_or_404($url){
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
$response = curl_exec($handle);
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
curl_close($handle);
if($httpCode == 404 || !$response) { // arbitrary choice to return 404 when anything went wront
return 404;
} else {
return $response;
}
}
$html = str_get_html(get_with_curl_or_404("http://your-
url.com/index.html"));
if ($html == 404) {
// Do whatever you want
} else {
// If not 404, you can use it as usually, ->find(), etc
}
If it is much more stable on big websites.
If it was the kind of behavior you were looking for, please try it out, and tell me I didn't make your day.

Related

how to get response of http request

[![enter image description here][1]][1]I am trying to get the some tag value but it's showing some error.
Below is the code, please suggest some solution.
This is the method i used for httpGet request.
function httpGet($result15)
{
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$result15);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
$output=curl_exec($ch);
curl_close($ch);
return $output;
}
$result15= httpGet("https://www.googleapis.com/customsearch/v1?key=API_KEY&cx=003255er&q=cancer&num=1&alt=atom");//new cse
echo $result15;
$xml = new DOMDocument();
$xml->loadXML($result15);
foreach( $xml->entry as $entry )
{
echo "URL=".(string)$entry->id.PHP_EOL;
echo "Summary=".(string)$entry->summary.PHP_EOL;
}

You might find the curl request is failing. You need to do a couple of things...
function httpGet($result15)
{
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$result15);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // Add this
$output=curl_exec($ch);
// If this fails, output error.
if ($output === FALSE) {
echo curl_error($ch);
// Not sure what you want to do, but 'exit' will work for now
exit();
}
curl_close($ch);
return $output;
}
This will display an error if the curl request fails. You will need to decide how your going to cope with this. You could return false, and then in your code further down, check this before trying to load it as XML. The code above just stops on errors.
Your next piece of code seems to mix SimpleXML and DOMDocument, you can use SimpleXML if the document structure is fairly straight forward...
$xml = simplexml_load_string($result15);
foreach( $xml->entry as $entry )
{

HTTP Response Code 0 - Site is working

I am making a website that will check if a website is working and live. I pass in the URL of the site I would like to check and the following code will check if the site is live and return the HTTP response code as well as true or false.
function urlExists($url=NULL)
{
if($url == NULL) return false;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpcode == 0) {
return array (false, $httpcode);
}
else if($httpcode < 400){
return array (true, $httpcode);
} else {
return array (false, $httpcode);
}
}
With one of the sites I am testing though I am getting the HTTP response code of 0 even though I know that the site is live and working.
The site is very slow as its a large site on a not very powerful server so response times can vary between 7 - 25 seconds.
Any help would be greatly appreciated.
Thanks,
Sam

Based on these two links:-
https://curl.haxx.se/libcurl/c/CURLOPT_TIMEOUT.html
And
https://curl.haxx.se/libcurl/c/CURLOPT_CONNECTTIMEOUT.html
First one is:- set maximum time the request is allowed to take
Second one is:- timeout for the connect phase
As you said that the Site URL you are hitting is taking 7-25 second for responding. meanwhile your CURL request is terminated and closed because of these two time settings.
Increase these two time settings in your code and it will work for you.
thanks.

I will offer 2 alternatives for you to compare - along with your curl() function, you will have 3 options to see which one is better/faster for you.
Option A (all php versions), requires fopen() to be activated:
if (!$fp = fopen($url, 'r'))
{
trigger_error("Unable to open URL ($url)", E_USER_ERROR);
}
$headers = stream_get_meta_data($fp);
fclose($fp);
$http_header_info = $headers['wrapper_data'][0];
$httpCode = (int)substr($http_header_info, 9, 3);
Option B (php5+):
$headers = get_headers($url, 1);
$http_header_info = $headers[0];
$httpCode = substr($http_header_info, 9, 3);
Also, if anyone has benchmarks on these 3 approaches, i am curious to see which is more appropriate (only for retrieving http response headers of course)

Code 0 returns often when used invalid URL syntax or host not found error.
You can also call curl_error($ch) function (http://php.net/manual/en/function.curl-error.php) to determine error details.

WP - PHP cURL or get_headers() function leads to the 404 error

1) I am using wordpress engine.
2) I have an numeric array() with 800+ links in it, like this.
What I'm trying to do is to run foreach() function and check if link still exist (not returning 404 error).
I tried 2 functions:
1)
<?php
foreach($links as $link) {
$file_headers = #get_headers($link);
if(strpos($file_headers[0],'404') === false) {
$toDeleteLinks[] = $link;
}
}
?>
so according to this first function, $toDeleteLinks array should contain all the links that return 404 error. using get_headers() functions here...
2)
<?php
foreach($links as $link) {
$handle = curl_init($link);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
$response = curl_exec($handle);
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode != 404) {
$toDeleteLinks[] = $link;
}
curl_close($handle);
}
?>
this second one should do the same just using cURL..
BUT in both cases I get redirected to the wordpress 404.php page ((. I think that's because of a big number of links.
Can you please help me get a solution for this? Use another function instead or however...
Thanks.

Why can't get JSON data from PHP?

I have a webpage that exposes some public interfaces that are accessed like a simple AJAX call from other pages. Example:
http://domain1.com/interface/function.php:
$json['result'] = ... // fill with data
$json['ok'] = true;
echo json_encode($json);
http://domain2.com/application.php:
$call = 'http://domain1.com/interface/function.php';
$curl = curl_init($call);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$call_data = curl_exec($curl);
$error = curl_error($curl);
curl_close($curl);
print_r($error);
print_r($call_data);
The problem is $call_data is empty. I already try to use *file_get_contents()* and other curl parameters without success. Also, if I change first line in application.php by:
$call = 'http://www.google.com/';
$call_data gets the right file content (Google home page content, of course). More, *curl_error()* doesn't return any error. What's happening? Why?

I had the same thing last weekend.
If you have some htaccess file or similar thing that unifies the url of the service proveider and adds / or makes some kind of redirection this might cause you the problem - both curl_exec($curl); and curl_error($curl); will be empty strings.

from the PHP curl_exec manual:
Returns TRUE on success or FALSE on failure. However, if the CURLOPT_RETURNTRANSFER option is set, it will return the result on success, FALSE on failure.
are you sure that you've set the CURLOPT_RETURNTRANSFER to true?

How can one check to see if a remote file exists using PHP?

The best I could find, an if fclose fopen type thing, makes the page load really slowly.
Basically what I'm trying to do is the following: I have a list of websites, and I want to display their favicons next to them. However, if a site doesn't have one, I'd like to replace it with another image rather than display a broken image.

You can instruct curl to use the HTTP HEAD method via CURLOPT_NOBODY.
More or less
$ch = curl_init("http://www.example.com/favicon.ico");
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_exec($ch);
$retcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
// $retcode >= 400 -> not found, $retcode = 200, found.
curl_close($ch);
Anyway, you only save the cost of the HTTP transfer, not the TCP connection establishment and closing. And being favicons small, you might not see much improvement.
Caching the result locally seems a good idea if it turns out to be too slow.
HEAD checks the time of the file, and returns it in the headers. You can do like browsers and get the CURLINFO_FILETIME of the icon.
In your cache you can store the URL => [ favicon, timestamp ]. You can then compare the timestamp and reload the favicon.

As Pies say you can use cURL. You can get cURL to only give you the headers, and not the body, which might make it faster. A bad domain could always take a while because you will be waiting for the request to time-out; you could probably change the timeout length using cURL.
Here is example:
function remoteFileExists($url) {
$curl = curl_init($url);
//don't fetch the actual page, you only want to check the connection is ok
curl_setopt($curl, CURLOPT_NOBODY, true);
//do request
$result = curl_exec($curl);
$ret = false;
//if request did not fail
if ($result !== false) {
//if request was ok, check response code
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if ($statusCode == 200) {
$ret = true;
}
}
curl_close($curl);
return $ret;
}
$exists = remoteFileExists('http://stackoverflow.com/favicon.ico');
if ($exists) {
echo 'file exists';
} else {
echo 'file does not exist';
}

CoolGoose's solution is good but this is faster for large files (as it only tries to read 1 byte):
if (false === file_get_contents("http://example.com/path/to/image",0,null,0,1)) {
$image = $default_image;
}

This is not an answer to your original question, but a better way of doing what you're trying to do:
Instead of actually trying to get the site's favicon directly (which is a royal pain given it could be /favicon.png, /favicon.ico, /favicon.gif, or even /path/to/favicon.png), use google:
<img src="http://www.google.com/s2/favicons?domain=[domain]">
Done.

A complete function of the most voted answer:
function remote_file_exists($url)
{
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); # handles 301/2 redirects
curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if( $httpCode == 200 ){return true;}
}
You can use it like this:
if(remote_file_exists($url))
{
//file exists, do something
}

If you are dealing with images, use getimagesize. Unlike file_exists, this built-in function supports remote files. It will return an array that contains the image information (width, height, type..etc). All you have to do is to check the first element in the array (the width). use print_r to output the content of the array
$imageArray = getimagesize("http://www.example.com/image.jpg");
if($imageArray[0])
{
echo "it's an image and here is the image's info<br>";
print_r($imageArray);
}
else
{
echo "invalid image";
}

if (false === file_get_contents("http://example.com/path/to/image")) {
$image = $default_image;
}
Should work ;)

This can be done by obtaining the HTTP Status code (404 = not found) which is possible with file_get_contentsDocs making use of context options. The following code takes redirects into account and will return the status code of the final destination (Demo):
$url = 'http://example.com/';
$code = FALSE;
$options['http'] = array(
'method' => "HEAD",
'ignore_errors' => 1
);
$body = file_get_contents($url, NULL, stream_context_create($options));
foreach($http_response_header as $header)
sscanf($header, 'HTTP/%*d.%*d %d', $code);
echo "Status code: $code";
If you don't want to follow redirects, you can do it similar (Demo):
$url = 'http://example.com/';
$code = FALSE;
$options['http'] = array(
'method' => "HEAD",
'ignore_errors' => 1,
'max_redirects' => 0
);
$body = file_get_contents($url, NULL, stream_context_create($options));
sscanf($http_response_header[0], 'HTTP/%*d.%*d %d', $code);
echo "Status code: $code";
Some of the functions, options and variables in use are explained with more detail on a blog post I've written: HEAD first with PHP Streams.

PHP's inbuilt functions may not work for checking URL if allow_url_fopen setting is set to off for security reasons. Curl is a better option as we would not need to change our code at later stage. Below is the code I used to verify a valid URL:
$url = str_replace(' ', '%20', $url);
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if($httpcode>=200 && $httpcode<300){ return true; } else { return false; }
Kindly note the CURLOPT_SSL_VERIFYPEER option which also verify the URL's starting with HTTPS.

To check for the existence of images, exif_imagetype should be preferred over getimagesize, as it is much faster.
To suppress the E_NOTICE, just prepend the error control operator (#).
if (#exif_imagetype($filename)) {
// Image exist
}
As a bonus, with the returned value (IMAGETYPE_XXX) from exif_imagetype we could also get the mime-type or file-extension with image_type_to_mime_type / image_type_to_extension.

A radical solution would be to display the favicons as background images in a div above your default icon. That way, all overhead would be placed on the client while still not displaying broken images (missing background images are ignored in all browsers AFAIK).

You could use the following:
$file = 'http://mysite.co.za/images/favicon.ico';
$file_exists = (#fopen($file, "r")) ? true : false;
Worked for me when trying to check if an image exists on the URL

function remote_file_exists($url){
return(bool)preg_match('~HTTP/1\.\d\s+200\s+OK~', #current(get_headers($url)));
}
$ff = "http://www.emeditor.com/pub/emed32_11.0.5.exe";
if(remote_file_exists($ff)){
echo "file exist!";
}
else{
echo "file not exist!!!";
}

This works for me to check if a remote file exist in PHP:
$url = 'https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico';
$header_response = get_headers($url, 1);
if ( strpos( $header_response[0], "404" ) !== false ) {
echo 'File does NOT exist';
} else {
echo 'File exists';
}

You can use :
$url=getimagesize(“http://www.flickr.com/photos/27505599#N07/2564389539/”);
if(!is_array($url))
{
$default_image =”…/directoryFolder/junal.jpg”;
}

If you're using the Laravel framework or guzzle package, there is also a much simpler way using the guzzle client, it also works when links are redirected:
$client = new \GuzzleHttp\Client(['allow_redirects' => ['track_redirects' => true]]);
try {
$response = $client->request('GET', 'your/url');
if ($response->getStatusCode() != 200) {
// not exists
}
} catch (\GuzzleHttp\Exception\GuzzleException $e) {
// not exists
}
More in Document : https://docs.guzzlephp.org/en/latest/faq.html#how-can-i-track-redirected-requests

You should issue HEAD requests, not GET one, because you don't need the URI contents at all. As Pies said above, you should check for status code (in 200-299 ranges, and you may optionally follow 3xx redirects).
The answers question contain a lot of code examples which may be helpful: PHP / Curl: HEAD Request takes a long time on some sites

There's an even more sophisticated alternative. You can do the checking all client-side using a JQuery trick.
$('a[href^="http://"]').filter(function(){
return this.hostname && this.hostname !== location.hostname;
}).each(function() {
var link = jQuery(this);
var faviconURL =
link.attr('href').replace(/^(http:\/\/[^\/]+).*$/, '$1')+'/favicon.ico';
var faviconIMG = jQuery('<img src="favicon.png" alt="" />')['appendTo'](link);
var extImg = new Image();
extImg.src = faviconURL;
if (extImg.complete)
faviconIMG.attr('src', faviconURL);
else
extImg.onload = function() { faviconIMG.attr('src', faviconURL); };
});
From http://snipplr.com/view/18782/add-a-favicon-near-external-links-with-jquery/ (the original blog is presently down)

all the answers here that use get_headers() are doing a GET request.
It's much faster/cheaper to just do a HEAD request.
To make sure that get_headers() does a HEAD request instead of a GET you should add this:
stream_context_set_default(
array(
'http' => array(
'method' => 'HEAD'
)
)
);
so to check if a file exists, your code would look something like this:
stream_context_set_default(
array(
'http' => array(
'method' => 'HEAD'
)
)
);
$headers = get_headers('http://website.com/dir/file.jpg', 1);
$file_found = stristr($headers[0], '200');
$file_found will return either false or true, obviously.

If the file is not hosted external you might translate the remote URL to an absolute Path on your webserver. That way you don't have to call CURL or file_get_contents, etc.
function remoteFileExists($url) {
$root = realpath($_SERVER["DOCUMENT_ROOT"]);
$urlParts = parse_url( $url );
if ( !isset( $urlParts['path'] ) )
return false;
if ( is_file( $root . $urlParts['path'] ) )
return true;
else
return false;
}
remoteFileExists( 'https://www.yourdomain.com/path/to/remote/image.png' );
Note: Your webserver must populate DOCUMENT_ROOT to use this function

Don't know if this one is any faster when the file does not exist remotely, is_file(), but you could give it a shot.
$favIcon = 'default FavIcon';
if(is_file($remotePath)) {
$favIcon = file_get_contents($remotePath);
}

If you're using the Symfony framework, there is also a much simpler way using the HttpClientInterface:
private function remoteFileExists(string $url, HttpClientInterface $client): bool {
$response = $client->request(
'GET',
$url //e.g. http://example.com/file.txt
);
return $response->getStatusCode() == 200;
}
The docs for the HttpClient are also very good and maybe worth looking into if you need a more specific approach: https://symfony.com/doc/current/http_client.html

You can use the filesystem:
use Symfony\Component\Filesystem\Filesystem;
use Symfony\Component\Filesystem\Exception\IOExceptionInterface;
and check
$fileSystem = new Filesystem();
if ($fileSystem->exists('path_to_file')==true) {...

Please check this URL. I believe it will help you. They provide two ways to overcome this with a bit of explanation.
Try this one.
// Remote file url
$remoteFile = 'https://www.example.com/files/project.zip';
// Initialize cURL
$ch = curl_init($remoteFile);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_exec($ch);
$responseCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
// Check the response code
if($responseCode == 200){
echo 'File exists';
}else{
echo 'File not found';
}
or visit the URL
https://www.codexworld.com/how-to/check-if-remote-file-exists-url-php/#:~:text=The%20file_exists()%20function%20in,a%20remote%20server%20using%20PHP.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Failsafe for PHP Simple HTML DOM Parser - php

Related

how to get response of http request

HTTP Response Code 0 - Site is working

WP - PHP cURL or get_headers() function leads to the 404 error

Why can't get JSON data from PHP?

How can one check to see if a remote file exists using PHP?

Categories

Resources