I'm using html_dom to scrape a website.
$url = $_POST["textfield"];
$html = file_get_html($url);
html_dom.php
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//$contents = retrieve_url_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}
The problem is if the internet connection is too slow it is still going for the file_get_html then it will be a warning error saying failed to open stream and fatal error: 30 seconds max execution time. I tried to solve it by stoping the function if it detected a Warning error:
function errHandle($errNo, $errStr, $errFile, $errLine) {
$msg = "Slow Internet Connection";
if ($errNo == E_NOTICE || $errNo == E_WARNING) {
throw new ErrorException($msg, $errNo);
} else {
echo $msg;
}
}
set_error_handler('errHandle');
But it still printing the fatal error on execution time. Any idea on how i can solve this?
If it takes to long you could increase the time limit:
http://php.net/manual/en/function.set-time-limit.php
You can't catch a fatal in php 5.6 or below. In php 7+ you can with
try {
doSomething();
} catch (\Throwable $exception) {
// error handling
echo $exception->getMessage();
}
Not sure if you can catch the execution time limit though.
Related
For my project, I'm using domcrawler to parse pages and extract images.
Code:
$goutteClient = new Client();
$guzzleClient = new GuzzleClient(array(
'timeout' => 15,
));
$goutteClient->setClient($guzzleClient);
try {
$crawler = $goutteClient->request('GET', $url);
$crawlerError = false;
} catch (RequestException $e) {
$crawlerError = true;
}
if ($crawlerError == false) {
//find open graph image
try {
$file = $crawler->filterXPath("//meta[#property='og:image']")->attr('content');
} catch (\InvalidArgumentException $e) {
$file = null;
}
//if that fails, find the biggest image in the DOM
if (!$file) {
$images = $crawler
->filterXpath('//img')
->extract(array('src'));
$files = [];
foreach ($images as $image) {
$attributes = getimagesize($image);
//stopping here since this is where i'm getting my error
The relevant part is at the bottom. This will work some of the time. However, occasionally I get an error. For example, if $url was https://www.google.com it would spit out the following error:
ErrorException (E_WARNING)
getimagesize(/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png):
failed to open stream: No such file or directory
If I dd($image); in this situation, $image equals "/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png".
However, if I try with a website that doesn't give me an error, like https://www.harvard.edu, dd($image); returns "https://www.harvard.edu/sites/default/files/feature_item_media/Kremer900x600.jpg"
In other words, I'm not getting the full URL. How can I rectify this?
Prepend the relative links with the scheme and host. You can use parse_url on $url to extract the scheme and host, and can use the same function on $image to detect if a scheme/host is set.
The problem I am experiencing is below:
Warning: file_get_contents(): Unable to find the wrapper "https" -
did you forget to enable it when you configured PHP? in
<b>C:\xampp\htdocs\test_crawl\simple_html_dom.php</b> on line <b>75</b><br
/>
<br />
<b>Warning</b>: file_get_contents(https://www.yahoo.com): failed to open
stream: Invalid argument in
<b>C:\xampp\htdocs\test_crawl\simple_html_dom.php</b> on line <b>75</b><br
/>
I did some research and I found a few posts that said uncommenting extension=php_openssl.dll in php.ini works but when I did and restarted my server it did not.The script I am used is below:
$url = 'https://yahoo.com'
function CrawlMe($url)
{
$html = file_get_html($url);
return json_encode($html);
}
Not sure why it's not working would appreciate your help..
Below is the function that's erroring out at $contents = file_get_contents($url, $use_include_path, $context, $offset);
function file_get_html($url, $use_include_path = false, $context=null,
$offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true,
$target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true,
$defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed,
$target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the
// retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to
control the timeout.
//$contents = retrieve_url_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}
Whats on line 75 of simple_html_dom.php? From what you have posted all I can say is
$url = 'https://yahoo.com'
is missing a semi colon, it should be:
$url = 'https://yahoo.com';
--Edit after seeing the code...
You are setting the offset to -1. which means start reading from the end of the file. As per the documentation
Seeking (offset) is not supported with remote files. Attempting to
seek on non-local files may work with small offsets, but this is
unpredictable because it works on the buffered stream.
Your maxlength is set to minus 1. As per the documaentation:
An E_WARNING level error is generated if filename cannot be found,
maxlength is less than zero, or if seeking to the specified offset in
the stream fails.
You don't need to specify all those parameters, this will work fine:
$file = file_get_contents('https://www.yahoo.com');
I am using the SimpleHTMLDOM parser to fetch data from other sites. This is working pretty well on PHP 7.0. Since I upgraded to PHP 7.1.3, I get the following error code from file_get_contents:
Warning: file_get_contents(): stream does not support seeking in
/..../test/scripts/simple_html_dom.php on line
75 Warning: file_get_contents(): Failed to seek to position -1 in the
stream in
/..../test/scripts/simple_html_dom.php on line
75
What I did
I downgraded to PHP 7 and it works like before without any problems. Next, I looked at the code of the parser. But I didn't find anything unusual:
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//$contents = retrieve_url_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}
The parser which I use you can find here: http://simplehtmldom.sourceforge.net/
I had the same problem.
The PHP function 'file_get_contents' changed in PHP 7.1 (support for negative offsets has been added), so the '-1' value of $offset used by default in Simple HTML5 Dom Parser is invalid for PHP>=7.1. You would have to put it to zero.
I noticed that the bug has been corrected a few days ago, so this problem should not appear in the latest versions (https://sourceforge.net/p/simplehtmldom/repository/ci/3ab5ee865e460c56859f5a80d74727335f4516de/)
Currently, I am trying to get the results of a https website and I am getting this error while using simple_html_dom.php, can anybody give me an idea on how to fix this or what's potentially causing it?
Warning: file_get_contents(https://crimson.gg/jackpot-history): failed to open stream: HTTP request failed! HTTP/1.1 503 Service Temporarily Unavailable in /tickets/simple_html_dom.php on line 75
Line line 75 of simple_html_dom.php is this.
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//$contents = retrieve_url_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}
And the code I am currently using is.
$html = file_get_html('https://crimson.gg/jackpot-history');
foreach($html->find('#contentcolumn > a') as $element)
{
print '<br><br>';
echo $url = 'https://crimson.gg/'.$element->href;
$html2 = file_get_html($url);
$title = $html2->find('#contentcolumn > a',0);
print $title = $title->plaintext;
}
Line 4.
foreach($html->find('#contentcolumn > a') as $element)
PHP is connected to SOAP between NuSOAP and I can recieve data from PHP. After getting data, I get timeout error because of data size is huge which is received. You can see xdebug print out on below.I can run same service using PHP native SOAP command.
http://s15.postimg.org/c9491lf1n/2015_03_08_02_07_10.jpg
I am using following settings. I can not get a solution about the above error although I increased maximum limit as possible as much.
ini_set('memory_limit', "512M");
ini_set('max_execution_time', 60000); //x saniye
ini_set("display_errors", 1);
ini_set("max_input_time ", 6000);
I am waiting your solution about this issue urgently. Thank for your support.
Edit:
My NuSOAP client function:
function nusoapCall($method,$params,$returnType="array"){
global $WSDLURL,$debugMode;
require_once('tools/nusoap/nusoap.php');
$soapClient = new nusoap_client($WSDLURL, 'wsdl');
$soapClient->soap_defencoding = 'UTF-8';
$soapClient->decode_utf8 = false;
$soapClient->setUseCURL(true);
//$soapClient->loadWSDL();
$soapClient->setCurlOption(CURLOPT_CONNECTTIMEOUT, 60);
if($debugMode == 1){
$soapClient->debug_flag = false;
}
$tResult = $soapClient->call( $method,$params);
if($returnType=="object"){
$tResult = array_to_object($tResult);
}
$soapError = $soapClient->getError();
echo $soapClient->getDebug();
if (!empty($soapError)) {
$errorMessage = 'Nusoap object creation failed: ' . $soapError;
throw new Exception($errorMessage);
return false;
}else{
return $tResult;
}
}