I am using the SimpleHTMLDOM parser to fetch data from other sites. This is working pretty well on PHP 7.0. Since I upgraded to PHP 7.1.3, I get the following error code from file_get_contents:
Warning: file_get_contents(): stream does not support seeking in
/..../test/scripts/simple_html_dom.php on line
75 Warning: file_get_contents(): Failed to seek to position -1 in the
stream in
/..../test/scripts/simple_html_dom.php on line
75
What I did
I downgraded to PHP 7 and it works like before without any problems. Next, I looked at the code of the parser. But I didn't find anything unusual:
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//$contents = retrieve_url_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}
The parser which I use you can find here: http://simplehtmldom.sourceforge.net/
I had the same problem.
The PHP function 'file_get_contents' changed in PHP 7.1 (support for negative offsets has been added), so the '-1' value of $offset used by default in Simple HTML5 Dom Parser is invalid for PHP>=7.1. You would have to put it to zero.
I noticed that the bug has been corrected a few days ago, so this problem should not appear in the latest versions (https://sourceforge.net/p/simplehtmldom/repository/ci/3ab5ee865e460c56859f5a80d74727335f4516de/)
Related
I'm using html_dom to scrape a website.
$url = $_POST["textfield"];
$html = file_get_html($url);
html_dom.php
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//$contents = retrieve_url_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}
The problem is if the internet connection is too slow it is still going for the file_get_html then it will be a warning error saying failed to open stream and fatal error: 30 seconds max execution time. I tried to solve it by stoping the function if it detected a Warning error:
function errHandle($errNo, $errStr, $errFile, $errLine) {
$msg = "Slow Internet Connection";
if ($errNo == E_NOTICE || $errNo == E_WARNING) {
throw new ErrorException($msg, $errNo);
} else {
echo $msg;
}
}
set_error_handler('errHandle');
But it still printing the fatal error on execution time. Any idea on how i can solve this?
If it takes to long you could increase the time limit:
http://php.net/manual/en/function.set-time-limit.php
You can't catch a fatal in php 5.6 or below. In php 7+ you can with
try {
doSomething();
} catch (\Throwable $exception) {
// error handling
echo $exception->getMessage();
}
Not sure if you can catch the execution time limit though.
The problem I am experiencing is below:
Warning: file_get_contents(): Unable to find the wrapper "https" -
did you forget to enable it when you configured PHP? in
<b>C:\xampp\htdocs\test_crawl\simple_html_dom.php</b> on line <b>75</b><br
/>
<br />
<b>Warning</b>: file_get_contents(https://www.yahoo.com): failed to open
stream: Invalid argument in
<b>C:\xampp\htdocs\test_crawl\simple_html_dom.php</b> on line <b>75</b><br
/>
I did some research and I found a few posts that said uncommenting extension=php_openssl.dll in php.ini works but when I did and restarted my server it did not.The script I am used is below:
$url = 'https://yahoo.com'
function CrawlMe($url)
{
$html = file_get_html($url);
return json_encode($html);
}
Not sure why it's not working would appreciate your help..
Below is the function that's erroring out at $contents = file_get_contents($url, $use_include_path, $context, $offset);
function file_get_html($url, $use_include_path = false, $context=null,
$offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true,
$target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true,
$defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed,
$target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the
// retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to
control the timeout.
//$contents = retrieve_url_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}
Whats on line 75 of simple_html_dom.php? From what you have posted all I can say is
$url = 'https://yahoo.com'
is missing a semi colon, it should be:
$url = 'https://yahoo.com';
--Edit after seeing the code...
You are setting the offset to -1. which means start reading from the end of the file. As per the documentation
Seeking (offset) is not supported with remote files. Attempting to
seek on non-local files may work with small offsets, but this is
unpredictable because it works on the buffered stream.
Your maxlength is set to minus 1. As per the documaentation:
An E_WARNING level error is generated if filename cannot be found,
maxlength is less than zero, or if seeking to the specified offset in
the stream fails.
You don't need to specify all those parameters, this will work fine:
$file = file_get_contents('https://www.yahoo.com');
Currently, I am trying to get the results of a https website and I am getting this error while using simple_html_dom.php, can anybody give me an idea on how to fix this or what's potentially causing it?
Warning: file_get_contents(https://crimson.gg/jackpot-history): failed to open stream: HTTP request failed! HTTP/1.1 503 Service Temporarily Unavailable in /tickets/simple_html_dom.php on line 75
Line line 75 of simple_html_dom.php is this.
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//$contents = retrieve_url_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}
And the code I am currently using is.
$html = file_get_html('https://crimson.gg/jackpot-history');
foreach($html->find('#contentcolumn > a') as $element)
{
print '<br><br>';
echo $url = 'https://crimson.gg/'.$element->href;
$html2 = file_get_html($url);
$title = $html2->find('#contentcolumn > a',0);
print $title = $title->plaintext;
}
Line 4.
foreach($html->find('#contentcolumn > a') as $element)
I am using SIMPLE HTML DOM PARSER PHP.
partial source of document that I want to parse is
<pre>
LINE 1
LINE 2
LINE 3
</pre>
these code is defined $string and
I wrote those PHP code
$html = str_get_html($string);
$ret = $html->find('pre',0)->plaintext;
echo $ret;
result is
LINE 1 LINE 2 LINE 3
but in web browser those HTML code is showing
LINE 1
LINE 2
LINE 3
web browser's showing is what I want.
How can I get same result in PHP?
You need to set Default Parameter $stripRN to false inside your simple_html_dom.php file
e.g.
$stripRN = false;
change for both functions file_get_html & str_get_html
so this
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
becomes
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
and
this
function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
becomes
function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
ref: http://sourceforge.net/p/simplehtmldom/bugs/122/
ref: Preserve Line Breaks - Simple HTML DOM Parser
You want to make line breaks, correct? The following should work:
<?php
foreach ($ret as $string) {
echo "$string\n";
}
EDIT: Reverse effect of original code.
You can use the following to echo text on each line:
Output in browser:
LINE 1
LINE 2
LINE 3
File content:
<pre>
LINE 1
LINE 2
LINE 3
</pre>
PHP code:
<?php
$lines = file('myfile.txt');
foreach ($lines as $line_num => $line) {
echo ($line);
}
?>
Original answer, misunderstood OP's question yet useful code nonetheless.
You can use the following code to echo text in one line.
File content:
<pre>
LINE 1
LINE 2
LINE 3
</pre>
Output in browser: LINE 1 LINE 2 LINE 3
PHP code:
<?php
$lines_of_file = file("myfile.txt");
//Now $lines_of_file have an array item per each line
$file_content = file_get_contents("myfile.txt");
$file_content_separated_by_spaces = explode(" ", $file_content);
echo $file_content;
?>
I couldn't figure out how to do it with Simple HTML Dom Parser, but I was successful with Zend's Dom Query package. The following code works for maintaining the formatting of the text inside of pre elements:
$dom = new Zend\Dom\Query($html_string);
$results = $dom->execute('pre');
foreach ($results as $result) { // each $result is of type DOMElement
echo $result->ownerDocument->saveXML($result);
}
I just want to know if its possible to extract content encoded (in utf-8) from a html file without encoding header.
My specific case is this website:
http://www.metal-archives.com/band/discography/id/203/tab/all
I want to extract all the info but, as you can see, this word for example, looks bad:
Motörhead
I tried to use file_get_html, htmlentities, utf_decode, utf_encode and mix of them with different options but I cant find a solution...
Edit:
I just want to see the same website with correct format with this simple code:
$html_discos = file_get_html("http://www.metal-archives.com/band/discography/id/223/tab/all");
//some transform/decode here
print_r($html_discos);
I want the content in correct format in a string or DOM object to get some parts later.
Edit 2:
$file_get_html is a function of "simple html dom" library:
http://simplehtmldom.sourceforge.net/
That have this code:
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//$contents = retrieve_url_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}
The Content-Type of the URL
http://www.metal-archives.com/band/discography/id/203/tab/all
is:
Content-Type: text/html
This will default to ISO-8859-1. But instead you want to use UTF-8. Change the Content-Type so this is correctly signaled:
Content-Type: text/html; charset=utf-8
See: Setting the HTTP charset parameter
header('Content-Type: text/html; charset=utf-8');
echo file_get_contents('http://www.metal-archives.com/band/discography/id/203/tab/all');
As long as you are emitting as UTF-8, the raw data will work properly.
Try using html_eneity_decode http://php.net/manual/en/function.html-entity-decode.php (the source of that page has encoded characters)