PHP output keeps saying 'DOMDocument::loadHTML(): Empty string supplied as input in' - php

I have this code that will retrieve every link in the $curl_scrapped_page:
require_once ('simple_html_dom.php');
$des_array = array();
$url = 'http://citeseerx.ist.psu.edu/search?q=mean&t=doc&sort=rlv';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
$html = new simple_html_dom();
$html->load($curl_scraped_page);
Then I want to get abstract for each of link (on the page of that link) I scrapped. (I also get other things like title, description and so on, but the problem only lies on this abstract):
foreach ($html->find('div.result h3 a') as $des) {
$des2 = 'http://citeseerx.ist.psu.edu' . $des->href;
$ch = curl_init($des2);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page2 = curl_exec($ch);
libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHtml($curl_scraped_page2);//line 72
libxml_use_internal_errors(false);
$xpath2 = new DomXPath($dom);
$thing = $xpath2->query('//p[preceding::h3[preceding::div]]')->item(1)->textContent; //line 75
array_push($des_array, $thing);
}
curl_close ($ch);
This is the display code:
for ($i = 0; $i < 10; $i++) {
echo $des_array[$i];
}
When I checked it on my browser, it gave me this, thrice:
Warning: DOMDocument::loadHTML(): Empty string supplied as input in C:\xampp\htdocs\MSP\Citeseerx.php on line 72
Notice: Trying to get property of non-object in C:\xampp\htdocs\MSP\Citeseerx.php on line 75
I realised I pushed an empty string to the $des_array. So I tried this:
if (empty($thing)){
array_push($des_array,'');
}
else{
array_push($des_array, $thing);
}
And this: if ($thing!=''){..}.
It still gave me that error.
What should I do?
Thanks..

curl_exec() may return false. In that case check with curl_error() what's the error. For example if the href attribute does not begin with / you will pass invalid url to the curl_init function. Also you may use curl_info() to get more information about the server response

Actually the $curl_scraped_page should be an handle for an open file not a variable since you are returning the transfer as a. Binary it should be read to file you can't pass to a varible since it is not a string

Related

Is it possible to extract Dom Elements from htmlentities() function in php?

I appreciate the time you take to try and help me with my question.
So what i am doing is trying an html parser from a link. So I use curl first to link to the website then I convert it into htmlentities() so it doesn't load on the page so I get a string from that then i use the DOM object to extract the tag from. I checked different methods for a parser on google search so i learned a little bit about it then i execute my script but the problem is that the string is getting saved as textCont and not as a real html document so i would like to know how can convert htmlentities string into a real dom document and extract elements from it ?
the image of the var_dump is here
here is my script:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://www.usatoday.com/story/news/world/2021/02/17/dubai-princess-sheikha-latifa-says-she-hostage-after-flee-attempt/6778014002/?utm_source=feedblitz&utm_medium=FeedBlitzRss&utm_campaign=usatodaycomworld-topstories');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);
curl_close($curl);
$htmlentities = htmlentities($result);
// I added the code here
$htmlDom = new DOMDocument();
$htmlDom->loadHTML($htmlentities);
$htmlDom->preserveWhiteSpace = false;
$styles = $htmlDom->getElementsByTagName('style');
foreach ($styles as $style) {
$item = $style->getElementsByTagName('td');
//echo the values
echo '1: '.$item->item(0)->nodeValue.'<br />';
echo '2: '.$item->item(1)->nodeValue.'<br />';
echo '3: '.$item->item(2)->nodeValue;
}
EDIT:
what i added next to the code is this:
$htmlentities = htmlentities($result);
$htmlentities = str_replace(""",'"', $htmlentities);
$htmlentities = str_replace("'","'", $htmlentities);
$htmlentities = str_replace("<","<", $htmlentities);
$htmlentities = str_replace(">",">", $htmlentities);
libxml_use_internal_errors(true);
$htmlDom = new DOMDocument();
$htmlDom->loadHTML($htmlentities);
libxml_clear_errors();
var_dump($htmlDom);

PHP curl Inside Foreach

EDIT:What is really happening is that a new xml is created each time but it is adding the new $html information to the previous so by the time it gets to the last element in the list being curled, it is saving parsed information from all previous curls. Can't figure out what is wrong.
Having trouble with a curl not executing as expected. In the code below I have a foreach loop that loops thru a list ($textarray) and passes the list element to a curl and also used to create an xml file using the element as the file name. The curl then returns $html which is then parsed and saved to an xml. The script runs, the list is passed, the url is created and passed to the curl function. I get an echo showing the correct url, a return is made and then each return is parsed and saved to the appropriate file. The problem seems to be that the curl is not actually curling the new $url. I get the exact same information saved in every xml file. I no this is not correct. Not sure why this is happening. Any help appreciated.
Function FeedXml($textarray){
$doc=new DOMDocument('1.0', 'UTF-8');
$feed=$doc->createElement("feed");
Foreach ($textarray as $text){
$url="http://xxx/xxx/".$text;
echo "PATH TO CURL".$url."<br>";
$html=curlurl($url);
$xmlsave="http://xxxx/xxx/".$text;
$dom = new DOMDocument(); //NEW dom FOR EACH SHOW
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$dom->formatOutput = true;
$dom->preserveWhiteSpace = true;
//PARSE EACH RETURN INFORMATION
$images= $dom->getElementsByTagName('img');
foreach($images as $img){
$icon= $img ->getAttribute('src');
if( preg_match('/\.(jpg|jpeg|gif)(?:[\?\#].*)?$/i', $icon) ) {
// ITEM TAG
$item= $doc->createElement("item");
$sdAttribute = $doc->createAttribute("sdImage");
$sdAttribute->value = $icon;
$item->appendChild($sdAttribute);
} // IMAGAGE FOR EACH
$feed->appendChild($item);
$doc->appendChild($feed);
$doc->save($xmlsave);
}
}
}
Function curlurl($url){
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_FRESH_CONNECT, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_VERBOSE, 1);//0-FALSE 1 TRUE
curl_setopt($ch,CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER ,FALSE);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_TIMEOUT,'10');
$html = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
echo $httpcode;
return $html;
}
Thanks for pointing out my shortcomings on the above. I have figured out the problem. The following needed to be moved into the Foreach.
$doc=new DOMDocument('1.0', 'UTF-8');
$feed=$doc->createElement("feed");

simple xpath query not working

This snippet of code is not working:
Notice: Trying to get property of non-object in test.php on line 13
but the xpath query seems obviously correct... and the url provided obviously have a tag .
I tried to replace the query even with '//html' but no luck.
I always use xpath and this is a strange behaviour.
<?php
$_url = 'http://www.portaleaste.com/it/Aste/Detail/876989';
$ch2 = curl_init();
curl_setopt($ch2, CURLOPT_URL, $_url);
curl_setopt($ch2, CURLOPT_RETURNTRANSFER, true);
$result2 = curl_exec($ch2);
curl_close($ch2);
$doc2 = new DOMDocument();
#$doc2->load($result2);
$xpath2 = new DOMXpath($doc2);
$txt = $xpath2->query('//p[#id="descrizione"]')->item(0)->nodeValue;
echo $txt;
?>
There is nothing wrong with your xpath query as it is correct syntax and the node does exist. The problematic line is this:
#$doc2->load($result2);
// DOMDocument::load — Load XML from a file
You are not loading the result page that you got from your curl request properly. To load the response,
Use this instead:
#$doc2->loadHTML($result2);
// DOMDocument::loadHTML — Load HTML from a string
Here's a sample output you'd expect

SimpleXML - I/O warning : failed to load external entity

I'm trying to create a small application that will simply read an RSS feed and then layout the info on the page.
All the instructions I find make this seem simplistic but for some reason it just isn't working. I have the following
include_once(ABSPATH.WPINC.'/rss.php');
$feed = file_get_contents('http://feeds.bbci.co.uk/sport/0/football/rss.xml?edition=int');
$items = simplexml_load_file($feed);
That's it, it then breaks on the third line with the following error
Error: [2] simplexml_load_file() [function.simplexml-load-file]: I/O warning : failed to load external entity "<?xml version="1.0" encoding="UTF-8"?> <?xm
The rest of the XML file is shown.
I have turned on allow_url_fopen and allow_url_include in my settings but still nothing.
I've tried multiple feeds that all end up with the same result?
I'm going mad here
simplexml_load_file() interprets an XML file (either a file on your disk or a URL) into an object. What you have in $feed is a string.
You have two options:
Use file_get_contents() to get the XML feed as a string, and use e simplexml_load_string():
$feed = file_get_contents('...');
$items = simplexml_load_string($feed);
Load the XML feed directly using simplexml_load_file():
$items = simplexml_load_file('...');
You can also load the content with cURL, if file_get_contents insn't enabled on your server.
Example:
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,"http://feeds.bbci.co.uk/sport/0/football/rss.xml?edition=int");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
$output = curl_exec($ch);
curl_close($ch);
$items = simplexml_load_string($output);
this also works:
$url = "http://www.some-url";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$xmlresponse = curl_exec($ch);
$xml=simplexml_load_string($xmlresponse);
then I just run a forloop to grab the stuff from the nodes.
like this:`
for($i = 0; $i < 20; $i++) {
$title = $xml->channel->item[$i]->title;
$link = $xml->channel->item[$i]->link;
$desc = $xml->channel->item[$i]->description;
$html .="<div><h3>$title</h3>$link<br />$desc</div><hr>";
}
echo $html;
***note that your node names will differ, obviously..and your HTML might be structured differently...also your loop might be set to higher or lower amount of results.
$url = 'http://legis.senado.leg.br/dadosabertos/materia/tramitando';
$xml = file_get_contents("xml->{$url}");
$xml = simplexml_load_file($url);

Get div and the correct close tag preg

Now preg has always been a tool to me that i like but i cant figure out for the life if me if what i want to do is possible let and how to do it is going over my head
What i want is preg_match to be able to return me a div's innerHTML the problem is the div im tring to read has more divs in it and my preg keeps closing on the first tag it find
Here is my Actual code
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
preg_match('% <div id="torrent_details">(.*)</div> %six', $data, $match);
print_r($match);
This has been updated for TomcatExodus's help
Live at :: http://megatorrentz.com/beta/details.php?hash=98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6
<?php
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($data);
libxml_use_internal_errors(false);
$div = $domd->getElementById("torrent_details");
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
Using regular expression leads often to problems when parsing markup documents.
XPath version - independent of the source layout. The only thing you need is a div with that id.
loadHTMLFile($url);
$xp = new domxpath($dom);
$result = $xp->query("//*[#id = 'torrent_details']");
$div=$result->item(0);
if($result->length){
$out =new DOMDocument();
$out->appendChild($out->importNode($div, true));
echo $out->saveHTML();
}else{
echo "No such id";
}
?>
And this is the fix for Maerlyn solution. It didn't work because getElementById() wants a DTD with the id attribute specified. I mean, you can always build a document with "apple" as the record id, so you need something that says "id" is really the id for this tag.
validateOnParse = true;
#$domd->loadHTML($data);
//this doesn't work as the DTD is not specified
//or the specified id attribute is not the attributed called "id"
//$div = $domd->getElementById("torrent_details");
/*
* workaround found here: https://fosswiki.liip.ch/display/BLOG/GetElementById+Pitfalls
* set the "id" attribute as the real id
*/
$elements = $domd->getElementsByTagName('div');
if (!is_null($elements)) {
foreach ($elements as $element) {
//try-catch needed because of elements with no id
try{
$element->setIdAttribute('id', true);
}catch(Exception $e){}
}
}
//now it works
$div = $domd->getElementById("torrent_details");
//Print its content or error
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
?>
Both of the solutions work for me.
You can do this:
/]>(.)<\/div>/i
Which would give you the largest possible innerHTML.
You cannot. I will not link to the famous question, because I dislike the pointless drivel on top. But still regular expressions are unfit to match nested structures.
You can use some trickery, but this is neither reliable, nor necessarily fast:
preg_match_all('#<div id="1">((<div>.*?</div>|.)*?)</div>#ims'
Your regex had a problem due to the /x flag not matching the opening div. And you used a wrong assertion notation.
preg_match_all('% <div \s+ id="torrent_details">(?<innerHtml>.*)</div> %six', $html, $match);
echo $match['innerHtml'];
That one will work, but you should only need preg_match not preg_match_all if the pages are written well, there should only be one instance of id="torrent_details" on the given page.
I'm retracting my answer. This will not work properly. Use DOM for navigating the document.
haha did it with a bit of tampering thanks for the DOMDocument idea i just to use simple
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$doc = new DOMDocument();
libxml_use_internal_errors(false);
$doc->strictErrorChecking = FALSE;
libxml_use_internal_errors(true);
$doc->loadHTML($data);
$xml = simplexml_import_dom($doc);
print_r($xml->body->table->tr->td->table[2]->tr->td[0]->span[0]->div);

Categories