I trying to get the "link" elements from certain webpages. I can't figure out what i'm doing wrong though. I'm getting the following error:
Severity: Warning
Message: DOMDocument::loadHTML() [domdocument.loadhtml]:
htmlParseEntityRef: no name in Entity, line: 536
Filename: controllers/test.php
Line Number: 34
Line 34 is the following in the code:
$dom->loadHTML($html);
my code:
$url = "http://www.amazon.com/";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
if($html = curl_exec($ch)){
// parse the html into a DOMDocument
$dom = new DOMDocument();
$dom->recover = true;
$dom->strictErrorChecking = false;
$dom->loadHTML($html);
$hrefs = $dom->getElementsByTagName('a');
echo "<pre>";
print_r($hrefs);
echo "</pre>";
curl_close($ch);
}else{
echo "The website could not be reached.";
}
It means some of the HTML code is invalid.
THis is just a warning, not an error. Your script will still process it. To suppress the warnings set
libxml_use_internal_errors(true);
Or you could just completely suppress the warning by doing
#$dom->loadHTML($html);
This may be caused by a rogue & symbol that is immediately succeeded by a proper tag. As otherwise you would receive a missing ; error. See: Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,.
The solution is to - replace the & symbol with &
or if you must have that & as it is then, may be you could enclose it in: <![CDATA[ - ]]>
The HTML is poorly formed. If formed poorly enough loading the HTML into the DOM Document might even fail. If loadHTML is not working then suppressing the errors is pointless. I suggest using a tool like HTML Tidy to "clean up" the poorly formed HTML if you are unable to load the HTML into the DOM.
HTML Tidy can be found here http://www.htacg.org/tidy-html5/
Related
I'm trying to create a small application that will simply read an RSS feed and then layout the info on the page.
All the instructions I find make this seem simplistic but for some reason it just isn't working. I have the following
include_once(ABSPATH.WPINC.'/rss.php');
$feed = file_get_contents('http://feeds.bbci.co.uk/sport/0/football/rss.xml?edition=int');
$items = simplexml_load_file($feed);
That's it, it then breaks on the third line with the following error
Error: [2] simplexml_load_file() [function.simplexml-load-file]: I/O warning : failed to load external entity "<?xml version="1.0" encoding="UTF-8"?> <?xm
The rest of the XML file is shown.
I have turned on allow_url_fopen and allow_url_include in my settings but still nothing.
I've tried multiple feeds that all end up with the same result?
I'm going mad here
simplexml_load_file() interprets an XML file (either a file on your disk or a URL) into an object. What you have in $feed is a string.
You have two options:
Use file_get_contents() to get the XML feed as a string, and use e simplexml_load_string():
$feed = file_get_contents('...');
$items = simplexml_load_string($feed);
Load the XML feed directly using simplexml_load_file():
$items = simplexml_load_file('...');
You can also load the content with cURL, if file_get_contents insn't enabled on your server.
Example:
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,"http://feeds.bbci.co.uk/sport/0/football/rss.xml?edition=int");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
$output = curl_exec($ch);
curl_close($ch);
$items = simplexml_load_string($output);
this also works:
$url = "http://www.some-url";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$xmlresponse = curl_exec($ch);
$xml=simplexml_load_string($xmlresponse);
then I just run a forloop to grab the stuff from the nodes.
like this:`
for($i = 0; $i < 20; $i++) {
$title = $xml->channel->item[$i]->title;
$link = $xml->channel->item[$i]->link;
$desc = $xml->channel->item[$i]->description;
$html .="<div><h3>$title</h3>$link<br />$desc</div><hr>";
}
echo $html;
***note that your node names will differ, obviously..and your HTML might be structured differently...also your loop might be set to higher or lower amount of results.
$url = 'http://legis.senado.leg.br/dadosabertos/materia/tramitando';
$xml = file_get_contents("xml->{$url}");
$xml = simplexml_load_file($url);
I'm attempting to gather text from a webpage using PHP, so that when the text on that website is updated, it's also automatically updated.
Take the site http://www.roblox.com/CW-Ultimate-Amethyst-Addiction-item?id=188004500 for example - inside the class robux-text, there's a figure saying R$ 20,003 - my aim is to get that text from Roblox, to my site.
I have attempted this using the code, but to no avail - I'm being presented with the following errors:
Warning: file_get_contents(): php_network_getaddresses: getaddrinfo
failed: Temporary failure in name resolution in
/home/public_html/index.php on line 9
Warning:
file_get_contents(http://www.roblox.com/CW-Ultimate-Amethyst-Addiction-item?id=188004500):
failed to open stream: php_network_getaddresses: getaddrinfo failed:
Temporary failure in name resolution in /home/public_html/index.php on
line 9
Warning: DOMDocument::loadHTML(): Empty string supplied as input in /home/public_html/index.php on line 11
<?php
$html = file_get_contents("http://www.roblox.com/CW-Ultimate-Amethyst-Addiction-item?id=188004500");
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$finder = new DomXPath($DOM);
$classname = 'robux-text';
$nodes = $finder->query("//*[contains(#class, '$classname')]");
foreach ($nodes as $node) {
echo $node->nodeValue;
}
?>
It seems that allow_url_fopen is disabled your system (php.ini), that's why you're getting the error.
Try it with curl:
<?php
libxml_use_internal_errors(true);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.roblox.com/CW-Ultimate-Amethyst-Addiction-item?id=188004500");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$finder = new DomXPath($DOM);
$classname = 'robux-text';
$nodes = $finder->query("//*[contains(#class, '$classname')]");
foreach ($nodes as $node) {
echo $node->nodeValue;
}
?>
You can get the html content of an url easily via curl. You just have to set the returntransfer option to true.
This snippet of code is not working:
Notice: Trying to get property of non-object in test.php on line 13
but the xpath query seems obviously correct... and the url provided obviously have a tag .
I tried to replace the query even with '//html' but no luck.
I always use xpath and this is a strange behaviour.
<?php
$_url = 'http://www.portaleaste.com/it/Aste/Detail/876989';
$ch2 = curl_init();
curl_setopt($ch2, CURLOPT_URL, $_url);
curl_setopt($ch2, CURLOPT_RETURNTRANSFER, true);
$result2 = curl_exec($ch2);
curl_close($ch2);
$doc2 = new DOMDocument();
#$doc2->load($result2);
$xpath2 = new DOMXpath($doc2);
$txt = $xpath2->query('//p[#id="descrizione"]')->item(0)->nodeValue;
echo $txt;
?>
There is nothing wrong with your xpath query as it is correct syntax and the node does exist. The problematic line is this:
#$doc2->load($result2);
// DOMDocument::load — Load XML from a file
You are not loading the result page that you got from your curl request properly. To load the response,
Use this instead:
#$doc2->loadHTML($result2);
// DOMDocument::loadHTML — Load HTML from a string
Here's a sample output you'd expect
I have a PHP script that returns links on a webpage. I am getting 500 internal error and this is what my server logs say. I let my friend try the same code on his server and it seems to run correctly. Can someone help me debug my problem? The warning says something about the wrapper is disabled. I checked line 1081 but I do not see allow_url_fopen.
PHP Warning: file_get_contents(): http:// wrapper is disabled in the server configuration by allow_url_fopen=0 in /hermes/bosweb/web066/b669/ipg.streamversetv/simple_html_dom.php on line 1081
PHP Warning: file_get_contents(http://www.dota2lounge.com/): failed to open stream: no suitable wrapper could be found in /hermes/bosweb/web066/b669/ipg.streamversetv/simple_html_dom.php on line 1081
PHP Fatal error: Call to a member function find() on a non-object in /hermes/bosweb/web066/b669/ipg.streamversetv/sim
<?php
include_once('simple_html_dom.php');
$target_url = 'http://www.dota2lounge.com/';
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find(a) as $link){
echo $link->href.'<br />';
}
?>
Download latest simple_html_dom.php: LINK TO DOWNLOAD
Open simple_html_dom.php in your favourite editor and add this code to the first couple of line(can be added right after <?php):
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
curl_close($ch);
return $data; }
Find line starting with function file_get_html($url..... for me it is line 71, but you can use search in your editor as well. (search for file_get_html)
Edit this line(some lines below after function file_get_html):
$contents = file_get_contents($url, $use_include_path, $context, $offset);
to this:
$contents = file_get_contents_curl($url);
Instead of load_file, use file_get_html an it will work for you, without editing php.ini
You need to set the allow_url_fopen php setting to 1 to allow using fopen() with urls.
Reference: PHP: Runtime Configuration
Edit:
Also tracked down another thing, have you tried loading this way?
<?php
include_once('simple_html_dom.php');
$html = file_get_html('http://www.dota2lounge.com/');
foreach($html->find('a') as $link)
{
echo $link->href.'<br />';
}
?>
I have this code that will retrieve every link in the $curl_scrapped_page:
require_once ('simple_html_dom.php');
$des_array = array();
$url = 'http://citeseerx.ist.psu.edu/search?q=mean&t=doc&sort=rlv';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
$html = new simple_html_dom();
$html->load($curl_scraped_page);
Then I want to get abstract for each of link (on the page of that link) I scrapped. (I also get other things like title, description and so on, but the problem only lies on this abstract):
foreach ($html->find('div.result h3 a') as $des) {
$des2 = 'http://citeseerx.ist.psu.edu' . $des->href;
$ch = curl_init($des2);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page2 = curl_exec($ch);
libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHtml($curl_scraped_page2);//line 72
libxml_use_internal_errors(false);
$xpath2 = new DomXPath($dom);
$thing = $xpath2->query('//p[preceding::h3[preceding::div]]')->item(1)->textContent; //line 75
array_push($des_array, $thing);
}
curl_close ($ch);
This is the display code:
for ($i = 0; $i < 10; $i++) {
echo $des_array[$i];
}
When I checked it on my browser, it gave me this, thrice:
Warning: DOMDocument::loadHTML(): Empty string supplied as input in C:\xampp\htdocs\MSP\Citeseerx.php on line 72
Notice: Trying to get property of non-object in C:\xampp\htdocs\MSP\Citeseerx.php on line 75
I realised I pushed an empty string to the $des_array. So I tried this:
if (empty($thing)){
array_push($des_array,'');
}
else{
array_push($des_array, $thing);
}
And this: if ($thing!=''){..}.
It still gave me that error.
What should I do?
Thanks..
curl_exec() may return false. In that case check with curl_error() what's the error. For example if the href attribute does not begin with / you will pass invalid url to the curl_init function. Also you may use curl_info() to get more information about the server response
Actually the $curl_scraped_page should be an handle for an open file not a variable since you are returning the transfer as a. Binary it should be read to file you can't pass to a varible since it is not a string