I'm attempting to gather text from a webpage using PHP, so that when the text on that website is updated, it's also automatically updated.
Take the site http://www.roblox.com/CW-Ultimate-Amethyst-Addiction-item?id=188004500 for example - inside the class robux-text, there's a figure saying R$ 20,003 - my aim is to get that text from Roblox, to my site.
I have attempted this using the code, but to no avail - I'm being presented with the following errors:
Warning: file_get_contents(): php_network_getaddresses: getaddrinfo
failed: Temporary failure in name resolution in
/home/public_html/index.php on line 9
Warning:
file_get_contents(http://www.roblox.com/CW-Ultimate-Amethyst-Addiction-item?id=188004500):
failed to open stream: php_network_getaddresses: getaddrinfo failed:
Temporary failure in name resolution in /home/public_html/index.php on
line 9
Warning: DOMDocument::loadHTML(): Empty string supplied as input in /home/public_html/index.php on line 11
<?php
$html = file_get_contents("http://www.roblox.com/CW-Ultimate-Amethyst-Addiction-item?id=188004500");
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$finder = new DomXPath($DOM);
$classname = 'robux-text';
$nodes = $finder->query("//*[contains(#class, '$classname')]");
foreach ($nodes as $node) {
echo $node->nodeValue;
}
?>
It seems that allow_url_fopen is disabled your system (php.ini), that's why you're getting the error.
Try it with curl:
<?php
libxml_use_internal_errors(true);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.roblox.com/CW-Ultimate-Amethyst-Addiction-item?id=188004500");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$finder = new DomXPath($DOM);
$classname = 'robux-text';
$nodes = $finder->query("//*[contains(#class, '$classname')]");
foreach ($nodes as $node) {
echo $node->nodeValue;
}
?>
You can get the html content of an url easily via curl. You just have to set the returntransfer option to true.
Related
i want to grab all link from a url.
but i want it showing in XML.
for example i want to take all link from this url http://www.example.com/xxxx/
i want it to print like this:
anotherxxx
here my code but i got error
Fatal error: Uncaught TypeError: Argument 1 passed to
DOMDocument::saveXML() must be an instance of DOMNode or null, string
given in C:\xampp\htdocs\sh\index.php:18 Stack trace: #0
C:\xampp\htdocs\sh\index.php(18): DOMDocument->saveXML('/') #1 {main}
thrown in C:\xampp\htdocs\sh\index.php on line 18
$url = "http://www.example.com/xxxx/";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $link) {
$short_link = $link->getAttribute('href');
echo $short_link1 = $dom->saveXML($short_link);
echo "<br />";
}
Use DOMXPath to retrieve all links as
$links = $xpath->query("//a/#href");
Then loop through links and get its html content as
$dom->saveHTML($link)
Full code here ..
$dom = new domDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($this->dom);
$links = $xpath->query("//a/#href");
foreach($links as $link){
echo $dom->saveHTML($link);
echo "<br />";
}
The call to getAttribute() will return the attributes value as a string. So if you just want the href then
$short_link = $link->getAttribute('href');
echo $short_link;
with...
anotherxxx
will give you http://www.example.com/yyyy/
If you want the anchor tag itself...
foreach($dom->getElementsByTagName('a') as $link) {
echo $dom->saveXML($link);
}
Will give
anotherxxx
ppl. I ussualy find my answers looking the web and stackoverflow, but this time couldn't resolve my issue.
I'm using php dom for parse a website and extract some data from it, but for some reason, all the ways i tryed keep returning me less items than the number on the page.
Tryed with "simple php simple html dom", "php advanced html dom" and the native php dom... but still get, in this case, 14 article tags.
http://www.emol.com/movil/nacional/
In this site there are 28 elements tagged "article", but i always get 14 (or less)
Tryed using the classic find (from simple and advance), with all the combinations possible; and with the native one, query xpath and getelementsbytag.
$xpath->query('//article');
$xpath->query('//*[#id="listNews"]/article[6]') //even this don't work
$html->find('article:not(.sec_mas_vistas_emol), article'); //return 14
So my guess was the way i was loading the url... so i tryed the classic "file_get_html", curl, and some custom functions... and all them are the same.
What is more extrange, is if i use a a online xpath tester, copy all the html and use the "query->('//article')... it find all.
This are my two last tests:
//Way 1
$html = file_get_html('http://www.emol.com/movil/nacional/');
$lidata = $html->find('article');
//Way 2
$url = 'http://www.emol.com/movil/nacional';
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$e = curl_exect($ch);
$dom = new DOMDocument;
#$dom->loadHTML($e); //tryed with loadHTMLFile too and the libxml_use_internal_erros
$xpath = new DOMXPath($dom);
$xpath->query('//article');
Any suggestion on what could be the issue and a way to fix it? Actually, is my first incursion with PHP dom, so possible there is something i'm missing.
Maybe my comment above and this example can help you to proceed.
With phpcasperjs wrapper:
<?php
require_once 'vendor/autoload.php';
use Browser\Casper;
$casper = new Casper();
$casper->start('http://www.emol.com/movil/nacional/');
$casper->wait(5000);
$output = $casper->getOutput();
$casper->run();
$html = $casper->getHtml();
$dom = new DOMDocument('1.0', 'UTF-8');
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$cnt = 1;
foreach ($xpath->query('//article') as $article) {
print $cnt . ' - ' . $article->nodeName . ' - ' . $article->getAttribute('id') . "\n";
$cnt += 1;
}
With file_get_contents as you tried before:
<?php
$html = file_get_contents('http://www.emol.com/movil/nacional/');
$dom = new DOMDocument('1.0', 'UTF-8');
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$cnt = 1;
foreach ($xpath->query('//article') as $article) {
print $cnt . ' - ' . $article->nodeName . ' - ' . $article->getAttribute('id') . "\n";
$cnt += 1;
}
Counts 30 (with phpcasperjs) vs 14 (with file_get_contents).
I want to get links from a rss url . This is my code :
$doc = new DOMDocument();
$doc->load("http://www.alef.ir/rssdx.gmyefy,ggeltshmci.62ay2x.y.xml");
$arrFeeds = array();
foreach ($doc->getElementsByTagName('item') as $node) {
$title = $node->getElementsByTagName('title')->item(0)->nodeValue;
$title=strip_tags($title);
$link=$node->getElementsByTagName('link')->item(0)->nodeValue;
}
I've used this code for several others URLs and all of them worked but on this one I get:
Warning:
DOMDocument::load(http://www.alef.ir/rssdx.gmyefy,ggeltshmci.62ay2x.y.xml): failed to open stream: HTTP request failed!
HTTP/1.1 403 Forbidden in /home/xxxxxxx/domains/xxxxxxx/public_html/data.php on line 14
Warning:
DOMDocument::load(): I/O warning: failed to load external entity "http://www.alef.ir/rssdx.gmyefy,ggeltshmci.62ay2x.y.xml"
in /home/xxxxxxx/domains/xxxxxxx/public_html/data.php on line 14
http://www.alef.ir/rssdx.gmyefy,ggeltshmci.62ay2x.y.xml
Line 14 is:
$doc->load("http://www.alef.ir/rssdx.gmyefy,ggeltshmci.62ay2x.y.xml");
Could you help me? Why does this request give me an error?
Thanks
Using the code above failed for me and it was not due to the comma as I commented. I found that, using curl, I was able to retrieve the xml file.
$c=curl_init('http://www.alef.ir/rssdx.gmyefy,ggeltshmci.62ay2x.y.xml');
curl_setopt( $c, CURLOPT_USERAGENT,'nginx-curl-blahblahblah' );
curl_setopt( $c, CURLOPT_RETURNTRANSFER, true );
$r=curl_exec( $c );
curl_close( $c );
$doc = new DOMDocument();
$doc->loadxml($r);
$arrFeeds = array();
foreach ($doc->getElementsByTagName('item') as $node) {
$title=$node->getElementsByTagName('title')->item(0)->nodeValue;
$title=strip_tags($title);
$link=$node->getElementsByTagName('link')->item(0)->nodeValue;
}
Add this code before calling your feed, this will change user agent.
$opts = array(
'http' => array(
'user_agent' => 'PHP libxml agent',
)
);
$context = stream_context_create($opts);
libxml_set_streams_context($context);
I have this code that will retrieve every link in the $curl_scrapped_page:
require_once ('simple_html_dom.php');
$des_array = array();
$url = 'http://citeseerx.ist.psu.edu/search?q=mean&t=doc&sort=rlv';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
$html = new simple_html_dom();
$html->load($curl_scraped_page);
Then I want to get abstract for each of link (on the page of that link) I scrapped. (I also get other things like title, description and so on, but the problem only lies on this abstract):
foreach ($html->find('div.result h3 a') as $des) {
$des2 = 'http://citeseerx.ist.psu.edu' . $des->href;
$ch = curl_init($des2);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page2 = curl_exec($ch);
libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHtml($curl_scraped_page2);//line 72
libxml_use_internal_errors(false);
$xpath2 = new DomXPath($dom);
$thing = $xpath2->query('//p[preceding::h3[preceding::div]]')->item(1)->textContent; //line 75
array_push($des_array, $thing);
}
curl_close ($ch);
This is the display code:
for ($i = 0; $i < 10; $i++) {
echo $des_array[$i];
}
When I checked it on my browser, it gave me this, thrice:
Warning: DOMDocument::loadHTML(): Empty string supplied as input in C:\xampp\htdocs\MSP\Citeseerx.php on line 72
Notice: Trying to get property of non-object in C:\xampp\htdocs\MSP\Citeseerx.php on line 75
I realised I pushed an empty string to the $des_array. So I tried this:
if (empty($thing)){
array_push($des_array,'');
}
else{
array_push($des_array, $thing);
}
And this: if ($thing!=''){..}.
It still gave me that error.
What should I do?
Thanks..
curl_exec() may return false. In that case check with curl_error() what's the error. For example if the href attribute does not begin with / you will pass invalid url to the curl_init function. Also you may use curl_info() to get more information about the server response
Actually the $curl_scraped_page should be an handle for an open file not a variable since you are returning the transfer as a. Binary it should be read to file you can't pass to a varible since it is not a string
I trying to get the "link" elements from certain webpages. I can't figure out what i'm doing wrong though. I'm getting the following error:
Severity: Warning
Message: DOMDocument::loadHTML() [domdocument.loadhtml]:
htmlParseEntityRef: no name in Entity, line: 536
Filename: controllers/test.php
Line Number: 34
Line 34 is the following in the code:
$dom->loadHTML($html);
my code:
$url = "http://www.amazon.com/";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
if($html = curl_exec($ch)){
// parse the html into a DOMDocument
$dom = new DOMDocument();
$dom->recover = true;
$dom->strictErrorChecking = false;
$dom->loadHTML($html);
$hrefs = $dom->getElementsByTagName('a');
echo "<pre>";
print_r($hrefs);
echo "</pre>";
curl_close($ch);
}else{
echo "The website could not be reached.";
}
It means some of the HTML code is invalid.
THis is just a warning, not an error. Your script will still process it. To suppress the warnings set
libxml_use_internal_errors(true);
Or you could just completely suppress the warning by doing
#$dom->loadHTML($html);
This may be caused by a rogue & symbol that is immediately succeeded by a proper tag. As otherwise you would receive a missing ; error. See: Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,.
The solution is to - replace the & symbol with &
or if you must have that & as it is then, may be you could enclose it in: <![CDATA[ - ]]>
The HTML is poorly formed. If formed poorly enough loading the HTML into the DOM Document might even fail. If loadHTML is not working then suppressing the errors is pointless. I suggest using a tool like HTML Tidy to "clean up" the poorly formed HTML if you are unable to load the HTML into the DOM.
HTML Tidy can be found here http://www.htacg.org/tidy-html5/