Simple Html Dom Scraping half the page

Simple Html Dom Scraping half the page - php

I am trying to scrape this url https://nrg91.gr/nrg-airplay-chart/ using simple-html-dom, but it does not seem to get the full html source code. This code:
include_once('simple_html_dom.php');
$html = file_get_html('https://nrg91.gr/nrg-airplay-chart');
echo $html->plaintext;
displays the content up to the h1, just before the content I am after. And from the simple-html-dom manual examples, this should display all links from that url:
foreach($html->find('a') as $e)
echo $e->href . '<br>';
but it only displays the links up to the main navigation menu, not from the main body or footer.
I also tried using prerender.com, to fully load url before passing it to file_get_html but the result was the same. What am I doing wrong?

That library looks like it hasn't been updated in 7 years. I'd always recommend using PHP's built-in functions:
$url = "https://nrg91.gr/nrg-airplay-chart/";
$dom = new DomDocument();
libxml_use_internal_errors(true);
$dom->load($url);
foreach($dom->getElementsByTagName("a") as $e) {
echo $e->getAttribute("href") . "\n";
}

Here's my super dirty approach to fetching the rank/artist/title/youtube data using both DOMDocument and SimpleXML.
The concept is to locate each "row" of data via the xpath //ul[#id="chart_ul"]/li, then using dom_import_simplexml( $outer )->getNodePath() to build a new xpath to select the individual elements where the desired data can be located.
$temp = sys_get_temp_dir() . DIRECTORY_SEPARATOR . 'nrg-airplay-chart.html';
if( file_exists( $temp ) === false or filemtime( $temp ) < time() - 3600 )
{
file_put_contents( $temp, $html = file_get_contents('https://nrg91.gr/nrg-airplay-chart/') );
}
else
{
$html = file_get_contents( $temp );
}
$dom = new DOMDocument();
$dom->loadHTML( $html );
$xml = simplexml_import_dom( $dom );
$array = array();
foreach( $xml->xpath('//ul[#id="chart_ul"]/li') as $index => $set )
{
$basexpath = dom_import_simplexml( $set )->getNodePath();
$array[] = array(
'ranking' => (string) $xml->xpath( $basexpath . '//span[#id="ranking"]' )[0],
'artist' => (string) $xml->xpath( $basexpath . '//p[#id="artist"]/b' )[0],
'title' => (string) $xml->xpath( $basexpath . '//p[#id="title"]' )[0],
'youtube' => (string) $xml->xpath( $basexpath . '//div[#id="media"]/a/#href' )[0],
);
}
print_r( $array );

Another approach you might wanna comply:
<?php
function get_content($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_exec($ch);
$htmlContent = curl_exec($ch);
curl_close($ch);
return $htmlContent;
}
$link = "https://nrg91.gr/nrg-airplay-chart/";
$xml = get_content($link);
$dom = #DOMDocument::loadHTML($xml);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//li[contains(#id,"wprs_chart-")]') as $items){
$artist = $xpath->query('.//p[#id="artist"]/b',$items)->item(0)->nodeValue;
$title = $xpath->query('.//p[#id="title"]',$items)->item(0)->nodeValue;
echo "{$artist} -- {$title}<br>";
}
?>
Output you should get like:
PORTOGAL THE MAN -- Feel It Still
JAX JONEW Feat INA WROLDSEN -- Breathe
CAMILA CABELLO -- Havana
CARBI B, J BALVIN & BAD BUNNY -- I Like It
ZAYN Feat SIA -- Dusk Till Dawn

Related

Why is new SimpleXMLElement causing a 500 error?

I have a simple script that until yesterday had worked fine for 2 years. Im just taking a XML feed from a WP site and formatting it to be displayed on a different website. Here is the code:
<?php
function download_page($path){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$path);
curl_setopt($ch, CURLOPT_FAILONERROR,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$retValue = curl_exec($ch);
curl_close($ch);
return $retValue;
}
$sXML = download_page('https://example.com/tradeblog/feed/atom/');
$oXML = new SimpleXMLElement($sXML);
$items = $oXML->entry;
$i = 0;
foreach($items as $item) {
$title = $item->title;
$link = $item->link;
echo '<li>';
foreach($link as $links) {
$loc = $links['href'];
$href = str_replace("/feed/atom/", "", $loc);
echo "<a href=\"$href\" target=\"_blank\">";
}
echo $title;
echo "</a>";;
echo "</li>";
if(++$i == 3) break;
}
?>
I can echo out $sXML and it will display the entire XML contents as expected. When I try and echo $oXML I get the 500 error. Any use of $oXML causes the 500. What changed? Is there a different / better way to do this using PHP?

It seems your xml source is not exactly a xml. I tried to validate it using w3 scholl validator and it throws an error. Tried here too, and got the same error.

Not sure why, but this worked
<?php
$rss = new DOMDocument();
$rss->load('https://example.com/tradeblog/feed/rss2/');
$feed = array();
foreach ($rss->getElementsByTagName('item') as $node) {
$item = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue,
'link' => $node->getElementsByTagName('link')->item(0)->nodeValue,
);
array_push($feed, $item);
}
$limit = 3;
for($x=0;$x<$limit;$x++) {
$title = str_replace(' & ', ' & ', $feed[$x]['title']);
$link = $feed[$x]['link'];
echo '<li>'.$title.'</li>';
}
?>

DOMDocument, grab several values on a website

How can I get several values on a website using PHP (the value between div tags, value1, value2, value3 in the example below)?
I have been looking into DOMDocument, but getting confused.
Also, will it be possible to get the values without loading the website 3 times?
Example.
I need to get 3 values (or more) from a website:
<div class="SomeUniqueClassName">value1</div>
<div class="AnotherUniqueClassName">value2</div>
<div class="UniqueClassName">value3</div>
This is what I have now, but it looks stupid and i'm not 100% sure what i'm doing:
$doc = new DOMDocument;
$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
$query1 = "//div[#class='SomeUniqueClassName']";
$query2 = "//div[#class='AnotherUniqueClassName']";
$query3 = "//div[#class='UniqueClassName']";
$entry1 = $xpath->query($query1);
$value 1 = var_dump($entry1->item(0)->textContent);
$entry2 = $xpath->query($query2);
$value 2 = var_dump($entry2->item(0)->textContent);
$entry3 = $xpath->query($query3);
$value 3 = var_dump($entry3->item(0)->textContent);

You should use CURL for this :
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL,'http://theurlhere.com');
//Optional, if the target URL use SSL
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
$parse = curl_exec($curl);
curl_close($curl);
preg_match_all('/<div class="uniqueClassName([0-9])">(.*)<\/div>/', $parse, $value);
print_r($value);

With the XPath expression you could try using the "contains" qualifier and look for the unique class if it follows your example
$dom = new DOMDocument;
$dom->loadHTMLFile( $url );
$xp = new DOMXPath( $dom );
$query="//div[ contains( #class,'UniqueClass' ) ]";
$col=$xp->query( $query );
if( $col && $col->length > 0 ){
foreach( $col as $node ){
echo $node->item(0)->nodeValue;
}
}
Or modify the XPath expression to search for multiple conditions, like:
$query="//div[#class='UniqueClass1'] | //div[#class='UniqueClass2'] | //div[#class='UniqueClass3']";
$col=$xp->query( $query );
if( $col && $col->length > 0 ){
foreach( $col as $node ){
echo $node->item(0)->nodeValue;
}
}

PHP: XPath query returns nothing from large XML

$newstring = substr_replace("http://ws.spotify.com/search/1/track?q=", $_COOKIE["word"], 39, 0);
/*$curl = curl_init($newstring);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);*/
//echo $result;
$xml = simplexml_load_file($newstring);
//print_r($xml);
$xpath = new DOMXPath($xml);
$value = $xpath->query("//track/#href");
foreach ($value as $e) {
echo $e->nodevalue;
}
This is my code. I am using spotify to supply me with an xml document. I am then trying to get the href link from all of the track tags so I can use it. Right now the print_r($xml) I have commented out works, but if I try to query and print that out it returns nothing. The exact link I am trying to get my xml from is: http://ws.spotify.com/search/1/track?q=incredible

This maybe is not the answer you need, because I dropped the DOMXPath, I'm using getElementsByTagName() instead.
$url = "http://ws.spotify.com/search/1/track?q=incredible";
$xml = file_get_contents( $url );
$domDocument = new DOMDocument();
$domDocument->loadXML( $xml );
$value = $domDocument->getElementsByTagName( "track" );
foreach ( $value as $e ) {
echo $e->getAttribute( "href" )."<br>";
}

Parsing XML in PHP DOM via cURL - can't get nodeValue if it is url address or date

I have this strange problem parsing XML document in PHP loaded via cURL. I cannot get nodeValue containing URL address (I'm trying to implement simple RSS reader into my CMS). Strange thing is that it works for every node except that containing url addresses and date ( and ).
Here is the code (I know it is a stupid solution, but I'm kinda newbie in working with DOM and parsing XML documents).
function file_get_contents_curl($url) {
$ch = curl_init(); // initialize curl handle
curl_setopt($ch, CURLOPT_URL, $url); // set url to post to
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // return into a variable
curl_setopt($ch, CURLOPT_TIMEOUT, 4); // times out after 4s
$result = curl_exec($ch); // run the whole process
return $result;
}
function vypis($adresa) {
$html = file_get_contents_curl($adresa);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$desc = $doc->getElementsByTagName('description');
$ctg = $doc->getElementsByTagName('category');
$pd = $doc->getElementsByTagName('pubDate');
$ab = $doc->getElementsByTagName('link');
$aut = $doc->getElementsByTagName('author');
for ($i = 1; $i < $desc->length; $i++) {
$dsc = $desc->item($i);
$titles = $nodes->item($i);
$categorys = $ctg->item($i);
$pubDates = $pd->item($i);
$links = $ab->item($i);
$autors = $aut->item($i);
$description = $dsc->nodeValue;
$title = $titles->nodeValue;
$category = $categorys->nodeValue;
$pubDate = $pubDates->nodeValue;
$link = $links->nodeValue;
$autor = $autors->nodeValue;
echo 'Title:' . $title . '<br/>';
echo 'Description:' . $description . '<br/>';
echo 'Category:' . $category . '<br/>';
echo 'Datum ' . gmdate("D, d M Y H:i:s",
strtotime($pubDate)) . " GMT" . '<br/>';
echo "Autor: $autor" . '<br/>';
echo 'Link: ' . $link . '<br/><br/>';
}
}
Can you please help me with this?

To read RSS you shouldn't use loadHTML, but loadXML. One reason why your links don't show is because the <link> tag in HTML ignores its contents. See also here: http://www.w3.org/TR/html401/struct/links.html#h-12.3
Also, I find it easier to just iterate over the <item> tags and then iterate over their children nodes. Like so:
$d = new DOMDocument;
// don't show xml warnings
libxml_use_internal_errors(true);
$d->loadXML($xml_contents);
// clear xml warnings buffer
libxml_clear_errors();
$items = array();
// iterate all item tags
foreach ($d->getElementsByTagName('item') as $item) {
$item_attributes = array();
// iterate over children
foreach ($item->childNodes as $child) {
$item_attributes[$child->nodeName] = $child->nodeValue;
}
$items[] = $item_attributes;
}
var_dump($items);

From DOMDocument to CURL?

im using DOMDocuments to download an RSS feed into my PHP script, simple by:
$doc = new DOMDocument();
$doc->load($source);
I want to use instead of DOMDocument, CURL. How can change those 2 lines of code to make all my script compatible?. This is my complete script by the way:
<?php
//PUBLIC VARS
$arrFeeds = array();
$downItems = 0;
$time_taken = 0;
//*PUBLIC VARS
function getRSS($source) {
$start = microtime(true);
ini_set('default_socket_timeout', 1);
global $arrFeeds, $downItems, $time_taken;
$arrFeeds = array();
$doc = new DOMDocument();
$doc->load($source);
foreach ($doc->getElementsByTagName('item') as $node) {
$itemRSS = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue,
'desc' => $node->getElementsByTagName('description')->item(0)->nodeValue,
'link' => $node->getElementsByTagName('link')->item(0)->nodeValue
);
array_push($arrFeeds, $itemRSS);
$downItems+=1;
}
$time_taken = microtime(true) - $start;
}
//getRSS("http://www.atm-mi.it/_layouts/atm/apps/PublishingRSS.aspx?web=388a6572-890f-4e0f-a3c7-a3dd463f7252&c=News%20Infomobilita");
//echo(strip_tags($arrFeeds[0]['title'])."<br><br>".$time_taken);
?>
Thanks for the help!

This ought to do it:
$ch = curl_init($source);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
$doc->loadXML($content);
Your mileage may vary, of course, and you might have to add more CURL options, but that's basic enough functionality to get it all started.

Use loadXML.
http://www.php.net/manual/en/domdocument.loadxml.php

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Simple Html Dom Scraping half the page - php

Related

Why is new SimpleXMLElement causing a 500 error?

DOMDocument, grab several values on a website

PHP: XPath query returns nothing from large XML

Parsing XML in PHP DOM via cURL - can't get nodeValue if it is url address or date

From DOMDocument to CURL?

Categories

Resources