im using DOMDocuments to download an RSS feed into my PHP script, simple by:
$doc = new DOMDocument();
$doc->load($source);
I want to use instead of DOMDocument, CURL. How can change those 2 lines of code to make all my script compatible?. This is my complete script by the way:
<?php
//PUBLIC VARS
$arrFeeds = array();
$downItems = 0;
$time_taken = 0;
//*PUBLIC VARS
function getRSS($source) {
$start = microtime(true);
ini_set('default_socket_timeout', 1);
global $arrFeeds, $downItems, $time_taken;
$arrFeeds = array();
$doc = new DOMDocument();
$doc->load($source);
foreach ($doc->getElementsByTagName('item') as $node) {
$itemRSS = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue,
'desc' => $node->getElementsByTagName('description')->item(0)->nodeValue,
'link' => $node->getElementsByTagName('link')->item(0)->nodeValue
);
array_push($arrFeeds, $itemRSS);
$downItems+=1;
}
$time_taken = microtime(true) - $start;
}
//getRSS("http://www.atm-mi.it/_layouts/atm/apps/PublishingRSS.aspx?web=388a6572-890f-4e0f-a3c7-a3dd463f7252&c=News%20Infomobilita");
//echo(strip_tags($arrFeeds[0]['title'])."<br><br>".$time_taken);
?>
Thanks for the help!
This ought to do it:
$ch = curl_init($source);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
$doc->loadXML($content);
Your mileage may vary, of course, and you might have to add more CURL options, but that's basic enough functionality to get it all started.
Use loadXML.
http://www.php.net/manual/en/domdocument.loadxml.php
Related
Im a a newbie trying to code a crawler to make some stats from a forum.
Here is my code :
<?php
$ch = curl_init();
$timeout = 0; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, 'http://m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom-je-vous-le-montre-en-action.htm');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file_contents = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($file_contents);
$xpath = new DOMXPath($dom);
$posts = $xpath->query("//div[#class='who-post']/a");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$dates = $xpath->query("//div[#class='date-post']");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$contents = $xpath->query("//div[#class='message text-enrichi-fmobile text-crop-fmobile']/p");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$i = 0;
foreach ($posts as $post) {
$nodes = $post->childNodes;
foreach ($nodes as $node) {
$value = trim($node->nodeValue);
$tab[$i]['author'] = $value;
$i++;
}
}
$i = 0;
foreach ($dates as $date) {
$nodes = $date->childNodes;
foreach ($nodes as $node) {
$value = trim($node->nodeValue);
$tab[$i]['date'] = $value;
$i++;
}
}
$i = 0;
foreach ($contents as $content) {
$nodes = $content->childNodes;
foreach ($nodes as $node) {
$value = $node->nodeValue;
echo $value;
$tab[$i]['content'] = trim($value);
$i++;
}
}
?>
<h1>Participants</h2>
<pre>
<?php
print_r($tab);
?>
</pre>
As you can see, the code do not retrieve some content. For example, Im trying to retrieve this content from : http://m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom-je-vous-le-montre-en-action.htm
The second post is a picture and my code do not work.
On the second hand, I guess i made some errors, I find my code ugly.
Can you help me please ?
You could simply select the posts first, then grab each subdata separately using:
DOMXPath::evaluate combined with normalize-space to retrieve pure text,
DOMXPath::query combined with DOMDocument::save to retrieve message paragraphs.
Code:
$xpath = new DOMXPath($dom);
$postsElements = $xpath->query('//*[#class="post"]');
$posts = [];
foreach ($postsElements as $postElement) {
$author = $xpath->evaluate('normalize-space(.//*[#class="who-post"])', $postElement);
$date = $xpath->evaluate('normalize-space(.//*[#class="date-post"])', $postElement);
$message = '';
foreach ($xpath->query('.//*[contains(#class, "message")]/p', $postElement) as $messageParagraphElement) {
$message .= $dom->saveHTML($messageParagraphElement);
}
$posts[] = (object)compact('author', 'date', 'message');
}
print_r($posts);
Unrelated note: scraping a website's HTML is not illegal in itself, but you should refrain from displaying their data on your own app/website without their consent. Also, this might break just about anytime if they decide to alter their HTML structure/CSS class names.
I am trying to scrape this url https://nrg91.gr/nrg-airplay-chart/ using simple-html-dom, but it does not seem to get the full html source code. This code:
include_once('simple_html_dom.php');
$html = file_get_html('https://nrg91.gr/nrg-airplay-chart');
echo $html->plaintext;
displays the content up to the h1, just before the content I am after. And from the simple-html-dom manual examples, this should display all links from that url:
foreach($html->find('a') as $e)
echo $e->href . '<br>';
but it only displays the links up to the main navigation menu, not from the main body or footer.
I also tried using prerender.com, to fully load url before passing it to file_get_html but the result was the same. What am I doing wrong?
That library looks like it hasn't been updated in 7 years. I'd always recommend using PHP's built-in functions:
$url = "https://nrg91.gr/nrg-airplay-chart/";
$dom = new DomDocument();
libxml_use_internal_errors(true);
$dom->load($url);
foreach($dom->getElementsByTagName("a") as $e) {
echo $e->getAttribute("href") . "\n";
}
Here's my super dirty approach to fetching the rank/artist/title/youtube data using both DOMDocument and SimpleXML.
The concept is to locate each "row" of data via the xpath //ul[#id="chart_ul"]/li, then using dom_import_simplexml( $outer )->getNodePath() to build a new xpath to select the individual elements where the desired data can be located.
$temp = sys_get_temp_dir() . DIRECTORY_SEPARATOR . 'nrg-airplay-chart.html';
if( file_exists( $temp ) === false or filemtime( $temp ) < time() - 3600 )
{
file_put_contents( $temp, $html = file_get_contents('https://nrg91.gr/nrg-airplay-chart/') );
}
else
{
$html = file_get_contents( $temp );
}
$dom = new DOMDocument();
$dom->loadHTML( $html );
$xml = simplexml_import_dom( $dom );
$array = array();
foreach( $xml->xpath('//ul[#id="chart_ul"]/li') as $index => $set )
{
$basexpath = dom_import_simplexml( $set )->getNodePath();
$array[] = array(
'ranking' => (string) $xml->xpath( $basexpath . '//span[#id="ranking"]' )[0],
'artist' => (string) $xml->xpath( $basexpath . '//p[#id="artist"]/b' )[0],
'title' => (string) $xml->xpath( $basexpath . '//p[#id="title"]' )[0],
'youtube' => (string) $xml->xpath( $basexpath . '//div[#id="media"]/a/#href' )[0],
);
}
print_r( $array );
Another approach you might wanna comply:
<?php
function get_content($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_exec($ch);
$htmlContent = curl_exec($ch);
curl_close($ch);
return $htmlContent;
}
$link = "https://nrg91.gr/nrg-airplay-chart/";
$xml = get_content($link);
$dom = #DOMDocument::loadHTML($xml);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//li[contains(#id,"wprs_chart-")]') as $items){
$artist = $xpath->query('.//p[#id="artist"]/b',$items)->item(0)->nodeValue;
$title = $xpath->query('.//p[#id="title"]',$items)->item(0)->nodeValue;
echo "{$artist} -- {$title}<br>";
}
?>
Output you should get like:
PORTOGAL THE MAN -- Feel It Still
JAX JONEW Feat INA WROLDSEN -- Breathe
CAMILA CABELLO -- Havana
CARBI B, J BALVIN & BAD BUNNY -- I Like It
ZAYN Feat SIA -- Dusk Till Dawn
i made a code like this and how to make it short ? i mean i don't want to use foreach all the time for regex match, thank you.
<?php
preg_match_all('#<article [^>]*>(.*?)<\/article>#sim', $content, $article);
foreach($article[1] as $posts) {
preg_match_all('#<img class="images" [^>]*>#si', $posts, $matches);
$img[] = $matches[0];
}
$result = array_filter($img);
foreach($result as $res) {
preg_match_all('#src="(.*?)" data-highres="(.*?)"#si', $res[0], $out);
$final[] = array(
'src' => $proxy.base64_encode($out[1][0]),
'highres' => $proxy.base64_encode($out[2][0])
);
?>
If you want a robust code (that always works), avoid to parse html using regex, because html is more complicated and unpredictable than you think. Instead use build-in tools available for these particular tasks, i.e DOMxxx classes.
$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($content);
libxml_use_internal_errors($state);
$xp = new DOMXPath($dom);
$imgList = $xp->query('//article//img[#src][#data-highres]');
foreach($imgList as $img) {
$final[] = [
'src' => $proxy.base64_encode($img->getAttribute('src')),
'highres' => $proxy.base64_encode($img->getAttribute('data-highres'))
];
}
$newstring = substr_replace("http://ws.spotify.com/search/1/track?q=", $_COOKIE["word"], 39, 0);
/*$curl = curl_init($newstring);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);*/
//echo $result;
$xml = simplexml_load_file($newstring);
//print_r($xml);
$xpath = new DOMXPath($xml);
$value = $xpath->query("//track/#href");
foreach ($value as $e) {
echo $e->nodevalue;
}
This is my code. I am using spotify to supply me with an xml document. I am then trying to get the href link from all of the track tags so I can use it. Right now the print_r($xml) I have commented out works, but if I try to query and print that out it returns nothing. The exact link I am trying to get my xml from is: http://ws.spotify.com/search/1/track?q=incredible
This maybe is not the answer you need, because I dropped the DOMXPath, I'm using getElementsByTagName() instead.
$url = "http://ws.spotify.com/search/1/track?q=incredible";
$xml = file_get_contents( $url );
$domDocument = new DOMDocument();
$domDocument->loadXML( $xml );
$value = $domDocument->getElementsByTagName( "track" );
foreach ( $value as $e ) {
echo $e->getAttribute( "href" )."<br>";
}
i am using DOMDocument() to include RSS feed in my code. However i get this error:
URL file-access is disabled in the server configuration
and thats because my server doesnt allow me either to modify the php.ini file or to set allow_url_fopen to ON.
Is there a workaround for this? This is my full code:
<?php
$rss = new DOMDocument();
$rss->load('rss.php');
$feed = array();
foreach ($rss->getElementsByTagName('item') as $node) {
$item = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue,
'desc' => $node->getElementsByTagName('description')->item(0)->nodeValue,
'link' => $node->getElementsByTagName('link')->item(0)->nodeValue,
'date' => $node->getElementsByTagName('pubDate')->item(0)->nodeValue,
);
array_push($feed, $item);
}
$limit = 5;
echo '<table>';
for($x=0;$x<$limit;$x++) {
$title = str_replace(' & ', ' & ', $feed[$x]['title']);
$link = $feed[$x]['link'];
echo <<<EOF
<tr>
<td><b>$title</b></td>
</tr>
EOF;
}
echo '</table>';
?>
Thank you.
Okay, i solved it myself.
<?php
$k = 'rss.php';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $k);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$rss = curl_exec($ch);
curl_close($ch);
$xml = simplexml_load_string($rss, 'SimpleXMLElement', LIBXML_NOCDATA);
$feed = array();
foreach($xml->channel->item as $item){
$item = array (
'title' => $item->title,
'desc' => $item->description,
'link' => $item->link,
'date' => $item->pubDate,
);
array_push($feed, $item);
}
$limit = 5;
echo '<table>';
for($x=0;$x<$limit;$x++) {
$title = str_replace(' & ', ' & ', $feed[$x]['title']);
$link = $feed[$x]['link'];
echo <<<EOF
<tr>
<td><b>$title</b></td>
</tr>
EOF;
}
echo '</table>';
?>
Use cURL commands. You really should be using this for server to server interactions rather than trying to pass URL's to constructors anyways.
Here is cURL documentation - http://us1.php.net/curl
I also have a simple cURL based REST client you can feel free to use - https://github.com/mikecbrant/php-rest-client
Basically, all you are looking to do is use cURL to retrieve the remote content instead of trying to open it directly using fopen wrapper. Once you retrieve the content then you pass it in to DOMDocument.