Retrieve XML from a third party page in PHP - php

I need read in and parse data from a third party website which sends XML data. All of this needs to be done server side.
What is the best way to do this using PHP?

You can obtain the remote XML data with, e.g.
$xmldata = file_get_contents("http://www.example.com/xmldata");
or with curl. Then use SimpleXML, DOM, whatever.

A good way of parsing XML is often to use XPP (XML Pull Parsing) librairy, PHP has an implementation of it, it's called XMLReader.
http://php.net/manual/en/class.xmlreader.php

I would suggest you to use DOMDocument (PHP inline built class)
A simple example of its power could be the following code:
/***********************************************************************************************
Takes the RSS news feeds found at $url and prints them as HTML code.
Each news is rendered in a <div class="rss"> block in the order: date + title + description.
***********************************************************************************************/
function Render($url, $max_feeds = 1000)
{
$doc = new DOMDocument();
if(#$doc->load($url, LIBXML_NOCDATA|LIBXML_NOBLANKS))
{
$feed_count = 0;
$items = $doc->getElementsByTagName("item");
//echo $items->length; //DEBUG
foreach($items as $item)
{
if($feed_count > $max_feeds)
break;
//Unfortunately inside <item> node elements are not always in same order, therefor we have to call many times getElementsByTagName
//WARNING: using iconv function instead of utf8_decode because this last one did not convert properly some characters like apostrophe 0x19 from techsport.it feeds.
$title = iconv('UTF-8', 'CP1252', $item->getElementsByTagName("title")->item(0)->firstChild->textContent); //can use "CP1252//TRANSLIT"
$description = iconv('UTF-8', 'CP1252', $item->getElementsByTagName("description")->item(0)->firstChild->textContent); //can use "CP1252//TRANSLIT"
$link = iconv('UTF-8', 'CP1252', $item->getElementsByTagName("link")->item(0)->firstChild->textContent); //can use "CP1252//TRANSLIT"
//pubDate tag is not mandatory in RSS [RSS2 spec: http://cyber.law.harvard.edu/rss/rss.html]
$pub_date = $item->getElementsByTagName("pubDate"); $date_html = "";
//play with date here if you want
echo "<div class='rss'>\n<p class='title'><a href='" . $link . "'>" . $title . "</a></p>\n<p class='description'>" . $description . "</p>\n</div>\n\n";
$feed_count++;
}
}
else
echo "<div class='rss'>Service not available.</div>";
}

I have been using simpleXML for a while.

Related

Retrieving Barometric and Other Climate Data Using simple_html_dom.php

I want to periodically (once a day or so) collect the barometric pressure reading for various USA weather stations. Using simple_html_dom.php I can scrape the entire page of this site, for example (https://www.localconditions.com/weather-alliance-nebraska/69301/). However, I don't know how to then parse this down to just the barometric pressure reading: in this case "30.26".
Here's the code that grabs all the html. Obviously the find('Barometer') element isn't working.
<?php
// example of how to use basic selector to retrieve HTML contents
include('simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('https://www.localconditions.com/weather-alliance-nebraska/69301/');
// find all span tags with class=gb1
foreach($html->find('strong') as $e)
echo $e->outertext . '<HR>';
// get an element representing the second paragraph
$element = $html->find("Barometer");
echo $e->outertext . '<br>';
// extract text from HTML
echo $html->plaintext;
?>
Any advise on how to parse this?
Thanks!
As mentioned by #bato3 in his comment, queries like this are far better handled with xpath. Unfortunately, neither DOMDocument nor simplexml (which I usually use to parse xml/html) could digest the html of this site (at least not when I tried). So we have to do it with simple_html_dom and resort to (somewhat inelegant) CSS selectors and string manipulation:
$dest = $html->find("//div[class='col-sm-6 col-md-6'] > p:has(> strong)");
foreach($dest as $e) {
$target = $e->innertext;
if (strpos($target, "Barometer")!== false){
$pressure = explode(" ", $target);
echo $pressure[2];
}
}
Output:
30.25 inHg.

Laravel: Parsing XML with SimpleXML namespace issue [duplicate]

This question has two parts.
Part 1. Yesterday I had some code which would echo the entire content of the XML from an RSS feed. Then I deleted it from my php document, saved over it, and I am totally kicking myself.
I believe the syntax went something like this:
$xml = simplexml_load_file($url);
echo $xml;
I tried that again and it is not working, so apparently I forgot the correct syntax and could use your help, dear stackoverflow question answerers.
I keep trying to figure out what I was doing and I am unable to find an example on Google or the PHP site. I tried the print_r($url); command, and it gives me what appears to be an atomized version of the feed. I want the whole string, warts and all. I realize that I could just type the RSS link into the window and see it, but it was helpful to have it on my PHP page as I am coding and noding.
Part 2 The main reason I wanted to reconstruct this is because I am trying to parse nodes off a blog RSS in order to display it on a webpage hosted on a private domain. I posted a dummy blog and discovered an awkward formatting glitch when I failed to add a title to one of the dummy posts.
So what does one do in this situation? I tried a little:
if(entry->title == "")
{$entryTitle = "untitled";}
That did not work at all.
Here's my entire php script for the handling of the blog:
<?php
/*create variables*/
$subtitle ="";
$entryTitle="";
$html = "";
$pubDate ="";
/*Store RSS feed address in new variable*/
$url = "http://www.blogger.com/feeds/6552111825067891333/posts/default";
/*Retrieve BLOG XML and store it in PHP object*/
$xml = simplexml_load_file($url);
print_r($xml);
/*Parse blog subtitle into HTML and echo it on the page*/
$subtitle .= "<h2 class='blog'>" . $xml->subtitle . "</h2><br />";
echo $subtitle;
/*Go through all the entries and parse them into HTML*/
foreach($xml->entry as $entry){
/*retrieve publication date*/
$xmlDate = $entry->published;
/*Convert XML timestamp into PHP timestamp*/
$phpDate = new DateTime(substr($xmlDate,0,19));
/*Format PHP timestamp to something humans understand*/
$pubDate .= $phpDate->format('l\, F j\, Y h:i A');
if ($entry->title == "")
{
$entryTitle .= "Untitled";
}
echo $entry->title;
/*Pick through each entry and parse each XML tree node into an HTML ready blog post*/
$html .= "<h3 class='blog'>".$entry->title . "<span class='pubDate'> | " .$pubDate . "</span></h3><p class='blog'>" . $entry->content . "</p>";
/*Print the HTML to the web page*/
echo $html;
/*Set the variables back to empty strings so they do not repeat data upon reiteration*/
$html = "";
$pubDate = "";
}
?>
According to the php manual:
$xml = new SimpleXMLElement($string);
.
.
.
then if you want to echo the result:
echo $xml->asXML();
or save the xml to a file:
$xml->asXML('blog.xml');
References
http://php.net/manual/fr/simplexmlelement.asxml.php
http://spotlesswebdesign.com/blog.php?id=14
Part 1
This is still not exactly what I wanted, but rather a very tidy and organized way of echoing the xml data:
$url = "http://www.blogger.com/feeds/6552111825067891333/posts/default";
$xml = simplexml_load_file($url);
echo '<pre>';
print_r($xml);
Part 2
I had to get firephp running so I could see exactly what elements php was encountering when it reached an entry without a blog title. Ultimately it is an empty array. Therefore, the simple:
if(empty($entry->title))
works perfectly. For string comparison, I found that you can simply cast it as a string. For my purposes, that was unnecessary.
The simplexml_load_file returns an SimpleXMLElement, so:
print_r($xml);
will show its minor objects and arrays.
After your tweaks you can call $xml->asXML("filename.xml"); as #Tim Withers pointed out.
Part 1: echo $xml->asXML(); - http://www.php.net/manual/en/simplexmlelement.asxml.php
Part 2: php SimpleXML check if a child exists
$html .= "<h3 class='blog'>".($entry->title!=null?$entry->title:'No Title')
. "<span class='pubDate'> | " .$pubDate . "</span></h3><p class='blog'>"
. $entry->content . "</p>";
Note I would probably load the url like this:
$feedUrl = 'http://www.blogger.com/feeds/6552111825067891333/posts/default';
$rawFeed = file_get_contents($feedUrl);
$xml = new SimpleXmlElement($rawFeed);
Based on your comment in regards to part 1, I am not sure if the XML is being loaded completely. If you try loading it this way, it should display all the XML data.

How to extract only certain tags from HTML document using PHP?

I'm using a crawler to retrieve the HTML content of certain pages on the web. I currently have the entire HTML stored in a single PHP variable:
$string = "<PRE>".htmlspecialchars($crawler->results)."</PRE>\n";
What I want to do is select all "p" tags (for example) and store their in an array. What is the proper way to do that?
I've tried the following, by using xpath, but it doesn't show anything (most probably because the document itself isn't an XML, I just copy-pasted the example given in its documentation).
$xml = new SimpleXMLElement ($string);
$result=$xml->xpath('/p');
while(list( , $node)=each($result)){
echo '/p: ' , $node, "\n";
}
Hopefully someone with (a lot) more experience in PHP will be able to help me out :D
Try using DOMDocument along with DOMDocument::getElementsByTagName. The workflow should be quite simple. Something like:
$doc = DOMDocument::loadHTML(htmlspecialchars($crawler->results));
$pNodes = $doc->getElementsByTagName('p');
Which will return a DOMNodeList.
I vote for use regexp. For tag p
preg_match_all('/<p>(.*)<\/p>/', '<p>foo</p><p>foo 1</p><p>foo 2</p>', $arr, PREG_PATTERN_ORDER);
if(is_array($arr))
{
foreach($arr as $value)
{
echo $value."</br>";
}
}
Check out Simple HTML Dom. It will grab external pages and process them with fairly accurate detail.
http://simplehtmldom.sourceforge.net/
It can be used like this:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';

php and simpleXml - how to change node contents

I'm trying to change the contents of a node in an XML file using simpleXML. I know that the variable for the new node-contents contains the right stuff, but for some reason the file isn't changed when it is saved. I'm probably missing something basic, because I'm new to simpleXML. Here is the whole php script:
<?php
$doc=$_REQUEST["book"];
$div1=$_REQUEST["div1"];
$div2=$_REQUEST["div2"];
if ($div1=="") $div1=$_REQUEST["chapter"];
if ($div2=="") $div2=$_REQUEST["verse"];
$div3=$_REQUEST["div3"];
$textresponse=$_REQUEST["xmltext"];
$strippedresponse = "<?xml version='1.0'?>" . stripslashes($textresponse);
echo("Saved changes to " . $doc . " " . $div1 . "." . $div2 ."<br />");
$fileName="/home/ocp/public_html/sites/default/docs/drafts/".$doc.".xml";
$xmlDoc = simplexml_load_file($fileName);
$backupFileName="/home/ocp/public_html/sites/default/docs/backups/".$doc." ".date("Y-m-d H.i.s").".xml";
file_put_contents($backupFileName, $xmlDoc->asXML());
$backupSize = filesize($backupFileName);
echo("Backup {$backupFileName} created:".$backupSize." bytes<br />");
if ($doc) {
if ($div1) {
if ($div2) {
$newVerse = simplexml_load_string($strippedresponse);
$oldVerse = $xmlDoc->xpath("//div[#number='".$div1."']/div[#number='".$div2."']");
$oldVerse = $newVerse;
$newDoc = $xmlDoc->asXml();
file_put_contents($fileName, $newDoc);
$newSize = filesize($fileName);
echo("New file is ".$newSize." bytes <br />");
}
}
}
?>
I'll venture to say that this code certainly doesn't do what you want it to:
$newVerse = simplexml_load_string($strippedresponse);
$oldVerse = $xmlDoc->xpath("//div[#number='".$div1."']/div[#number='".$div2."']");
$oldVerse = $newVerse;
Changing the value of a PHP variable has no side-effects. In other word, nothing happens when you do $a = $b; except in some specific cases, and it's not one of them.
I don't know what you really want to achieve with this code. If you want to replace the (X)HTML inside a specific <div/> you will need to use DOM and create a DOMDocumentFragment, use appendXML() to populate it then substitute it to your old <div/>. Either that or create a new DOMDocument, loadXML() then importNode() to your old document and replaceChild() your old div.
SimpleXMLElement::xpath returns an array of SimpleXMLElement objects. Copies, not references. So $oldVerse = $newVerse; does not change $xmlDoc in any way.
SimpleXML is sufficient to read XML, for manipulation you might want to choose a more powerful alternative from http://www.php.net/manual/de/refs.xml.php, e.g. DOM.

PHP: using preg_replace with htmlentities

I'm writing an RSS to JSON parser and as a part of that, I need to use htmlentities() on any tag found inside the description tag. Currently, I'm trying to use preg_replace(), but I'm struggling a little with it. My current (non-working) code looks like:
$pattern[0] = "/\<description\>(.*?)\<\/description\>/is";
$replace[0] = '<description>'.htmlentities("$1").'</description>';
$rawFeed = preg_replace($pattern, $replace, $rawFeed);
If you have a more elegant solution to this as well, please share. Thanks.
Simple. Use preg_replace_callback:
function _handle_match($match)
{
return '<description>' . htmlentities($match[1]) . '</description>';
}
$pattern = "/\<description\>(.*?)\<\/description\>/is";
$rawFeed = preg_replace_callback($pattern, '_handle_match', $rawFeed);
It accepts any callback type, so also methods in classes.
The more elegant solution would be to employ SimpleXML. Or a third party library such as XML_Feed_Parser or Zend_Feed to parse the feed.
Here is a SimpleXML example:
<?php
$rss = file_get_contents('http://rss.slashdot.org/Slashdot/slashdot');
$xml = simplexml_load_string($rss);
foreach ($xml->item as $item) {
echo "{$item->description}\n\n";
}
?>
Keep in mind that RSS and RDF and Atom look different, which is why it can make sense to employ one of the above libraries I mentioned.

Categories