XML to PHP array to mysql - php

I'm trying to import a xml data from a google xml document using simple xml to achieve that, an example of the code is here
<entry>
<id>
tag:google.com,2013:googlealerts/feed:11187837211342886856
</id>
<title type="html">
<b>London</b> Collections: Topman Design's retro mash-up
</title>
<link href="https://www.google.com/url?q=http://www.telegraph.co.uk/men/fashion-and-style/10901146/London-Collections-Topman-Designs-retro-mash-up.html&ct=ga&cd=CAIyAA&usg=AFQjCNEib0lLtkzUzFtR2Hk37wGefTVAZQ"/>
<published>2014-06-15T14:15:00Z</published>
<updated>2014-06-15T14:15:00Z</updated>
<content type="html">
Today is a very important day for England, and I'm not referring to the World Cup; it's the first day of <b>London</b> Collections: Men, a three day celebration ...
</content>
<author>
<name/>
</author>
</entry>
What would the best solution to do this? I'm so confused with how to get each as an variable to pass to mysql
this is exactly where I'm stuck
$xml = simplexml_load_file("xml.xml");
$feed = simplexml_load_string($xml);
$ns=$feed->getNameSpaces(true);
foreach ($feed->entry as $entry) {
}
thank you all in advance

You can use XPath. It may be simpler than SimpleXML when you have namespaces. You will also have to register the namespace which is not present in the feed excerpt you included as an example.
I found an arbitrary feed here: http://www.google.com/alerts/feeds/01662123773360489091/16526224428036307178
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:idx="urn:atom-extension:indexing">
<id>
tag:google.com,2005:reader/user/01662123773360489091/state/com.google/alerts/16526224428036307178
</id>
<title>Google Alert - test</title>
<link href="http://www.google.com/alerts/feeds/01662123773360489091/16526224428036307178" rel="self"/>
<updated>2014-06-15T17:30:04Z</updated>
<entry>
<id>
tag:google.com,2013:googlealerts/feed:5957360885559055905
</id>
<title type="html">
Dad's <b>Test</b> Out Products Made For the Family
</title>
<link href="https://www.google.com/url?q=http://gma.yahoo.com/video/dads-test-products-made-family-141428658.html&ct=ga&cd=CAIyAA&usg=AFQjCNHHBPoS6Poz-Y5A3vFfbsGL3fkrBA"/>
<published>2014-06-15T17:30:04Z</published>
<updated>2014-06-15T17:30:04Z</updated>
<content type="html">
Watch the video Dad's <b>Test</b> Out Products Made For the Family on Yahoo Good Morning America . Becky Worley enlists a group of fathers to see if "As ...
</content>
<author>
<name/>
</author>
</entry>
<entry>
...
I will use it to provide your answer.
In the first line there is a default namespace declaration xmlns. You have to register that in PHP to use the namespace in XPath. You should map it to a prefix (could be any one) even if there is no prefix in the original file. So this is how you would initialize the parser.
These two lines initialize the DOM parser and parse the file, loading it from the Internet:
$document = new DOMDocument();
$document->load( "http://www.google.com/alerts/feeds/01662123773360489091/16526224428036307178" );
These two initialize the XPath environment, registering the default namespace of your file with a prefix (I chose atom):
$xpath = new DOMXpath($document);
$xpath->registerNamespace("atom", "http://www.w3.org/2005/Atom");
Once that is set up, you can select the nodes using the evaluate() expression, which can be absolute or relative. To get all entry nodes, you can use an absolute expression:
$entries = $xpath->evaluate("//atom:entry");
The XPath expression is //atom::entry. It returns a set of entry nodes from the "http://www.w3.org/2005/Atom" namespace, which is what you want.
To extract the nodes and the information in the context of each entry, you can use DOM methods and properties such as firstChild, nextSibling, etc. or you can perform additional XPath contextual searches. A contextual search passes the context node as a second parameter to the evaluate() expression. Here is a loop that gets the data in each child node of <entry> and places it in an HTML sublist:
$entries = $xpath->evaluate("//atom:entry");
echo '<ul>'."\n";
foreach ($entries as $entry) {
echo '<li><b>Entry ID: '.$xpath->evaluate("atom:id/text()", $entry)->item(0)->nodeValue.'</b></li>'."\n";
echo '<ul>'."\n";
echo '<li>Title: '.$xpath->evaluate("atom:title/text()", $entry)->item(0)->nodeValue.'</li>'."\n";
echo '<li>Link: '.$xpath->evaluate("atom:link/#href", $entry)->item(0)->nodeValue.'</li>'."\n";
echo '<li>Published: '.$xpath->evaluate("atom:published/text()", $entry)->item(0)->nodeValue.'</li>'."\n";
echo '<li>Updated: '.$xpath->evaluate("atom:updated/text()", $entry)->item(0)->nodeValue.'</li>'."\n";
echo '<li>Content: '.$xpath->evaluate("atom:content/text()", $entry)->item(0)->nodeValue.'</li>'."\n";
echo '<li>Author: '.$xpath->evaluate("atom:author/atom:name/text()", $entry)->item(0)->nodeValue.'</li>'."\n";
echo '</ul>'."\n";
}
echo '</ul>'."\n";
Note that the expressions are relative to entry (they don't start with /), he element selectors are also prefixed (they also belong to the atom namespace), and I used item(0) and nodeValue to extract the results. Since nodes may have many children, the evaluate() expression as used above returns a nodeset. If there is only one text child, it's in item(0). nodeValue converts it to string.
The result of running the program above will be:
<ul>
<li><b>Entry ID: tag:google.com,2013:googlealerts/feed:5957360885559055905</b></li>
<ul>
<li>Title: Dad's <b>Test</b> Out Products Made For the Family</li>
<li>Link: https://www.google.com/url?q=http://gma.yahoo.com/video/dads-test-products-made-family-141428658.html&ct=ga&cd=CAIyAA&usg=AFQjCNHHBPoS6Poz-Y5A3vFfbsGL3fkrBA</li>
<li>Published: 2014-06-15T17:30:04Z</li>
<li>Updated: 2014-06-15T17:30:04Z</li>
<li>Content: Watch the video Dad's <b>Test</b> Out Products Made For the Family on Yahoo Good Morning America . Becky Worley enlists a group of fathers to see if "As ...</li>
<li>Author: </li>
</ul>
<li><b>Entry ID: tag:google.com,2013:googlealerts/feed:11008408359408830921</b></li>
<ul>
<li>Title: Germany faces major <b>test</b> of strength in its World Cup opener against Portugal</li>
<li>Link: https://www.google.com/url?q=http://www.foxnews.com/sports/2014/06/15/germany-faces-major-test-strength-in-its-world-cup-opener-against-portugal/&ct=ga&cd=CAIyAA&usg=AFQjCNHOU94QyciRpCEdJawOwl3diEEO0A</li>
<li>Published: 2014-06-15T16:18:45Z</li>
<li>Updated: 2014-06-15T16:18:45Z</li>
<li>Content: Cristiano Ronaldo stretches during a training session of Portugal in Campinas, Brazil, Saturday, June 14, 2014. Portugal plays in group G of the Brazil ...</li>
<li>Author: </li>
</ul>
<li><b>Entry ID: tag:google.com,2013:googlealerts/feed:8664961950651004785</b></li>
...
Now you can edit the code to adapt it to the data you wish to extract.
You can see a working example of this application in this PHP Fiddle

Related

PHP simplexml_load_string trims out any HTML in between two of the XML tags...how to keep?

Below it he XML and the the XML parsed object..
The code used is
$XML = preg_replace("/(<\/?)(\w+):([^>]*>)/", "$1$2$3", $XML);
echo('\n\n'.$XML);
$xmldoc = simplexml_load_string($XML);
print_r($xmldoc);
$jsondoc = json_encode($xmldoc);
$phpobjectsdoc = json_decode($json, true);
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2023-01-06T02:06:06Z</responseDate>
<request identifier="journals:aajses19230810-01" metadataPrefix="oai_dc" verb="GetRecord"> https://x.x.edu/journals/cgi-bin/bcjournals-oaiserver</request>
<GetRecord>
<record>
<header>
<identifier>bcjournals:aajses19230810-01</identifier>
<datestamp>2020-12-03</datestamp>
<setSpec>bcjournals:aajses-documents</setSpec>
</header>
<metadata>
<oai_dcdc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dctitle>Bulletin of the American Association of Jesuit Scientists, Eastern Section</dctitle>
<dcdate>1923-08-10</dcdate>
<dcdescription>
Bulletin of the American Association of Jesuit Scientists, Eastern Section, 10 August 1923
<a href="https://xxx.xxx.edu/iiif/issue/aajses19230810-01/manifest.json?manifest=https%3a%2f%2fxxx.xxx.edu%2fiiif%2fissue%2faajses19230810-01%2fmanifest.json" target="_blank">
<img style="width: 20px;" alt="IIIF Collection Link" src="/custom/bournals/web/images/iiif-logo.png"/>
</a>
(22 pages, 19 articles)
</dcdescription>
<dclanguage>en</dclanguage>
</oai_dcdc>
</metadata>
</record>
</GetRecord>
</OAI-PMH>
SimpleXMLElement Object
(
[responseDate] => 2023-01-06T02:06:06Z
[request] => https://.ddd.edu/bcjournals/cg-bin/brnals-oaiserver
[GetRecord] => SimpleXMLElement Object
(
[record] => SimpleXMLElement Object
(
[header] => SimpleXMLElement Object
(
[identifier] => bcrnals:aajses19230810-01
[datestamp] => 2020-12-03
[setSpec] => bnals:aajses-documents
)
[metadata] => SimpleXMLElement Object
(
[oai_dcdc] => SimpleXMLElement Object
(
[dctitle] => Bulletin of the American Association of Jesuit Scientists, Eastern Section
[dcdate] => 1923-08-10
[dcdescription] =>
Bulletin of the American Association of Jesuit Scientists, Eastern Section, 10 August 1923
(22 pages, 19 articles)
[dclanguage] => en
)
)
)
)
)
It's not your fault, but you have managed to pick up on other people's terrible habits, and ended up digging yourself into an entirely unnecessary hole.
The first thing to understand is that SimpleXML is an API, not a way to create plain PHP objects - the structure of an XML document is (or can be) much more complex than a PHP object can easily represent, so SimpleXML provides ways to access the data, but doesn't put it all in obvious places.
So the first mistake that everyone makes is expecting this to work:
$xmldoc = simplexml_load_string($XML);
print_r($xmldoc);
The first line creates a SimpleXMLElement object, and the second tries to display it - but unfortunately print_r doesn't know how to show all the data that's available, so some of the object is invisible. This confuses people into doing lots of unnecessary and counter-productive things, because they think the invisible things are missing.
The only real way to show what's inside the object, is to turn it back into XML, at which point you'll see that everything is in fact still there:
$xmldoc = simplexml_load_string($XML);
echo $xmldoc->asXML();
Once you realise this, you'll realise that this line was unnecessary:
$XML = preg_replace("/(<\/?)(\w+):([^>]*>)/", "$1$2$3", $XML);
This is a terrible attempt at handling element and attribute names with colons in, which actually represent "XML Namespaces" - a relatively complex but useful way to combine different formats into one without names colliding (my <link> and your <link> can sit next to each other, and we can tell the difference). To handle them properly, see Reference - How do I handle Namespaces (Tags and Attributes with a Colon in their Name) in SimpleXML?
Obviously, most of the time you don't just want to see the XML again, you want to get data out of it. Because print_r isn't telling them how to do that, people use another terrible trick to get a different kind of object:
$jsondoc = json_encode($xmldoc);
$phpobjectsdoc = json_decode($json, true);
Do not do this. Once you've done this, that data that was invisible in print_r really has gone for good.
Instead, read the examples in the manual of how to traverse a basic XML document.
So, back to your example. I've taken a guess at where the colons were before you removed them. Then I pasted it into a programmer's editor (in my case, PhpStorm) to automatically indent it for easier reading:
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2023-01-06T02:06:06Z</responseDate>
<request identifier="journals:aajses19230810-01" metadataPrefix="oai_dc" verb="GetRecord"> https://x.x.edu/journals/cgi-bin/bcjournals-oaiserver</request>
<GetRecord>
<record>
<header>
<identifier>bcjournals:aajses19230810-01</identifier>
<datestamp>2020-12-03</datestamp>
<setSpec>bcjournals:aajses-documents</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Bulletin of the American Association of Jesuit Scientists, Eastern Section</dc:title>
<dc:date>1923-08-10</dc:date>
<dc:description>
Bulletin of the American Association of Jesuit Scientists, Eastern Section, 10 August 1923
<a href="https://xxx.xxx.edu/iiif/issue/aajses19230810-01/manifest.json?manifest=https%3a%2f%2fxxx.xxx.edu%2fiiif%2fissue%2faajses19230810-01%2fmanifest.json" target="_blank">
<img style="width: 20px;" alt="IIIF Collection Link" src="/custom/bournals/web/images/iiif-logo.png"/>
</a>
(22 pages, 19 articles)
</dc:description>
<dc:language>en</dc:language>
</oai_dc:dc>
</metadata>
</record>
</GetRecord>
</OAI-PMH>
Now, to get to the dc:description element, we need to walk through step by step:
// 1. Our initial object represents the root element, `<OAI=PMH>`
$xmldoc = simplexml_load_string($XML);
// 2. Unprefixed children are in namespace `http://www.openarchives.org/OAI/2.0/` namespace
// because of the xmlns="http://www.openarchives.org/OAI/2.0/"
$children = $xmldoc->children('http://www.openarchives.org/OAI/2.0/');
// 3. We want the element <GetRecord>
$GetRecord = $children->GetRecord;
// 4. Inside that, we want <record>, then <metadata>
$record = $GetRecord->record;
$metadata = $record->metadata;
// Or all in one statement:
$metadata = $xmldoc->children('http://www.openarchives.org/OAI/2.0/')
->GetRecord->record->metadata;
// 5. Now we have <oai_dc:dc> - a new namespace
// xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" so:
$dc = $metadata->children('http://www.openarchives.org/OAI/2.0/oai_dc/')->dc;
// 6. Another new namespace, for dc:description
// xmlns:dc="http://purl.org/dc/elements/1.1/"
$description = $dc->children('http://purl.org/dc/elements/1.1/')->description;
// 7. At this point, see what we've got:
echo $description->asXML();
Giving
<dc:description>
Bulletin of the American Association of Jesuit Scientists, Eastern Section, 10 August 1923
<a href="https://xxx.xxx.edu/iiif/issue/aajses19230810-01/manifest.json?manifest=https%3a%2f%2fxxx.xxx.edu%2fiiif%2fissue%2faajses19230810-01%2fmanifest.json" target="_blank">
<img style="width: 20px;" alt="IIIF Collection Link" src="/custom/bournals/web/images/iiif-logo.png"/>
</a>
(22 pages, 19 articles)
</dc:description>
The description, with all its content in place!
There's one remaining problem: we don't really want the <dc:description> and </dc:description>, just the text in between. If the element only contained text, we could write echo (string)$description;. But if we do that here, the HTML disappears again - that's because whoever produced this data messed up a bit, and included the HTML without escaping it, so SimpleXML thinks the <a> and <img> elements are part of the XML structure, not the content.
Getting the "inner XML" is one thing that SimpleXML doesn't have a neat method for, but there are tricks for doing it:
$content= '';
foreach (dom_import_simplexml($description)->childNodes as $child)
{
$content .= $child->ownerDocument->saveXML( $child );
}
echo $content;
Giving
Bulletin of the American Association of Jesuit Scientists, Eastern Section, 10 August 1923
<a href="https://xxx.xxx.edu/iiif/issue/aajses19230810-01/manifest.json?manifest=https%3a%2f%2fxxx.xxx.edu%2fiiif%2fissue%2faajses19230810-01%2fmanifest.json" target="_blank">
<img style="width: 20px;" alt="IIIF Collection Link" src="/custom/bournals/web/images/iiif-logo.png"/>
</a>
(22 pages, 19 articles)
Perfect!
If that looks long-winded, that's just because I stopped to explain along the way; here's a tidied up version:
$xmldoc = simplexml_load_string($XML);
$description = $xmldoc
->children('http://www.openarchives.org/OAI/2.0/')
->GetRecord->record->metadata
->children('http://www.openarchives.org/OAI/2.0/oai_dc/')
->dc
->children('http://purl.org/dc/elements/1.1/')
->description;
$content= '';
foreach (dom_import_simplexml($description)->childNodes as $child)
{
$content .= $child->ownerDocument->saveXML( $child );
}

Xpath query for HTML table within XML in PHP DOMDocument

I have an XML file with following tree structure.
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<channel>
<title>Videos</title>
<link>https://www.example.com/r/videos/</link>
<description>A long description of the video.</description>
<image>...</image>
<atom:link rel="self" href="http://www.example.com/videos/.xml" type="application/rss+xml"/>
<item>
<title>The most used Jazz lick in history.</title>
<link>
http://www.example.com/
</link>
<guid isPermaLink="true">
http://www.example.com/
</guid>
<pubDate>Mon, 07 Sep 2015 14:43:34 +0000</pubDate>
<description>
<table>
<tr>
<td>
<a href="http://www.example.com/">
<img src="http://www.example.com/.jpg" alt="The most used Jazz lick in history." title="The most used Jazz lick in history." />
</a>
</td>
<td> submitted by
jcepiano
<br/>
[link]
<a href="http://www.example.com/">
[508 comments]
</a>
</td>
</tr>
</table>
</description>
<media:title>The most used Jazz lick in history.</media:title>
<media:thumbnail url="http://example.jpg"/>
</item>
</channel>
</rss>
Here, the html table element is embedded inside XML and that's confusing me.
Now I want to pick the text node values for //channel/item/title and href value for //channel/item/description/table/tr/td[1]/a[1] (with a text node value = "[link]")
Above in 2nd case, I am looking for the value of 2nd a (with a text node value = "[link]"), inside 2nd td inside tr, table, description, item, channel.
I am using PHP DOMDocument();
I have been looking for a perfect solution for this for 2 days now, can you please let me know how would this happen?
Also I need to count the total number of items in the feed, right now I am doing like this:
...
$queryResult = $xpathvar->query('//item/title');
$total = 1;
foreach($queryResult as $result){
$total++;
}
echo $title;
And I also need a reference link for XPath query selectors' rules.
Thanks in advance! :)
You wrote that you wanted the length of the result set of the following query:
$queryResult = $xpathvar->query('//item/title');
I assume that $xpathvar here is of type DOMXPath. If so, it has a length property as described here. Instead of using foreach, simply use:
$length = $xpathvar->query('//item/title')->length;
Now I want to pick the text node values for //channel/item/title
Which you can get with the expression //channel/item/title/text().
and href value for //channel/item/description/table/tr/td[1]/a[1] (with a text node value = "[link]")
Your expression here selects any tr, the first td under that, then the first a. But the first a does not have a value of "[link]" in your source. If you want that, though, you can use:
//channel/item/description/table/tr/td[1]/a[1]/#href
but it looks like you rather want:
//channel/item/description/table/tr/td/a[. = "[link]"][1]/#href
which finds the first a element in the tree that has the value (text node) that is "[link]".
Above in 2nd case, I am looking for the value of 2nd a (with a text node value = "[link]"), inside 2nd td inside tr, table, description, item, channel.
Not sure if this was a separate question or meant to explain the previous one. Regardless, the answer the same as in the previous one, unless you explicitly want to search for 2nd a etc (i.e., search by position), in which case you can use numeric predicates.
Note: you start most of your expressions with //expr, which essentially means: search the whole tree at any depth for the expression expr. This is potentially expensive and if all you need is a (relative) root node for which you know the starting point or expression, it is better, and far more performant, to use a direct path. In your case, you can replace //channel for /*/channel (because it is the first under the root element).
I finally could make it work with the code below
$url = "https://www.example.com/r/videos/.xml";
$feed_dom = new domDocument;
$feed_dom->load($url);
$feed_dom->preserveWhiteSpace = false;
$items = $feed_dom->getElementsByTagName('item');
foreach($items as $item){
$title = $item->getElementsByTagName('title')->item(0)->nodeValue;
$desc_table = $item->getElementsByTagName('description')->item(0)->nodeValue;
echo $title . "<br>";
$table_dom = new domDocument;
$table_dom->loadHTML($desc_table);
$xpath = new DOMXpath($table_dom);
$table_dom->preserveWhiteSpace = false;
$yt_link_node = $xpath->query("//table/tr/td[2]/a[2]");
foreach($yt_link_node as $yt_link){
$yt = $yt_link->getAttribute('href');
echo $yt . "<br>";
echo "<br>";
}
}
I thank Abel, your help was greatly useful to achieve the tasks. :)

Regular Expressions - PHP and XML

I'm in college and new to PHP regular expressions but I have somewhat of an idea what I need to do I think. Basically I need to create a PHP program to read XML source code containing several 'stories' and store their details in a mySQL database. I've managed to create an expression that selects each story but I need to break this expression down further in order to get each element within the story. Here's the XML:
XML
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="test.xsl"?>
<latestIssue>
<issue number="256" />
<date>
<day> 21 </day>
<month> 1 </month>
<year> 2011 </year>
</date>
<story>
<title> Is the earth flat? </title>
<author> A. N. Redneck </author>
<url> http://www.HotStuff.ie/stories/story123456.xml </url>
</story>
<story>
<title> What the actress said to the bishop </title>
<author> Brated Film Critic </author>
<url> http://www.HotStuff.ie/stories/story123457.xml </url>
</story>
<story>
<title> What the year has in store </title>
<author> Stargazer </author>
<url> http://www.HotStuff.ie/stories/story123458.xml </url>
</story>
</latestIssue>
So I need to get the title, author and url from each story and add them as a row in my database. Here's what I have so far:
PHP
<?php
$url = fopen("http://address/to/test.xml", "r");
$contents = fread($url,10000000);
$exp = preg_match_all("/<title>(.+?)<\/url>/s", $contents, $matches);
foreach($matches[1] as $match) {
// NO IDEA WHAT TO DO FROM HERE
// $exp2 = "/<title>(.+?)<\/title><author>(.+?)<\/author><url>(.+?)<\/url>/";
// This is what I had but I'm not sure if it's right or what to do after
}
?>
I'd really appreciate the help guys, I've been stuck on this all day and I can't wrap my head around regular expressions at all. Once I've managed to get each story's details I can easily update the database.
EDIT:
Thanks for replying but are you sure this can't be done with regular expressions? It's just the question says "Use regular expressions to analyse the XML and extract the relevant data that you need. Note that information about each story is spread across several lines of XML". Maybe he made a mistake but I don't see why he'd write it like that if it can't be done this way.
First of all, start using
file_get_contents("UrlHere");
to gather the content from a page.
Now if you want to parse the XML use the XML parser in PHP for example.
You could also use third-party XML parsers
Regular expressions are not the correct tool to use here. You want to use a XML parser. I like PHP's SimpleXML
$sXML = new SimpleXMLElement('http://address/to/test.xml', 0, TRUE);
$stories = $sXML->story;
foreach($stories as $story){
$title = (string)$story->title;
$author = (string)$story->author;
$url = (string)$story->url;
}
You should never use regexp to parse an XML document (Ok, never is a big word, in some rare cases the regexp can be better but not in your case).
As it's a document reading, I suggest you to use the SimpleXML class and XPath queries.
For example :
$ cat test.php
#!/usr/bin/php
<?php
function xpathValueToString(SimpleXMLElement $xml, $xpath){
$arrayXpath = $xml->xpath($xpath);
return ($arrayXpath) ? trim((string) $arrayXpath[0]) : null;
}
$xml = new SimpleXMLElement(file_get_contents("test.xml"));
$arrayXpathStories = $xml->xpath("/latestIssue/story");
foreach ($arrayXpathStories as $story){
echo "Title : " . xpathValueToString($story, 'title') . "\n";
echo "Author : " . xpathValueToString($story, 'author') . "\n";
echo "URL : " . xpathValueToString($story, 'url') . "\n\n";
}
?>
$ ./test.php
Title : Is the earth flat?
Author : A. N. Redneck
URL : http://www.HotStuff.ie/stories/story123456.xml
Title : What the actress said to the bishop
Author : Brated Film Critic
URL : http://www.HotStuff.ie/stories/story123457.xml
Title : What the year has in store
Author : Stargazer
URL : http://www.HotStuff.ie/stories/story123458.xml

Parse XML namespaces with php SimpleXML

I know this has been asked many many times but I haven't been able to get any of the suggestions to work with my situation and I have searched the web and here and tried everything and anything and nothing works. I just need to parse this XML with the namespace cap: and just need four entries from it.
<?xml version="1.0" encoding="UTF-8"?>
<entry>
<id>http://alerts.weather.gov/cap/wwacapget.php?x=TX124EFFB832F0.SpecialWeatherStatement.124EFFB84164TX.LUBSPSLUB.ac20a1425c958f66dc159baea2f9e672</id>
<updated>2013-05-06T20:08:00-05:00</updated>
<published>2013-05-06T20:08:00-05:00</published>
<author>
<name>w-nws.webmaster#noaa.gov</name>
</author>
<title>Special Weather Statement issued May 06 at 8:08PM CDT by NWS</title>
<link href="http://alerts.weather.gov/cap/wwacapget.php?x=TX124EFFB832F0.SpecialWeatherStatement.124EFFB84164TX.LUBSPSLUB.ac20a1425c958f66dc159baea2f9e672"/>
<summary>...SIGNIFICANT WEATHER ADVISORY FOR COCHRAN AND BAILEY COUNTIES... AT 808 PM CDT...NATIONAL WEATHER SERVICE DOPPLER RADAR INDICATED A STRONG THUNDERSTORM 30 MILES NORTHWEST OF MORTON...MOVING SOUTHEAST AT 25 MPH. NICKEL SIZE HAIL...WINDS SPEEDS UP TO 40 MPH...CONTINUOUS CLOUD TO GROUND LIGHTNING...AND BRIEF MODERATE DOWNPOURS ARE POSSIBLE WITH</summary>
<cap:event>Special Weather Statement</cap:event>
<cap:effective>2013-05-06T20:08:00-05:00</cap:effective>
<cap:expires>2013-05-06T20:45:00-05:00</cap:expires>
<cap:status>Actual</cap:status>
<cap:msgType>Alert</cap:msgType>
<cap:category>Met</cap:category>
<cap:urgency>Expected</cap:urgency>
<cap:severity>Minor</cap:severity>
<cap:certainty>Observed</cap:certainty>
<cap:areaDesc>Bailey; Cochran</cap:areaDesc>
<cap:polygon>34.19,-103.04 34.19,-103.03 33.98,-102.61 33.71,-102.61 33.63,-102.75 33.64,-103.05 34.19,-103.04</cap:polygon>
<cap:geocode>
<valueName>FIPS6</valueName>
<value>048017 048079</value>
<valueName>UGC</valueName>
<value>TXZ027 TXZ033</value>
</cap:geocode>
<cap:parameter>
<valueName>VTEC</valueName>
<value>
</value>
</cap:parameter>
</entry>
I am using simpleXML and I have a small simple test script set up and it works great for parsing regular elements. I can't for the dickens of me find or get a way to parse the elements with the namespaces.
Here is a small sample test script with code I am using and works great for parsing simple elements. How do I use this to parse namespaces? Everything I've tried doesn't work. I need it to be able to create variables so I can be able to embed them in HTML for style.
<?php
$html = "";
// Get the XML Feed
$data = "http://alerts.weather.gov/cap/tx.php?x=1";
// load the xml into the object
$xml = simplexml_load_file($data);
for ($i = 0; $i < 10; $i++){
$title = $xml->entry[$i]->title;
$summary = $xml->entry[$i]->summary;
$html .= "<p><strong>$title</strong></p><p>$summary</p><hr/>";
}
echo $html;
?>
This works fine for parsing regular elements but what about the ones with the cap: namespace under the entry parent?
<?php
ini_set('display_errors','1');
$html = "";
$data = "http://alerts.weather.gov/cap/tx.php?x=1";
$entries = simplexml_load_file($data);
if(count($entries)):
//Registering NameSpace
$entries->registerXPathNamespace('prefix', 'http://www.w3.org/2005/Atom');
$result = $entries->xpath("//prefix:entry");
//echo count($asin);
//echo "<pre>";print_r($asin);
foreach ($result as $entry):
$title = $entry->title;
$summary = $entry->summary;
$html .= "<p><strong>$title</strong></p><p>$summary</p>$event<hr/>";
endforeach;
endif;
echo $html;
?>
Any help would be greatly appreciated.
-Thanks
I have given same type of answer here - solution to your question
You just need to register Namespace and then you can work normally with simplexml_load_file and XPath
<?php
$data = "http://alerts.weather.gov/cap/tx.php?x=1";
$entries = file_get_contents($data);
$entries = new SimpleXmlElement($entries);
if(count($entries)):
//echo "<pre>";print_r($entries);die;
//alternate way other than registring NameSpace
//$asin = $asins->xpath("//*[local-name() = 'ASIN']");
$entries->registerXPathNamespace('prefix', 'http://www.w3.org/2005/Atom');
$result = $entries->xpath("//prefix:entry");
//echo count($asin);
//echo "<pre>";print_r($result);die;
foreach ($result as $entry):
//echo "<pre>";print_r($entry);die;
$dc = $entry->children('urn:oasis:names:tc:emergency:cap:1.1');
echo $dc->event."<br/>";
echo $dc->effective."<br/>";
echo "<hr>";
endforeach;
endif;
That's it.
Here's an alternative solution:
<?php
$xml = <<<XML
<?xml version = '1.0' encoding = 'UTF-8' standalone = 'yes'?>
<?xml-stylesheet href='http://alerts.weather.gov/cap/capatom.xsl' type='text/xsl'?>
<!--
This atom/xml feed is an index to active advisories, watches and warnings
issued by the National Weather Service. This index file is not the complete
Common Alerting Protocol (CAP) alert message. To obtain the complete CAP
alert, please follow the links for each entry in this index. Also note the
CAP message uses a style sheet to convey the information in a human readable
format. Please view the source of the CAP message to see the complete data
set. Not all information in the CAP message is contained in this index of
active alerts.
-->
<feed
xmlns = 'http://www.w3.org/2005/Atom'
xmlns:cap = 'urn:oasis:names:tc:emergency:cap:1.1'
xmlns:ha = 'http://www.alerting.net/namespace/index_1.0'
>
<!-- http-date = Tue, 07 May 2013 04:14:00 GMT -->
<id>http://alerts.weather.gov/cap/tx.atom</id>
<logo>http://alerts.weather.gov/images/xml_logo.gif</logo>
<generator>NWS CAP Server</generator>
<updated>2013-05-06T23:14:00-05:00</updated>
<author>
<name>w-nws.webmaster#noaa.gov</name>
</author>
<title>Current Watches, Warnings and Advisories for Texas Issued by the National Weather Service</title>
<link href='http://alerts.weather.gov/cap/tx.atom'/>
<entry>
<id>http://alerts.weather.gov/cap/wwacapget.php?x=TX124EFFB8AA78.FireWeatherWatch.124EFFD70270TX.EPZRFWEPZ.1716207877d94d15d43d410892b9f175</id>
<updated>2013-05-06T23:14:00-05:00</updated>
<published>2013-05-06T23:14:00-05:00</published>
<author>
<name>w-nws.webmaster#noaa.gov</name>
</author>
<title>Fire Weather Watch issued May 06 at 11:14PM CDT until May 08 at 10:00PM CDT by NWS</title>
<link href="http://alerts.weather.gov/cap/wwacapget.php?x=TX124EFFB8AA78.FireWeatherWatch.124EFFD70270TX.EPZRFWEPZ.1716207877d94d15d43d410892b9f175"/>
<summary>...CRITICAL FIRE CONDITIONS EXPECTED WEDNESDAY ACROSS FAR WEST TEXAS AND THE SOUTHWEST NEW MEXICO LOWLANDS... .WINDS ALOFT WILL STRENGTHEN OVER THE REGION EARLY THIS WEEK...AHEAD OF AN UPPER LEVEL TROUGH FORECAST TO MOVE THROUGH NEW MEXICO AND TEXAS ON WEDNESDAY. SURFACE LOW PRESSURE WILL ALSO DEVELOP TO OUR EAST AS THE TROUGH APPROACHES. THIS COMBINATION WILL RESULT</summary>
<cap:event>Fire Weather Watch</cap:event>
<cap:effective>2013-05-06T23:14:00-05:00</cap:effective>
<cap:expires>2013-05-08T22:00:00-05:00</cap:expires>
<cap:status>Actual</cap:status>
<cap:msgType>Alert</cap:msgType>
<cap:category>Met</cap:category>
<cap:urgency>Future</cap:urgency>
<cap:severity>Moderate</cap:severity>
<cap:certainty>Possible</cap:certainty>
<cap:areaDesc>El Paso; Hudspeth</cap:areaDesc>
<cap:polygon></cap:polygon>
<cap:geocode>
<valueName>FIPS6</valueName>
<value>048141 048229</value>
<valueName>UGC</valueName>
<value>TXZ055 TXZ056</value>
</cap:geocode>
<cap:parameter>
<valueName>VTEC</valueName>
<value>/O.NEW.KEPZ.FW.A.0018.130508T1900Z-130509T0300Z/</value>
</cap:parameter>
</entry>
<entry>
<id>http://alerts.weather.gov/cap/wwacapget.php?x=TX124EFFABB2F0.AirQualityAlert.124EFFC750DCTX.HGXAQAHGX.7f2cf548a67d403f0541492b2804d621</id>
<updated>2013-05-06T14:16:00-05:00</updated>
<published>2013-05-06T14:16:00-05:00</published>
<author>
<name>w-nws.webmaster#noaa.gov</name>
</author>
<title>Air Quality Alert issued May 06 at 2:16PM CDT by NWS</title>
<link href="http://alerts.weather.gov/cap/wwacapget.php?x=TX124EFFABB2F0.AirQualityAlert.124EFFC750DCTX.HGXAQAHGX.7f2cf548a67d403f0541492b2804d621"/>
<summary>...OZONE ACTION DAY FOR TUESDAY... THE TEXAS COMMISSION ON ENVIRONMENTAL QUALITY (TCEQ)...HAS ISSUED AN OZONE ACTION DAY FOR THE HOUSTON...GALVESTON...AND BRAZORIA AREAS FOR TUESDAY...MAY 7 2013. ATMOSPHERIC CONDITIONS ARE EXPECTED TO BE FAVORABLE FOR PRODUCING HIGH LEVELS OF OZONE POLLUTION IN THE HOUSTON...GALVESTON AND</summary>
<cap:event>Air Quality Alert</cap:event>
<cap:effective>2013-05-06T14:16:00-05:00</cap:effective>
<cap:expires>2013-05-07T19:15:00-05:00</cap:expires>
<cap:status>Actual</cap:status>
<cap:msgType>Alert</cap:msgType>
<cap:category>Met</cap:category>
<cap:urgency>Unknown</cap:urgency>
<cap:severity>Unknown</cap:severity>
<cap:certainty>Unknown</cap:certainty>
<cap:areaDesc>Brazoria; Galveston; Harris</cap:areaDesc>
<cap:polygon></cap:polygon>
<cap:geocode>
<valueName>FIPS6</valueName>
<value>048039 048167 048201</value>
<valueName>UGC</valueName>
<value>TXZ213 TXZ237 TXZ238</value>
</cap:geocode>
<cap:parameter>
<valueName>VTEC</valueName>
<value></value>
</cap:parameter>
</entry>
</feed>
XML;
$sxe = new SimpleXMLElement($xml);
$capFields = $sxe->entry->children('cap', true);
echo "Event: " . (string) $capFields->event . "\n";
echo "Effective: " . (string) $capFields->effective . "\n";
echo "Expires: " . (string) $capFields->expires . "\n";
echo "Severity: " . (string) $capFields->severity . "\n";
Output:
Event: Fire Weather Watch
Effective: 2013-05-06T23:14:00-05:00
Expires: 2013-05-08T22:00:00-05:00
Severity: Moderate

Identical nested XML elements with namespaces and PHP

Try as I may, I cannot seem to grab the value of the "Id" attribute in the nested apcm:Property element, where the "Name" attribute equals "sequenceNumber", on line 12. As you can see, there element of interest is buried in a nest of other elements with an identical name and namespace.
Using PHP, I'm having a difficult time wrapping my head around how to grab that Id value.
<?xml version="1.0" encoding="utf-8" ?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:apcm="http://ap.org/schemas/03/2005/apcm" xmlns:apnm="http://ap.org/schemas/03/2005/apnm" xmlns:georss="http://www.georss.org/georss">
<id>urn:publicid:ap.org:30085</id>
<title type="xhtml">
<apxh:div xmlns:apxh="http://www.w3.org/1999/xhtml">
<apxh:span>AP New York State News - No Weather</apxh:span>
</apxh:div>
</title>
<apcm:Property Name="FeedProperties">
<apcm:Property Name="Entitlement" Id="urn:publicid:ap.org:product:30085" Value="AP New York State News - No Weather" />
<apcm:Property Name="FeedSequencing">
<apcm:Property Name="sequenceNumber" Id="169310964" />
<apcm:Property Name="minDateTime" Value="2012-05-22T18:04:18.913Z" />
</apcm:Property>
</apcm:Property>
<updated>2012-05-22T18:04:18.913Z</updated>
<author>
<name>The Associated Press</name>
<uri>http://www.ap.org</uri>
</author>
<rights>Copyright 2012 The Associated Press. All rights reserved. This material may not be published, broadcast, rewritten or redistributed.</rights>
<link rel="self" href="http://syndication.ap.org/AP.Distro.Feed/GetFeed.aspx?idList=30085&idListType=products&maxItems=20" />
<entry>
...
</entry>
</feed>
You have to register the namespaces, and use the [] predicate to identify which Property element you are interested in. It is safest if you do NOT use double slash, i.e., if you start the look up from the document element.
<?php
$xml = <<<EOD
...
EOD;
$sxe = new SimpleXMLElement($xml);
$sxe->registerXPathNamespace('apcm', 'http://ap.org/schemas/03/2005/apcm');
$sxe->registerXPathNamespace('atom', 'http://www.w3.org/2005/Atom');
$result = $sxe->xpath('/atom:feed/acpm:Property[#Name=\'FeedProperties\']/acpm:Property[#Name=\'FeedSequencing\']/acpm:Property[#Name=\'sequenceNumber\']/#Id');
foreach ($result as $sequenceNumber) {
echo $sequenceNumber . "\n";
}
?>
Note that there may theoretically be multiple sibling Property elements with the same #Name and so this Xpath may produce multiple nodes (#Id values).

Categories