Parsing XML data with Namespaces in PHP

Parsing XML data with Namespaces in PHP - php

I'm trying to work with this XML feed that uses namespaces and i'm not able to get past the colon in the tags. Here's how the XML feed looks like:
<r25:events pubdate="2010-05-19T13:58:08-04:00">
<r25:event xl:href="event.xml?event_id=328" id="BRJDMzI4" crc="00000022" status="est">
<r25:event_id>328</r25:event_id>
<r25:event_name>Testing 09/2005-08/2006</r25:event_name>
<r25:alien_uid/>
<r25:event_priority>0</r25:event_priority>
<r25:event_type_id xl:href="evtype.xml?type_id=105">105</r25:event_type_id>
<r25:event_type_name>CABINET</r25:event_type_name>
<r25:node_type>C</r25:node_type>
<r25:node_type_name>cabinet</r25:node_type_name>
<r25:state>1</r25:state>
<r25:state_name>Tentative</r25:state_name>
<r25:event_locator>2005-AAAAMQ</r25:event_locator>
<r25:event_title/>
<r25:favorite>F</r25:favorite>
<r25:organization_id/>
<r25:organization_name/>
<r25:parent_id/>
<r25:cabinet_id xl:href="event.xml?event_id=328">328</r25:cabinet_id>
<r25:cabinet_name>cabinet 09/2005-08/2006</r25:cabinet_name>
<r25:start_date>2005-09-01</r25:start_date>
<r25:end_date>2006-08-31</r25:end_date>
<r25:registration_url/>
<r25:last_mod_dt>2008-02-27T14:22:43-05:00</r25:last_mod_dt>
<r25:last_mod_user>abc00296004</r25:last_mod_user>
</r25:event>
</r25:events>
And here is what I'm using for code - I'll trying to throw these into a bunch of arrays where I can format the output however I want:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://somedomain.com/blah.xml");
curl_setopt ($ch, CURLOPT_HTTPHEADER, Array("Content-Type: text/xml"));
curl_setopt($ch, CURLOPT_USERPWD, "username:password");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
$xml = new SimpleXmlElement($output);
foreach ($xml->events->event as $entry){
$dc = $entry->children('http://www.collegenet.com/r25');
echo $entry->event_name . "<br />";
echo $entry->event_id . "<br /><br />";
}

Figured out the issue was with the XML feed rather than code:
XML feed was missing this line:
<r25:events xmlns:r25="http://www.collegenet.com/r25" xmlns:xl="http://www.w3.org/1999/xlink" pubdate="2010-05-19T13:58:08-04:00">
Thanks for the help though.

"All kinds of errors" isn't a helpful description; what errors are you actually getting?
You should give the object a namespace option like this:
$xml = new SimpleXmlElement($output, null, false, $ns = 'r25');
See the manual.

Alternatively, since r25 is the only namespace used and therefore is not especially helpful, I just run
$xml = preg_replace('/r25:/','',$xml);
And that strips out the namespace. Then you can navigate much easier with simplexml, just like in your example.

Related

PHP DOMDocument getting elements by tag name ignores commented ones [duplicate]

I'm creating a little web app to help me manage and analyze the content of my websites, and cURL is my favorite new toy. I've figured out how to extract info about all sorts of elements, how to find all elements with a certain class, etc., but I am stuck on two problems (see below). I hope there is some nifty xpath answer, but if I have to resort to regular expressions I guess that's ok. Although I'm not so great with regex so if you think that's the way to go, I'd appreciate examples...
Pretty standard starting point:
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
$info .= "<br />cURL error number:" .curl_errno($ch);
$info .= "<br />cURL error:" . curl_error($ch);
return $info;
}
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
and extraction of info, for example:
// iframes
$iframes = $xpath->evaluate("/html/body//iframe");
$info .= '<h3>iframes ('.$iframes->length.'):</h3>';
for ($i = 0; $i < $iframes->length; $i++) {
// get iframe attributes
$iframe = $iframes->item($i);
$framesrc = $iframe->getAttribute("src");
$framewidth = $iframe->getAttribute("width");
$frameheight = $iframe->getAttribute("height");
$framealt = $iframe->getAttribute("alt");
$frameclass = $iframe->getAttribute("class");
$info .= $framesrc.' ('.$framewidth.'x'.$frameheight.'; class="'.$frameclass.'")'.'<br />';
}
Questions/Problems:
How to extract HTML comments?
I can't figure out how to identify the comments – are they considered nodes, or something else entirely?
How to get the entire content of a div, including child nodes? So if the div contains an image and a couple of hrefs, it would find those and hand it all back to me as a block of HTML.

Comment nodes should be easy to find in XPath with the comment() test, analogous to the text() test:
$comments = $xpath->query('//comment()'); // or another path, as you prefer
They are standard nodes: here is the manual entry for the DOMComment class.
To your other question, it's a bit trickier. The simplest way is to use saveXML() with its optional $node argument:
$html = $dom->saveXML($el); // $el should be the element you want to get
// the HTML for

For the HTML comments a fast method is:
function getComments ($html) {
$rcomments = array();
$comments = array();
if (preg_match_all('#<\!--(.*?)-->#is', $html, $rcomments)) {
foreach ($rcomments as $c) {
$comments[] = $c[1];
}
return $comments;
} else {
// No comments matchs
return null;
}
}

That Regex
\s*<!--[\s\S]+?-->
Helps to you.
In regex Test

for comments your looking for recursive regex. For instance, to get rid of html comments:
preg_replace('/<!--(?(?=<!--)(?R)|.)*?-->/s',$yourHTML);
to find them:
preg_match_all('/(<!--(?(?=<!--)(?R)|.)*?-->)/s',$yourHTML,$comments);

Parsing through XML with namespace

I try to parse through a XML file with a difficult namespace but this does not work.
Hope you can help me with this issue.
This is my XML file which is generated from an URL:
<searchresultresponse xmlns="urn:veloconnect:catalog-1.1" xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-1.0" xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-1.0" xmlns:vct="urn:veloconnect:transaction-1.0" xmlns:vco="urn:veloconnect:order-1.1" xmlns:vcc="urn:veloconnect:catalog-1.1">
<vct:buyersid>417641</vct:buyersid>
<vct:responsecode>200</vct:responsecode>
<vct:transactionid>dmVsb2Nvbm5lY3Rfc2VhcmNoLkhJUl9TUkNfMTIwMg</vct:transactionid>
<vct:statuscode>2</vct:statuscode>
<startindex>0</startindex>
<count>500</count>
<totalcount>51691</totalcount>
<resultformat>ITEM_TYPE</resultformat>
<cac:item>
<cbc:description>BREMSBELAG GALFER ORGAN. FD171-G1054 (KBA)</cbc:description>
<cac:sellersitemidentification>
<cac:id>04303400</cac:id>
</cac:sellersitemidentification>
<cac:standarditemidentification>
<cac:id identificationschemeid="EAN/UCC-13">8400160001718</cac:id>
</cac:standarditemidentification>
<cac:manufacturersitemidentification>
<cac:id>FD171-G1054</cac:id>
<cac:issuerparty>
<cac:partyname>
<cbc:name>Galfer</cbc:name>
</cac:partyname>
</cac:issuerparty>
</cac:manufacturersitemidentification>
<cac:baseprice>
<cbc:priceamount amountcurrencyid="EUR">5.95</cbc:priceamount>
<cbc:basequantity quantityunitcode="EA">1</cbc:basequantity>
</cac:baseprice>
<cac:recommendedretailprice>
<cbc:priceamount amountcurrencyid="EUR">10.9</cbc:priceamount>
<cbc:basequantity quantityunitcode="EA">1</cbc:basequantity>
</cac:recommendedretailprice>
</cac:item>
I grab this from an URL via PHP like this:
<?php
error_reporting(E_ALL);
$url = "http://somedomain.com/feed";
set_time_limit(0);
$ch = curl_init($url);// or any url you can pass which gives you the xml file
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 50);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$xml = curl_exec($ch);
curl_close($ch);
$namespaces = $xml->getNameSpaces(true);
$cac = $xml->children($namespaces['cac']);
foreach ($xml as $entry){
$cac = $xml->children($namespaces['cac']);
echo $cac->item;
}
?>
I always all data shown without linebreaks etc. but I need to fetch specific objects to save them in array (later).
This namesapce here is really wired.

Scrape a statistic from YouTube using PHP

After struggling for 3 hours at trying to do this on my own, I have decided that it is either not possible or not possible for me to do on my own. My question is as follows:
How can I scrape the numbers in the attached image using PHP to echo them in a webpage?
Image URL: http://gyazo.com/6ee1784a87dcdfb8cdf37e753d82411c
Please help. I have tried almost everything, from using cURL, to using a regex, to trying an xPath. Nothing has worked the right way.
I only want the numbers by themselves in order for them to be isolated, assigned to a variable, and then echoed elsewhere on the page.
Update:
http://youtube.com/exonianetwork - The URL I am trying to scrape.
/html/body[#class='date-20121213 en_US ltr ytg-old-clearfix guide-feed-v2 site-left-aligned exp-new-site-width exp-watch7-comment-ui webkit webkit-537']/div[#id='body-container']/div[#id='page-container']/div[#id='page']/div[#id='content']/div[#id='branded-page-default-bg']/div[#id='branded-page-body-container']/div[#id='branded-page-body']/div[#class='channel-tab-content channel-layout-two-column selected blogger-template ']/div[#class='tab-content-body']/div[#class='secondary-pane']/div[#class='user-profile channel-module yt-uix-c3-module-container ']/div[#class='module-view profile-view-module']/ul[#class='section'][1]/li[#class='user-profile-item '][1]/span[#class='value']
The xPath I tried, which didn't work for some unknown reason. No exceptions or errors were thrown, and nothing was displayed.

Perhaps a simple XPath would be easier to manipulate and debug.
Here's a Short Self-Contained Correct Example (watch for the space at the end of the class name):
#!/usr/bin/env php
<?
$url = "http://youtube.com/exonianetwork";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html)
{
print "Failed to fetch page. Error handling goes here";
}
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$profile_items = $xpath->query("//li[#class='user-profile-item ']/span[#class='value']");
if ($profile_items->length === 0) {
print "No values found\n";
} else {
foreach ($profile_items as $profile_item) {
printf("%s\n", $profile_item->textContent);
}
}
?>
Execute:
% ./scrape.php
57
3,593
10,659,716
113,900
United Kingdom

If you are willing to try a regex again, this pattern should work:
!Network Videos:</span>\r\n +<span class=\"value\">([\d,]+).+Views:</span>\r\n +<span class=\"value\">([\d,]+).+Subscribers:</span>\r\n +<span class=\"value\">([\d,]+)!s
It captures the numbers with their embedded commas, which would then need to be stripped out. I'm not familiar with PHP, so cannot give you more complete code

Parsing XML with PHP?

This has been driving me insane for about the last hour. I'm trying to parse a bit of XML out of Last.fm's API, I've used about 35 different permutations of the code below, all of which have failed. I'm really bad at XML parsing, lol. Can anyone help me parse the first toptags>tag>name 'name' from this XML API in PHP? :(
http://ws.audioscrobbler.com/2.0/?method=track.getinfo&api_key=b25b959554ed76058ac220b7b2e0a026&artist=Owl+city&track=fireflies
Which in that case ^ would be 'electronic'
Right now, all I have is this
<?
$xmlstr = file_get_contents("http://ws.audioscrobbler.com/2.0/?method=track.getinfo&api_key=b25b959554ed76058ac220b7b2e0a026&artist=Owl+city&track=fireflies");
$genre = new SimpleXMLElement($xmlstr);
echo $genre->lfm->track->toptags->tag->name;
?>
Which returns with, blank. No errors either, which is what's incredibly annoying!
Thank You very Much :) :) :)
Any help greatly, and by greatly I mean really, really greatly appreciated! :)

The <tag> tag is an array, so you should loop through them with a foreach or similar construct. In your case, just grabbing the first would look like this:
<?
$xmlstr = file_get_contents("http://ws.audioscrobbler.com/2.0/?method=track.getinfo&api_key=b25b959554ed76058ac220b7b2e0a026&artist=Owl+city&track=fireflies");
$genre = new SimpleXMLElement($xmlstr);
echo $genre->track->toptags->tag[0]->name;
Also note that the <lfm> tag is not needed.
UPDATE
I find it's much easier to grab exactly what I'm looking for in a SimpleXMLElement by using print_r(). It'll show you what's an array, what's a simple string, what's another SimpleXMLElement, etc.

Try using
$url = "http://ws.audioscrobbler.com/2.0/?method=track.getinfo&api_key=b25b959554ed76058ac220b7b2e0a026&artist=Owl+city&track=fireflies";
$xml = simplexml_load_file($url);
echo $xml->track->toptags->tag[0]->name;

Suggestion: insert a statement to echo $xmlstr, and make sure you are getting something back from the API.

You don't need to reference lfm. Actually, $genre already is lfm. Try this:
echo $genre->track->toptags->tag->name;

if you wan't to read xml data please follow those steps,
$xmlURL = "your xml url / file name goes here";
try {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $xmlURL);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Content-type: text/xml'
));
$content = curl_exec($ch);
$error = curl_error($ch);
curl_close($ch);
$obj = new SimpleXMLElement($content);
echo "<pre>";
var_dump($obj);
echo "</pre>";
}
catch(Exception $e){
var_dump($e);exit;
}
You will get array formate of whole xml file.
Thanks.

Get div and the correct close tag preg

Now preg has always been a tool to me that i like but i cant figure out for the life if me if what i want to do is possible let and how to do it is going over my head
What i want is preg_match to be able to return me a div's innerHTML the problem is the div im tring to read has more divs in it and my preg keeps closing on the first tag it find
Here is my Actual code
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
preg_match('% <div id="torrent_details">(.*)</div> %six', $data, $match);
print_r($match);
This has been updated for TomcatExodus's help
Live at :: http://megatorrentz.com/beta/details.php?hash=98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6

<?php
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($data);
libxml_use_internal_errors(false);
$div = $domd->getElementById("torrent_details");
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}

Using regular expression leads often to problems when parsing markup documents.
XPath version - independent of the source layout. The only thing you need is a div with that id.
loadHTMLFile($url);
$xp = new domxpath($dom);
$result = $xp->query("//*[#id = 'torrent_details']");
$div=$result->item(0);
if($result->length){
$out =new DOMDocument();
$out->appendChild($out->importNode($div, true));
echo $out->saveHTML();
}else{
echo "No such id";
}
?>
And this is the fix for Maerlyn solution. It didn't work because getElementById() wants a DTD with the id attribute specified. I mean, you can always build a document with "apple" as the record id, so you need something that says "id" is really the id for this tag.
validateOnParse = true;
#$domd->loadHTML($data);
//this doesn't work as the DTD is not specified
//or the specified id attribute is not the attributed called "id"
//$div = $domd->getElementById("torrent_details");
/*
* workaround found here: https://fosswiki.liip.ch/display/BLOG/GetElementById+Pitfalls
* set the "id" attribute as the real id
*/
$elements = $domd->getElementsByTagName('div');
if (!is_null($elements)) {
foreach ($elements as $element) {
//try-catch needed because of elements with no id
try{
$element->setIdAttribute('id', true);
}catch(Exception $e){}
}
}
//now it works
$div = $domd->getElementById("torrent_details");
//Print its content or error
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
?>
Both of the solutions work for me.

You can do this:
/]>(.)<\/div>/i
Which would give you the largest possible innerHTML.

You cannot. I will not link to the famous question, because I dislike the pointless drivel on top. But still regular expressions are unfit to match nested structures.
You can use some trickery, but this is neither reliable, nor necessarily fast:
preg_match_all('#<div id="1">((<div>.*?</div>|.)*?)</div>#ims'
Your regex had a problem due to the /x flag not matching the opening div. And you used a wrong assertion notation.

preg_match_all('% <div \s+ id="torrent_details">(?<innerHtml>.*)</div> %six', $html, $match);
echo $match['innerHtml'];
That one will work, but you should only need preg_match not preg_match_all if the pages are written well, there should only be one instance of id="torrent_details" on the given page.
I'm retracting my answer. This will not work properly. Use DOM for navigating the document.

haha did it with a bit of tampering thanks for the DOMDocument idea i just to use simple
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$doc = new DOMDocument();
libxml_use_internal_errors(false);
$doc->strictErrorChecking = FALSE;
libxml_use_internal_errors(true);
$doc->loadHTML($data);
$xml = simplexml_import_dom($doc);
print_r($xml->body->table->tr->td->table[2]->tr->td[0]->span[0]->div);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing XML data with Namespaces in PHP - php

Figured out the issue was with the XML feed rather than code: XML feed was missing this line: <r25:events xmlns:r25="http://www.collegenet.com/r25" xmlns:xl="http://www.w3.org/1999/xlink" pubdate="2010-05-19T13:58:08-04:00"> Thanks for the help though.

"All kinds of errors" isn't a helpful description; what errors are you actually getting? You should give the object a namespace option like this: $xml = new SimpleXmlElement($output, null, false, $ns = 'r25'); See the manual.

Alternatively, since r25 is the only namespace used and therefore is not especially helpful, I just run $xml = preg_replace('/r25:/','',$xml); And that strips out the namespace. Then you can navigate much easier with simplexml, just like in your example.

Related

PHP DOMDocument getting elements by tag name ignores commented ones [duplicate]

Parsing through XML with namespace

Scrape a statistic from YouTube using PHP

Parsing XML with PHP?

Get div and the correct close tag preg

Categories

Resources