How to retrieve comments from within an XML Document in PHP - php

I want to extract all comments below a specific node within an XML document, using PHP. I have tried both the SimpleXML and DOMDocument methods, but I keep getting blank outputs. Is there a way to retrieve comments from within a document without having to resort to Regex?

SimpleXML cannot handle comments, but the DOM extension can. Here's how you can extract all the comments. You just have to adapt the XPath expression to target the node you want.
$doc = new DOMDocument;
$doc->loadXML(
'<doc>
<node><!-- First node --></node>
<node><!-- Second node --></node>
</doc>'
);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//comment()') as $comment)
{
var_dump($comment->textContent);
}

Do you have access to an XPath API ? XPath allows you to find comments using (e.g.)
//comment()

Use XMLReader. Comments can be easily detected/found, they are xml elements of type COMMENT.
For details see PHP documentation: The XMLReader class
Code example:
$reader = new XMLReader();
$reader->open('filename.xml');
while ($reader->read()){
if ($reader->nodeType == XMLReader::COMMENT) {
$comments[] = $reader->readOuterXml();
}
}
And in array $comments you will have all comments found in XML file.

If you are using a SAX event driven-parser, the parser should have an event for comments. For example, when using Expat you would implement a handler and set it using:
void XMLCALL
XML_SetCommentHandler(XML_Parser p,
XML_CommentHandler cmnt);

Related

Using Xpath with PHP to parse HTML

I'm currently trying to parse some data from a forum. Here is the code:
$xml = simplexml_load_file('https://forums.eveonline.com');
$names = $xml->xpath("html/body/div/div/form/div/div/div/div/div[*]/div/div/table//tr/td[#class='topicViews']");
foreach($names as $name)
{
echo $name . "<br/>";
}
Anyway, the problem is that I'm using google xpath extension to help me get the path, and I'm guessing that google is changing the html enough to make it not come up when i use my website to do this search. Is there some type of way I can make the host look at the site through google chrome so that it gets the right code? What would you suggest?
Thanks!
My suggestion is to always use DOMDocument as opposed to SimpleXML, since it's a much nicer interface to work with and makes tasks a lot more intuitive.
The following example shows you how to load the HTML into the DOMDocument object and query the DOM using XPath. All you really need to do is find all td elements with a class name of topicViews and this will output each of the nodeValue members found in the DOMNodeList returned by this XPath query.
/* Use internal libxml errors -- turn on in production, off for debugging */
libxml_use_internal_errors(true);
/* Createa a new DomDocument object */
$dom = new DomDocument;
/* Load the HTML */
$dom->loadHTMLFile("https://forums.eveonline.com");
/* Create a new XPath object */
$xpath = new DomXPath($dom);
/* Query all <td> nodes containing specified class name */
$nodes = $xpath->query("//td[#class='topicViews']");
/* Set HTTP response header to plain text for debugging output */
header("Content-type: text/plain");
/* Traverse the DOMNodeList object to output each DomNode's nodeValue */
foreach ($nodes as $i => $node) {
echo "Node($i): ", $node->nodeValue, "\n";
}
A double '/' will make xpath search. So if you would use the xpath '//table' you would get all tables.
You can also use this deeper in your xpath structure like 'html/body/div/div/form//table' to get all tables under xpath 'html/body/div/div/form'.
This way you can make your code a bit more resilient against changes in the html source.
I do suggest learning a little about xpath if you want to use it. Copy paste only gets you so far.
A simple explanation about the syntax can be found at w3schools.com/xml/xpath_syntax.asp

Finding and Echoing out a Specific ID from HTML document with PHP

I am grabbing the contents from google with PhP, how can I search $page for elements with the id of "#lga" and echo out another property? Say #lga is an image, how would I echo out it's source?
No, i'm not going to do this with Google, Google is strictly an example and testing page.
<body><img id="lga" src="snail.png" /></body>
I want to find the element named "lga" and echo out it's source; so the above code I would want to echo out "snail.png".
This is what i'm using and how i'm storing what I found:
<?php
$url = "https://www.google.com/";
$page = file($url);
foreach($page as $part){
}
?>
You can achieve this using the built-in DOMDocument class. This class allows you to work with HTML in a structured manner rather than parsing plain text yourself, and it's quite versatile:
$dom = new DOMDocument();
$dom->loadHTML($html);
To get the src attribute of the element with the id lga, you could simply use:
$imageSrc = $dom->getElementById('lga')->getAttribute('src');
Note that DOMDocument::loadHTML will generate warnings when it encounters invalid HTML. The method's doc page has a few notes on how to suppress these warnings.
Also, if you have control over the website you are parsing the HTML from, it might be more appropriate to have a dedicated script to serve the information you are after. Unless you need to parse exactly what's on a page as it is served, extracting data from HTML like this could be quite wasteful.

php: Getting the contents of a feedburner feed

I wrote this function to parse through html source code, but for some reason it does not work for feedburner feeds. Any ideas?
$dom = new DOMDocument();
$dom->loadHTMLFile('http://www.killington.com/winter/mountain/conditions');
$xml = simplexml_import_dom($dom);
$snow = $xml->xpath('//td');
What I really need to do is simply get the data from the page.
Not sure what the problem is other than the fact that this isnt a feed its a webpage. That said since youre using dom document theres no reason to bother with simplexml and that may be where the problem is coming in...
$dom = new DOMDocument();
$dom->loadHTMLFile('http://www.killington.com/winter/mountain/conditions');
$xpath = new DOMXPath($dom);
$snow = $xpath->query('//td');
First of all, you must open the feed page (the xml one, for example) and check which kind of feed it is:
<rss xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
Then, you take a look at something like this good tutorial: http://net.tutsplus.com/articles/news/how-to-read-an-rss-feed-with-php-screencast/ and you're almost done :)

XML validation against given DTD in PHP

In PHP, I am trying to validate an XML document using a DTD specified by my application - not by the externally fetched XML document. The validate method in the DOMDocument class seems to only validate using the DTD specified by the XML document itself, so this will not work.
Can this be done, and how, or do I have to translate my DTD to an XML schema so I can use the schemaValidate method?
(this seems to have been asked in Validate XML using a custom DTD in PHP but without correct answer, since the solution only relies on DTD speicified by the target XML)
Note: XML validation could be subject to the Billion Laughs attack, and similar DoS vectors.
This essentially does what rojoca mentioned in his comment:
<?php
$xml = <<<END
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE foo SYSTEM "foo.dtd">
<foo>
<bar>baz</bar>
</foo>
END;
$root = 'foo';
$old = new DOMDocument;
$old->loadXML($xml);
$creator = new DOMImplementation;
$doctype = $creator->createDocumentType($root, null, 'bar.dtd');
$new = $creator->createDocument(null, null, $doctype);
$new->encoding = "utf-8";
$oldNode = $old->getElementsByTagName($root)->item(0);
$newNode = $new->importNode($oldNode, true);
$new->appendChild($newNode);
$new->validate();
?>
This will validate the document against the bar.dtd.
You can't just call $new->loadXML(), because that would just set the DTD to the original, and the doctype property of a DOMDocument object is read-only, so you have to copy the root node (with everything in it) to a new DOM document.
I only just had a go with this myself, so I'm not entirely sure if this covers everything, but it definitely works for the XML in my example.
Of course, the quick-and-dirty solution would be to first get the XML as a string, search and replace the original DTD by your own DTD and then load it.
I think that's only possible with XSD, see:
http://php.net/manual/en/domdocument.schemavalidate#62032

Using SimpleXML to create an XML object from scratch

Is it possible to use PHP's SimpleXML functions to create an XML object from scratch? Looking through the function list, there's ways to import an existing XML string into an object that you can then manipulate, but if I just want to generate an XML object programmatically from scratch, what's the best way to do that?
I figured out that you can use simplexml_load_string() and pass in the root string that you want, and then you've got an object you can manipulate by adding children... although this seems like kind of a hack, since I have to actually hardcode some XML into the string before it can be loaded.
I've done it using the DOMDocument functions, although it's a little confusing because I'm not sure what the DOM has to do with creating a pure XML document... so maybe it's just badly named :-)
Sure you can. Eg.
<?php
$newsXML = new SimpleXMLElement("<news></news>");
$newsXML->addAttribute('newsPagePrefix', 'value goes here');
$newsIntro = $newsXML->addChild('content');
$newsIntro->addAttribute('type', 'latest');
Header('Content-type: text/xml');
echo $newsXML->asXML();
?>
Output
<?xml version="1.0"?>
<news newsPagePrefix="value goes here">
<content type="latest"/>
</news>
Have fun.
In PHP5, you should use the Document Object Model class instead.
Example:
$domDoc = new DOMDocument;
$rootElt = $domDoc->createElement('root');
$rootNode = $domDoc->appendChild($rootElt);
$subElt = $domDoc->createElement('foo');
$attr = $domDoc->createAttribute('ah');
$attrVal = $domDoc->createTextNode('OK');
$attr->appendChild($attrVal);
$subElt->appendChild($attr);
$subNode = $rootNode->appendChild($subElt);
$textNode = $domDoc->createTextNode('Wow, it works!');
$subNode->appendChild($textNode);
echo htmlentities($domDoc->saveXML());
Please see my answer here. As dreamwerx.myopenid.com points out, it is possible to do this with SimpleXML, but the DOM extension would be the better and more flexible way. Additionally there is a third way: using XMLWriter. It's much more simple to use than the DOM and therefore it's my preferred way of writing XML documents from scratch.
$w=new XMLWriter();
$w->openMemory();
$w->startDocument('1.0','UTF-8');
$w->startElement("root");
$w->writeAttribute("ah", "OK");
$w->text('Wow, it works!');
$w->endElement();
echo htmlentities($w->outputMemory(true));
By the way: DOM stands for Document Object Model; this is the standardized API into XML documents.

Categories