Parse RDF XML file to get all rdf:about values - php

I am using php's simple xml and xpath to parse an rdf xml file and am struggling to get a list of all the rdf:about values.
Any advice?

There seems to be an issue when using SimpleXml with namespaced attributes prior to PHP5.3. Basically, anything with a : will be dropped when converted to an object property of a SimpleXml element. The following will do, but feels hackish to me:
$rdf = str_replace('rdf:about', 'rdf_about', $rdf);
$rdf = new SimpleXMLElement($rdf);
foreach($rdf->xpath('//#rdf_about') as $node) {
echo $node, PHP_EOL;
}
See here:
http://groups.google.com/group/comp.lang.php/browse_thread/thread/d2a9b29ee21f7403/c6b24b6d398ece2c
You could use DOM instead of SimpleXml:
$dom = new DomDocument;
$dom->loadXml($rdf);
$xph = new DOMXPath($dom);
$xph->registerNamespace('rdf', "http://www.w3.org/1999/02/22-rdf-syntax-ns#");
foreach($xph->query('//#rdf:about') as $attribute) {
echo $attribute->value, PHP_EOL;
}
But, I suggest using a dedicated library for this over SimpleXml or DOM:
http://arc.semsol.org/docs/v2/parsing
http://www.seasr.org/wp-content/plugins/meandre/rdfapi-php/doc/
http://librdf.org/raptor/
http://phpxmlclasses.sourceforge.net/show_doc.php?class=class_rdf_parser.html
And here's a blog post about the parsers:
http://www.wasab.dk/morten/blog/archives/2004/05/31/easy-rdf-parsing-with-php

Related

PHP: Keeping HTML inside XML node without CDATA

I've got an xml like this:
<father>
<son>Text with <b>HTML</b>.</son>
</father>
I'm using simplexml_load_string to parse it into SimpleXmlElement. Then I get my node like this
$xml->father->son->__toString(); //output: "Text with .", but expected "Text with <b>HTML</b>."
I need to handle simple HTML such as:
<b>text</b> or <br/> inside the xml which is sent by many users.
Me problem is that I can't just ask them to use CDATA because they won't be able to handle it properly, and they are already use to do without.
Also, if it's possible I don't want the file to be edited because the information need to be the one sent by the user.
The function simplexml_load_string simply erase anything inside HTML node and the HTML node itself.
How can I keep the information ?
SOLUTION
To handle the problem I used the asXml as explained by #ThW:
$tmp = $xml->father->son->asXml(); //<son>Text with <b>HTML</b>.</son>
I just added a preg_match to erase the node.
A CDATA section is a character node, just like a text node. But it does less encoding/decoding. This is mostly a downside, actually. On the upside something in a CDATA section might be more readable for a human and it allows for some BC in special cases. (Think HTML script tags.)
For an XML API they are nearly the same. Here is a small DOM example (SimpleXML abstracts to much).
$document = new DOMDocument();
$father = $document->appendChild(
$document->createElement('father')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createTextNode('With <b>HTML</b><br>It\'s so nice.')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createCDataSection('With <b>HTML</b><br>It\'s so nice.')
);
$document->formatOutput = TRUE;
echo $document->saveXml();
Output:
<?xml version="1.0"?>
<father>
<son>With <b>HTML</b><br>It's so nice.</son>
<son><![CDATA[With <b>HTML</b><br>It's so nice.]]></son>
</father>
As you can see they are serialized very differently - but from the API view they are basically exchangeable. If you're using an XML parser the value you get back should be the same in both cases.
So the first possibility is just letting the HTML fragment be stored in a character node. It is just a string value for the outer XML document itself.
The other way would be using XHTML. XHTML is XML compatible HTML. You can mix an match different XML formats, so you could add the XHTML fragment as part of the outer XML.
That seems to be what you're receiving. But SimpleXML has some problems with mixed nodes. So here is an example how you can read it in DOM.
$xml = <<<'XML'
<father>
<son>With <b>HTML</b><br/>It's so nice.</son>
</father>
XML;
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$result = '';
foreach ($xpath->evaluate('/father/son[1]/node()') as $child) {
$result .= $document->saveXml($child);
}
echo $result;
Output:
With <b>HTML</b><br/>It's so nice.
Basically you need to save each child of the son element as XML.
SimpleXML is based on the same DOM library internally. That allows you to convert a SimpleXMLElement into a DOM node. From there you can again save each child as XML.
$father = new SimpleXMLElement($xml);
$sonNode = dom_import_simplexml($father->son);
$document = $sonNode->ownerDocument;
$result = '';
foreach ($sonNode->childNodes as $child) {
$result .= $document->saveXml($child);
}
echo $result;

SimpleXML how to get line number of a node?

I'm using this in SimpleXML and PHP:
foreach ($xml->children() as $node) {
echo $node->attributes('namespace')->id;
}
That prints the id attribute of all nodes (using a namespace).
But now I want to know the line number that $node is located in the XML file.
I need the line number, because I'm analyzing the XML file, and returning to the user information of possible issues to resolve them. So I need to say something like: "Here you have an error at line X". I'm sure that the XML file would be in a standard format that will have enough line breaks for this to be useful.
It is possible with DOM. DOMNode provides the function getLineNo().
DOM
$xml = <<<'XML'
<foo>
<bar/>
</foo>
XML;
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
var_dump(
$xpath->evaluate('//bar[1]')->item(0)->getLineNo()
);
Output:
int(2)
SimpleXML
SimpleXML is based on DOM, so you can convert SimpleXMLElement objects to DOMElement objects.
$element = new SimpleXMLElement($xml);
$node = dom_import_simplexml($element->bar);
var_dump($node->getLineNo());
And yes, most of the time if you have a problem with SimpleXML, the answer is to use DOM.
XMLReader
XMLReader has the line numbers internally, but here is no direct method to access them. Again you will have to convert it into a DOMNode. It works because both use libxml2. This will read the node and all its descendants into memory, so be careful with it.
$reader = new XMLReader();
$reader->open('data://text/xml;base64,'.base64_encode($xml));
while ($reader->read()) {
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name== 'bar') {
var_dump($reader->expand()->getLineNo());
}
}

Getting an element from PHP DOM and changing its value

I'm using PHP/Zend to load html into a DOM, and then I get a specific div id that I want to modify.
$dom = new Zend_Dom_Query($html);
$element = $dom->query('div[id="someid"]');
How do I modify the text/content/html displayed inside that $element div, and then save the changes to the $dom or $html so I can print the modified html. Any idea how to do this?
Zend_Dom_Query is tailored just for querying a dom, so it doesn't provide an interface in and of itself to alter the dom and save it, but it does expose the PHP Native DOM objects that will let you do so. Something like this should work:
$dom = new Zend_Dom_Query($html);
$document = $dom->getDocument();
$elements = $dom->query('div[id="someid"]');
foreach($elements AS $element) {
//$element is an instance of DOMElement (http://www.php.net/DOMElement)
//You have to create new nodes off the document
$node = $document->createElement("div", "contents of div");
$element->appendChild($node)
}
$newHtml = $document->saveXml();
Take a look at the PHP Doc for DOMElement to get an idea of how you can alter the dom:
http://www.php.net/DOMElement

Get instance of nodes by their name in SimpleXML (PHP)

I'd like to search for nodes with the same node name in a SimpleXML Object no matter how deep they are nested and create an instance of them as an array.
In the HTML DOM I can do that with JavaScript by using getElementsByTagName(). Is there a way to do that in PHP as well?
Yes use xpath
$xml->xpath('//div');
Here $xml is your SimpleXML object.
In this example you will get array of all 'div' elements
$fname = dirname(__FILE__) . '\\xml\\crRoll.xml';
$dom = new DOMDocument;
$dom->load($fname, LIBXML_DTDLOAD|LIBXML_DTDATTR);
$root = $dom->documentElement;
$xpath = new DOMXpath($dom);
$xpath->registerNamespace('cr', "http://www.w3.org/1999/xhtml");
$candidateNodes = $xpath->query("//cr:break");
foreach ($candidateNodes as $child) {
$max = $child->getAttribute('tstamp');
}
This finds all the BREAK nodes (tstamp attr) using XPath ...
Only on DOMDocument::getElementsByTagName,
however, you can import/export SimpleXML into DOMDocument,
or simply use DOMDocument to parse XML.
Another answer mentioned about Xpath,
it will return duplication of node, if you have something like :-
<div><div>1</div></div>

How can I find text nodes in an HTML snippet?

I'm trying to parse an HTML snippet, using the PHP DOM functions. I have stripped out everything apart from paragraph, span and line break tags, and now I want to retrieve all the text, along with its accompanying styles.
So, I'd like to get each piece of text, one by one, and for each one I can then go back up the tree to get the values of particular attributes (I'm only interested in some specific ones, like color etc.).
How can I do this? Or am I thinking about it the wrong way?
Suppose you have a DOMDocument here:
$doc = new DOMDocument();
$doc->loadHTMLFile('http://stackoverflow.com/');
You can find all text nodes using a simple Xpath.
$xpath = new DOMXpath($doc);
$textNodes = $xpath->query('//text()');
Just foreach over it to iterate over all textnodes:
foreach ($textNodes as $textNode) {
echo $textNode->data . "\n";
}
From that, you can go up the DOM tree by using ->parentNode.
Hope that this can give you a good start.
For those who are more comfortable with CSS3 selectors, and are willing to include a single extra PHP class into their project, I would suggest the use of Simple PHP DOM parser. The solution would look something like the following:
$html = file_get_html('http://www.example.com/');
$ret = $html->find('p, span');
$store = array();
foreach($ret as $element) {
$store[] = array($element->tag => array('text' => $element->innertext,
'color' => $element->color,
'style' => $element->style));
}
print_r($store);

Categories