ignoring nested elements when parsing xml with php - php

probably a simple question to answer for someone:::
xml:
<foobar>
<foo>i am a foo</foo>
<bar>i am a bar</bar>
<foo>i am a <bar>bar</bar></foo>
</foobar>
In the above, I want to display all elements that are <foo>. When the script gets to the line with the nested < bar > the result is "i am a bar" .. which isn't the result I had hoped for.
Is it not possible to print out the entire contents of that element as it is, so that i see: "i am a <bar>bar</bar>"
php:
$xml = file_get_contents('sample');
$dom = new DOMDocument;
#$dom->loadHTML($xml);
$resources= $dom->getElementsByTagName('foo');
foreach ($resources as $resource){
echo $resource->nodeValue . "\n";
}

After some trolling and trying to do what I needed with SimpleXML, I arrived at the following conclusion. My issue with SimpleXML was where the elements are. If the xml is structured, and the hierarchy is standard ... I have no problem.
If the XML is a web page for example, and the <foo> element is anywhere, SimpleXML doesn't have a good facility like getElementsByTagName to pull out the element wherever it may be....
<?php
$doc = new DOMDocument();
$doc->load('sample');
$element_name = 'foo';
if ($doc->getElementsByTagName($element_name)->length > 0) {
$resources = $doc->getElementsByTagName($element_name);
foreach ($resources as $resource) {
$id = null;
if (!$resource->hasAttribute('id')) {
$resource->setAttribute('id', gen_uuid());
}
$innerHTML = null;
$children = $resource->childNodes;
foreach ($children as $child) {
$tmp_doc = new DOMDocument();
$tmp_doc->appendChild($tmp_doc->importNode($child,true));
$innerHTML .= rtrim($tmp_doc->saveHTML());
}
$resource->nodevalue = $innerHTML;
}
}
echo $doc->saveHTML();
?>

Rather than writing all that code, you might try XPath. That expression would be "//foo", which would get a list of all the elements in the document named "foo".
http://php.net/manual/en/simplexmlelement.xpath.php

Related

How to get a list of all html elements in PHP?

According to the documentation for DOMDocument::getElementsByTagName, I can call the function with "*" argument, and get a list of all HTML elements from some HTML code.
However, with the following code:
<?php
$dom = new DOMDocument();
$dom->loadHTML("<html><body><div>hello</div><div>bye</div></body></html>");
$nodes = $dom->getElementsByTagName("*");
foreach ($nodes as $node) {
$new_text= new DOMText($node->textContent."MODIFIED");
$node->removeChild($node->firstChild);
$node->appendChild($new_text);
}
$content = $dom->saveHTML();
echo $content;
?>
I get a list of only one element, and the result of execution of the code above is:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>hellobyeMODIFIED</html>
while I would expect something like this:
<html><body><div>helloMODIFIED</div><div>byeMODIFIED</div></body></html>
Shouldn't DOMDocument::getElementsByTagName method return a list of as many HTML elements as available in the HTML code?
Note: I need to create DOMText instances explicitly, because I need this to work in PHP 5.4. DOMNode::textContent is accessible for writing only from PHP 5.6
The DOMDocument::getElementsByTagName method actually returns all the tags, if the first argument is '*'. But your code replaces <body> tag (including all child nodes) with a text node at the first iteration.
Iterate the nodes, and modify only the nodes with nodeType property equal to XML_TEXT_NODE:
$nodes = $dom->getElementsByTagName('*');
foreach ($nodes as $node) {
for ($child = $node->firstChild; $child; $child = $child->nextSibling) {
if (! ($child->nodeType === XML_TEXT_NODE && trim($child->textContent))) {
continue;
}
// The textContent is writable since PHP 5.6.1
if (PHP_VERSION_ID >= 50601) {
$child->textContent .= 'MODIFIED';
continue;
}
// For older versions, create DOMText explicitly
$text = new DOMText($child->textContent . 'MODIFIED');
try {
if ($child->parentNode->replaceChild($text, $child))
$child = $text;
} catch (Exception $e) {
trigger_error("Failed to modify text '$child->textContent': "
. $e->getMessage(), E_USER_WARNING);
}
}
}
echo $dom->saveHTML();
Note, for PHP versions 5.6.1 and newer, you don't need to create DOMText instances explicitly, since the DOMNode::textContent property is accessible for read and write. So you can simply modify the text by assigning a string value to this property. Only make sure that the node has no child nodes other than XML_TEXT_NODE.
The code above checks if trim($child->textContent) is not empty, because the document may contain extra space characters (including newline), e.g.:
<div><!-- newline/spaces -->
<span>text</span><!-- newline/spaces -->
</div><!-- newline/spaces -->
This function 'DOMDocument::getElementsByTagName' returns a new instance of class DOMNodeList containing all the elements.
And it works fine:
<?php
$dom = new DOMDocument();
$dom->loadHTML("<html><body><div>hello</div><div>bye</div></body></html>");
$nodes = $dom->getElementsByTagName("*");
foreach ($nodes as $node) {
echo $node->tagName."<br />";
}
?>
it output all tags of your document.
Probably you need smth like:
<?php
$dom = new DOMDocument();
$dom->loadHTML("<html><body><div>hello</div><div>bye</div></body></html>");
$nodes = $dom->getElementsByTagName("*");
foreach ($nodes as $node) {
if ($node->tagName=='div'){
$node->nodeValue .= "new content";
}
}
$content = $dom->saveHTML();
echo htmlspecialchars($content);
?>
Try this:-
foreach($dom->getElementsByTagName('*') as $element ){
}

How to query a DOMNode using XPath in PHP?

I'm trying to get the bing search results with XPath. Here is my code:
$html = file_get_contents("http://www.bing.com/search?q=bacon&first=11");
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHtml($html);
$x = new DOMXpath($doc);
$output = array();
// just grab the urls for now
foreach ($x->query("//li[#class='b_algo']") as $node)
{
//$output[] = $node->getAttribute("href");
$tmpDom = new DOMDocument();
$tmpDom->loadHTML($node);
$tmpDP = new DOMXPath($tmpDom);
echo $tmpDP->query("//div[#class='b_title']//h2//a//href");
}
return $output;
This foreach iterates over all results, all I want to do is to extract the link and text from $node in foreach, but because $node itself is an object I can't create a DOMDocument from it. How can I query it?
First of all, your XPath expression tries to match non-existant href subelements, query #href for the attribute.
You don't need to create any new DOMDocuments, just pass the $node as context item:
foreach ($x->query("//li[#class='b_algo']") as $node)
{
var_dump( $x->query("./div[#class='b_title']//h2//a//#href", $node)->item(0) );
}
If you're just interested in the URLs, you could also query them directly:
foreach ($x->query("//li[#class='b_algo']/div[#class='b_title']/h2/a/#href") as $node)
{
var_dump($node);
}

DOMDOCUMENT | PHP: save getElementById output into new HTML file

I'm trying to save the result of getElementById using PHP.
The code I have:
<?php
$doc = new DOMDocument();
$doc->validateOnParse = true;
#$doc->loadHTMLfile('test.htm');
$div = $doc->getElementById('storytext');
echo $doc->saveHTML($div);
?>
This displays the relevant text, I now want to save that to a new file, I have tried using save(), saveHTMLfile() and file_put_contents(), none of those work because they only save strings and I cannot turn $div into a string, so I'm stuck.
If I just save the entire thing:
$doc->saveHTMLfile('name.ext');
It works but it saves everything, not just the part that I need.
I'm a complete DOM noob so I may be missing something very simple but I can't really find much about this through my searches.
function getInnerHtml( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
$html = getInnerHtml($div);
file_put_contents("name.ext", $html);

Modify html attribute with php

I have a html string that contains exactly one a-element in it. Example:
test
In php I have to test if rel contains external and if yes, then modify href and save the string.
I have looked for DOM nodes and objects. But they seem to be too much for only one A-element, as I have to iterate to get html nodes and I am not sure how to test if rel exists and contains external.
$html = new DOMDocument();
$html->loadHtml($txt);
$a = $html->getElementsByTagName('a');
$attr = $a->item(0)->attributes();
...
At this point I am going to get NodeMapList that seems to be overhead. Is there any simplier way for this or should I do it with DOM?
Is there any simplier way for this or should I do it with DOM?
Do it with DOM.
Here's an example:
<?php
$html = 'test';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//a[contains(concat(' ', normalize-space(#rel), ' '), ' external ')]");
foreach($nodes as $node) {
$node->setAttribute('href', 'http://example.org');
}
echo $dom->saveHTML();
I kept going to modify with DOM. This is what I get:
$html = new DOMDocument();
$html->loadHtml('<?xml encoding="utf-8" ?>' . $txt);
$nodes = $html->getElementsByTagName('a');
foreach ($nodes as $node) {
foreach ($node->attributes as $att) {
if ($att->name == 'rel') {
if (strpos($att->value, 'external')) {
$node->setAttribute('href','modified_url_goes_here');
}
}
}
}
$txt = $html->saveHTML();
I did not want to load any other library for just this one string.
The best way is to use a HTML parser/DOM, but here's a regex solution:
$html = 'test<br>
<p> Some text</p>
test2<br>
<a rel="external">test3</a> <-- This won\'t work since there is no href in it.
';
$new = preg_replace_callback('/<a.+?rel\s*=\s*"([^"]*)"[^>]*>/i', function($m){
if(strpos($m[1], 'external') !== false){
$m[0] = preg_replace('/href\s*=\s*(("[^"]*")|(\'[^\']*\'))/i', 'href="http://example.com"', $m[0]);
}
return $m[0];
}, $html);
echo $new;
Online demo.
You could use a regular expression like
if it matches /\s+rel\s*=\s*".*external.*"/
then do a regExp replace like
/(<a.*href\s*=\s*")([^"]\)("[^>]*>)/\1[your new href here]\3/
Though using a library that can do this kind of stuff for you is much easier (like jquery for javascript)

how to filter XML

I'm looking for filtering this xml enter link description here
The best way to filter an xml it is :
-to run all the xml and to affect value in variable
-after that we rewrite this xml with this variable
is there any other method?
For this method i've used the dom like this:
$flux= new DOMDocument();
if ($flux->load('http://xml.weather.com/weather/local/FRXX0076?unit=m&hbhf=6&ut=C'))
{$loc=$flux->getElementsByTagName('t');
foreach($loc as $lo)
echo $lo->firstChild->nodeValue . "<br />";
}
in this code i've tried to display <t> but there are 2 balise <t> in <hour> , therefore i've two value of <t> instead the first child of <hour>
The brief answer:
$flux= new DOMDocument();
if ($flux->load('http://xml.weather.com/weather/local/FRXX0076?unit=m&hbhf=6&ut=C'))
{
$xpath = new DomXPath($flux);
$elements = $xpath->query("*/hour/t[1]");
if (!is_null($elements)) {
foreach ($elements as $element)
{
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node)
{
echo $node->nodeValue. "\n";
}
}
}
}
I think you should know where to go from this point :)
More advanced version will be to issue XPath query on all hour elements ("*/hour") and then in foreach for each hour element issue another xpath query in this element context ($xpath->query("*/t[1]", $hourElement);). This way you'll also have access to hour object and can for example display this hour.
UPDATE
Simpler version of foreach:
if (!is_null($elements)) {
foreach ($elements as $element)
{
echo "<br/>".$element->nodeValue;
}
}

Categories