How to extract the text in a SimpleXmlElement object? [duplicate]

How to extract the text in a SimpleXmlElement object? [duplicate] - php

Given the php code:
$xml = <<<EOF
<articles>
<article>
This is a link
<link>Title</link>
with some text following it.
</article>
</articles>
EOF;
function traverse($xml) {
$result = "";
foreach($xml->children() as $x) {
if ($x->count()) {
$result .= traverse($x);
}
else {
$result .= $x;
}
}
return $result;
}
$parser = new SimpleXMLElement($xml);
traverse($parser);
I expected the function traverse() to return:
This is a link Title with some text following it.
However, it returns only:
Title
Is there a way to get the expected result using simpleXML (obviously for the purpose of consuming the data rather than just returning it as in this simple example)?

There might be ways to achieve what you want using only SimpleXML, but in this case, the simplest way to do it is to use DOM. The good news is if you're already using SimpleXML, you don't have to change anything as DOM and SimpleXML are basically interchangeable:
// either
$articles = simplexml_load_string($xml);
echo dom_import_simplexml($articles)->textContent;
// or
$dom = new DOMDocument;
$dom->loadXML($xml);
echo $dom->documentElement->textContent;
Assuming your task is to iterate over each <article/> and get its content, your code will look like
$articles = simplexml_load_string($xml);
foreach ($articles->article as $article)
{
$articleText = dom_import_simplexml($article)->textContent;
}

node->asXML();// It's the simple solution i think !!

So, the simple answer to my question was: Simplexml can't process this kind of XML. Use DomDocument instead.
This example shows how to traverse the entire XML. It seems that DomDocument will work with any XML whereas SimpleXML requires the XML to be simple.
function attrs($list) {
$result = "";
foreach ($list as $attr) {
$result .= " $attr->name='$attr->value'";
}
return $result;
}
function parseTree($xml) {
$result = "";
foreach ($xml->childNodes AS $item) {
if ($item->nodeType == 1) {
$result .= "<$item->nodeName" . attrs($item->attributes) . ">" . parseTree($item) . "</$item->nodeName>";
}
else {
$result .= $item->nodeValue;
}
}
return $result;
}
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xml);
print parseTree($xmlDoc->documentElement);
You could also load the xml using simpleXML and then convert it to DOM using dom_import_simplexml() as Josh said. This would be useful, if you are using simpleXml to filter nodes for parsing, e.g. using XPath.
However, I don't actually use simpleXML, so for me that would be taking the long way around.
$simpleXml = new SimpleXMLElement($xml);
$xmlDom = dom_import_simplexml($simpleXml);
print parseTree($xmlDom);
Thank you for all the help!

You can get the text node of a DOM element with simplexml just by treating it like a string:
foreach($xml->children() as $x) {
$result .= "$x"
However, this prints out:
This is a link
with some text following it.
TitleTitle
..because the text node is treated as one block and there is no way to tell where the child fits in inside the text node. The child node is also added twice because of the other else {}, but you can just take that out.
Sorry if I didn't help much, but I don't think there's any way to find out where the child node fits in the text node unless the xml is consistent (but then, why not use tags). If you know what element you want to strip the text out of, strip_tags() will work great.

This has already been answered, but CASTING TO STRING ( i.e. $sString = (string) oSimpleXMLNode->TagName) always worked for me.

Try this:
$parser = new SimpleXMLElement($xml);
echo html_entity_decode(strip_tags($parser->asXML()));
That's pretty much equivalent to:
$parser = simplexml_load_string($xml);
echo dom_import_simplexml($parser)->textContent;

Like #tandu said, it's not possible, but if you can modify your XML, this will work:
$xml = <<<EOF
<articles>
<article>
This is a link
</article>
<link>Title</link>
<article>
with some text following it.
</article>
</articles>

Related

php dom document remove some html tags but keep inner tags and text

I need to remove some tags (e.g. <div></div>) in HTML document and keep inner tags and text.
I managed to do that with Simple HTML Dom Parser. But it can't process big files due to huge memory requirements.
I would prefer to use native PHP tools like DOMDocument cause I read that it's more optimized and quicker in processing HTML documents.
But I struggle at the first stage - how to remove some tags while keeping inner text and tags.
Source HTML sample is:
<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>
I try this code:
$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
foreach ($oldnodes as $node) {
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
}
echo $doc->saveHTML();
It produces the output:
<html><body>00000aaaaa<div>bbbbbbccc<a>link</a>cccdddddd</div>eeeee<div>1111</div></body></html>
I need the following:
<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>
Could someone please help me with proper code for the task?

You can use strip_tags function in PHP.
$thmltext = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
strip_tags($htmltext, '<html>,<body>,<a>');
This remove all tags except html,body,a
And output is:
<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>
EDIT:
If it is input from user, it's better for security reason to use whitelist tags and not blacklist.

If your code only contains simple HTML tags without any attributes you can keep it simple like:
$value = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
$pattern = '/<[\/]*(div|h1)>/';
$removedTags = preg_replace($pattern, '', $value);
Since you wrote in your comment that there are more than just div tags you want to remove, I added a h1 tag to the pattern in case you also want to remove h1 tags.
This code snippet is only for simple code, but fits to your HTML input and output example.

Try this..
Just replace the for loop with the below code.
foreach ($oldnodes as $node) {
$children = $node->childNodes;
$string = "";
foreach($children as $child) {
$childString = $doc->saveXML($child);
$string = $string."".$childString;
}
$fragment = $doc->createDocumentFragment();
$fragment->appendXML($string);
$node->parentNode->insertBefore($fragment,$node);
$node->parentNode->removeChild($node);
}

I found a way to make it work.
The reason code in question not working is the manipulation with nodes in nodelist ruin nodelist. So "foreach" function wents through only 2 out of 4 items in nodelist - the rest 2 become distorted.
So I had to deal with only the 1st element of the list and then rebuild list until there are some items in the list left.
The code is:
$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
echo "<!--
".$htmltext."
-->
";
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
while ($oldnodes->length>0){
$node=$oldnodes->item(0);
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
$oldnodes = $doc->getElementsByTagName('div');
}
echo $doc->saveHTML();
I hope that will be helpful for someone who finds same difficulties.

Strip an entire block of html based on class or id with php

I have the following php function which is supposed to remove a block of html tag based on a given classname or id. I got this function at http://www.katcode.com/php-html-parsing-extracting-and-removing-html-tag-of-specific-class-from-string/
This function works as it should but seems to have problems when we have nested tags. In the example below i'm trying to remove the entire div block that has class 'two'.
This function seems to have problems with nested tags. It's not removing the div block properly. It's having problems figuring out beginning and end of the block. How can i rework this function remove an entire tag regardless of how many nested elements it contains. I'm open to other php suggestions. I can easily do this with jQuery, but i'm looking for a php server side solution.
html looks like this
<div class="test">
<div>testing1</div>
<div class="two">
<div>testing3</div>
<div>testing3</div>
</div>
<div>testing3</div>
<div>testing4</div>
</div>
php
<?php
$x = '<div class="test"><div>testing1</div><div class="two"><div>testing3</div><div>testing3</div></div><div>testing3</div><div>testing4</div></div>';
function removeTag($str,$id,$start_tag,$end_tag){
while(($pos_srch = strpos($str,$id))!==false){
$beg = substr($str,0,$pos_srch);
$pos_start_tag = strrpos($beg,$start_tag);
$beg = substr($beg,0,$pos_start_tag);
$end = substr($str,$pos_srch);
$end_tag_len = strlen($end_tag);
$pos_end_tag = strpos($end,$end_tag);
$end = substr($end,$pos_end_tag+$end_tag_len);
$str = $beg.$end;
}
return $str;
}
echo removeTag($x,'two','<div','/div>');
?>

Not tested but try something like:
$doc = new DOMDocument();
$doc->loadHTML($x);
$xpath = new DOMXPath($doc);
$query = "//div[contains(#class, 'two')]";
$oldnodes = $xpath->query($query);
foreach ($oldnodes as $node) {
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
}
echo $doc->saveHTML();
Hope it helps

html should probably never be parsed with php that way.
use phps domdocument class to open the html as an object. you can then use domdocument methods to search the document for the block you are looking for (xpath), loop through the xpath results and remove them, and then resave the document in text form.

Parsing xml-like data

I have a string with xml-like data:
<header>Article header</header>
<description>This article is about you</description>
<text>some <b>html</b> text</text>
I need to parse it into variables/object/array "header", "description", "text".
What is the best way to do this? I tried $vars = simplexml_load_string($content), but it does not work, because it is not 100% pure xml (no <?xml...).
So, should I use preg_match? Is it the only way?

Your XML string looks like (though may or may not be) an XML document fragment. PHP can work with this using the DOMDocumentFragment class.
$doc = new DOMDocument;
$frag = $doc->createDocumentFragment();
$frag->appendXML($content);
$parsed = array();
foreach ($frag->childNodes as $element) {
if ($element->nodeType === XML_ELEMENT_NODE) {
$parsed[$element->nodeName] = $element->textContent;
}
}
echo $parsed['description']; // This article is about you

With a string like that, simlexml_load_string should work.
Because of the 3rd tag, if you try to get that it will fail, and not return the correct value (because there is a sub part within the tag.
Try something like this, which might work for you:
$xml = simplexml_load_string($content)
$text = $xml->text->asXML();
You should also take a look at this documentation: http://www.php.net/manual/en/simplexmlelement.asxml.php. They also do the same thing with the string. You might wanna use this option instead of simplexml_load_string too
$xml = new SimpleXMLElement($string);

Getting the text portion of a node using php Simple XML

Given the php code:
$xml = <<<EOF
<articles>
<article>
This is a link
<link>Title</link>
with some text following it.
</article>
</articles>
EOF;
function traverse($xml) {
$result = "";
foreach($xml->children() as $x) {
if ($x->count()) {
$result .= traverse($x);
}
else {
$result .= $x;
}
}
return $result;
}
$parser = new SimpleXMLElement($xml);
traverse($parser);
I expected the function traverse() to return:
This is a link Title with some text following it.
However, it returns only:
Title
Is there a way to get the expected result using simpleXML (obviously for the purpose of consuming the data rather than just returning it as in this simple example)?

There might be ways to achieve what you want using only SimpleXML, but in this case, the simplest way to do it is to use DOM. The good news is if you're already using SimpleXML, you don't have to change anything as DOM and SimpleXML are basically interchangeable:
// either
$articles = simplexml_load_string($xml);
echo dom_import_simplexml($articles)->textContent;
// or
$dom = new DOMDocument;
$dom->loadXML($xml);
echo $dom->documentElement->textContent;
Assuming your task is to iterate over each <article/> and get its content, your code will look like
$articles = simplexml_load_string($xml);
foreach ($articles->article as $article)
{
$articleText = dom_import_simplexml($article)->textContent;
}

node->asXML();// It's the simple solution i think !!

So, the simple answer to my question was: Simplexml can't process this kind of XML. Use DomDocument instead.
This example shows how to traverse the entire XML. It seems that DomDocument will work with any XML whereas SimpleXML requires the XML to be simple.
function attrs($list) {
$result = "";
foreach ($list as $attr) {
$result .= " $attr->name='$attr->value'";
}
return $result;
}
function parseTree($xml) {
$result = "";
foreach ($xml->childNodes AS $item) {
if ($item->nodeType == 1) {
$result .= "<$item->nodeName" . attrs($item->attributes) . ">" . parseTree($item) . "</$item->nodeName>";
}
else {
$result .= $item->nodeValue;
}
}
return $result;
}
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xml);
print parseTree($xmlDoc->documentElement);
You could also load the xml using simpleXML and then convert it to DOM using dom_import_simplexml() as Josh said. This would be useful, if you are using simpleXml to filter nodes for parsing, e.g. using XPath.
However, I don't actually use simpleXML, so for me that would be taking the long way around.
$simpleXml = new SimpleXMLElement($xml);
$xmlDom = dom_import_simplexml($simpleXml);
print parseTree($xmlDom);
Thank you for all the help!

You can get the text node of a DOM element with simplexml just by treating it like a string:
foreach($xml->children() as $x) {
$result .= "$x"
However, this prints out:
This is a link
with some text following it.
TitleTitle
..because the text node is treated as one block and there is no way to tell where the child fits in inside the text node. The child node is also added twice because of the other else {}, but you can just take that out.
Sorry if I didn't help much, but I don't think there's any way to find out where the child node fits in the text node unless the xml is consistent (but then, why not use tags). If you know what element you want to strip the text out of, strip_tags() will work great.

This has already been answered, but CASTING TO STRING ( i.e. $sString = (string) oSimpleXMLNode->TagName) always worked for me.

Try this:
$parser = new SimpleXMLElement($xml);
echo html_entity_decode(strip_tags($parser->asXML()));
That's pretty much equivalent to:
$parser = simplexml_load_string($xml);
echo dom_import_simplexml($parser)->textContent;

Like #tandu said, it's not possible, but if you can modify your XML, this will work:
$xml = <<<EOF
<articles>
<article>
This is a link
</article>
<link>Title</link>
<article>
with some text following it.
</article>
</articles>

Create array from the contents of <div> tags in php

I have the contents of a web page assigned to a variable $html
Here's an example of the contents of $html:
<div class="content">something here</div>
<span>something random thrown in <strong>here</strong></span>
<div class="content">more stuff</div>
How, using PHP can I create an array from that that finds the contents of <div class="content"></div> regions like this (for the example above) so:
echo $array[0] . "\n" . $array[1]; //etc
outputs
something here
more stuff

Assuming this is just a simplified case in the OP and the real situation is more complicated, you'll want to use XPath.
If it's really complex, then you may want to use DOMDocument (with DOMXPath), but here's a simple example using SimpleXML
$xml = new SimpleXMLElement($html);
$result = $xml->xpath('//div[#class="content"]');
while(list( , $node) = each($result)) {
echo $node,"\n";
}
Since you explicitly asked about creating an array for this, you could use:
$res_Arr = array();
while(list( , $node) = each($result)) {
$res_Arr[] = $node;
}
and $res_Arr would be an array with the contents you're looking for.
See http://php.net/manual/en/simplexmlelement.xpath.php for php SimpleXML Xpath info and http://www.w3.org/TR/xpath for the XPath specifications

PHP has several means of processing HTML, including DomDocument and SimpleXML. See Parse HTML With PHP And DOM. Here is an example:
$dom = new DomDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
$class = $div->getAttribute('class');
if ($class == 'content') {
echo $div->nodeValue . "\n";
}
}
Technically the class attribute could be multiple classes so you might want to use:
$classes = explode(' ', $class);
if (in_array('content', $classes)) {
...
}
The SimpleXML/XPath approach is more concise but if you don't want to go the XPath route (and learning another technology, at least enough to do these sorts of tasks) then the above is a programmatic alternative.

There not much you can do short of using string manipulations function or regular expressions. you can load your HTML as XML using the DOM library and use that to traverse to your div, but that can become cumbersome if your not careful or if the structure is complex.
http://ca3.php.net/manual/en/book.dom.php

It looks like Kalem13 beat me to it, but I agree. You could use the DOMDocument class. I haven't used it personally, but I think it would work for you. First you instantiate a DOMDocument object, then you load your $html variable using the loadHTML() function. Then you can use the getElementsByTagName() function.

You probaly need to use preg_match_all()
$matches = array();
preg_match_all('`\<div(.*?)class\=\"content\"(.*?)\>(.*?)\<\/div\>`iUsm',$html,$matches,PREG_SET_ORDER);
foreach($matches as $m){
// $m[3] represents the content in <div class="content">
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to extract the text in a SimpleXmlElement object? [duplicate] - php

node->asXML();// It's the simple solution i think !!

This has already been answered, but CASTING TO STRING ( i.e. $sString = (string) oSimpleXMLNode->TagName) always worked for me.

Try this: $parser = new SimpleXMLElement($xml); echo html_entity_decode(strip_tags($parser->asXML())); That's pretty much equivalent to: $parser = simplexml_load_string($xml); echo dom_import_simplexml($parser)->textContent;

Like #tandu said, it's not possible, but if you can modify your XML, this will work: $xml = <<<EOF <articles> <article> This is a link </article> <link>Title</link> <article> with some text following it. </article> </articles>

Related

php dom document remove some html tags but keep inner tags and text

Strip an entire block of html based on class or id with php

Parsing xml-like data

Getting the text portion of a node using php Simple XML

Create array from the contents of <div> tags in php

Categories

Resources