PHP SimpleXML get innerXML - php

I need to get the HTML contents of answer in this bit of XML:
<question>Who are you?</question>
<answer>Who who, <strong>who who</strong>, <em>me</em></answer>
So I want to get the string "Who who, <strong>who who</strong>, <em>me</em>".
If I have the answer as a SimpleXMLElement, I can call asXML() to get "<answer>Who who, <strong>who who</strong>, <em>me</em></answer>", but how to get the inner XML of an element without the element itself wrapped around it?
I'd prefer ways that don't involve string functions, but if that's the only way, so be it.

function SimpleXMLElement_innerXML($xml)
$innerXML= '';
foreach (dom_import_simplexml($xml)->childNodes as $child)
$innerXML .= $child->ownerDocument->saveXML( $child );
return $innerXML;

This works (although it seems really lame):
echo (string)$qa->answer;

To the best of my knowledge, there is not built-in way to get that. I'd recommend trying SimpleDOM, which is a PHP class extending SimpleXMLElement that offers convenience methods for most of the common problems.
include 'SimpleDOM.php';
$qa = simpledom_load_string(
<question>Who are you?</question>
<answer>Who who, <strong>who who</strong>, <em>me</em></answer>
echo $qa->answer->innerXML();
Otherwise, I see two ways of doing that. The first would be to convert your SimpleXMLElement to a DOMNode then loop over its childNodes to build the XML. The other would be to call asXML() then use string functions to remove the root node. Attention though, asXML() may sometimes return markup that is actually outside of the node it was called from, such as XML prolog or Processing Instructions.

most straightforward solution is to implement custom get innerXML with simple XML:
function simplexml_innerXML($node)
foreach($node->children() as $child)
$content .= $child->asXml();
return $content;
In your code, replace $body_content = $el->asXml(); with $body_content = simplexml_innerXML($el);
However, you could also switch to another API that offers distinction between innerXML (what you are looking for) and outerXML (what you get for now). Microsoft Dom libary offers this distinction but unfortunately PHP DOM doesn't.
I found that PHP XMLReader API offers this distintion. See readInnerXML(). Though this API has quite a different approach to processing XML. Try it.
Finally, I would stress that XML is not meant to extract data as subtrees but rather as value. That's why you running into trouble finding the right API. It would be more 'standard' to store HTML subtree as a value (and escape all tags) rather than XML subtree. Also beware that some HTML synthax are not always XML compatible ( i.e. vs , ). Anyway in practice, you approach is definitely more convenient for editing the xml file.

I would have extend the SimpleXmlElement class:
class MyXmlElement extends SimpleXMLElement{
final public function innerXML(){
$tag = $this->getName();
$value = $this->__toString();
if('' === $value){
return null;
return preg_replace('!<'. $tag .'(?:[^>]*)>(.*)</'. $tag .'>!Ums', '$1', $this->asXml());
and then use it like this:
echo $qa->answer->innerXML();

function getInnerXml($xml_text) {
//strip the first element
//check if the strip tag is empty also
$xml_text = trim($xml_text);
$s1 = strpos($xml_text,">");
$s2 = trim(substr($xml_text,0,$s1)); //get the head with ">" and trim (note that string is indexed from 0)
if ($s2[strlen($s2)-1]=="/") //tag is empty
return "";
$s3 = strrpos($xml_text,"<"); //get last closing "<"
return substr($xml_text,$s1+1,$s3-$s1-1);
var_dump(getInnerXml("<xml />"));
var_dump(getInnerXml("<xml / >faf < / xml>"));
var_dump(getInnerXml("<xml >< / xml>"));
var_dump(getInnerXml("<xml>faf < / xml>"));
var_dump(getInnerXml("<xml > faf < / xml>"));
After I search for a while, I got no satisfy solution. So I wrote my own function.
This function will get exact the innerXml content (including white-space, of course).
To use it, pass the result of the function asXML(), like this getInnerXml($e->asXML()). This function work for elements with many prefixes as well (as my case, as I could not find any current methods that do conversion on all child node of different prefixes).
string '' (length=0)
string '' (length=0)
string '' (length=0)
string 'faf ' (length=4)
string ' faf ' (length=6)

function get_inner_xml(SimpleXMLElement $SimpleXMLElement)
$element_name = $SimpleXMLElement->getName();
$inner_xml = $SimpleXMLElement->asXML();
$inner_xml = str_replace('<'.$element_name.'>', '', $inner_xml);
$inner_xml = str_replace('</'.$element_name.'>', '', $inner_xml);
$inner_xml = trim($inner_xml);
return $inner_xml;

If you don't want to strip CDATA section, comment out lines 6-8.
function innerXML($i){
$text=trim(($sp!==false && $sp<=$ep)?substr($text,$sp+1,$ep-$sp-1):'');
$text=trim(($sp==0 && $ep==strlen($text)-3)?substr($text,$sp+9,-3):$text);

You can just use this function :)
function innerXML( $node )
$name = $node->getName();
return preg_replace( '/((<'.$name.'[^>]*>)|(<\/'.$name.'>))/UD', "", $node->asXML() );

Here is a very fast solution i created:
function InnerHTML($Text)
return SubStr($Text, ($PosStart = strpos($Text,'>')+1), strpos($Text,'<',-1)-1-$PosStart);
echo InnerHTML($yourXML->qa->answer->asXML());

using regex you could do this
preg_match(’/<answer(.*)?>(.*)?<\/answer>/’, $xml, $match);


XHP with Regex for link Replacement

I am trying to implement a simple function that given a text input, returns the text modified with xhp_a when a link is detected, within a paragraph xhp_p.
Consider this class
class Urlifier {
protected static $reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
public static function convertParagraphWithLink(?string $input):xhp_p{
if (!$input)
return <p></p>;
if (preg_match(self::$reg_exUrl,$input,$url_match)) //match found
return <p>{preg_replace($reg_exUrl, '<a href="'.$url_match[0].'>'.$url_match[0].'</a>', $input)}<p>;
}else{//no link inside
The problem here is that xhp escapes html and links are not shown as expected. I suppose that this happens because a do not create a dom hierarchy as expected (with appendChild method for example) and thus everything regex replaces is a string.
So my other approach to this problem was to use preg_match_callback with a callback function that would create xhp_a and add to hierarchy under xhp_p but that did not work either.
Am i wrong somewhere ? If not would there by any security risk / bigger overhead by just finding and replacing on load the html on client side instead of server ?
Thanks for your time !
Since XHP maintains object hierarchy that maps to DOM, simply replacing parts of a string won't create any new objects. To manipulate XHP objects corresponding methods should be used, e.g. appendChild.
Here's an example of how what you need can be achieved with XHP manipulation.
class Urlifier {
public static function convertParagraphWithLink(
?string $input,
): xhp_p {
$url_pattern = re"/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
if (HH\Lib\Str\is_empty($input)) {
return <p/>;
$input = $input as nonnull;
// Extract links
$link_matches = HH\Lib\Regex\every_match($input, $url_pattern);
$links = HH\Lib\Vec\map($link_matches, $m ==> $m[0]);
$a_elements = HH\Lib\Vec\map($links, $link ==> <a href={$link}>{$link}</a>);
// Extract all pieces between matches
$texts = HH\Lib\Regex\split($input, $url_pattern);
$p_elements = HH\Lib\Vec\map($texts, $text ==> <p>{$text}</p>);
// Merge texts and links
$pairs = HH\Lib\Vec\zip($p_elements, $a_elements);
$elements = HH\Lib\Vec\flatten($pairs);
// Because there's one more p element than a element, append last p
$elements[] = HH\Lib\C\last($p_elements);
$result = <p/>;
return $result;

How to save regex backreferences to an array during preg_replace or preg_replace_callback

Here's the problem: I have a database full of articles marked up in XHTML. Our application uses Prince XML to generate PDFs. An artifact of that is that footnotes are marked up inline, using the following pattern:
<p>Some paragraph text<span class="fnt">This is the text of a footnote</span>.</p>
Prince replaces every span.fnt with a numeric footnote marker, and renders the enclosed text as a footnote at the bottom of the page.
We want to render the same content in ebook formats, and XHTML is a great starting point, but the inline footnotes are terrible. What I want to do is convert the footnotes to endnotes in my ebook build script.
This is what I'm thinking:
Create an empty array called $endnotes to store the endnote text.
Set a variable $endnote_no to zero. This variable will hold the current endnote number, to display inline as an endnote marker, and to be used in linking the endnote marker to the particular endnote.
Use preg_replace or preg_replace_callback to find every instance of <span class="fnt">(.*?)</span>.
Increment $endnote_no for each instance, and replace the inline span with '<sup><a href="#endnote_' . $endnote_no . '">' .$endnote_no . ''`
Push the footnote text to the $endnotes array so that I can use it at the end of the document.
After replacing all the footnotes with numeric endnote references, iterate through the $endnotes array to spit out the endnotes as an ordered list in XHTML.
This process is a bit beyond my PHP comprehension, and I get lost when I try to translate this into code. Here's what I have so far, which I mainly cobbled together based on code examples I found in the PHP documentation:
$endnotes = array();
$endnote_no = 0;
class Endnoter {
public function replace($subject) {
$this->endnote_no = 0;
return preg_replace_callback('`<span class="fnt">(.*?)</span>`', array($this, '_callback'), $subject);
public function _callback($matches) {
array_push($endnotes, $1);
return '<sup>' . $this->endnote_no . '</sup>';
$replacer = new Endnoter();
echo '<pre>';
print_r($endnotes); // Just checking to see if the $endnotes are there.
echo '</pre>';
Any guidance would be helpful, especially if there is a simpler way to get there.
Don't know about a simpler way, but you were halfway there. This seems to work.
I just cleaned it up a bit, moved the variables inside your class and added an output method to get the footnote list.
class Endnoter
private $number_of_notes = 0;
private $footnote_texts = array();
public function replace($input) {
return preg_replace_callback('#<span class="fnt">(.*)</span>#i', array($this, 'replace_callback'), $input);
protected function replace_callback($matches) {
// the text sits in the matches array
// see
$this->footnote_texts[] = $matches[1];
return '<sup>'.$this->number_of_notes.'</sup>';
public function getEndnotes() {
$out = array();
$out[] = '<ol>';
foreach($this->footnote_texts as $text) {
$out[] = '<li>'.$text.'</li>';
$out[] = '</ol>';
return implode("\n", $out);
First, you're best off not using a regex for HTML manipulation; see here:
How do you parse and process HTML/XML in PHP?
However, if you really want to go that route, there are a few things wrong with your code:
return '<sup>' . $this->endnote_no . '</sup>';
if endnote_no is 1, for example this will produce
If those values are both supposed to be the same, you want to increment endnote_no first:
return '<sup>' . $this->endnote_no . '</sup>';
Note the ++ in front of the call instead of after.
array_push($endnotes, $1);
$1 is not a defined value. You're looking for the array you passed in to the callback, so you want $matches[1]
$endnotes is not defined outside the class, so you either want a getter function to retrieve $endnotes (usually preferable) or make the variable public in the class. With a getter:
class Endnotes {
private $endnotes = array();
//replace any references to $endnotes in your class with $this->endnotes and add a function:
public function getEndnotes() {
return $this->endnotes;
//and then outside
preg_replace_callback doesn't pass by reference, so you aren't actually modifying the original string. $replacer->replace($body); should be $body = $replacer->replace($body); unless you want to pass body by reference into the replace() function and update its value there.

Reading php files with special tags in php

I have a file which reads as follows
<<row>> 1|test|20110404<</row>>
<<row>> 1|test|20110404<</row>>
<<row>><</row>> indicates start and end of line.I want to read line between this tags and also check whether this tags are present.
The first thing you need to do is locate the position of this "tag". The strpos() function does just that.
$tag_pos=strpos('<> 1|test|20110404<> <> 1|test|20110404<>', '<>');
if ($tag_pos===false) {
//The tag was not found!
} else {
//$tag_pos equals the numeric position of the first character of your tag
If these are truly lines, an efficient way to get them all is just to split on <>.
$lines=explode('<>', '<> 1|test|20110404<> <> 1|test|20110404<>');
$lines=array_filter($lines); //Removes blank strings from array
You could improve this by adding a callback function to the array_filter() call that uses trim() to remove any whitespace and then see if it is blank or not.
Edit: Great, I see that your "tags" were missing from your post. Since your start and end tags do not match, the code above will be of little use to you. Let me try again...
function strbetweenstrs($source, $tag1, $tag2, $casesensitive=true) {
while ($whatsleft<>'') {
if ($casesensitive) {
$pos1=strpos($whatsleft, $str1);
$pos2=strpos($whatsleft, $str2, $pos1+strlen($str1));
} else {
$pos1=strpos(strtoupper($whatsleft), strtoupper($str1));
$pos2=strpos(strtoupper($whatsleft), strtoupper($str2), $pos1+strlen($str1));
if (($pos1===false) || ($pos2===false)) {
array_push($results, substr($whatsleft, $pos1+strlen($str1), $pos2-($pos1_strlen($str1))));
$whatsleft=substr($whatsleft, $pos2+strlen($str2));
Note that I haven't tested this... but you get the generally idea. There is probably a much more efficient way to go about doing it.
Creating your own format is not so hard, but creating a script to read it can be difficult.
The advantage of using standardized formats is that most programming languages has support for them already. For example:
XML: You can use the simplexml_load_string() function and it can make you navigate easily through your content.
$str = "<?xml version="1.0" encoding="utf-8"?>
$xml = simplexml_load_string($str);
Now you can access your data
echo $xml->row[0];
echo $xml->row[1];
i'm sure you get the idea,
there is also a very good support for JSON (Javascript Object Notation) using the jsondecode() function;
Check it on for more details
i would suggest to use preg_match :-
preg_match( '#<< row>>(.*)<< /row>>#', $line, $matches);
if( ! empty($matches))
// line was found
print_r( $matches[1] ); // will contain the content between the start and end row tags

PHP DOM - stripping span tags, leaving their contents

I am looking to take markup like:
<span class="test">Some text that is <strong>bolded</strong> and contains a link.</span>
and find the best method in PHP for stripping the span so that what is left is this:
Some text that is <strong>bolded</strong> and contains a link.
I have read many of the other questions regarding parsing HTML using PHP DOM instead of regex, but have been unable to figure out a way to strip the spans with PHP DOM, leaving the HTML contents intact. The ultimate goal is to be able to strip the document of all span tags, leaving their contents. Can this be done with PHP DOM? Is there a method that provides better performance and does not rely on string parsing instead of DOM parsing?
I've used regex to do so, without any issues thus far:
But my interest here is in becoming a better PHP programmer. And since it is always possible to trip up a regex with badly formatted markup, I'm looking for a better way. I have also considered using strip_tags(), doing something like the following:
public function strip_tags( $content, $tags_to_strip = array() )
// All Valid XHTML tags
$valid_tags = array(
// Remove each tag to strip from the valid_tags array
foreach ( $tags_to_strip as $tag ){
$ndx = array_search( $tag, $valid_tags );
if ( $ndx !== false ){
unset( $valid_tags[ $ndx ] );
// convert valid_tags array into param for strip_tags
$valid_tags = implode( '><', $valid_tags );
$valid_tags = "<$valid_tags>";
$content = strip_tags( $content, $valid_tags );
return $content;
But this is still parsing the string, and not DOM parsing. So if the text is mal-formed, it is possible to strip too much. Many people are quick to suggest using Simple HTML DOM Parser, but looking at the source code, it seems to be using regex to parse the html as well.
Can this be done with PHP5's DOM, or is there a better way to strip tags leaving their contents intact. Would it be bad practice to use Tidy or HTML Purifier to clean the text and then use regex / HTML Simple HTML DOM parser on it?
Libraries like phpQuery seem to be too heavy weight for what seems like it should be a simple task.
I use the following function to remove a node without removing its children:
function DOMRemove(DOMNode $from) {
$sibling = $from->firstChild;
do {
$next = $sibling->nextSibling;
$from->parentNode->insertBefore($sibling, $from);
} while ($sibling = $next);
Per example:
$dom = new DOMDocument;
$nodes = $dom->getElementsByTagName('span');
foreach ($nodes as $node) {
echo $dom->saveHTML();
Would give you:
Some text that is <strong>bolded</strong> and contains a link.
While this:
$nodes = $dom->getElementsByTagName('a');
foreach ($nodes as $node) {
echo $dom->saveHTML();
Would give you:
<span class="test">Some text that is <strong>bolded</strong> and contains a link.</span>
In my experience, every time I worked with DOM, I los a little bit in performance when comparing with simple stri operations.
With your function, you tried to filter strictly the valid XHTML tags, but you don't need a loop with manual comparison since you can assign all this task to PHP interpreter through native functions.
Of course, you have combined well to achieve a very good performance (to me, 0.0002 miliseconds), but you could try to combine functions, in a single line, allowing each function do your own natural job.
Take a look and you will understand what I'm talking about:
$text = '<span class="test">Some text that is <strong>bolded</strong> and contains a link.</span>';
$validTags = array( 'a','abbr','acronym','address','area','b','base','bdo','big','blockquote','body','br','button','caption','cite',
$tagsToStrip = array( 'span' );
var_dump( strip_tags( $text, sprintf( '<%s>', implode( '><', array_diff( $validTags, $tagsToStrip ) ) ) ) );
I used your own list, but I combined sprintf(), implode() and array_diff() to do specific tasks for, together, achieve the goal.
Hope it helped.

Need a regex to add css class to first and last list item

Thank you all for your input. Some additional information.
It's really just a small chunk of markup (20 lines) I'm working with and had aimed to to leverage a regex to do the work.
I also do have the ability to hack up the script (an ecommerce one) to insert the classes as the navigation is built. I wanted to limit the number of hacks I have in place to keep things easier on myself when I go to update to the latest version of the software.
With that said, I'm pretty aware of my situation and the various options available to me. The first part of my regex works as expected. I posted really more or less to see if someone would say, "hey dummy, this is easy just change this....."
After coming close with a few of my efforts, it's more of the principle at this point. To just know (and learn) a solution exists for this problem. I also hate being beaten by a piece of code.
I'm trying to leverage regular expressions to add a CSS a class to the first and last list items within an ordered list. I've tried a bunch of different ways but can't produce the results I'm looking for.
I've got a regular expression for the first list item but can't seem to figure a correct one out for the last. Here is what I'm working with:
$patterns = array('/<ul+([^<]*)<li/m', '/<([^<]*)(?<=<li)(.*)<\/ul>/s');
$replace = array('<ul$1<li class="first"','<li class="last"$2$3</ul>');
$navigation = preg_replace($patterns, $replace, $navigation);
Any help would be greatly appreciated.
Jamie Zawinski would have something to say about this...
Do you have a proper HTML parser? I don't know if there's anything like hpricot available for PHP, but that's the right way to deal with it. You could at least employ hpricot to do the first cleanup for you.
If you're actually generating the HTML -- do it there. It looks like you want to generate some navigation and have a .first and .last kind of thing on it. Take a step back and try that.
+1 to generating the right html as the best option.
But a completely different approach, which may or may not be acceptable to you: you could use javascript.
This uses jquery to make it easy ...
function() {
As I say, may or may not be any use in this case, but I think its a valid solution to the problem in some cases.
PS: You say ordered list, then give ul in your example. ol = ordered list, ul = unordered list
You wrote:
$patterns = array('/<ul+([^<]*)<li/m','/<([^<]*)(?<=<li)(.*)<\/ul>/s');
First pattern:
ul+ => you search something like ullll...
The m modifier is useless here, since you don't use ^ nor $.
Second pattern:
Using .* along with s is "dangerous", because you might select the whole document up to the last /ul of the page...
And well, I would just drop s modifier and use: (<li\s)(.*?</li>\s*</ul>) with replace: '$1class="last" $2'
In view of above remarks, I would write the first expression: <ul.*?>\s*<li
Although I am tired of seeing the Jamie Zawinski quote each time there is a regex question, Dustin is right in pointing you to a HTML parser (or just generating the right HTML from the start!): regexes and HTML doesn't mix well, because HTML syntax is complex, and unless you act on a well known machine generated output with very predictable result, you are prone to get something breaking in some cases.
I don't know if anyone cares any longer, but I have a solution that works in my simple test case (and I believe it should work in the general case).
First, let me point out two things: While PhiLho is right in that the s is "dangerous", since dots may match everything up to the final of the document, this may very well be what you want. It only becomes a problem with not well formed pages. Be careful with any such regex on large, manually written pages.
Second, php has a special meaning of backslashes, even in single quotes. Most regexen will perform well either way, but you should always double-escape them, just in case.
Now, here's my code:
$patterns = array('/<ul.*?>\\s*<li/',
$replace = array('$0 class="first"',
'<li class="last"$1');
$navigation = preg_replace($patterns, $replace, $navigation);
echo $navigation;
This will output
<li class="first">Coffee</li>
<li class="last">Water</li>
This assumes no line feeds inside the opening <ul...> tag. If there are any, use the s modifier on the first expression too.
The magic happens in (.(?<!<li))*?. This will match any character (the dot) that is not the beginning of the string <li, repeated any amount of times (the *) in a non-greedy fashion (the ?).
Of course, the whole thing would have to be expanded if there is a chance the list items already have the class attribute set. Also, if there is only one list item, it will match twice, giving it two such attributes. At least for xhtml, this would break validation.
You could load the navigation in a SimpleXML object and work with that. This prevents you from breaking your markup with some crazy regex :)
As a preface .. this is waaay over-complicating things in most use-cases. Please see other answers for more sanity :)
Here is a little PHP class I wrote to solve a similar problem. It adds 'first', 'last' and any other classes you want. It will handle li's with no "class" attribute as well as those that already have some class(es).
* Modify list items in pre-rendered html.
* Usage Example:
* $replaced_text = ListAlter::addClasses($original_html, array('cool', 'awsome'));
class ListAlter {
private $classes = array();
private $classes_found = FALSE;
private $count = 0;
private $total = 0;
// No public instances.
private function __construct() {}
* Adds 'first', 'last', and any extra classes you want.
static function addClasses($html, $extra_classes = array()) {
$instance = new self();
$instance->classes = $extra_classes;
$total = preg_match_all('~<li([^>]*?)>~', $html, $matches);
$instance->total = $total ? $total : 0;
return preg_replace_callback('~<li([^>]*?)>~', array($instance, 'processListItem'), $html);
private function processListItem($matches) {
$this->classes_found = FALSE;
$processed = preg_replace_callback('~(\w+)="(.*?)"~', array($this, 'appendClasses'), $matches[0]);
if (!$this->classes_found) {
$classes = $this->classes;
if ($this->count == 1) {
$classes[] = 'first';
if ($this->count == $this->total) {
$classes[] = 'last';
if (!empty($classes)) {
$processed = rtrim($matches[0], '>') . ' class="' . implode(' ', $classes) . '">';
return $processed;
private function appendClasses($matches) {
list($name, $value) = $matches;
if ($name == 'class') {
$value = array_filter(explode(' ', $value));
$value = array_merge($value, $this->classes);
if ($this->count == 1) {
$value[] = 'first';
if ($this->count == $this->total) {
$value[] = 'last';
$value = implode(' ', $value);
$this->classes_found = TRUE;
return sprintf('%s="%s"', $name, $value);
