I'm looking for a way to dynamically surround parts of text with XML nodes based on regular expressions.
Consider the following example
<speak>The test number is 123456789, and some further block of text.</speak>
Now let's say I have a regular expression targeting the number to selectively surround it with a new tag so it would become:
<speak>The test number is <say-as interpret-as="characters">123456789</say-as>, and some further block of text.</speak>
I thought about using DomDocument for creating the tags, but not sure about the substitution part. Any advice?
DOM is the correct way. It allows you to find and traverse text nodes. Use RegEx on the content of these nodes and build the new nodes up as a fragment.
function wrapMatches(\DOMNode $node, string $pattern, string $tagName, $tagAttributes = []) {
$document = $node instanceof DOMDocument ? $node : $node->ownerDocument;
$xpath = new DOMXpath($document);
// iterate all descendant text nodes
foreach ($xpath->evaluate('.//text()', $node) as $textNode) {
$content = $textNode->textContent;
$found = preg_match_all($pattern, $content, $matches, PREG_OFFSET_CAPTURE);
$offset = 0;
if ($found) {
// fragments allow to treat multiple nodes as one
$fragment = $document->createDocumentFragment();
foreach ($matches[0] as $match) {
list($matchContent, $matchStart) = $match;
// add text from last match to current
$document->createTextNode(substr($content, $offset, $matchStart - $offset))
// add wrapper element, ...
$wrapper = $fragment->appendChild($document->createElement($tagName));
// ... set its attributes ...
foreach ($tagAttributes as $attributeName => $attributeValue) {
$wrapper->setAttribute($attributeName, $attributeValue);
// ... and add the text content
$wrapper->textContent = $matchContent;
$offset = $matchStart + strlen($matchContent);
// add text after last match
$fragment->appendChild($document->createTextNode(substr($content, $offset)));
// replace the text node with the new fragment
$textNode->parentNode->replaceChild($fragment, $textNode);
$xml = <<<'XML'
<speak>The test number is 123456789, and some further block of text.</speak>
$document = new DOMDocument();
wrapMatches($document, '(\d+)u', 'say-as', ['interpret-as' => 'characters']);
echo $document->saveXML();
This is conveniently handled using the xsl:analyze-string instruction in XSLT 2.0. For example you can define the rule:
<xsl:template match="speak">
<xsl:analyze-string select="." regex="\d+">
<say-as interpret-as="characters">
<xsl:value-of select="."/>
You can use preg_replace something like this:
$str = '<speak>The test number is 123456789, and some further block of text.</speak>';
echo preg_replace('/(\d+)/','<say-as interpret-as="characters">$1</say-as>',$str);
and the output would be:
<speak>The test number is <say-as interpret-as="characters">123456789</say-as>, and some further block of text.</speak>
I ended up doing it the simple way, since I don't need to handle nested nodes and other XML specific stuff. So just made a simple method for creating the tags as strings. It's good enough.
protected function createTag($name, $attributes = [], $content = null)
$openingTag = '<' . $name;
if ($attributes) {
foreach ($attributes as $attribute => $value) {
$openingTag .= sprintf(' %s="%s"', $attribute, $value);
$openingTag .= '>';
$closingTag = '</' . $name . '>';
$content = $content ?: '$1';
return $openingTag . $content . $closingTag;
$tag = $this->createTag($tagName, $attributes);
$text = preg_replace($regex, $tag, $text);
I try to replace words that are in my dictionary of terminology with an (html)anchor so it gets a tooltip. I get the replace-part done, but I just can't get it back in the DomDocument object.
I've made a recursive function that iterates the DOM, it iterates every childnode, searching for the word in my dictionary and replacing it with an anchor.
I've been using this with an ordinary preg_match on HTML, but that just runs into problems.. when HTML gets complex
The recursive function:
$terms = array(
'example'=>'explanation about example'
function iterate_html($doc, $original_doc = null)
global $terms;
if(is_null($original_doc)) {
self::iterate_html($doc, $doc);
foreach($doc->childNodes as $childnode)
$children = $childnode->childNodes;
if($children) {
} else {
$regexes = '~\b' . implode('\b|\b',array_keys($terms)) . '\b~i';
$new_nodevalue = preg_replace_callback($regexes, function($matches) {
$doc = new DOMDocument();
$anchor = $doc->createElement('a', $matches[0]);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($matches[0])]);
return $doc->saveXML($anchor);
}, $childnode->nodeValue);
$dom = new DOMDocument();
$template = $dom->createDocumentFragment();
$original_doc->importNode($template->childNodes, true);
$childnode->parentNode->replaceChild($template, $childnode);
echo iterate_html('this is just some example text.');
I expect the result to be:
this is just some <a class="text-info" data-toggle="tooltip" data-original-title="explanation about example">example</a> text
I don't think building a recursive function to walk the DOM is usefull when you can use an XPath query. Also, I'm not sure that preg_replace_callback is an adapted function for this case. I prefer to use preg_split. Here is an example:
$html = 'this is just some example text.';
$terms = array(
'example'=>'explanation about example'
// sort by reverse order of key size
// (to be sure that the longest string always wins instead of the first in the pattern)
uksort($terms, function ($a, $b) {
$diff = mb_strlen($b) - mb_strlen($a);
return ($diff) ? $diff : strcmp($a, $b);
// build the pattern inside a capture group (to have delimiters in the results with the PREG_SPLIT_DELIM_CAPTURE option)
$pattern = '~\b(' . implode('|', array_map(function($i) { return preg_quote($i, '~'); }, array_keys($terms))) . ')\b~i';
// prevent eventual html errors to be displayed
$libxmlInternalErrors = libxml_use_internal_errors(true);
// determine if the html string have a root html element already, if not add a fake root.
$dom = new DOMDocument;
$fakeRootElement = false;
if ( $dom->documentElement->nodeName !== 'html' ) {
$fakeRootElement = true;
// find all text nodes (not already included in a link or between other unwanted tags)
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::a)][not(ancestor::style)][not(ancestor::script)]');
// replacement
foreach ($textNodes as $textNode) {
$parts = preg_split($pattern, $textNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$fragment = $dom->createDocumentFragment();
foreach ($parts as $k=>$part) {
if ($k&1) {
$anchor = $dom->createElement('a', $part);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($part)]);
} else {
$textNode->parentNode->replaceChild($fragment, $textNode);
// building of the result string
$result = '';
if ( $fakeRootElement ) {
foreach ($dom->documentElement->childNodes as $childNode) {
$result .= $dom->saveHTML($childNode);
} else {
$result = $dom->saveHTML();
echo $result;
Feel free to put that into one or more functions/methods, but keep in mind that this kind of editing has a non-neglictable weight and should be used each time the html is edited (and not each time the html is displayed).
This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 5 years ago.
How to Change innerHTML of a php DOMElement ?
Another solution:
1) create new DOMDocumentFragment from the HTML string to be inserted;
2) remove old content of our element by deleting its child nodes;
3) append DOMDocumentFragment to our element.
function setInnerHTML($element, $html)
$fragment = $element->ownerDocument->createDocumentFragment();
while ($element->hasChildNodes())
Alternatively, we can replace our element with its clean copy and then append DOMDocumentFragment to this clone.
function setInnerHTML($element, $html)
$fragment = $element->ownerDocument->createDocumentFragment();
$clone = $element->cloneNode(); // Get element copy without children
$element->parentNode->replaceChild($clone, $element);
$doc = new DOMDocument();
$doc->loadXML('<div><span style="color: green">Old HTML</span></div>');
$div = $doc->getElementsByTagName('div')->item(0);
echo $doc->saveHTML();
setInnerHTML($div, '<p style="color: red">New HTML</p>');
echo $doc->saveHTML();
// Output:
// <div><span style="color: green">Old HTML</span></div>
// <div><p style="color: red">New HTML</p></div>
I needed to do this for a project recently and ended up with an extension to DOMElement: http://www.keyvan.net/2010/07/javascript-like-innerhtml-access-in-php/
Here's an example showing how it's used:
require_once 'JSLikeHTMLElement.php';
$doc = new DOMDocument();
$doc->registerNodeClass('DOMElement', 'JSLikeHTMLElement');
$doc->loadHTML('<div><p>Para 1</p><p>Para 2</p></div>');
$elem = $doc->getElementsByTagName('div')->item(0);
// print innerHTML
echo $elem->innerHTML; // prints '<p>Para 1</p><p>Para 2</p>'
// set innerHTML
$elem->innerHTML = 'FF';
// print document (with our changes)
echo $doc->saveXML();
I think the best thing you can do is come up with a function that will take the DOMElement that you want to change the InnerHTML of, copy it, and replace it.
In very rough PHP:
function replaceElement($el, $newInnerHTML) {
$newElement = $myDomDocument->createElement($el->nodeName, $newInnerHTML);
$el->parentNode->insertBefore($newElement, $el);
return $newElement;
This doesn't take into account attributes and nested structures, but I think this will get you on your way.
I ended up making this function using a few functions from other people on this page. I changed the one from Joanna Goch the way that Peter Brand says mostly, and also added some code from Guest and from other places.
This function does not use an extension, and does not use appendXML (which is very picky and breaks even if it sees one BR tag that is not closed) and seems to be working good.
function set_inner_html( $element, $content ) {
$DOM_inner_HTML = new DOMDocument();
$internal_errors = libxml_use_internal_errors( true );
$DOM_inner_HTML->loadHTML( mb_convert_encoding( $content, 'HTML-ENTITIES', 'UTF-8' ) );
libxml_use_internal_errors( $internal_errors );
$content_node = $DOM_inner_HTML->getElementsByTagName('body')->item(0);
$content_node = $element->ownerDocument->importNode( $content_node, true );
while ( $element->hasChildNodes() ) {
$element->removeChild( $element->firstChild );
$element->appendChild( $content_node );
It seems that appendXML doesn't work always - for example if you try to append XML with 3 levels. Here is the function I wrote that always work (you want to set $content as innerHTML to $element):
function setInnerHTML($DOM, $element, $content) {
$DOMInnerHTML = new DOMDocument();
$contentNode = $DOMInnerHTML->getElementsByTagName('body')->item(0)->firstChild;
$contentNode = $DOM->importNode($contentNode, true);
return $elementNode;
Have a look at this library PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/
It looks pretty straightforward. You can change innertextproperty of your elements. It might help.
Here is a replace by class function I just wrote:
It will replace the innerHtml of a class. You can also specify the node type eg. div/p/a etc.
function replaceInnerHtmlByClass($html, $replace=null, $class=null, $nodeType=null){
if(!$nodeType){ $nodeType = '*'; }
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//{$nodeType}[contains(concat(' ', normalize-space(#class), ' '), '$class')]");
foreach($nodes as $node) {
$fragment = $dom->createDocumentFragment();
return $dom->saveHTML($dom->documentElement);
Here is another function I wrote to remove nodes with a specific class but preserving the inner html.
Setting replace to true will discard the inner html.
Setting replace to any other content will replace the inner html with the provided content.
function stripTagsByClass($html, $class=null, $nodeType=null, $replace=false){
if(!$nodeType){ $nodeType = '*'; }
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//{$nodeType}[contains(concat(' ', normalize-space(#class), ' '), '$class')]");
foreach($nodes as $node) {
$innerHTML = '';
$children = $node->childNodes;
foreach($children as $child) {
$tmp = new DOMDocument();
$innerHTML .= $tmp->saveHTML();
$fragment = $dom->createDocumentFragment();
if($replace !== null && $replace !== false){
if($replace === true){ $replace = ''; }
$innerHTML = $replace;
$node->parentNode->replaceChild($fragment, $node);
return $dom->saveHTML($dom->documentElement);
Theses functions can easily be adapted to use other attributes as the selector.
I only needed it to evaluate the class attribute.
Developing on from Joanna Goch's answer, this function will insert either a text node or an HTML fragment:
function nodeFromContent($node, $content) {
//creates a text node, or dom node if content contains html
$lt = strpos($content, '<');
$gt = strrpos($content, '>');
if (!($lt === false || $gt === false) && $gt > $lt) {
//< followed by > means potentially contains HTML
$DOMInnerHTML = new DOMDocument();
$contentNode = $DOMInnerHTML->getElementsByTagName('body')->item(0);
$newNode = $node->ownerDocument->importNode($contentNode, true);
} else {
$newNode = $node->ownerDocument->createTextNode($content);
return $newNode;
$newNode = nodeFromContent($node, $content);
$node->parentNode->insertBefore($newNode, $node);
//or $node->appendChild($newNode) depending on what you require
here is how you do it:
$doc = new DOMDocument('');
$label = $doc->createElement('label');
echo $doc->saveHTML();
function setInnerHTML($DOM, $element, $innerHTML) {
$node = $DOM->createTextNode($innerHTML);
I have a html string that contains exactly one a-element in it. Example:
In php I have to test if rel contains external and if yes, then modify href and save the string.
I have looked for DOM nodes and objects. But they seem to be too much for only one A-element, as I have to iterate to get html nodes and I am not sure how to test if rel exists and contains external.
$html = new DOMDocument();
$a = $html->getElementsByTagName('a');
$attr = $a->item(0)->attributes();
At this point I am going to get NodeMapList that seems to be overhead. Is there any simplier way for this or should I do it with DOM?
Is there any simplier way for this or should I do it with DOM?
Do it with DOM.
Here's an example:
$html = 'test';
$dom = new DOMDocument;
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//a[contains(concat(' ', normalize-space(#rel), ' '), ' external ')]");
foreach($nodes as $node) {
$node->setAttribute('href', 'http://example.org');
echo $dom->saveHTML();
I kept going to modify with DOM. This is what I get:
$html = new DOMDocument();
$html->loadHtml('<?xml encoding="utf-8" ?>' . $txt);
$nodes = $html->getElementsByTagName('a');
foreach ($nodes as $node) {
foreach ($node->attributes as $att) {
if ($att->name == 'rel') {
if (strpos($att->value, 'external')) {
$txt = $html->saveHTML();
I did not want to load any other library for just this one string.
The best way is to use a HTML parser/DOM, but here's a regex solution:
$html = 'test<br>
<p> Some text</p>
<a rel="external">test3</a> <-- This won\'t work since there is no href in it.
$new = preg_replace_callback('/<a.+?rel\s*=\s*"([^"]*)"[^>]*>/i', function($m){
if(strpos($m[1], 'external') !== false){
$m[0] = preg_replace('/href\s*=\s*(("[^"]*")|(\'[^\']*\'))/i', 'href="http://example.com"', $m[0]);
return $m[0];
}, $html);
echo $new;
Online demo.
You could use a regular expression like
if it matches /\s+rel\s*=\s*".*external.*"/
then do a regExp replace like
/(<a.*href\s*=\s*")([^"]\)("[^>]*>)/\1[your new href here]\3/
Though using a library that can do this kind of stuff for you is much easier (like jquery for javascript)
I have some code that is generating a diff between two documents, inserting <ins> and <del> tags haphazardly. For the most part it's doing a great job, but every now and then it inserts tags in script, style and the title tags.
Any ideas on how to remove the <del> tags (including the text between them), remove the <ins> tags (but retaining the text within them as part of the original string), however only within those three tags? (title, script and style).
Don't use regex to do this; it sounds like you have to deal with many, many lines. DOMDocument is great.
$dom = new DOMDocument;
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//script|//title|//style') as $node) {
foreach ($node->getElementsByTagName('del') as $delNode) {
foreach ($node->getElementsByTagName('ins') as $insNode) {
$node->replaceChild($dom->createTextNode($insNode->nodeValue), $insNode);
Untested, this may or may not work:
$str = preg_replace('/(<script.*?>.*?)<del>.*?</del>(.*?</script>)/im', '$1$2', $str);
It attempts to look within the <script> ... </script> block of the string, and replace any instances of <del>...</del> with empty string.
The following ended up working quite well for me:
$tags = array('script', 'title', 'style');
foreach ($tags as $tag) {
$str = preg_replace_callback(
'/(<' . ($tag) . '\b[^>]*>)(.*?)(<\/' . ($tag) . '>)/is',
function($match) {
$replaced = preg_replace(
return ($match[1]) . ($replaced) . ($match[3]);
While the following didn't end up being my solution, it did get me far and could be useful to others:
$dom = new DOMDocument;
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//script|//title|//style') as $node) {
foreach ($node->getElementsByTagName('del') as $delNode) {
foreach ($node->getElementsByTagName('ins') as $insNode) {
$node->replaceChild($dom->createTextNode($insNode->nodeValue), $insNode);
$str = (string) $dom->saveXML($dom, LIBXML_NOEMPTYTAG);//$xpath->query('//p')->item(0));
Hope this helps someone else.
This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 5 years ago.
How to Change innerHTML of a php DOMElement ?
Another solution:
1) create new DOMDocumentFragment from the HTML string to be inserted;
2) remove old content of our element by deleting its child nodes;
3) append DOMDocumentFragment to our element.
function setInnerHTML($element, $html)
$fragment = $element->ownerDocument->createDocumentFragment();
while ($element->hasChildNodes())
Alternatively, we can replace our element with its clean copy and then append DOMDocumentFragment to this clone.
function setInnerHTML($element, $html)
$fragment = $element->ownerDocument->createDocumentFragment();
$clone = $element->cloneNode(); // Get element copy without children
$element->parentNode->replaceChild($clone, $element);
$doc = new DOMDocument();
$doc->loadXML('<div><span style="color: green">Old HTML</span></div>');
$div = $doc->getElementsByTagName('div')->item(0);
echo $doc->saveHTML();
setInnerHTML($div, '<p style="color: red">New HTML</p>');
echo $doc->saveHTML();
// Output:
// <div><span style="color: green">Old HTML</span></div>
// <div><p style="color: red">New HTML</p></div>
I needed to do this for a project recently and ended up with an extension to DOMElement: http://www.keyvan.net/2010/07/javascript-like-innerhtml-access-in-php/
Here's an example showing how it's used:
require_once 'JSLikeHTMLElement.php';
$doc = new DOMDocument();
$doc->registerNodeClass('DOMElement', 'JSLikeHTMLElement');
$doc->loadHTML('<div><p>Para 1</p><p>Para 2</p></div>');
$elem = $doc->getElementsByTagName('div')->item(0);
// print innerHTML
echo $elem->innerHTML; // prints '<p>Para 1</p><p>Para 2</p>'
// set innerHTML
$elem->innerHTML = 'FF';
// print document (with our changes)
echo $doc->saveXML();
I think the best thing you can do is come up with a function that will take the DOMElement that you want to change the InnerHTML of, copy it, and replace it.
In very rough PHP:
function replaceElement($el, $newInnerHTML) {
$newElement = $myDomDocument->createElement($el->nodeName, $newInnerHTML);
$el->parentNode->insertBefore($newElement, $el);
return $newElement;
This doesn't take into account attributes and nested structures, but I think this will get you on your way.
I ended up making this function using a few functions from other people on this page. I changed the one from Joanna Goch the way that Peter Brand says mostly, and also added some code from Guest and from other places.
This function does not use an extension, and does not use appendXML (which is very picky and breaks even if it sees one BR tag that is not closed) and seems to be working good.
function set_inner_html( $element, $content ) {
$DOM_inner_HTML = new DOMDocument();
$internal_errors = libxml_use_internal_errors( true );
$DOM_inner_HTML->loadHTML( mb_convert_encoding( $content, 'HTML-ENTITIES', 'UTF-8' ) );
libxml_use_internal_errors( $internal_errors );
$content_node = $DOM_inner_HTML->getElementsByTagName('body')->item(0);
$content_node = $element->ownerDocument->importNode( $content_node, true );
while ( $element->hasChildNodes() ) {
$element->removeChild( $element->firstChild );
$element->appendChild( $content_node );
It seems that appendXML doesn't work always - for example if you try to append XML with 3 levels. Here is the function I wrote that always work (you want to set $content as innerHTML to $element):
function setInnerHTML($DOM, $element, $content) {
$DOMInnerHTML = new DOMDocument();
$contentNode = $DOMInnerHTML->getElementsByTagName('body')->item(0)->firstChild;
$contentNode = $DOM->importNode($contentNode, true);
return $elementNode;
Have a look at this library PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/
It looks pretty straightforward. You can change innertextproperty of your elements. It might help.
Here is a replace by class function I just wrote:
It will replace the innerHtml of a class. You can also specify the node type eg. div/p/a etc.
function replaceInnerHtmlByClass($html, $replace=null, $class=null, $nodeType=null){
if(!$nodeType){ $nodeType = '*'; }
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//{$nodeType}[contains(concat(' ', normalize-space(#class), ' '), '$class')]");
foreach($nodes as $node) {
$fragment = $dom->createDocumentFragment();
return $dom->saveHTML($dom->documentElement);
Here is another function I wrote to remove nodes with a specific class but preserving the inner html.
Setting replace to true will discard the inner html.
Setting replace to any other content will replace the inner html with the provided content.
function stripTagsByClass($html, $class=null, $nodeType=null, $replace=false){
if(!$nodeType){ $nodeType = '*'; }
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//{$nodeType}[contains(concat(' ', normalize-space(#class), ' '), '$class')]");
foreach($nodes as $node) {
$innerHTML = '';
$children = $node->childNodes;
foreach($children as $child) {
$tmp = new DOMDocument();
$innerHTML .= $tmp->saveHTML();
$fragment = $dom->createDocumentFragment();
if($replace !== null && $replace !== false){
if($replace === true){ $replace = ''; }
$innerHTML = $replace;
$node->parentNode->replaceChild($fragment, $node);
return $dom->saveHTML($dom->documentElement);
Theses functions can easily be adapted to use other attributes as the selector.
I only needed it to evaluate the class attribute.
Developing on from Joanna Goch's answer, this function will insert either a text node or an HTML fragment:
function nodeFromContent($node, $content) {
//creates a text node, or dom node if content contains html
$lt = strpos($content, '<');
$gt = strrpos($content, '>');
if (!($lt === false || $gt === false) && $gt > $lt) {
//< followed by > means potentially contains HTML
$DOMInnerHTML = new DOMDocument();
$contentNode = $DOMInnerHTML->getElementsByTagName('body')->item(0);
$newNode = $node->ownerDocument->importNode($contentNode, true);
} else {
$newNode = $node->ownerDocument->createTextNode($content);
return $newNode;
$newNode = nodeFromContent($node, $content);
$node->parentNode->insertBefore($newNode, $node);
//or $node->appendChild($newNode) depending on what you require
here is how you do it:
$doc = new DOMDocument('');
$label = $doc->createElement('label');
echo $doc->saveHTML();
function setInnerHTML($DOM, $element, $innerHTML) {
$node = $DOM->createTextNode($innerHTML);