I'm trying to scan my wordpress content for:
<p><span class="embed-youtube">some iframed video</span></p>
and then change it into:
<p class="img_wrap"><span class="embed-youtube">some iframed video</span></p>
using the following code in my function.php file in my theme:
$classes = 'class="img_wrap"';
$youtube_match = preg_match('/(<p.*?)(.*?><span class="embed-youtube")/', $content, $youtube_array);
if(!empty($youtube_match))
{
$content = preg_replace('/(<p.*?)(.*?><span class=\"embed-youtube\")/', '$1 ' . $classes . '$2', $content);
}
but for some reason I am not getting a match on my regex nor is the replace working. I don't understand why there isn't a match because the span with class embed-youtube exists.
UPDATE - HERE IS THE FULL FUNCTION
function give_attachments_class($content){
$classes = 'class="img_wrap"';
$img_match = preg_match("/(<p.*?)(.*?><img)/", $content, $img_array);
$youtube_match = preg_match('/(<p.*?)(.*?><span class="embed-youtube")/', $content, $youtube_array);
// $doc = new DOMDocument;
// #$doc->loadHTML($content); // load the HTML data
// $xpath = new DOMXPath($doc);
// $nodes = $xpath->query('//p/span[#class="embed-youtube"]');
// foreach ($nodes as $node) {
// $node->parentNode->setAttribute('class', 'img_wrap');
// }
// $content = $doc->saveHTML();
if(!empty($img_match))
{
$content = preg_replace('/(<p.*?)(.*?><img)/', '$1 ' . $classes . '$2', $content);
}
else if(!empty($youtube_match))
{
$content = preg_replace('/(<p.*?)(.*?><span class=\"embed-youtube\")/', '$1 ' . $classes . '$2', $content);
}
$content = preg_replace("/<img(.*?)src=('|\")(.*?).(bmp|gif|jpeg|jpg|png)(|\")(.*?)>/", '<img$1 data-original=$3.$4 $6>' , $content);
return $content;
}
add_filter('the_content','give_attachments_class');
Instead of using regex, make effective use of DOM and XPath to do this for you.
$doc = new DOMDocument;
#$doc->loadHTML($html); // load the HTML data
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//p/span[#class="embed-youtube"]');
foreach ($nodes as $node) {
$node->parentNode->setAttribute('class', 'img_wrap');
}
echo $doc->saveHTML();
Here is a quick and dirty REGEX I did for you. It finds the entire string starting with p tag, ending p tag, span also included etc. I also wrote it to include single or double quotes for you since you never know and also to include spaces in various places. Let me know how it works out for you, thanks.
(<p )+(class=)['"]+img_wrap+['"](><span)+[ ]+(class=)+['"]embed-youtube+['"]>[A-Za-z0-9='" ]+(</span></p>)
I have tested it on your code and a few other variations and it works for me.
Related
This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 5 years ago.
How to Change innerHTML of a php DOMElement ?
Another solution:
1) create new DOMDocumentFragment from the HTML string to be inserted;
2) remove old content of our element by deleting its child nodes;
3) append DOMDocumentFragment to our element.
function setInnerHTML($element, $html)
{
$fragment = $element->ownerDocument->createDocumentFragment();
$fragment->appendXML($html);
while ($element->hasChildNodes())
$element->removeChild($element->firstChild);
$element->appendChild($fragment);
}
Alternatively, we can replace our element with its clean copy and then append DOMDocumentFragment to this clone.
function setInnerHTML($element, $html)
{
$fragment = $element->ownerDocument->createDocumentFragment();
$fragment->appendXML($html);
$clone = $element->cloneNode(); // Get element copy without children
$clone->appendChild($fragment);
$element->parentNode->replaceChild($clone, $element);
}
Test:
$doc = new DOMDocument();
$doc->loadXML('<div><span style="color: green">Old HTML</span></div>');
$div = $doc->getElementsByTagName('div')->item(0);
echo $doc->saveHTML();
setInnerHTML($div, '<p style="color: red">New HTML</p>');
echo $doc->saveHTML();
// Output:
// <div><span style="color: green">Old HTML</span></div>
// <div><p style="color: red">New HTML</p></div>
I needed to do this for a project recently and ended up with an extension to DOMElement: http://www.keyvan.net/2010/07/javascript-like-innerhtml-access-in-php/
Here's an example showing how it's used:
<?php
require_once 'JSLikeHTMLElement.php';
$doc = new DOMDocument();
$doc->registerNodeClass('DOMElement', 'JSLikeHTMLElement');
$doc->loadHTML('<div><p>Para 1</p><p>Para 2</p></div>');
$elem = $doc->getElementsByTagName('div')->item(0);
// print innerHTML
echo $elem->innerHTML; // prints '<p>Para 1</p><p>Para 2</p>'
// set innerHTML
$elem->innerHTML = 'FF';
// print document (with our changes)
echo $doc->saveXML();
?>
I think the best thing you can do is come up with a function that will take the DOMElement that you want to change the InnerHTML of, copy it, and replace it.
In very rough PHP:
function replaceElement($el, $newInnerHTML) {
$newElement = $myDomDocument->createElement($el->nodeName, $newInnerHTML);
$el->parentNode->insertBefore($newElement, $el);
$el->parentNode->removeChild($el);
return $newElement;
}
This doesn't take into account attributes and nested structures, but I think this will get you on your way.
I ended up making this function using a few functions from other people on this page. I changed the one from Joanna Goch the way that Peter Brand says mostly, and also added some code from Guest and from other places.
This function does not use an extension, and does not use appendXML (which is very picky and breaks even if it sees one BR tag that is not closed) and seems to be working good.
function set_inner_html( $element, $content ) {
$DOM_inner_HTML = new DOMDocument();
$internal_errors = libxml_use_internal_errors( true );
$DOM_inner_HTML->loadHTML( mb_convert_encoding( $content, 'HTML-ENTITIES', 'UTF-8' ) );
libxml_use_internal_errors( $internal_errors );
$content_node = $DOM_inner_HTML->getElementsByTagName('body')->item(0);
$content_node = $element->ownerDocument->importNode( $content_node, true );
while ( $element->hasChildNodes() ) {
$element->removeChild( $element->firstChild );
}
$element->appendChild( $content_node );
}
It seems that appendXML doesn't work always - for example if you try to append XML with 3 levels. Here is the function I wrote that always work (you want to set $content as innerHTML to $element):
function setInnerHTML($DOM, $element, $content) {
$DOMInnerHTML = new DOMDocument();
$DOMInnerHTML->loadHTML($content);
$contentNode = $DOMInnerHTML->getElementsByTagName('body')->item(0)->firstChild;
$contentNode = $DOM->importNode($contentNode, true);
$element->appendChild($contentNode);
return $elementNode;
}
Have a look at this library PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/
It looks pretty straightforward. You can change innertextproperty of your elements. It might help.
Here is a replace by class function I just wrote:
It will replace the innerHtml of a class. You can also specify the node type eg. div/p/a etc.
function replaceInnerHtmlByClass($html, $replace=null, $class=null, $nodeType=null){
if(!$nodeType){ $nodeType = '*'; }
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//{$nodeType}[contains(concat(' ', normalize-space(#class), ' '), '$class')]");
foreach($nodes as $node) {
while($node->childNodes->length){
$node->removeChild($node->firstChild);
}
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($replace);
$node->appendChild($fragment);
}
return $dom->saveHTML($dom->documentElement);
}
Here is another function I wrote to remove nodes with a specific class but preserving the inner html.
Setting replace to true will discard the inner html.
Setting replace to any other content will replace the inner html with the provided content.
function stripTagsByClass($html, $class=null, $nodeType=null, $replace=false){
if(!$nodeType){ $nodeType = '*'; }
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//{$nodeType}[contains(concat(' ', normalize-space(#class), ' '), '$class')]");
foreach($nodes as $node) {
$innerHTML = '';
$children = $node->childNodes;
foreach($children as $child) {
$tmp = new DOMDocument();
$tmp->appendChild($tmp->importNode($child,true));
$innerHTML .= $tmp->saveHTML();
}
$fragment = $dom->createDocumentFragment();
if($replace !== null && $replace !== false){
if($replace === true){ $replace = ''; }
$innerHTML = $replace;
}
$fragment->appendXML($innerHTML);
$node->parentNode->replaceChild($fragment, $node);
}
return $dom->saveHTML($dom->documentElement);
}
Theses functions can easily be adapted to use other attributes as the selector.
I only needed it to evaluate the class attribute.
Developing on from Joanna Goch's answer, this function will insert either a text node or an HTML fragment:
function nodeFromContent($node, $content) {
//creates a text node, or dom node if content contains html
$lt = strpos($content, '<');
$gt = strrpos($content, '>');
if (!($lt === false || $gt === false) && $gt > $lt) {
//< followed by > means potentially contains HTML
$DOMInnerHTML = new DOMDocument();
$DOMInnerHTML->loadHTML($content);
$contentNode = $DOMInnerHTML->getElementsByTagName('body')->item(0);
$newNode = $node->ownerDocument->importNode($contentNode, true);
} else {
$newNode = $node->ownerDocument->createTextNode($content);
}
return $newNode;
}
usage
$newNode = nodeFromContent($node, $content);
$node->parentNode->insertBefore($newNode, $node);
//or $node->appendChild($newNode) depending on what you require
here is how you do it:
$doc = new DOMDocument('');
$label = $doc->createElement('label');
$label->appendChild($doc->createTextNode('test'));
$li->appendChild($label);
echo $doc->saveHTML();
function setInnerHTML($DOM, $element, $innerHTML) {
$node = $DOM->createTextNode($innerHTML);
$element->appendChild($node);
}
What I'm seeking to do is find an elegant solution to remove the contents of everything between a certain class = i.e. you want to remove all the HTML in the sometestclass class using php.
The function below works somewhat - not that well - it removes some parts of the page I don't want removed.
Below is a function based on an original post (below):
$html = "<p>Hello World</p>
<div class='sometestclass'>
<img src='foo.png'/>
<div>Bar</div>
</div>";
$clean = removeDiv ($html,'sometestclass');
echo $clean;
function removeDiv ($html,$removeClass){
$dom = new DOMDocument;
$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
$removeString = ".//div[#class='$removeClass']";
$pDivs = $xpath->query($removeString);
foreach ( $pDivs as $div ) {
$div->parentNode->removeChild( $div );
}
$output = preg_replace( "/.*<body>(.*)<\/body>.*/s", "$1", $dom->saveHTML() );
return $output;
}
does anyone have any suggestions to improve the results of this?
the original post is here
You are not quoting the class name:
$removeString = ".//div[#class=$removeClass]";
should be:
$removeString = ".//div[#class='$removeClass']";
I'm trying to remove all <br> before my text.
So I have this:
<p>
<br/><br/>When the battle is on between contestants in a talent show, it gets really competitive when down to the last four. X-FactorUSAcontestant Marcus Canty knows this all too well as this is the stage he was voted off of the show earlier this year. <br/><br/>
</p>
I want to get rid of the first two <br/> but also I'd want to get rid of them if there were more than 2.
I would prefer to sue xpath as I'm already using it, at the moment I have this.
foreach($xpath->query('//br[not(preceding::text())]') as $node) {
$node->parentNode->removeChild($node);
}
For some reason on this particular page it doesn't seem to be working.
UPDATE
Originally the question was why was there at the start of document when my xpath should be getting rid of them (see below). I applied some regex to see if that worked which revealed the doctype you see now. I thought the doctype was somehow causing my original problem but it just wasn't being shown until now. This content is what I've imported from blogger and currently manipulating to fit a new blog.
link to example page
!DOCTYPE html PUBLIC “-//W3C//DTD HTML 4.0 Transitional//EN” “http://www.w3.org/TR/REC-html40/loose.dtd”><br><br>
Here's my code:
global $post;
$postTime = $post - > post_date;
$postTime = strtotime($postTime);
$startDate = "2014/01/16";
if ($postTime < strtotime($startDate)) {
$html = mb_convert_encoding($content, 'HTML-ENTITIES', "UTF-8");
$doc = new DOMDocument();#$doc - > loadHTML($html);
$xpath = new DOMXPath($doc);
foreach($xpath - > query('//br[not(preceding::text())]') as $node) {
$node - > parentNode - > removeChild($node);
}
$nodes = $xpath - > query('//a[string-length(.) = 0]');
foreach($nodes as $node) {
$node - > parentNode - > removeChild($node);
}
$nodes = $xpath - > query('//*[not(text() or node() or self::br)]');
foreach($nodes as $node) {
$node - > parentNode - > removeChild($node);
}
remove_filter('the_content', 'wpautop');
$content = $doc - > saveHTML();
$content = ltrim($content, '<br>');
$content = strip_tags($content, '<br> <a> <iframe>');
$content = preg_replace(array('/(<br\s*\/?>\s*){1,}/'), array('<br/><br/>'), $content);
$content = str_replace(' ', ' ', $content);
$content = "<p>".implode("</p>\n\n<p>", preg_split('/\n(?:\s*\n)+/', $content))."</p>";
return $content;
Help appreciated.
What about ltrim?
$string = ltrim($string, '<br/>');
You could try using a regex
s/!DOCTYPE html PUBLIC “-\/\/W3C\/\/DTD HTML 4.0 Transitional\/\/EN” “http:\/\/www.w3.org\/TR\/REC-html40\/loose.dtd”>((<br[^>]*/>)+)(.*)/\3/
or in PHP:
$pattern = '/^((<br[^>]*/>)+)(.*)/i';
$replacement = '$3';
$content = preg_replace($pattern, $replacement, $content);
I have a html string that contains exactly one a-element in it. Example:
test
In php I have to test if rel contains external and if yes, then modify href and save the string.
I have looked for DOM nodes and objects. But they seem to be too much for only one A-element, as I have to iterate to get html nodes and I am not sure how to test if rel exists and contains external.
$html = new DOMDocument();
$html->loadHtml($txt);
$a = $html->getElementsByTagName('a');
$attr = $a->item(0)->attributes();
...
At this point I am going to get NodeMapList that seems to be overhead. Is there any simplier way for this or should I do it with DOM?
Is there any simplier way for this or should I do it with DOM?
Do it with DOM.
Here's an example:
<?php
$html = 'test';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//a[contains(concat(' ', normalize-space(#rel), ' '), ' external ')]");
foreach($nodes as $node) {
$node->setAttribute('href', 'http://example.org');
}
echo $dom->saveHTML();
I kept going to modify with DOM. This is what I get:
$html = new DOMDocument();
$html->loadHtml('<?xml encoding="utf-8" ?>' . $txt);
$nodes = $html->getElementsByTagName('a');
foreach ($nodes as $node) {
foreach ($node->attributes as $att) {
if ($att->name == 'rel') {
if (strpos($att->value, 'external')) {
$node->setAttribute('href','modified_url_goes_here');
}
}
}
}
$txt = $html->saveHTML();
I did not want to load any other library for just this one string.
The best way is to use a HTML parser/DOM, but here's a regex solution:
$html = 'test<br>
<p> Some text</p>
test2<br>
<a rel="external">test3</a> <-- This won\'t work since there is no href in it.
';
$new = preg_replace_callback('/<a.+?rel\s*=\s*"([^"]*)"[^>]*>/i', function($m){
if(strpos($m[1], 'external') !== false){
$m[0] = preg_replace('/href\s*=\s*(("[^"]*")|(\'[^\']*\'))/i', 'href="http://example.com"', $m[0]);
}
return $m[0];
}, $html);
echo $new;
Online demo.
You could use a regular expression like
if it matches /\s+rel\s*=\s*".*external.*"/
then do a regExp replace like
/(<a.*href\s*=\s*")([^"]\)("[^>]*>)/\1[your new href here]\3/
Though using a library that can do this kind of stuff for you is much easier (like jquery for javascript)
I have some code that is generating a diff between two documents, inserting <ins> and <del> tags haphazardly. For the most part it's doing a great job, but every now and then it inserts tags in script, style and the title tags.
Any ideas on how to remove the <del> tags (including the text between them), remove the <ins> tags (but retaining the text within them as part of the original string), however only within those three tags? (title, script and style).
Don't use regex to do this; it sounds like you have to deal with many, many lines. DOMDocument is great.
$dom = new DOMDocument;
$dom->loadHTML($your_html_string);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//script|//title|//style') as $node) {
foreach ($node->getElementsByTagName('del') as $delNode) {
$node->removeChild($delNode);
}
foreach ($node->getElementsByTagName('ins') as $insNode) {
$node->replaceChild($dom->createTextNode($insNode->nodeValue), $insNode);
}
}
Untested, this may or may not work:
$str = preg_replace('/(<script.*?>.*?)<del>.*?</del>(.*?</script>)/im', '$1$2', $str);
It attempts to look within the <script> ... </script> block of the string, and replace any instances of <del>...</del> with empty string.
The following ended up working quite well for me:
$tags = array('script', 'title', 'style');
foreach ($tags as $tag) {
$str = preg_replace_callback(
'/(<' . ($tag) . '\b[^>]*>)(.*?)(<\/' . ($tag) . '>)/is',
function($match) {
$replaced = preg_replace(
array(
'/__Delete-Start__.+__Delete-End__/Uis',
'/__Insert-Start__(.+)__Insert-End__/Uis'
),
array(
'',
'$1'
),
$match[2]
);
return ($match[1]) . ($replaced) . ($match[3]);
},
$str
);
}
While the following didn't end up being my solution, it did get me far and could be useful to others:
$dom = new DOMDocument;
$dom->loadHTML($str);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//script|//title|//style') as $node) {
foreach ($node->getElementsByTagName('del') as $delNode) {
$node->removeChild($delNode);
}
foreach ($node->getElementsByTagName('ins') as $insNode) {
$node->replaceChild($dom->createTextNode($insNode->nodeValue), $insNode);
}
}
$str = (string) $dom->saveXML($dom, LIBXML_NOEMPTYTAG);//$xpath->query('//p')->item(0));
Hope this helps someone else.