Ok, as it is WordPress problem and it sadly goes a little deeper, I need to remove each representation of parent div and its inside:
<div class="sometestclass">
<img ....>
<div>.....</div>
any other html tags
</div><!-- END: .sometestclass -->
The only idea I have is to match everything that starts with:
<div class="sometestclass">
and ends with:
<!-- END: .sometestclass -->
with all that is between (I can tag the end of parent div anyway I want, this is just a sample).
Anybody have an idea how to do it with:
<?php $content = preg_replace('?????','',$content); ?>
I wouldn't use a regular expression. Instead, I would use the DOMDocument class. Just find all of the div elements with that class, and remove them from their parent(s):
$html = "<p>Hello World</p>
<div class='sometestclass'>
<img src='foo.png'/>
<div>Bar</div>
</div>";
$dom = new DOMDocument;
$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
$pDivs = $xpath->query(".//div[#class='sometestclass']");
foreach ( $pDivs as $div ) {
$div->parentNode->removeChild( $div );
}
echo preg_replace( "/.*<body>(.*)<\/body>.*/s", "$1", $dom->saveHTML() );
Which results in:
<p>Hello World</p>
<?php $content = preg_replace('/<div class="sometestclass">.*?<\/div><!-- END: .sometestclass -->/s','',$content); ?>
My RegEx is a bit rusty, but I think this should work. Do note that, as others have said, RegEx is not properly equipped to handle some of the complexities of HTML.
In addition, this pattern won't find embedded div elements with the class sometestclass. You would need recursion for that.
How about just some CSS .sometestclass{display: none;} ?
For the UTF-8 issue, I found a hack at the PHP-manual
So my functions looks as follows:
function rem_fi_cat() {
/* This function removes images from _within_ the article.
* If these images are enclosed in a "wp-caption" div-tag.
* If the articles are post formatted as "image".
* Only on home-page, front-page an in category/archive-pages.
*/
if ( (is_home() || is_front_page() || is_category()) && has_post_format( 'image' ) ) {
$document = new DOMDocument();
$content = get_the_content( '', true );
if( '' != $content ) {
/* incl. UTF-8 "hack" as described at
* http://www.php.net/manual/en/domdocument.loadhtml.php#95251
*/
$document->loadHTML( '<?xml encoding="UTF-8">' . $content );
foreach ($doc->childNodes as $item) {
if ($item->nodeType == XML_PI_NODE) {
$doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper
}
}
$xpath = new DOMXPath( $document );
$pDivs = $xpath->query(".//div[#class='wp-caption']");
foreach ( $pDivs as $div ) {
$div->parentNode->removeChild( $div );
}
echo preg_replace( "/.*<div class=\"entry-container\">(.*)<\/div>.*/s", "$1", $document->saveHTML() );
}
}
}
Related
I have the following 2 sets of code (Wordpress) using regex, but I was told that it's a bad practice.
I am using it in 2 ways:
To ake out the blockquote and images from the post and just display the text.
To essentially do the opposite and display just the images.
Looking to write it in the proper more acceptable/cross browser form.
html (display text):
<?php
$content = preg_replace('/<blockquote>(.*?)<\/blockquote>/', '', get_the_content());
$content = preg_replace('/(<img [^>]*>)/', '', $content);
$content = wpautop($content); // Add paragraph-tags
$content = str_replace('<p></p>', '', $content); // remove empty paragraphs
echo $content;
?>
html (display images):
<?php
preg_match_all('/(<img [^>]*>)/', get_the_content(), $images);
for( $i=0; isset($images[1]) && $i < count($images[1]); $i++ ) {
if ($i == end(array_keys($images[1]))) {
echo sprintf('<div id="last-img">%s</div>', $images[1][$i]);
continue;
}
echo $images[1][$i];
}
?>
You can use the answer from here: Strip Tags and everything in between
The point is to use a parser, rather than roll-your-own regex that might be buggy.
$content = get_the_content();
$content = wpautop($content);
$doc = new DOMDocument();
$doc->loadHTML(get_the_content(), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//blockquote') as $node) {
$node->parentNode->removeChild($node);
}
foreach ($xpath->query('//img') as $node) {
$node->parentNode->removeChild($node);
}
foreach( $xpath->query('//p[not(node())]') as $node ) {
$node->parentNode->removeChild($node);
}
$content = $doc->saveHTML($doc);
You may find that php DOMDocument has wrapped your html fragment in <html> tags, in which case look at How to saveHTML of DOMDocument without HTML wrapper?
The part that removes empty p tags is from Remove empty tags from a XML with PHP
I want to wrap an iframe object in a div class, but only if it isn't already wrapped in that div class. I'm trying to use a negative match pattern for that div class so preg_replace will not match and return the original $content. However it still matches:
<?php
$content = <<< EOL
<div class="aoa_wrap"><iframe width="500" height="281" src="https://www.youtube.com/embed/uuZE_IRwLNI" frameborder="0" allowfullscreen></iframe></div>
EOL;
$pattern = "~(?!<div(.*?)aoa_wrap(.*?)>)<iframe\b[^>]*>(?:(.*?)?</iframe>)?~";
$replace = '<div class="aoa_wrap">${0}</div>';
$content = preg_replace( $pattern, $replace, $content);
echo $content . "\n";
?>
Output (incorrect):
<div class="aoa_wrap"><div class="aoa_wrap"><iframe width="500" height="281" src="https://www.youtube.com/embed/uuZE_IRwLNI" frameborder="0" allowfullscreen></iframe></div></div>
I'm not sure why the negative pattern at the beginning is not causing preg_replace to return the original $content as expected. Am I missing something obvious?
I ended up trying DOM as suggested in above comments. This is what works for me:
<?php
$content = <<< EOL
<p>something here</p>
<iframe width="500" height="281" src="https://www.youtube.com/embed/uuZE_IRwLNI" frameborder="0" allowfullscreen></iframe>
<p><img src="test.jpg" /></p>
EOL;
$doc = new DOMDocument();
$doc->loadHTML( "<div>" . $content . "</div>" );
// remove <!DOCTYPE and html and body tags that loadHTML adds:
$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);
while ($doc->firstChild) {
$doc->removeChild($doc->firstChild);
}
while ($container->firstChild ) {
$doc->appendChild($container->firstChild);
}
// get all iframes and see if we need to wrap them in our aoa_wrap class:
$nodes = $doc->getElementsByTagName( 'iframe' );
foreach ( $nodes as $node ) {
$parent = $node->parentNode;
// skip if already wrapped in div class 'aoa_wrap'
if ( isset( $parent->tagName ) && 'div' == $parent->tagName && 'aoa_wrap' == $parent->getAttribute( 'class' ) ) {
continue;
}
// create new element for class "aoa_wrap"
$wrap = $doc->createElement( "div" );
$wrap->setAttribute( "class", "aoa_wrap" );
// clone the iframe node as child
$wrap->appendChild( $node->cloneNode( true ) );
// replace original iframe node with new div class wrapper node
$parent->replaceChild( $wrap, $node );
}
echo $doc->saveHTML();
?>
I'm looking for a way to transform this:
...[inner content]...
Into this:
...[inner content]...
The context has multiple links a with other showinfo:[integer] values. (I can process those ones)
Thanks for any help,
Bálint
Edit: Thanks to Kaiser's answer, here is the working snippet:
$html = $a;
$dom = new \DOMDocument;
#$dom->loadHTML( $html ); //Cannot guarantee all-valid input
foreach ($dom->getElementsByTagName('a') as $tag) {
// Fixed strstr order and added a != false check - the, because the string started with the substring
if ($tag->hasAttribute('href') && strstr($tag->getAttribute('href'), 'showinfo:3875') != false) {
$tag->setAttribute( 'href', "http://somelink.com/{$tag->textContent}");
// Assign the Converted HTML, prevents failing when saving
$html = $tag;
}
}
return $dom->saveHTML( $dom);
}
You can use DOMDocument for a pretty reliable and fast way to handle DOM nodes and their attributes, etc. Hint: Much faster and more reliable than (most) Regex.
// Your original HTML
$html = '[inner content]';
$dom = new \DOMDocument;
$dom->loadHTML( $html );
Now that you have your DOM ready, you can use either the DOMDocument methods or DOMXPath to search through it and obtain your target element.
Example with XPath:
$xpath = new DOMXpath( $dom );
// Alter the query to your needs
$el = $xpath->query( "/html/body/a[href='showinfo:']" );
or for example by ID with the DOMDocument methods:
// Check what we got so we have something to compare
var_dump( 'BEFORE', $html );
foreach ( $dom->getElementsByTagName( 'a' ) as $tag )
{
if (
$tag->hasAttribute( 'href' )
and stristr( $tag->getAttribute( 'href' ), 'showinfo:3875' )
)
{
$tag->setAttribute( 'href', "http://somelink.com/{$tag->textContent}" );
// Assign the Converted HTML, prevents failing when saving
$html = $tag;
}
}
// Now Save Our Converted HTML;
$html = $dom->saveHTML( $html);
// Check if it worked:
var_dump( 'AFTER', $html );
It's as easy as that.
What I'm seeking to do is find an elegant solution to remove the contents of everything between a certain class = i.e. you want to remove all the HTML in the sometestclass class using php.
The function below works somewhat - not that well - it removes some parts of the page I don't want removed.
Below is a function based on an original post (below):
$html = "<p>Hello World</p>
<div class='sometestclass'>
<img src='foo.png'/>
<div>Bar</div>
</div>";
$clean = removeDiv ($html,'sometestclass');
echo $clean;
function removeDiv ($html,$removeClass){
$dom = new DOMDocument;
$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
$removeString = ".//div[#class='$removeClass']";
$pDivs = $xpath->query($removeString);
foreach ( $pDivs as $div ) {
$div->parentNode->removeChild( $div );
}
$output = preg_replace( "/.*<body>(.*)<\/body>.*/s", "$1", $dom->saveHTML() );
return $output;
}
does anyone have any suggestions to improve the results of this?
the original post is here
You are not quoting the class name:
$removeString = ".//div[#class=$removeClass]";
should be:
$removeString = ".//div[#class='$removeClass']";
I have the following html:
<html>
<body>
bla bla bla bla
<div id="myDiv">
more text
<div id="anotherDiv">
And even more text
</div>
</div>
bla bla bla
</body>
</html>
I want to remove everything starting from <div id="anotherDiv"> until its closing <div>. How do I do that?
With native DOM
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$xPath = new DOMXPath($dom);
$nodes = $xPath->query('//*[#id="anotherDiv"]');
if($nodes->item(0)) {
$nodes->item(0)->parentNode->removeChild($nodes->item(0));
}
echo $dom->saveHTML();
You can use preg_replace() like:
$string = preg_replace('/<div id="someid"[^>]+\>/i', "", $string);
Using the native XML Manipulation Library
Assuming that your html content is stored in the variable $html:
$html='<html>
<body>
bla bla bla bla
<div id="myDiv">
more text
<div id="anotherDiv">
And even more text
</div>
</div>
bla bla bla
</body>
</html>';
To delete the tag by ID use the following code:
$dom=new DOMDocument;
$dom->validateOnParse = false;
$dom->loadHTML( $html );
// get the tag
$div = $dom->getElementById('anotherDiv');
// delete the tag
if( $div && $div->nodeType==XML_ELEMENT_NODE ){
$div->parentNode->removeChild( $div );
}
echo $dom->saveHTML();
Note that certain versions of libxml require a doctype to be present in order to use the getElementById method.
In that case you can prepend $html with <!doctype>
$html = '<!doctype>' . $html;
Alternatively, as suggested by Gordon's answer, you can use DOMXPath to find the element using the xpath:
$dom=new DOMDocument;
$dom->validateOnParse = false;
$dom->loadHTML( $html );
$xp=new DOMXPath( $dom );
$col = $xp->query( '//div[ #id="anotherDiv" ]' );
if( !empty( $col ) ){
foreach( $col as $node ){
$node->parentNode->removeChild( $node );
}
}
echo $dom->saveHTML();
The first method works regardless the tag. If you want to use the second method with the same id but a different tag, let say form, simply replace //div in //div[ #id="anotherDiv" ] by '//form'
strip_tags() function is what you are looking for.
http://us.php.net/manual/en/function.strip-tags.php
I wrote these to strip specific tags and attributes. Since they're regex they're not 100% guaranteed to work in all cases, but it was a fair tradeoff for me:
// Strips only the given tags in the given HTML string.
function strip_tags_blacklist($html, $tags) {
foreach ($tags as $tag) {
$regex = '#<\s*' . $tag . '[^>]*>.*?<\s*/\s*'. $tag . '>#msi';
$html = preg_replace($regex, '', $html);
}
return $html;
}
// Strips the given attributes found in the given HTML string.
function strip_attributes($html, $atts) {
foreach ($atts as $att) {
$regex = '#\b' . $att . '\b(\s*=\s*[\'"][^\'"]*[\'"])?(?=[^<]*>)#msi';
$html = preg_replace($regex, '', $html);
}
return $html;
}
how about this?
// Strips only the given tags in the given HTML string.
function strip_tags_blacklist($html, $tags) {
$html = preg_replace('/<'. $tags .'\b[^>]*>(.*?)<\/'. $tags .'>/is', "", $html);
return $html;
}
Following RafaSashi's answer using preg_replace(), here's a version that works for a single tag or an array of tags:
/**
* #param $str string
* #param $tags string | array
* #return string
*/
function strip_specific_tags ($str, $tags) {
if (!is_array($tags)) { $tags = array($tags); }
foreach ($tags as $tag) {
$_str = preg_replace('/<\/' . $tag . '>/i', '', $str);
if ($_str != $str) {
$str = preg_replace('/<' . $tag . '[^>]*>/i', '', $_str);
}
}
return $str;
}