Changing code from regex - php

I have the following 2 sets of code (Wordpress) using regex, but I was told that it's a bad practice.
I am using it in 2 ways:
To ake out the blockquote and images from the post and just display the text.
To essentially do the opposite and display just the images.
Looking to write it in the proper more acceptable/cross browser form.
html (display text):
<?php
$content = preg_replace('/<blockquote>(.*?)<\/blockquote>/', '', get_the_content());
$content = preg_replace('/(<img [^>]*>)/', '', $content);
$content = wpautop($content); // Add paragraph-tags
$content = str_replace('<p></p>', '', $content); // remove empty paragraphs
echo $content;
?>
html (display images):
<?php
preg_match_all('/(<img [^>]*>)/', get_the_content(), $images);
for( $i=0; isset($images[1]) && $i < count($images[1]); $i++ ) {
if ($i == end(array_keys($images[1]))) {
echo sprintf('<div id="last-img">%s</div>', $images[1][$i]);
continue;
}
echo $images[1][$i];
}
?>

You can use the answer from here: Strip Tags and everything in between
The point is to use a parser, rather than roll-your-own regex that might be buggy.
$content = get_the_content();
$content = wpautop($content);
$doc = new DOMDocument();
$doc->loadHTML(get_the_content(), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//blockquote') as $node) {
$node->parentNode->removeChild($node);
}
foreach ($xpath->query('//img') as $node) {
$node->parentNode->removeChild($node);
}
foreach( $xpath->query('//p[not(node())]') as $node ) {
$node->parentNode->removeChild($node);
}
$content = $doc->saveHTML($doc);
You may find that php DOMDocument has wrapped your html fragment in <html> tags, in which case look at How to saveHTML of DOMDocument without HTML wrapper?
The part that removes empty p tags is from Remove empty tags from a XML with PHP

Related

Removing images from paragraph tags

I have the following code which pulls out the blockquote and puts my WordPress post content in <p> tags.
<?php
$content = preg_replace('/<blockquote>(.*?)<\/blockquote>/', '', get_the_content());
$content = wpautop($content); // Add paragraph-tags
$content = str_replace('<p></p>', '', $content); // remove empty paragraphs
echo $content;
?>
However it puts the images in <p> tags which I don't want
Here is some code that should do it (not tested).
<?php
$content = preg_replace('/<blockquote>(.*?)<\/blockquote>/', '', get_the_content());
$content = wpautop($content); // Add paragraph-tags
$content = str_replace('<p></p>', '', $content); // remove empty paragraphs
$content = preg_replace('/<p>\s*(<a .*>)?\s*(<img .* \/>)\s*(<\/a>)?\s*<\/p>/iU', '\1\2\3', $content); // remove paragraphs around img tags
echo $content;
?>
On the line after the str_replace you could use this domDocument method:
$dom = new domDocument;
$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagname('img');
$removeList = array();
foreach ($images as $domElement) {
$removeList[] = $domElement;
}
foreach ($removeList as $toRemove) {
$toRemove->parentNode->removeChild($toRemove);
}
$content = $dom->saveHTML();
(ps: this will also give you a non preg_replace method, not that it really matters)

Wordpress - content (text) not showing up

I have the following code which I am trying to pull out just the text from my wordpress post and be able to echo just the text content in a div. (I am removing the blockquotes, images, etc from the post to be used elsewhere)
<?php
$content = get_the_content();
$content = wpautop($content);
$doc = new DOMDocument();
$doc->loadHTML(get_the_content(), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//blockquote') as $node) {
$node->parentNode->removeChild($node);
}
foreach ($xpath->query('//img') as $node) {
$node->parentNode->removeChild($node);
}
foreach( $xpath->query('//p[not(node())]') as $node ) {
$node->parentNode->removeChild($node);
}
$content = $doc->saveHTML($doc);
?>
<div>
<?php echo $content ?>
</div>
however the content doesn't appear.
I think you are overdoing too much for just retrieving a post. Why not just use the_content(); inside the loop?
get_the_content() does not auto-embed videos, or expand shortcodes, among other things -- and I can see that you are loading it again to an HTML format
try with this
use the_content();

Remove all occurrences of <br> before text

I'm trying to remove all <br> before my text.
So I have this:
<p>
<br/><br/>When the battle is on between contestants in a talent show, it gets really competitive when down to the last four. X-FactorUSAcontestant Marcus Canty knows this all too well as this is the stage he was voted off of the show earlier this year. <br/><br/>
</p>
I want to get rid of the first two <br/> but also I'd want to get rid of them if there were more than 2.
I would prefer to sue xpath as I'm already using it, at the moment I have this.
foreach($xpath->query('//br[not(preceding::text())]') as $node) {
$node->parentNode->removeChild($node);
}
For some reason on this particular page it doesn't seem to be working.
UPDATE
Originally the question was why was there at the start of document when my xpath should be getting rid of them (see below). I applied some regex to see if that worked which revealed the doctype you see now. I thought the doctype was somehow causing my original problem but it just wasn't being shown until now. This content is what I've imported from blogger and currently manipulating to fit a new blog.
link to example page
!DOCTYPE html PUBLIC “-//W3C//DTD HTML 4.0 Transitional//EN” “http://www.w3.org/TR/REC-html40/loose.dtd”><br><br>
Here's my code:
global $post;
$postTime = $post - > post_date;
$postTime = strtotime($postTime);
$startDate = "2014/01/16";
if ($postTime < strtotime($startDate)) {
$html = mb_convert_encoding($content, 'HTML-ENTITIES', "UTF-8");
$doc = new DOMDocument();#$doc - > loadHTML($html);
$xpath = new DOMXPath($doc);
foreach($xpath - > query('//br[not(preceding::text())]') as $node) {
$node - > parentNode - > removeChild($node);
}
$nodes = $xpath - > query('//a[string-length(.) = 0]');
foreach($nodes as $node) {
$node - > parentNode - > removeChild($node);
}
$nodes = $xpath - > query('//*[not(text() or node() or self::br)]');
foreach($nodes as $node) {
$node - > parentNode - > removeChild($node);
}
remove_filter('the_content', 'wpautop');
$content = $doc - > saveHTML();
$content = ltrim($content, '<br>');
$content = strip_tags($content, '<br> <a> <iframe>');
$content = preg_replace(array('/(<br\s*\/?>\s*){1,}/'), array('<br/><br/>'), $content);
$content = str_replace(' ', ' ', $content);
$content = "<p>".implode("</p>\n\n<p>", preg_split('/\n(?:\s*\n)+/', $content))."</p>";
return $content;
Help appreciated.
What about ltrim?
$string = ltrim($string, '<br/>');
You could try using a regex
s/!DOCTYPE html PUBLIC “-\/\/W3C\/\/DTD HTML 4.0 Transitional\/\/EN” “http:\/\/www.w3.org\/TR\/REC-html40\/loose.dtd”>((<br[^>]*/>)+)(.*)/\3/
or in PHP:
$pattern = '/^((<br[^>]*/>)+)(.*)/i';
$replacement = '$3';
$content = preg_replace($pattern, $replacement, $content);

How to remove entire div with preg_replace

Ok, as it is WordPress problem and it sadly goes a little deeper, I need to remove each representation of parent div and its inside:
<div class="sometestclass">
<img ....>
<div>.....</div>
any other html tags
</div><!-- END: .sometestclass -->
The only idea I have is to match everything that starts with:
<div class="sometestclass">
and ends with:
<!-- END: .sometestclass -->
with all that is between (I can tag the end of parent div anyway I want, this is just a sample).
Anybody have an idea how to do it with:
<?php $content = preg_replace('?????','',$content); ?>
I wouldn't use a regular expression. Instead, I would use the DOMDocument class. Just find all of the div elements with that class, and remove them from their parent(s):
$html = "<p>Hello World</p>
<div class='sometestclass'>
<img src='foo.png'/>
<div>Bar</div>
</div>";
$dom = new DOMDocument;
$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
$pDivs = $xpath->query(".//div[#class='sometestclass']");
foreach ( $pDivs as $div ) {
$div->parentNode->removeChild( $div );
}
echo preg_replace( "/.*<body>(.*)<\/body>.*/s", "$1", $dom->saveHTML() );
Which results in:
<p>Hello World</p>
<?php $content = preg_replace('/<div class="sometestclass">.*?<\/div><!-- END: .sometestclass -->/s','',$content); ?>
My RegEx is a bit rusty, but I think this should work. Do note that, as others have said, RegEx is not properly equipped to handle some of the complexities of HTML.
In addition, this pattern won't find embedded div elements with the class sometestclass. You would need recursion for that.
How about just some CSS .sometestclass{display: none;} ?
For the UTF-8 issue, I found a hack at the PHP-manual
So my functions looks as follows:
function rem_fi_cat() {
/* This function removes images from _within_ the article.
* If these images are enclosed in a "wp-caption" div-tag.
* If the articles are post formatted as "image".
* Only on home-page, front-page an in category/archive-pages.
*/
if ( (is_home() || is_front_page() || is_category()) && has_post_format( 'image' ) ) {
$document = new DOMDocument();
$content = get_the_content( '', true );
if( '' != $content ) {
/* incl. UTF-8 "hack" as described at
* http://www.php.net/manual/en/domdocument.loadhtml.php#95251
*/
$document->loadHTML( '<?xml encoding="UTF-8">' . $content );
foreach ($doc->childNodes as $item) {
if ($item->nodeType == XML_PI_NODE) {
$doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper
}
}
$xpath = new DOMXPath( $document );
$pDivs = $xpath->query(".//div[#class='wp-caption']");
foreach ( $pDivs as $div ) {
$div->parentNode->removeChild( $div );
}
echo preg_replace( "/.*<div class=\"entry-container\">(.*)<\/div>.*/s", "$1", $document->saveHTML() );
}
}
}

PHP Manipulating HTML from string

I'm reading in an HTML string from a text editor and need to manipulate some of the elements before saving it to the DB.
What I have is something like this:
<h3>Some Text<img src="somelink.jpg" /></h3>
or
<h3><img src="somelink.jpg" />Some Text</h3>
and I need to put it into the following format
<h3>Some Text</h3><div class="img_wrapper"><img src="somelink.jpg" /></div>
This is the solution that I came up with.
$html = '<html><body>' . $field["data"][0] . '</body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$domNodeList = $dom->getElementsByTagName("img");
// Remove Img tags from H3 and place it before the H# tag
foreach ($domNodeList as $domNode) {
if ($domNode->parentNode->nodeName == "h3") {
$parentNode = $domNode->parentNode;
$parentParentNode = $parentNode->parentNode;
$parentParentNode->insertBefore($domNode, $parentNode->nextSibling);
}
}
echo $dom->saveHtml();
You may be looking for a preg_replace
// take a search pattern, wrap the image tag matching parts in a tag
// and put the start and ending parts before the wrapped image tag.
// note: this will not match tags that contain > characters within them,
// and will only handle a single image tag
$output = preg_replace(
'|(<h3>[^<]*)(<img [^>]+>)([^<]*</h3>)|',
'$1$3<div class="img_wrapper">$2</div>',
$input
);
I updated the question with the answer, but for good measure, here it is again in the answers section.
$html = '<html><body>' . $field["data"][0] . '</body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$domNodeList = $dom->getElementsByTagName("img");
// Remove Img tags from H3 and place it before the H# tag
foreach ($domNodeList as $domNode) {
if ($domNode->parentNode->nodeName == "h3") {
$parentNode = $domNode->parentNode;
$parentParentNode = $parentNode->parentNode;
$parentParentNode->insertBefore($domNode, $parentNode->nextSibling);
}
}
echo $dom->saveHtml();

Categories