Regex match if pattern is not after another patten - PHP

Regex match if pattern is not after another patten - PHP - php

I want to wrap an iframe object in a div class, but only if it isn't already wrapped in that div class. I'm trying to use a negative match pattern for that div class so preg_replace will not match and return the original $content. However it still matches:
<?php
$content = <<< EOL
<div class="aoa_wrap"><iframe width="500" height="281" src="https://www.youtube.com/embed/uuZE_IRwLNI" frameborder="0" allowfullscreen></iframe></div>
EOL;
$pattern = "~(?!<div(.*?)aoa_wrap(.*?)>)<iframe\b[^>]*>(?:(.*?)?</iframe>)?~";
$replace = '<div class="aoa_wrap">${0}</div>';
$content = preg_replace( $pattern, $replace, $content);
echo $content . "\n";
?>
Output (incorrect):
<div class="aoa_wrap"><div class="aoa_wrap"><iframe width="500" height="281" src="https://www.youtube.com/embed/uuZE_IRwLNI" frameborder="0" allowfullscreen></iframe></div></div>
I'm not sure why the negative pattern at the beginning is not causing preg_replace to return the original $content as expected. Am I missing something obvious?

I ended up trying DOM as suggested in above comments. This is what works for me:
<?php
$content = <<< EOL
<p>something here</p>
<iframe width="500" height="281" src="https://www.youtube.com/embed/uuZE_IRwLNI" frameborder="0" allowfullscreen></iframe>
<p><img src="test.jpg" /></p>
EOL;
$doc = new DOMDocument();
$doc->loadHTML( "<div>" . $content . "</div>" );
// remove <!DOCTYPE and html and body tags that loadHTML adds:
$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);
while ($doc->firstChild) {
$doc->removeChild($doc->firstChild);
}
while ($container->firstChild ) {
$doc->appendChild($container->firstChild);
}
// get all iframes and see if we need to wrap them in our aoa_wrap class:
$nodes = $doc->getElementsByTagName( 'iframe' );
foreach ( $nodes as $node ) {
$parent = $node->parentNode;
// skip if already wrapped in div class 'aoa_wrap'
if ( isset( $parent->tagName ) && 'div' == $parent->tagName && 'aoa_wrap' == $parent->getAttribute( 'class' ) ) {
continue;
}
// create new element for class "aoa_wrap"
$wrap = $doc->createElement( "div" );
$wrap->setAttribute( "class", "aoa_wrap" );
// clone the iframe node as child
$wrap->appendChild( $node->cloneNode( true ) );
// replace original iframe node with new div class wrapper node
$parent->replaceChild( $wrap, $node );
}
echo $doc->saveHTML();
?>

Related

Remove entire div tag contents turn into a function

What I'm seeking to do is find an elegant solution to remove the contents of everything between a certain class = i.e. you want to remove all the HTML in the sometestclass class using php.
The function below works somewhat - not that well - it removes some parts of the page I don't want removed.
Below is a function based on an original post (below):
$html = "<p>Hello World</p>
<div class='sometestclass'>
<img src='foo.png'/>
<div>Bar</div>
</div>";
$clean = removeDiv ($html,'sometestclass');
echo $clean;
function removeDiv ($html,$removeClass){
$dom = new DOMDocument;
$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
$removeString = ".//div[#class='$removeClass']";
$pDivs = $xpath->query($removeString);
foreach ( $pDivs as $div ) {
$div->parentNode->removeChild( $div );
}
$output = preg_replace( "/.*<body>(.*)<\/body>.*/s", "$1", $dom->saveHTML() );
return $output;
}
does anyone have any suggestions to improve the results of this?
the original post is here

You are not quoting the class name:
$removeString = ".//div[#class=$removeClass]";
should be:
$removeString = ".//div[#class='$removeClass']";

How to remove entire div with preg_replace

Ok, as it is WordPress problem and it sadly goes a little deeper, I need to remove each representation of parent div and its inside:
<div class="sometestclass">
<img ....>
<div>.....</div>
any other html tags
</div><!-- END: .sometestclass -->
The only idea I have is to match everything that starts with:
<div class="sometestclass">
and ends with:
<!-- END: .sometestclass -->
with all that is between (I can tag the end of parent div anyway I want, this is just a sample).
Anybody have an idea how to do it with:
<?php $content = preg_replace('?????','',$content); ?>

I wouldn't use a regular expression. Instead, I would use the DOMDocument class. Just find all of the div elements with that class, and remove them from their parent(s):
$html = "<p>Hello World</p>
<div class='sometestclass'>
<img src='foo.png'/>
<div>Bar</div>
</div>";
$dom = new DOMDocument;
$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
$pDivs = $xpath->query(".//div[#class='sometestclass']");
foreach ( $pDivs as $div ) {
$div->parentNode->removeChild( $div );
}
echo preg_replace( "/.*<body>(.*)<\/body>.*/s", "$1", $dom->saveHTML() );
Which results in:
<p>Hello World</p>

<?php $content = preg_replace('/<div class="sometestclass">.*?<\/div><!-- END: .sometestclass -->/s','',$content); ?>
My RegEx is a bit rusty, but I think this should work. Do note that, as others have said, RegEx is not properly equipped to handle some of the complexities of HTML.
In addition, this pattern won't find embedded div elements with the class sometestclass. You would need recursion for that.

How about just some CSS .sometestclass{display: none;} ?

For the UTF-8 issue, I found a hack at the PHP-manual
So my functions looks as follows:
function rem_fi_cat() {
/* This function removes images from _within_ the article.
* If these images are enclosed in a "wp-caption" div-tag.
* If the articles are post formatted as "image".
* Only on home-page, front-page an in category/archive-pages.
*/
if ( (is_home() || is_front_page() || is_category()) && has_post_format( 'image' ) ) {
$document = new DOMDocument();
$content = get_the_content( '', true );
if( '' != $content ) {
/* incl. UTF-8 "hack" as described at
* http://www.php.net/manual/en/domdocument.loadhtml.php#95251
*/
$document->loadHTML( '<?xml encoding="UTF-8">' . $content );
foreach ($doc->childNodes as $item) {
if ($item->nodeType == XML_PI_NODE) {
$doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper
}
}
$xpath = new DOMXPath( $document );
$pDivs = $xpath->query(".//div[#class='wp-caption']");
foreach ( $pDivs as $div ) {
$div->parentNode->removeChild( $div );
}
echo preg_replace( "/.*<div class=\"entry-container\">(.*)<\/div>.*/s", "$1", $document->saveHTML() );
}
}
}

Regex to replace html src attribute in PHP

I'm trying to use regex to replace source attribute (could be image or any tag) in PHP.
I've a string like this:
$string2 = "<html><body><img src = 'images/test.jpg' /><img src = 'http://test.com/images/test3.jpg'/><video controls="controls" src='../videos/movie.ogg'></video></body></html>";
And I would like to turn it into:
$string2 = "<html><body><img src = 'test.jpg' /><img src = 'test3.jpg'/><video controls="controls" src='movie.ogg'></video></body></html>";
Heres what I tried :
$string2 = preg_replace("/src=["']([/])(.*)?["'] /", "'src=' . convert_url('$1') . ')'" , $string2);
echo htmlentities ($string2);
Basically it didn't change anything and gave me a warning about unescaped string.
Doesn't $1 send the content of the string ? What is wrong here ?
And the function of convert_url is from an example I posted here before :
function convert_url($url)
{
if (preg_match('#^https?://#', $url)) {
$url = parse_url($url, PHP_URL_PATH);
}
return basename($url);
}
It's supposed to strip out url paths and just return the filename.

Don't use regular expressions on HTML - use the DOMDocument class.
$html = "<html>
<body>
<img src='images/test.jpg' />
<img src='http://test.com/images/test3.jpg'/>
<video controls='controls' src='../videos/movie.ogg'></video>
</body>
</html>";
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
libxml_clear_errors();
$doc = $dom->getElementsByTagName("html")->item(0);
$src = $xpath->query(".//#src");
foreach ( $src as $s ) {
$s->nodeValue = array_pop( explode( "/", $s->nodeValue ) );
}
$output = $dom->saveXML( $doc );
echo $output;
Which outputs the following:
<html>
<body>
<img src="test.jpg">
<img src="test3.jpg">
<video controls="controls" src="movie.ogg"></video>
</body>
</html>

You have to use the e modifier.
$string = "<html><body><img src='images/test.jpg' /><img src='http://test.com/images/test3.jpg'/><video controls=\"controls\" src='../videos/movie.ogg'></video></body></html>";
$string2 = preg_replace("~src=[']([^']+)[']~e", '"src=\'" . convert_url("$1") . "\'"', $string);
Note that when using the e modifier, the replacement script fragment needs to be a string to prevent it from being interpreted before the call to preg_replace.

function replace_img_src($img_tag) {
$doc = new DOMDocument();
$doc->loadHTML($img_tag);
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
$old_src = $tag->getAttribute('src');
$new_src_url = 'website.com/assets/'.$old_src;
$tag->setAttribute('src', $new_src_url);
}
return $doc->saveHTML();
}

How to strip a tag and all of its inner html using the tag's id?

I have the following html:
<html>
<body>
bla bla bla bla
<div id="myDiv">
more text
<div id="anotherDiv">
And even more text
</div>
</div>
bla bla bla
</body>
</html>
I want to remove everything starting from <div id="anotherDiv"> until its closing <div>. How do I do that?

With native DOM
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$xPath = new DOMXPath($dom);
$nodes = $xPath->query('//*[#id="anotherDiv"]');
if($nodes->item(0)) {
$nodes->item(0)->parentNode->removeChild($nodes->item(0));
}
echo $dom->saveHTML();

You can use preg_replace() like:
$string = preg_replace('/<div id="someid"[^>]+\>/i', "", $string);

Using the native XML Manipulation Library
Assuming that your html content is stored in the variable $html:
$html='<html>
<body>
bla bla bla bla
<div id="myDiv">
more text
<div id="anotherDiv">
And even more text
</div>
</div>
bla bla bla
</body>
</html>';
To delete the tag by ID use the following code:
$dom=new DOMDocument;
$dom->validateOnParse = false;
$dom->loadHTML( $html );
// get the tag
$div = $dom->getElementById('anotherDiv');
// delete the tag
if( $div && $div->nodeType==XML_ELEMENT_NODE ){
$div->parentNode->removeChild( $div );
}
echo $dom->saveHTML();
Note that certain versions of libxml require a doctype to be present in order to use the getElementById method.
In that case you can prepend $html with <!doctype>
$html = '<!doctype>' . $html;
Alternatively, as suggested by Gordon's answer, you can use DOMXPath to find the element using the xpath:
$dom=new DOMDocument;
$dom->validateOnParse = false;
$dom->loadHTML( $html );
$xp=new DOMXPath( $dom );
$col = $xp->query( '//div[ #id="anotherDiv" ]' );
if( !empty( $col ) ){
foreach( $col as $node ){
$node->parentNode->removeChild( $node );
}
}
echo $dom->saveHTML();
The first method works regardless the tag. If you want to use the second method with the same id but a different tag, let say form, simply replace //div in //div[ #id="anotherDiv" ] by '//form'

strip_tags() function is what you are looking for.
http://us.php.net/manual/en/function.strip-tags.php

I wrote these to strip specific tags and attributes. Since they're regex they're not 100% guaranteed to work in all cases, but it was a fair tradeoff for me:
// Strips only the given tags in the given HTML string.
function strip_tags_blacklist($html, $tags) {
foreach ($tags as $tag) {
$regex = '#<\s*' . $tag . '[^>]*>.*?<\s*/\s*'. $tag . '>#msi';
$html = preg_replace($regex, '', $html);
}
return $html;
}
// Strips the given attributes found in the given HTML string.
function strip_attributes($html, $atts) {
foreach ($atts as $att) {
$regex = '#\b' . $att . '\b(\s*=\s*[\'"][^\'"]*[\'"])?(?=[^<]*>)#msi';
$html = preg_replace($regex, '', $html);
}
return $html;
}

how about this?
// Strips only the given tags in the given HTML string.
function strip_tags_blacklist($html, $tags) {
$html = preg_replace('/<'. $tags .'\b[^>]*>(.*?)<\/'. $tags .'>/is', "", $html);
return $html;
}

Following RafaSashi's answer using preg_replace(), here's a version that works for a single tag or an array of tags:
/**
* #param $str string
* #param $tags string | array
* #return string
*/
function strip_specific_tags ($str, $tags) {
if (!is_array($tags)) { $tags = array($tags); }
foreach ($tags as $tag) {
$_str = preg_replace('/<\/' . $tag . '>/i', '', $str);
if ($_str != $str) {
$str = preg_replace('/<' . $tag . '[^>]*>/i', '', $_str);
}
}
return $str;
}

Convert clickable anchor tags to plain text in html document

I am trying to match <a> tags within my content and replace them with the link text followed by the url in square brackets for a print-version.
The following example works if there is only the "href". If the <a> contains another attribute, it matches too much and doesn't return the desired result.
How can I match the URL and the link text and that's it?
Here is my code:
<?php
$content = 'This is a text link';
$result = preg_replace('/<a href="(http:\/\/[A-Za-z0-9\\.:\/]{1,})">([\\s\\S]*?)<\/a>/',
'<strong>\\2</strong> [\\1]', $content);
echo $result;
?>
Desired result:
<strong>This is a text link </strong> [http://www.website.com]

You should be using DOM to parse HTML, not regular expressions...
Edit: Updated code to do simple regex parsing on the href attribute value.
Edit #2: Made the loop regressive so it can handle multiple replacements.
$content = '
<p>This is a text link</p>
bah
I wont change
';
$dom = new DOMDocument();
$dom->loadHTML($content);
$anchors = $dom->getElementsByTagName('a');
$len = $anchors->length;
if ( $len > 0 ) {
$i = $len-1;
while ( $i > -1 ) {
$anchor = $anchors->item( $i );
if ( $anchor->hasAttribute('href') ) {
$href = $anchor->getAttribute('href');
$regex = '/^http/';
if ( !preg_match ( $regex, $href ) ) {
$i--;
continue;
}
$text = $anchor->nodeValue;
$textNode = $dom->createTextNode( $text );
$strong = $dom->createElement('strong');
$strong->appendChild( $textNode );
$anchor->parentNode->replaceChild( $strong, $anchor );
}
$i--;
}
}
echo $dom->saveHTML();
?>

You can make the match ungreedy using ?.
You should also take into account there may be attributes before the href attribute.
$result = preg_replace('/<a [^>]*?href="(http:\/\/[A-Za-z0-9\\.:\/]+?)">([\\s\\S]*?)<\/a>/',
'<strong>\\2</strong> [\\1]', $content);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex match if pattern is not after another patten - PHP - php

Related

Remove entire div tag contents turn into a function

How to remove entire div with preg_replace

Regex to replace html src attribute in PHP

How to strip a tag and all of its inner html using the tag's id?

Convert clickable anchor tags to plain text in html document

Categories

Resources