Replace HTML entities between divs and only between divs - php

Given the following string:
asd <div> def foo </div> ghi <div> moo </div>
I want to remove all of the 's that are within <div>s, resulting in:
asd <div> def foo </div> ghi <div> moo </div>
I can use any standard PHP stuff, but I'm not sure how to approach the problem. I couldn't figure out how to keep the contents inside the <div>s while removing the
The reason why I need this is because WordPress's content filter adds under strange situations. I can't simply remove all because they might've been specifically entered by the user, but I need to remove all of them within the element that's having display problems caused by them

The following works in your case:
$str = "asd <div> def </div> ghi <div> moo </div>";
$res = preg_replace("%<div>(.*?) (.*?)</div>%", "<div>$1$2</div>", $str);
But beware of some facts:
It won't work if the divs have attributes;
It won't work as expected if the divs are nested;
It applies the replacement of a only one time, so multiple s inside divs are untouched.
So the abovementioned replacement is not a good solution at all. It's way better to first find the div tags with a (XML) parser function and then replace all s.

$text = "asd <div> def </div> ghi <div> moo </div>";
echo preg_replace_callback(
"#<div(.*?)>(.*? .*?)</div>#i",
"filter_nbsp",
$text);
function filter_nbsp($matches)
{
return "<div".$matches[1].">" . str_replace(" ","",$matches[2]) . "</div>";
}
That should work for entities between div elements closed as </div>,
output
asd <div> def </div> ghi <div> moo </div>

simple_html_dom
$html = str_get_html('asd <div> def </div> ghi <div> moo </div>');
foreach($html->find('div') as $element) {
$a = $element->plaintext;
$element->innertext = preg_replace('{\ }','',$a);
}
echo $html;

Related

Select p tag after h2 that has a child with id

How can I select a p-tag that is after a tag that has a specific child? Using a web crawler.
http://symfony.com/doc/current/components/css_selector.html
$crawler->filter('h2 span#hello + p')->each(function ($node) {
var_dump($node->html());
});
Example:
<h2><span id="hello">Hi</span></h2>
<p>I want this p-tag, that is after the h2 above</p>
<p>me too!</p>
<a>Not me!</a>
<h2>lol</h2>
<p>yo, not me</p>
does not work.
It is usually best to traverse HTML using the DOMDocument class (http://php.net/manual/en/class.domdocument.php) but you could do it with a regular expression thus:
// put the example HTML code into a string
$html = <<< EOF
<h2><span id="hello">Hi</span></h2>
<p>I want this p-tag, that is after the h2 above</p>
<p>me too!</p>
<a>Not me!</a>
<h2>lol</h2>
<p>yo, not me</p>
EOF;
// set up a regular expression
$re = "/<h2[^>]*>.*?<span[^>]*id=\"hello\"[^>]*>.*?<\\/h2[^>]*>.*?(<p.*?)<[^\\/p]/sim";
// get the match ... the (.*?) in the above regex
preg_match($re,$html,$matches);
print $matches[1];
Would output:
<p>I want this p-tag, that is after the h2 above<p>
<p>me too!</p>

Regular expression missed first occurrence of target string

I am using regular expression to fetch both text1 and text2 in the following html code. Here is what I am using:
/<div\s?class="right-col">[\s\n\S]*<a[\s\n]?[^>]*>#(.*)<\/a>/
but apparently I missed text1, only got text2(here is the link to my problem).
<div class="right-col">
<h1>
title1
</h1>
<p>some text here</p>
<div class="some-class">
<div class="left">
<span>some text here </span>
</div>
<div class="postmeta"><a href="url-link-here" >#text1</a> </div>
</div>
<div class="right-col">
<h1>
title2
</h1>
<p>some text here</p>
<div class="some-class">
<div class="left">
<span>some text here </span>
</div>
<div class="postmeta"><a href="url-link-here" >#text2</a> </div>
</div>
Can you guys tell me what went wrong in my regular expression? Is there a better way to capture both title1, title2 and text1, text2?
Using a regular expression here is not the best way to do it. It's bad practice. You should be using a DOM/XML parser to do this.
I like using PHP's DOMDocument class. Using XPath, we can quickly find the elements you want
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$aTags = $xPath->query('//div[#class="some-class"]//a[starts-with(text(), "#")]');
foreach($aTags as $a){
echo $a->nodeValue;
}
DEMO: http://codepad.viper-7.com/QHOXzH
This is a fairly common issue with regular expressions as they are greedy. [\s\S]* (the \n is not needed) matches for the first '<' and 'a' and since it's greedy it will match those and continue. Adding a ? makes it not greedy and using your link returns both text1 and text2.
The short answer is to replace [\s\n\S]* with [\s\S]*? but as others have mentioned, this is probably not a good use of regular expressions.

PHP Regex - Remove text from HTML Tags

How to remove all text between tags.
Input
<div>
<p>testing</p>
<div>my world</div>
</div>
Output
<div>
<p></p>
<div></div>
</div>
You can use either DOMDocument or PHP Simple HTML DOM Parser.
The following example uses the latter, although you may want to use what suits you best.
include("simple_html_dom.php");
$str = '
<div>
<p>testing</p>
<div>my world</div>
</div>
';
$html = str_get_html($str);
foreach($html->find("text") as $ht) {
$ht->innertext = "";
}
$html->save();
echo $html;
You could use two capture groups which would eliminate characters between them while replacing:
(\<.+\>).*(\<\/.+\>)
working example: http://ideone.com/Oq14El

PHP preg_match_all + str_replace

I need to find a way to replace all the <p> within all the <blockquote> before the <hr />.
Here's a sample html:
<p>2012/01/03</p>
<blockquote>
<h4>File name</h4>
<p>Good Game</p>
</blockquote>
<blockquote><p>Laurie Ipsumam</p></blockquote>
<h4>Some title</h4>
<hr />
<p>Lorem Ipsum</p>
<blockquote><p>Laurel Ipsucandescent</p></blockquote>
Here's what I got:
$pieces = explode("<hr", $theHTML, 2);
$blocks = preg_match_all('/<blockquote>(.*?)<\/blockquote>/s', $pieces[0], $blockmatch);
if ($blocks) {
$t1=$blockmatch[1];
for ($j=0;$j<$blocks;$j++) {
$paragraphs = preg_match_all('/<p>/', $t1[$j], $paragraphmatch);
if ($paragraphs) {
$t2=$paragraphmatch[0];
for ($k=0;$k<$paragraphs;$k++) {
$t1[$j]=str_replace($t2[$k],'<p class=\"whatever\">',$t1[$j]);
}
}
}
}
I think I'm really close, but I don't know how to put back together the html that I just pieced out and modified.
You could try using simple_xml, or better DOMDocument (http://www.php.net/manual/en/class.domdocument.php) before you make it a valid html code, and use this functionality to find the nodes you are looking for, and replace them, for this you could try XPath (http://w3schools.com/xpath/xpath_syntax.asp).
Edit 1:
Take a look at the answer of this question:
RegEx match open tags except XHTML self-contained tags
$string = explode('<hr', $string);
$string[0] = preg_replace('/<blockquote>(.*)<p>(.*)<\/p>(.*)<\/blockquote>/sU', '<blockquote>\1<p class="whatever">\2</p>\3</blockquote>', $string[0]);
$string = $string[0] . '<hr' . $string[1];
output:
<p>2012/01/03</p>
<blockquote>
<h4>File name</h4>
<p class="whatever">Good Game</p>
</blockquote>
<blockquote><p class="whatever">Laurie Ipsumam</p></blockquote>
<h4>Some title</h4>
<hr />
<p>Lorem Ipsum</p>
<blockquote><p>Laurel Ipsucandescent</p></blockquote>

PHP How to div tags and retain a href?

$str= <<<EOT
<div class="tb">
<div class="mg"></div>
Title1 title2 introduce.</div>
</div>
EOT;
PHP How to div tags and retain a href? It can not use easy strip_tags
I need something back:
title1
Title2title3 introduce.
This will give you the exact output you are looking for...
$str = strip_tags($str, '<a>');
The strip_tags() function allows you to pass in a list of allowed tags. So you allow the tags you want and voila!
http://us3.php.net/manual/en/function.strip-tags.php
Use strip_tags(), allowing the <a> tag to remain:
$str= <<<EOT
<div class="tb">
<div class="mg"></div>
Title1 title2 introduce.</div>
</div>
EOT;
// Strip out all tags except <a>
echo strip_tags($str, "<a>");

Categories