preg_replace is replacing matches including content contained within - php

I am using preg_replace to replace HTML comment tags with empty space but it seems to be replacing the whole HTML comment with empty space.
echo preg_replace('/<!--(.*?)-->/','',$r->pageCont);
Where $r->pageCont is a database entry containing HTML, for example:
<div class="col-lg-12">
<p>The year is:</p>
<!-- <?php echo date(Y); ?> -->
</div>
In the above example, the HTML comment tags would be stripped away leaving only the PHP code to echo the year. Like I said, what is happening is the entire HTML comment is being stripped away.
Can someone recommend a pattern to use? Would appreciate your input.
EDIT: updated question to reflect the code I am using.

It seems like you're trying to replace the comment line with the php code present inside that. If yes, then you need to put the replacement string as $1 so that it would refer to the group index 1.
echo preg_replace('/<!--(.*?)-->/', '$1', $r->pageCont);
DEMO

Related

RegEx replace not working in PHP

I've written a regular expression to get the first two paragraphs from a database clob which stores its content in HTML formatting.
I've checked with these online RegEx builder/checkers here and here and they both seem to be doing what I want them to do (I've altered the RegEx slightly since these checkers to handle the new line formatting which I found after.
However when I go to use this in my PHP it doesn't seem to want to get just the group I'm after, and instead matches everything.
Here is my preg_replace line:
$description = preg_replace('/(^.*?)((<p[^>]*>.*?<\/p>\s*){2})(.*)/', "$2", $description);
And here is my testing content in the format of the content I am getting
<p>
Paragraph 1</p>
<p>
Paragraph 2</p>
<p>
Paragraph 3</p>
I've had a look at this SO Post which didn't help.
Any Ideas?
EDIT
As pointed out in one of the comments you cannot Regex HTML in PHP (Don't know why, I'm not really bothered by that).
Now I'm opening the option for getting it in PL/SQL as well.
select
DBMS_LOB.substr(description, 32000, 1) /* How do I make this into a regular expression? */
from
blog_posts
Your input contains newlines, therefore you have to add the s modifier:
/(^.*?)((<p[^>]*>.*?<\/p>\s*){2})(.*)/s
Otherwise, .* breaks on newlines and the regex doesn't match.
You could take a look at the PHP Simple DOM Parser. Going by their manual, you could do something like so:
$html = str_get_html('your html string');
foreach($html->find('p') as $element) //This should get all the paragraph elements in your string.
echo $element->plaintext. '<br>';

PHP get html comments in string and wrap in <pre> tag. Regex or DOM?

I would like to find comment tags in a string that are NOT already inside a <pre> tag, and wrap them in a <pre> tag.
It seems like there's no way of 'finding' comments using the PHP DOM.
I'm using regex to do some of the processing already, however I am very unfamiliar with (have yet to grasp or truly understand) look aheads and look behinds in regex.
For instance I may have the following code;
<!-- Comment 1 -->
<pre>
<div class="some_html"></div>
<!-- Comment 2 -->
</pre>
I would like to wrap Comment 1 in <pre> tags, but obviously not Comment 2 as it already resides in a <pre>.
How would this usually be done in RegEx?
Here's kind of what I've understood about negative look arounds, and my attempt at one, I'm clearly doing something very wrong!
(?<!<pre>.*?)<!--.*-->(?!.*?</pre>)
You should really use a DOM parser if you are planning on re-using this code. Every regex approach will fail horribly sooner rather than later when presented with real-world HTML.
Having said that, here's what you could (but should not, see above) do:
First, identify comments, e.g. using
<!-- (?:(?!-->).)*-->
The negative look-ahead block ensures that the .* does not run out of the comment block.
Now, you need to figure out if this comment is inside a <pre> block. The key observation here, is that there is an even number of either <pre> or </pre> elements following every comment NOT already included in one.
So, run through the rest of your text, always in pairs of <pre>s, and check if you arrive at the end.
This would look like
(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)
So, together this would be
<!-- (?:(?!-->).)*-->(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)
A hurray for write-only code =)
The prominent building block of this expression is (?:(?!</?pre>).) which matches every character that is not the starting bracket of a <pre> or </pre> sequence.
Allowing attributes on the <pre> and proper escaping are left as an exercise for the reader. See this in action at RegExr.
It seems like there's no way of 'finding' comments using the PHP DOM.
Of course you can... Check this code using PHP Simple HTML DOM Parser:
<?php
$text = '<!-- Comment 1 -->
<pre>
<div class="some_html"></div>
<!-- Comment 2 -->
</pre>';
echo "<div>Original Text: <xmp>$text</xmp></div>";
$html = str_get_html($text);
$comments = $html->find('comment');
// if find exists
if ($comments) {
echo '<br>Find function found '. count($comments) . ' results: ';
foreach($comments as $key=>$com){
echo '<br>'.$key . ': ' . $com->tag . ' wich contains = <xmp>' . $com->innertext . '</xmp>';
}
}
else
echo "Find() fails !";
?>
$com->innertext will give you the comments like <!-- Comment 1 -->...
You have now just to clean them as you wish. For example using <!--\s*(.*)\s*-->... Try it HERE
Edit:
Just a note concerning the lookbehind, it MUST have a fixed-width, therefore you cannot use repetition *+ or optional items ?
The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.
Therefore, many regex flavors, including those used by Perl and Python, only allow fixed-length strings. You can use any regex of which the length of the match can be predetermined. This means you can use literal text and character classes. You cannot use repetition or optional items. You can use alternation, but only if all options in the alternation have the same length.
Source: http://www.regular-expressions.info/lookaround.html
Xpath is your friend:
$xpath = new DOMXpath($doc);
foreach($xpath->query('//comment()[not(ancestor::pre)]') as $comment){
$pre = $doc->createElement("pre");
$comment->parentNode->insertBefore($pre, $comment);
$pre->appendChild($comment);
}
its quite easy, using a principle called the stack-counter, essentially you count the amount of <pre> tags and the amount of </pre> tags until the point in the HTML code your segment is placed. if there are more <pre> than </pre> - this means that "<pre>..--you are here--..</pre>". in that case, simply return back the match, unmodified - simple as that.

replace string pattern in HTML text with PHP

For my customer I wrote a custom web-based WYSIWYG HTML editor. It allows them to format basic HTML text and insert images. When they insert images I insert them with pattern like ##image1##. The produced HTML can be something like this:
<p>some text and some more text</p>
<p>some text and some <b>bold text</b></p>
<div>##image1##</div>
<p>more text can follow here</p>
<div>##image2##</div>
When outing this HTML I am searching trough it and replacing occurrences for images and replacing ##image1##, ##image2## and so on with HTML markup that actually display images. My replace code is here:
// first find all occurrences of image string
preg_match_all('|##(.+)##|', $inputHTML, $matches);
for every match in $inputHTML
$output = preg_replace('|##(.+)##|', $imageHTML, $inputHTML, 1 );
This will work mot of the times, but in some variations of input HTML will parse strange result. One of the HTML that produces strange result is:
<div>##image1##</div><p class="align-justify"><strong>Peter Dekleva</strong>, <strong>Damir Lisica</strong>, <strong>Anej Kočevar</strong> in <strong>Gregor Jakac</strong> so glasbeniki, ki v svoji glasbi združujejo silovite instrumentalne vložke, markantne melodije in močna besedila.</p><div>##image2##</div><p class="align-justify">Video dvojček skladbe Brez strahu torej prikazuje oblico sproščenih trenutkov iz zaodrja, veličasnih posnetkov s koncertnega dogajanja, priprav na nastope, nepredvidljive zaključke noči.</p>
If I edit that HTML and add a line brake before <div>##image2##</div> then it will parse it OK. Any idea what is happening here and why I have problems?
I am also opened to suggestions for a better way of doing this. I can insert something else instead ##image1## when inserting image in my WYSIWYG editor... Thanks
This is because the + modifier is greedy. So it will match everything until the last instance of ##. Try adding a ? after the + to change it to ungreedy.
|##(.+?)##|
The reason that a line break fixes the problem is because by default the . doesn't match line breaks. however if you had done instead: |##(.+)##|s the line break wouldn't have fixed the problem.
Edit I just noticed that churk's answer to your previous question would have also worked correctly.
you should create <img/> directly - but anyway, if you don't use # for your image names, use ^# instead of .
also if you are not sure that ## won't be used in other HTML, test for <div> too
<div>##(^#+)##</div>

PHP (Regex) code to remove a Div with a specific ID

How can I get rid of these in a scrapped page using a PHP Regex code to go over the content and replace it with nothing?
<div id="news-id-245245" style="display:inline;">
I also do not have any other parts of the page where I use an ID for a div, so it will also work if a pattern simply removed all DIVs with an ID tag.
The id of the div is always as follows: "news-id-NUMBER"
Regarding your comment "I am not sure if this even makes sense": Since it works, it makes sense. I simplified your solution a bit; we don't need the parentheses around .*?. (We do need the ? unless there is only a single match in $content.)
$content = preg_replace('/<div id="news-id-.*?" style="display:inline;">/s', '', $content);
I don't think RegEx is an option here, you'll have to call it specifically. For example:
<?php
echo "<style>#news-id-NUMBER{display:none;}</style>";
?>

Regular expression to match block of HTML

First I'll show you a sample of the code I'm working with:
<div class="entry">
<p>Any HTML content could go here!</p>
</div>
</div><!--/post -->
Normally I'd use a regex rule such as the following to look for a prefix and a suffix and grab everything in between:
(?<=<div class="entry">).*(?=</div><!--/post -->)
However, that doesnt appear to be working as it seems to be pulling the white space in between then following parts instead of the HTML content itself:
<div class="entry">
<p>
Any help/suggestions would be much appreciated as I've been bashing my head with this one for a good few hours now.
Many thanks in advance.
Don't use Regex to parse HTML. You need an Xml Parser or similar.
Search Stackoverflow for the best one, like so: Robust and Mature HTML Parser for PHP
You can also consider php strip_tags().

Categories