RegExp to strip HTML comments - php

Looking for a regexp sequence of matches and replaces (preferably PHP but doesn't matter) to change this (the start and end is just random text that needs to be preserved).
IN:
fkdshfks khh fdsfsk
<!--g1-->
<div class='codetop'>CODE: AutoIt</div>
<div class='geshimain'>
<!--eg1-->
<div class="autoit" style="font-family:monospace;">
<span class="kw3">msgbox</span>
</div>
<!--gc2-->
<!--bXNnYm94-->
<!--egc2-->
<!--g2-->
</div>
<!--eg2-->
fdsfdskh
to this OUT:
fkdshfks khh fdsfsk
<div class='codetop'>CODE: AutoIt</div>
<div class='geshimain'>
<div class="autoit" style="font-family:monospace;">
<span class="kw3">msgbox</span>
</div>
</div>
fdsfdskh
Thanks.

Are you just trying to remove the comments? How about
s/<!--[^>]*-->//g
or the slightly better (suggested by the questioner himself):
<!--(.*?)-->
But remember, HTML is not regular, so using regular expressions to parse it will lead you into a world of hurt when somebody throws bizarre edge cases at it.

preg_replace('/<!--(.*)-->/Uis', '', $html)
This PHP code will remove all html comment tags from the $html string.

A better version would be:
(?=<!--)([\s\S]*?)-->
It matches html comments like these:
<!--
multi line html comment
-->
or
<!-- single line html comment -->
and what is most important it matches comments like this (the other regex shown by others do not cover this situation):
<!-- this is my blog: <mynixworld.inf> -->
Note
Although syntactically the one below is a html comment your browser might parse it somehow differently and thus it might have a special meaning. Stripping such strings might break your code.
<!--[if !(IE 8) ]><!-->

Do not forget to consider conditional comments, as
<!--(.*?)-->
will remove them. Try this instead:
<!--[^\[](.*?)-->
This will also remove downlevel-revealed conditional comments, though.
EDIT:
This won't remove downlevel-revealed or downlevel-hidden comments.
<!--(?!<!)[^\[>].*?-->

Ah I've done it,
<!--(.*?)-->

With next:
/( )*<!--((.*)|[^<]*|[^!]*|[^-]*|[^>]*)-->\n*/g
Can remove multiline comments using test string:
fkdshfks khh fdsfsk
<!--g1-->
<div class='codetop'>CODE: AutoIt</div>
<div class='geshimain'>
<!--eg1-->
<div class="autoit" style="font-family:monospace;">
<span class="kw3">msgbox</span>
</div>
<!--gc2-->
<!--bXNnYm94-->
<!--egc2-->
<!--g2-->
</div>
<!--eg2-->
fdsfdskh
<!-- --
> test
- -->
<!-- --
<- test <
>
- -->
<!--
test !<
- <!--
-->
<script type="text/javascript">//<![CDATA[
var xxx = 'a';
//]]></script>
ok

Try the following if your comments contain line breaks:
/<!--(.|\n)*?-->/g

<!--([\s\S]*?)-->
Works in javascript and VBScript also as "." doesn't match line breaks in all languages

Here is my attempt:
<!--(?!<!)[^\[>][\s\S]*?-->
This will also remove multi line comments and won't remove downlevel-revealed or downlevel-hidden comments.

I know that this is quite an old post, but I felt that it would be useful to add to this post in case anyone wants an easy to implement PHP function that directly answers the original question.
/**
* Strip all the html comments from $text
*
* #param $text - text to modify
* #param string $new replacement string
* #return array|string|string[]|null
*/
function strip_html_comments($text, $new=''){
$search = array ("|<!--[\s\S]*?-->|si");
$replace = array ($new);
return preg_replace($search, $replace, $text);
}

these code is also remove javascript code.
that's too bad :|
here's the example javascript code will be remove with this code:
<script type="text/javascript"><!--
var xxx = 'a';
//-->
</script>

function remove_html_comments($html) {
$expr = '/<!--[\s\S]*?-->/';
$func = 'rhc';
$html = preg_replace_callback($expr, $func, $html);
return $html;
}
function rhc($search) {
list($l) = $search;
if (mb_eregi("\[if",$l) || mb_eregi("\[endif",$l) ) {
return $l;
}
}

// Remove multiline comment
$mlcomment = '/\/\*(?!-)[\x00-\xff]*?\*\//';
$code = preg_replace ($mlcomment, "", $code);
// Remove single line comment
$slcomment = '/[^:]\/\/.*/';
$code = preg_replace ($slcomment, "", $code);
// Remove extra spaces
$extra_space = '/\s+/';
$code = preg_replace ($extra_space, " ", $code);
// Remove spaces that can be removed
$removable_space = '/\s?([\{\};\=\(\)\\\/\+\*-])\s?/';
$code = preg_replace ('/\s?([\{\};\=\(\)\/\+\*-])\s?/', "\\1", $code);

If you just want the text or text with specific tags you can handle this with PHP strip_tags it also delete HTML comment and you can save HTML tags you need like this:
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text, ['p', 'a']);
the output will be:
<p>Test paragraph.</p> Other text
I hope it helps somebody!

You can achieve this with modern JavaScript.
function RemoveHtmlComments() {
let children = document.body.childNodes;
for (let child in children) {
if (children[child].nodeType === Node.COMMENT_NODE) children[child].remove();
}
}
It should be safer than RegEx.

Related

Is it possible to change original html text in php?

I am trying to make "manner friendly" website. We use different declination dependent on gender and other factors. For example:
You did = robili
It did = robilo
She did = robila
Linguisticaly this is very simplified (and unlucky) example! I would like to change html text in php file where appropriate. For example
<? php
something
?>
html text of the page and somewhere is the word "robil"
<div>we tried to robil^i|o|a^</div>
<? php something ?>
Now I would like to replace all occurences of different tokens ^characters|characters|characters^ and replace them by one of their internal values according to "gender".
It is easy in javascript on the client side, but you will see all this weird "tokenizing" before javascript replace it.
Here I do not know the elegant solution.
Or do you have better idea?
Thanks for advice.
You can add these scripts before and after the HTML:
<?php
// start output buffering
ob_start();
?>
<html>
<body>
html text of the page and somewhere is the word "robil"
<div>we tried to robil^i|o|a^, but also vital^si|sa|ste^, borko^mal|mala|malo^ </div>
</body>
</html>
<?php
$use = 1; // indicate which declination to use (0,1 or 2)
// get buffered html
$html = ob_get_contents();
ob_end_clean();
// match anything between '^' than's not a control chr or '^', min 5 and max 20 chrs.
if (preg_match_all('/\^[^[:cntrl:]\^]{3,20}\^/',$html,$matches))
{
// replace all
foreach (array_unique($matches[0]) as $match)
{
$choices = explode('|',trim($match,'^'));
$html = str_replace($match,$choices[$use],$html);
}
}
echo $html;
This returns:
html text of the page and somewhere is the word "robil" we tried to
robilo, but also vitalsa, borkomala

How to extract HTML element from a source file

I need to replace a HTML section identified by a tag id in a source code, which is combination of HTML and PHP using PHP. In case it's pure HTML, DOM parser could be used; in case there is no DIV in DIV, I can imagine how to use preg_match. This is what I am trying to do - I have a code (loaded into a string) like:
<div>
<img >
</div>
<? include(); ?>
<div id="mydiv">
<div>
<div>
<img >
</div>
</div>
</div>
and my task is to replace content of "mydiv" DIV with a new one e.g.
<div id="newdiv>
some text
</div>
so the string will look like this after the change:
<div>
<img >
</div>
<? include(); ?>
<div id="mydiv">
<div id="newdiv>
some text
</div>
</div>
I have already tried:
1) parsing the code using DOMdocument's loadHTML => it produces a lot of errors in case PHP code is included.
2) I played around a bit with regexes like preg_match_all('/<div id="myid"([^<]*)<\/div>/', $src, $matches), which fails in case more child divs are included.
The best approach I have found so far is:
1) find id="mydiv" string
2) search for '<' and '>' chars and count them like '<'=1 and '>'=-1 (not exactly, but it gives the idea)
3) once I get sum == 0 I should be on position of the closing tag, so I know, which portion string I should exchange
This is quite "heavy" solution, which can stop working in some cases, where the code is different (e.g. onpage PHP code contains the chars as well instead of just simple "include"). So I am looking so some better solution.
You could try something like this:
$file = 'filename.php';
$content = file_get_contents($file);
$array_one = explode( '<div id="mydiv">' , $content );
$my_div_content = explode("</div>" , $array_one[1] )[0];
Or use preg_match like you said:
preg_match('/<div id="mydiv"(.*?)<\/div>/s', $content, $matches)
Yes there is. First you need to use a function that will get the content of the file. Lets call the file homepage.php:
$homepageString = file_get_contents('homepage.php');
Now you have a string with all the content. The next thing you would do is use the preg_replace() function to take out the part of code that you want to take out:
$newHomepageString = preg_replace('/id="mydiv"/',"", $homepageString);
Now you overwrite the existing homepage.php file with the new source code:
file_put_contents("homepage.php", $newHomepageString);
Let me know if it worked for you! :)

PHP give tag id of its contents

I have a variable which consists of diffrent html tags:
$html = '<h1>Title</h1><u>Header</u><h2>Sub Title</h2><p>content</p><u>Footer</u>'
I want to find all the u tags in the $html variable and give them the id of their contents.
It should return:
$html = '<h1>Title</h1><u id="header" >Header</u><h2>Sub Title</h2><p>content</p><u id="footer" >Footer</u>'
You can use preg_replace() if you want it fast way, or learn about DOMDocument if you want to do it the proper way.
$pattern = '~<u>([^<]*)</u>~Ui';
$replace = '<u id="$1">$1</u>';
$html = preg_replace($pattern, $replace, $html);
You can use preg_replace.
$html = preg_replace('~<u>([^<]+)</u>~e','"<u id=\"".strtolower("$1")."\" >$1</u>"', $html);
The e means "evaluate", which allows you to cram the "strtolower" command into the replacement.
it will be good to do it using jquery if it suits your need else Forien answer is good to go
here it goes to do it in jquery
your html
<div id='specialString'>
<h1>Title</h1><u>Header</u><h2>Sub Title</h2><p>content</p><u>Footer</u>
</div>
your js
<script type="text/javascript">
$('#specialString > ul').each(function() {
$(this).attr('id', $(this).text());
});
</script>

convert DIV to SPAN using str_replace

I have some data that is provided to me as $data, an example of some of the data is...
<div class="widget_output">
<div id="test1">
Some Content
</div>
<ul>
<li>
<p>
<div>768hh</div>
<div>2308d</div>
<div>237ds</div>
<div>23ljk</div>
</p>
</li>
<div id="temp3">
Some more content
</div>
<li>
<p>
<div>lkgh322</div>
<div>32khhg</div>
<div>987dhgk</div>
<div>23lkjh</div>
</p>
</li>
</div>
I am attempting to change the non valid HTML DIVs inside the paragraphs so i end up with this instead...
<div class="widget_output">
<div id="test1">
Some Content
</div>
<ul>
<li>
<p>
<span>768hh</span>
<span>2308d</span>
<span>237ds</span>
<span>23ljk</span>
</p>
</li>
<div id="temp3">
Some more content
</div>
<li>
<p>
<span>lkgh322</span>
<span>32khhg</span>
<span>987dhgk</span>
<span>23lkjh</span>
</p>
</li>
</div>
I am trying to do this using str_replace with something like...
$data = str_replace('<div>', '<span>', $data);
$data = str_replace('</div>', '</span', $data);
Is there a way I can combine these two statements and also make it so that they only affect the 'This is a random item' and not the other occurences?
$data = str_replace(array('<div>', '</div>'), array('<span>', '</span>'), $data);
As long as you didn't give any other details and only asked:
Is there a way I can combine these two statements and also make it so that they only affect the 'This is a random item' and not the other occurences?
Here you go:
$data = str_replace('<div>This is a random item</div>', '<span>This is a random item</span>', $data);
You'll need to use a regular expression to do what you are looking to do, or to actually parse the string as XML and modify it that way. The XML parsing is almost surely the "safest," since as long as the string is valid XML, it will work in a predictable way. Regexes can at times fall prey to strings not being in exactly the expected format, but if your input is predictable enough, they can be ok. To do what you want with regular expressions, you'd so something like
$parsed_string = preg_replace("~<div>(?=This is a random item)(.*?)</div>~", "<span>$1</span>, $input_string);
What's happening here is the regex is looking for a <div> tag which is followed by (using a lookahead assertion) This is a random item. It then captures any text between that tag and the next </div> tag. Finally, it replaces the match with <span>, followed by the captured text from inside the div tags, followed by </span>. This will work fine on the example you posted, but will have problems if, for example, the <div> tag has a class attribute. If you are expecting things like that, either a more complex regular expression would be needed, or full XML parsing might be the best way to go.
I'm a little surprised by the other answers, I thought someone would post a good one, but that hasn't happened. str_replace is not powerful enough in this case, and regular expressions are hit-and-miss, you need to write a parser.
You don't have to write a full HTML-parser, you can cheat a bit.
$in = '<div class="widget_output">
(..)
</div>';
$lines = explode("\n", $in);
$in_paragraph = false;
foreach ($lines as $nr => $line) {
if (strstr($line, "<p>")) {
$in_paragraph = true;
} else if (strstr($line, "</p>")) {
$in_paragraph = false;
} else {
if ($in_paragraph) {
$lines[$nr] = str_replace(array('<div>', '</div>'), array('<span>', '</span>'), $line);
}
}
}
echo implode("\n", $lines);
The critical part here is detecting whether you're in a paragraph or not. And only when you're in a paragraph, do the string replacement.
Note: I'm splitting on newlines (\n) which is not perfect, but works in this case. You might want to improve this part.

filter php variable for specific bbcode string - wrap matches inside of divs?

hey guys,
my php variable $content holds html!
i want to filter this $content for
[q=SomeQuestoin] and [a=SomeAnswer]
and wrap each match inside of a div.question and div.answer.
So whenever this [q=Some Question][a=Some Answer] structure is found in $content i want to put out this.
<div class="qanda">
<div class="question">
Some Question
</div>
<div class="answer">
Some Answer
</div>
</div>
Is that possible? Important is that the Qustion Text or the Answer Text could hold html tags as well. like <p> or <b> etc.
update:
$q_regex = '/\[q=([^"]+?)]/is';
$q_output = '<div class="qanda"><div class="queston">$1</div>';
$content = preg_replace($q_regex, $q_output, $content);
$a_regex = '/\[a=([^"]+?)]/is';
$a_output = '<div class="answer">$1</div></div>';
$content = preg_replace($a_regex, $a_output, $content);
http://www.spotlesswebdesign.com/blog.php?id=12
tutorial on using regex to do bbcode parsing. people would recommend using a bbcode parser module however. should be safe to regex since you are not using nesting and whatnot.
EDIT
possible but tricky. could be error prone. something like this maybe:
$result = preg_replace('/\[q=(.+?)].+?\[a=(.+?)]/is', '<div class="qanda"><div class="question">$1</div><div class="answer">$2</div></div>', $subject);

Categories