how to remove html comments in php - php

I am trying to remove any comments embedded with the html file
$data= file_get_contents($stream); <br>
$data = preg_replace('<!--*-->', '', $data); <br>
echo $data;
I am still ending up with all the comments < !- bla bla bla -->
What am I doing wrong?

// Remove unwanted HTML comments
function remove_html_comments($content = '') {
return preg_replace('/<!--(.|\s)*?-->/', '', $content);
}
As you can read here : https://davidwalsh.name/remove-html-comments-php

I know lots of answers are already posted. I have tried many but for me this regular expression works for multi line (in my case 40 line of comments) HTML comments removal.
$string = preg_replace("~<!--(.*?)-->~s", "", $string);
Cheers :)

The below regex will remove HTML comments, but will keep conditional comments.
<!--(?!<!)[^\[>].*?-->

You could do it without using regular expression:
function strip_comments($html)
{
$html = str_replace(array("\r\n<!--", "\n<!--"), "<!--", $html);
while(($pos = strpos($html, "<!--")) !== false)
{
if(($_pos = strpos($html, "-->", $pos)) === false)
$html = substr($html, 0, $pos);
else
$html = substr($html, 0, $pos) . substr($html, $_pos+3);
}
return $html;
}

s/<!--[^>]*?-->//g
switch up regular expression

Regular expressions are very difficult to corral into doing what you want here.
To match arbitrary text in a regex, you need .*, not just *. Your expression is looking for <!-, followed by zero or more - characters, followed by -->.

I would not use regex for such a task. Regex can fail for unexpected characters.
Instead, I would do something that is safe, like this:
$linesExploded = explode('-->', $html);
foreach ($linesExploded as &$line) {
if (($pos = strpos($line, '<!--')) !== false) {
$line = substr($line, 0, $pos);
}
}
$html = implode('', $linesExploded);

You should do this way:
$str = "<html><!-- this is a commment -->OK</html>";
$str2 = preg_replace('/<!--.*-->/s', '', $str);
var_dump($str2);

Related

PHP Preg Replace. Remove strings inside {~ string ~} pattern, but skip <pre>{~ string ~}</pre> [duplicate]

I am using a WordPress plugin named Acronyms (https://wordpress.org/plugins/acronyms/). This plugin replaces acronyms with their description. It uses a PHP PREG_REPLACE function.
The issue is that it replaces the acronyms contained in a <pre> tag, which I use to present a source code.
Could you modify this expression so that it won't replace acronyms contained inside <pre> tags (not only directly, but in any moment)? Is it possible?
The PHP code is:
$text = preg_replace(
"|(?!<[^<>]*?)(?<![?.&])\b$acronym\b(?!:)(?![^<>]*?>)|msU"
, "<acronym title=\"$fulltext\">$acronym</acronym>"
, $text
);
You can use a PCRE SKIP/FAIL regex trick (also works in PHP) to tell the regex engine to only match something if it is not inside some delimiters:
(?s)<pre[^<]*>.*?<\/pre>(*SKIP)(*F)|\b$acronym\b
This means: skip all substrings starting with <pre> and ending with </pre>, and only then match $acronym as a whole word.
See demo on regex101.com
Here is a sample PHP demo:
<?php
$acronym = "ASCII";
$fulltext = "American Standard Code for Information Interchange";
$re = "/(?s)<pre[^<]*>.*?<\\/pre>(*SKIP)(*F)|\\b$acronym\\b/";
$str = "<pre>ASCII\nSometext\nMoretext</pre>More text \nASCII\nMore text<pre>More\nlines\nASCII\nlines</pre>";
$subst = "<acronym title=\"$fulltext\">$acronym</acronym>";
$result = preg_replace($re, $subst, $str);
echo $result;
Output:
<pre>ASCII</pre><acronym title="American Standard Code for Information Interchange">ASCII</acronym><pre>ASCII</pre>
It is also possible to use preg_split and keep the code block as a group, only replace the non-code block part then combine it back as a complete string:
function replace($s) {
return str_replace('"', '"', $s); // do something with `$s`
}
$text = 'Your text goes here...';
$parts = preg_split('#(<\/?[-:\w]+(?:\s[^<>]+?)?>)#', $text, null, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
$text = "";
$x = 0;
foreach ($parts as $v) {
if (trim($v) === "") {
$text .= $v;
continue;
}
if ($v[0] === '<' && substr($v, -1) === '>') {
if (preg_match('#^<(\/)?(?:code|pre)(?:\s[^<>]+?)?>$#', $v, $m)) {
$x = isset($m[1]) && $m[1] === '/' ? 0 : 1;
}
$text .= $v; // this is a HTML tag…
} else {
$text .= !$x ? replace($v) : $v; // process or skip…
}
}
return $text;
Taken from here.

How to remove "<>" brackets from string in php?

$cont=htmlspecialchars(file_get_contents("https://myanimelist.net/anime/30276/One_Punch_Man"));
function getBetween($string, $start = "", $end = ""){
if (strpos($string, $start)) { // required if $start not exist in $string
$startCharCount = strpos($string, $start) + strlen($start);
$firstSubStr = substr($string, $startCharCount, strlen($string));
$endCharCount = strpos($firstSubStr, $end);
if ($endCharCount == 0) {
$endCharCount = strlen($firstSubStr);
}
return substr($firstSubStr, 0, $endCharCount);
} else {
return '';
}
}
$name=getBetween($cont,'title',' - MyAnimeList.net');
//$name=preg_replace('/[^a-zA-Z0-9 \p{L}]/m', '', $name);
preg_replace('/(*UTF8)[\>\<]/m', '', $name);
trim($name," ");
//$name=str_replace("gt", "", $name);
echo $name;
i want to find the text between title tags. how to do this?
for example in this page title contains 'One Punch Man - MyAnimeList.net' i want to get that
Just use string replace function:
$string = '<BoomBox>';
$string = str_replace('<', '', $string);
$string = str_replace('>', '', $string);
echo $string; // output: Boombox
http://php.net/manual/en/function.str-replace.php
You edited your answer, and we can now see you are dealing with XML/HTML. It's always better to work with the DOM classes. Never use regex! There is a famous Stack Overflow post explaining why never to parse html with regex. Try this solution instead:
<?php
$dom = new DOMDocument();
$dom->loadHTML('<title>BoomBox</title>');
echo $dom->getElementsByTagName('title')->item(0)->textContent;
http://php.net/manual/en/class.domdocument.php
http://php.net/manual/en/class.domnode.php
See it working here https://3v4l.org/EjPQd
You can use preg_replace();, or strip_tags();.
Example preg_replace();:
$str = '> One Punch Man';
$new = preg_replace('/[^a-zA-Z0-9 \p{L}]/m', '', $str);
echo $new;
Output: One Punch Man
Above example will only allow a-z, A-Z and 0-9. You can expand this.
Example strip_tags();:
$str = '<title> BoomBox </title>';
$another = strip_tags($str);
echo $another;
Output: BoomBox
Documentation:
http://php.net/manual/en/function.preg-replace.php // preg_replace();
http://php.net/manual/en/function.strip-tags.php // strip_tags();
You can also use a single call to str_replace with the ['<','>'] as the search argument:
$string = '<BoomBox>';
echo str_replace(['<', '>'], '', $string) . PHP_EOL;
// => Boombox
Or, you may use a regex with preg_replace (especially, if you plan on adding more restrictions for in-context matching to it):
echo preg_replace('~[<>]~', '', $string);
// => Boombox
See the PHP demo.

PHP Regex expression excluding <pre> tag

I am using a WordPress plugin named Acronyms (https://wordpress.org/plugins/acronyms/). This plugin replaces acronyms with their description. It uses a PHP PREG_REPLACE function.
The issue is that it replaces the acronyms contained in a <pre> tag, which I use to present a source code.
Could you modify this expression so that it won't replace acronyms contained inside <pre> tags (not only directly, but in any moment)? Is it possible?
The PHP code is:
$text = preg_replace(
"|(?!<[^<>]*?)(?<![?.&])\b$acronym\b(?!:)(?![^<>]*?>)|msU"
, "<acronym title=\"$fulltext\">$acronym</acronym>"
, $text
);
You can use a PCRE SKIP/FAIL regex trick (also works in PHP) to tell the regex engine to only match something if it is not inside some delimiters:
(?s)<pre[^<]*>.*?<\/pre>(*SKIP)(*F)|\b$acronym\b
This means: skip all substrings starting with <pre> and ending with </pre>, and only then match $acronym as a whole word.
See demo on regex101.com
Here is a sample PHP demo:
<?php
$acronym = "ASCII";
$fulltext = "American Standard Code for Information Interchange";
$re = "/(?s)<pre[^<]*>.*?<\\/pre>(*SKIP)(*F)|\\b$acronym\\b/";
$str = "<pre>ASCII\nSometext\nMoretext</pre>More text \nASCII\nMore text<pre>More\nlines\nASCII\nlines</pre>";
$subst = "<acronym title=\"$fulltext\">$acronym</acronym>";
$result = preg_replace($re, $subst, $str);
echo $result;
Output:
<pre>ASCII</pre><acronym title="American Standard Code for Information Interchange">ASCII</acronym><pre>ASCII</pre>
It is also possible to use preg_split and keep the code block as a group, only replace the non-code block part then combine it back as a complete string:
function replace($s) {
return str_replace('"', '"', $s); // do something with `$s`
}
$text = 'Your text goes here...';
$parts = preg_split('#(<\/?[-:\w]+(?:\s[^<>]+?)?>)#', $text, null, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
$text = "";
$x = 0;
foreach ($parts as $v) {
if (trim($v) === "") {
$text .= $v;
continue;
}
if ($v[0] === '<' && substr($v, -1) === '>') {
if (preg_match('#^<(\/)?(?:code|pre)(?:\s[^<>]+?)?>$#', $v, $m)) {
$x = isset($m[1]) && $m[1] === '/' ? 0 : 1;
}
$text .= $v; // this is a HTML tag…
} else {
$text .= !$x ? replace($v) : $v; // process or skip…
}
}
return $text;
Taken from here.

Inverse htmlentities / html_entity_decode

Basically I want to turn a string like this:
<code> <div> blabla </div> </code>
into this:
<code> <div> blabla </div> </code>
How can I do it?
The use case (bc some people were curious):
A page like this with a list of allowed HTML tags and examples. For example, <code> is a allowed tag, and this would be the sample:
<code><?php echo "Hello World!"; ?></code>
I wanted a reverse function because there are many such tags with samples that I store them all into a array which I iterate in one loop, instead of handling each one individually...
My version using regular expressions:
$string = '<code> <div> blabla </div> </code>';
$new_string = preg_replace(
'/(.*?)(<.*?>|$)/se',
'html_entity_decode("$1").htmlentities("$2")',
$string
);
It tries to match every tag and textnode and then apply htmlentities and html_entity_decode respectively.
There isn't an existing function, but have a look at this.
So far I've only tested it on your example, but this function should work on all htmlentities
function html_entity_invert($string) {
$matches = $store = array();
preg_match_all('/(&(#?\w){2,6};)/', $string, $matches, PREG_SET_ORDER);
foreach ($matches as $i => $match) {
$key = '__STORED_ENTITY_' . $i . '__';
$store[$key] = html_entity_decode($match[0]);
$string = str_replace($match[0], $key, $string);
}
return str_replace(array_keys($store), $store, htmlentities($string));
}
Update:
Thanks to #Mike for taking the time to test my function with other strings. I've updated my regex from /(\&(.+)\;)/ to /(\&([^\&\;]+)\;)/ which should take care of the issue he raised.
I've also added {2,6} to limit the length of each match to reduce the possibility of false positives.
Changed regex from /(\&([^\&\;]+){2,6}\;)/ to /(&([^&;]+){2,6};)/ to remove unnecessary excaping.
Whooa, brainwave! Changed the regex from /(&([^&;]+){2,6};)/ to /(&(#?\w){2,6};)/ to reduce probability of false positives even further!
Replacing alone will not be good enough for you. Whether it be regular expressions or simple string replacing, because if you replace the &lt &gt signs then the < and > signs or vice versa you will end up with one encoding/decoding (all &lt and &gt or all < and > signs).
So if you want to do this, you will have to parse out one set (I chose to replace with a place holder) do a replace then put them back in and do another replace.
$str = "<code> <div> blabla </div> </code>";
$search = array("<",">",);
//place holder for < and >
$replace = array("[","]");
//first replace to sub out < and > for [ and ] respectively
$str = str_replace($search, $replace, $str);
//second replace to get rid of original < and >
$search = array("<",">");
$replace = array("<",">",);
$str = str_replace($search, $replace, $str);
//third replace to turn [ and ] into < and >
$search = array("[","]");
$replace = array("<",">");
$str = str_replace($search, $replace, $str);
echo $str;
I think i have a small sollution, why not break html tags into an array and then compare and change if needed?
function invertHTML($str) {
$res = array();
for ($i=0, $j=0; $i < strlen($str); $i++) {
if ($str{$i} == "<") {
if (isset($res[$j]) && strlen($res[$j]) > 0){
$j++;
$res[$j] = '';
} else {
$res[$j] = '';
}
$pos = strpos($str, ">", $i);
$res[$j] .= substr($str, $i, $pos - $i+1);
$i += ($pos - $i);
$j++;
$res[$j] = '';
continue;
}
$res[$j] .= $str{$i};
}
$newString = '';
foreach($res as $html){
$change = html_entity_decode($html);
if($change != $html){
$newString .= $change;
} else {
$newString .= htmlentities($html);
}
}
return $newString;
}
Modified .... with no errors.
So, although other people on here have recommended regular expressions, which may be the absolute right way to go ... I wanted to post this, as it is sufficient for the question you asked.
Assuming that you are always using html'esque code:
$str = '<code> <div> blabla </div> </code>';
xml_parse_into_struct(xml_parser_create(), $str, $nodes);
$xmlArr = array();
foreach($nodes as $node) {
echo htmlentities('<' . $node['tag'] . '>') . html_entity_decode($node['value']) . htmlentities('</' . $node['tag'] . '>');
}
Gives me the following output:
<CODE> <div> blabla </div> </CODE>
Fairly certain that this wouldn't support going backwards again .. as other solutions posted, would, in the sense of:
$orig = '<code> <div> blabla </div> </code>';
$modified = '<CODE> <div> blabla </div> </CODE>';
$modifiedAgain = '<code> <div> blabla </div> </code>';
I'd recommend using a regular expression, e.g. preg_replace():
http://www.php.net/manual/en/function.preg-replace.php
http://www.webcheatsheet.com/php/regular_expressions.php
http://davebrooks.wordpress.com/2009/04/22/php-preg_replace-some-useful-regular-expressions/
Edit: It appears that I haven't fully answered your question. There is no built-in PHP function to do what you want, but you can do find and replace with regular expressions or even simple expressions: str_replace, preg_replace

PHP extract text from string - trim?

I have the following XML:
<id>tag:search.twitter.com,2005:22204349686</id>
How can i write everything after the second colon to a variable?
E.g. 22204349686
if(preg_match('#<id>.*?:.*?:(.*?)</id>#',$input,$m)) {
$num = $m[1];
}
When you already have just the tags content in a variable $str, you could use explode to get everything from the second : on:
list(,,$rest) = explode(':', $str, 3);
$var = preg_replace('/^([^:]+:){2}/', '', 'tag:search.twitter.com,2005:22204349686');
I am assuming you already have the string without the <id> bits.
Otherwise, for SimpleXML:
$var = preg_replace('/^([^:]+:){2}/', '', "{$yourXml->id}");
First, parse the XML with an XML parser. Find the text content of the node in question (tag:search.twitter.com,2005:22204349686). Then, write a relevant regex, e.g.
<?php
$str = 'tag:search.twitter.com,2005:22204349686';
preg_match('#^([^:]+):([^,]+),([0-9]+):([0-9]+)#', $str, $matches);
var_dump($matches);
I suppose you have in a variable ($str) the content of id tag.
// get last occurence of colon
$pos = strrpos($str, ":");
if ($pos !== false) {
// get substring of $str from position $pos to the end of $str
$result = substr($str, $pos);
} else {
$result = null;
}
Regex seems to me inappropriate for such a simple matching.
If you dont have the ID tags around the string, you can simply do
echo trim(strrchr($xml, ':'), ':');
If they are around, you can use
$xml = '<id>tag:search.twitter.com,2005:22204349686</id>';
echo filter_var(strrchr($xml, ':'), FILTER_SANITIZE_NUMBER_INT);
// 22204349686
The strrchr part returns :22204349686</id> and the filter_var part strips everything that's not a number.
Use explode and strip_tags:
list(,,$id) = explode( ':', strip_tags( $input ), 3 );
function between($t1,$t2,$page) {
$p1=stripos($page,$t1);
if($p1!==false) {
$p2=stripos($page,$t2,$p1+strlen($t1));
} else {
return false;
}
return substr($page,$p1+strlen($t1),$p2-$p1-strlen($t1));
}
$x='<id>tag:search.twitter.com,2005:22204349686</id>';
$text=between(',','<',$x);
if($text!==false) {
//got some text..
}

Categories