(php) regexto remove comments but ignore occurances within strings

(php) regexto remove comments but ignore occurances within strings - php

I am writing a comment-stripper and trying to accommodate for all needs here. I have the below stack of code which removes pretty much all comments, but it actually goes too far. A lot of time was spent trying and testing and researching the regex patterns to match, but I don't claim that they are the best at each.
My problem is that I also have situation where I have 'PHP comments' (that aren't really comments' in standard code, or even in PHP strings, that I don't actually want to have removed.
Example:
<?php $Var = "Blah blah //this must not comment"; // this must comment. ?>
What ends up happening is that it strips out religiously, which is fine, but it leaves certain problems:
<?php $Var = "Blah blah ?>
Also:
will also cause problems, as the comment removes the rest of the line, including the ending ?>
See the problem? So this is what I need...
Comment characters within '' or "" need to be ignored
PHP Comments on the same line, that use double-slashes, should remove perhaps only the comment itself, or should remove the entire php codeblock.
Here's the patterns I use at the moment, feel free to tell me if there's improvement I can make in my existing patterns? :)
$CompressedData = $OriginalData;
$CompressedData = preg_replace('!/\*.*?\*/!s', '', $CompressedData); // removes /* comments */
$CompressedData = preg_replace('!//.*?\n!', '', $CompressedData); // removes //comments
$CompressedData = preg_replace('!#.*?\n!', '', $CompressedData); // removes # comments
$CompressedData = preg_replace('/<!--(.*?)-->/', '', $CompressedData); // removes HTML comments
Any help that you can give me would be greatly appreciated! :)

If you want to parse PHP, you can use token_get_all to get the tokens of a given PHP code. Then you just need to iterate the tokens, remove the comment tokens and put the rest back together.
But you would need a separate procedure for the HTML comments, preferably a real parser too (like DOMDocument provides with DOMDocument::loadHTML).

You should first think carefully whether you actually want to do this. Though what you're doing may seem simple, in the worst case scenario, it becomes extremely complex problem (to solve with just few regular expressions). Let me just illustrate just of the few problems you would be facing when trying to strip both HTML and PHP comments from a file.
You can't straight out strip HTML comments, because you may have PHP inside the HTML comments, like:

You can't just simply separately deal with stuff inside the <?php and ?> tags either, since the ending thag ?> can be inside strings or even comments, like:
<?php /* ?> This is still a PHP comment <?php */ ?>
Let's not forget, that ?> actually ends the PHP, if it's preceded by one line comment. For example:
<?php // ?> This is not a PHP comment <?php ?>
Of course, like you already illustrated, there will be plenty of problems with comment indicators inside strings. Parsing out strings to ignore them isn't that simple either, since you have to remember that quotes can be escaped. Like:
<?php
$foo = ' /* // None of these start a comment ';
$bar = ' \' // Remember escaped quotes ';
$orz = " ' \" \' /* // Still not a comment ";
?>
Parsing order will also cause you headache. You can't just simply choose to parse either the one line comments first or the multi line comments first. They both have to be parsed at the same time (i.e. in the order they appear in the document). Otherwise you may end up with broken code. Let me illustrate:
<?php
/* // Multiline comment */
// /* Single Line comment
$omg = 'This is not in a comment */';
?>
If you parse multi line comments first, the second /* will eat up part of the string destroying the code. If you parse the single line comments first, you will end up eating the first */, which will also destroy the code.
As you can see, there are many complex scenarios you'd have to account, if you intend to solve your problem with regular expression. The only correct solution is to use some sort of PHP parser, like token_get_all(), to tokenize the entire source code and strip the comment tokens and rebuild the file. Which, I'm afraid, isn't entirely simple either. It also won't help with HTML comments, since the HTML is left untouched. You can't use XML parsers to get the HTML comments either, because the HTML is rarely well formed with PHP.
To put it short, the idea of what you're doing is simple, but the actual implementation is much harder than it seems. Thus, I would recommend trying to avoid doing this, unless you have a very good reason to do it.

One way to do this in REGEX is to use one compound expression and preg_replace_callback.
I was going to post a poor example but the best place to look is at the source code to the PHP port of Dean Edwards' JS packer script - you should see the general idea.
http://joliclic.free.fr/php/javascript-packer/en/

try this
private function removeComments( $content ){
$content = preg_replace( "!/\*.*?\*/!s" , '', $content );
$content = preg_replace( "/\n\s*\n/" , "\n", $content );
$content = preg_replace( '#^\s*//.+$#m' , "", $content );
$content = preg_replace( '![\s\t]//.*?\n!' , "\n", $content );
$content = preg_replace( '/<\!--.*-->/' , "\n", $content );
return $content;
}

Related

Replace // comments by /* comments */ Except in URLs [duplicate]

I need to remove the comment lines from my code.
preg_replace('!//(.*)!', '', $test);
It works fine. But it removes the website url also and left the url like http:
So to avoid this I put the same like preg_replace('![^:]//(.*)!', '', $test);
It's work fine. But the problem is if my code has the line like below
$code = 'something';// comment here
It will replace the comment line with the semicolon. that is after replace my above code would be
$code = 'something'
So it generates error.
I just need to delete the single line comments and the url should remain same.
Please help. Thanks in advance

try this
preg_replace('#(?<!http:)//.*#','',$test);
also read more about PCRE assertions http://cz.php.net/manual/en/regexp.reference.assertions.php

If you want to parse a PHP file, and manipulate the PHP code it contains, the best solution (even if a bit difficult) is to use the Tokenizer : it exists to allow manipulation of PHP code.
Working with regular expressions for such a thing is a bad idea...
For instance, you thought about http:// ; but what about strings that contain // ?
Like this one, for example :
$str = "this is // a test";

This can get complicated fast. There are more uses for // in strings. If you are parsing PHP code, I highly suggest you take a look at the PHP tokenizer. It's specifically designed to parse PHP code.
Question: Why are you trying to strip comments in the first place?
Edit: I see now you are trying to parse JavaScript, not PHP. So, why not use a javascript minifier instead? It will strip comments, whitespace and do a lot more to make your file as small as possible.

PHP string containing two forward slashes in

This is probably a very simple question, but I just can't figure it out. I want to define a string containing two forward slashes
$htmlcode="text//text";
From what I understand what follows after // are comments.
Question: How do create a string containing //?

Parsing of the language is a little bit tricky. Within a string literal, comments and other language features are not triggered (except for special characters which need to be escaped). Also, within block comments, line-comments are not valued.
$example1 = 'hello /* this is not a comment */ '; /* but this is */
$example2 = 'hello // this is not a comment '; //but this is
$example3 = "works the same with double quotes /* not a comment */ //not a comment ";
/* comment example
$thisIsAComment
//this does not escape the closing */

$htmlcode="text//text"; //this is comment.
Your string is already defined as you want it to be.
Check out docs: http://php.net/manual/en/language.basic-syntax.comments.php
You should use some IDE or syntax highlighter, you will understand code more clearly.
Notepad++ is free and lightweight http://notepad-plus-plus.org/download

Your code is fine.
I would highly suggest reading the Basic Syntax - Comments guide to get a better understanding.

how to remove the comment line starts with // and not the url like http:// using preg_replace

I need to remove the comment lines from my code.
preg_replace('!//(.*)!', '', $test);
It works fine. But it removes the website url also and left the url like http:
So to avoid this I put the same like preg_replace('![^:]//(.*)!', '', $test);
It's work fine. But the problem is if my code has the line like below
$code = 'something';// comment here
It will replace the comment line with the semicolon. that is after replace my above code would be
$code = 'something'
So it generates error.
I just need to delete the single line comments and the url should remain same.
Please help. Thanks in advance

try this
preg_replace('#(?<!http:)//.*#','',$test);
also read more about PCRE assertions http://cz.php.net/manual/en/regexp.reference.assertions.php

If you want to parse a PHP file, and manipulate the PHP code it contains, the best solution (even if a bit difficult) is to use the Tokenizer : it exists to allow manipulation of PHP code.
Working with regular expressions for such a thing is a bad idea...
For instance, you thought about http:// ; but what about strings that contain // ?
Like this one, for example :
$str = "this is // a test";

This can get complicated fast. There are more uses for // in strings. If you are parsing PHP code, I highly suggest you take a look at the PHP tokenizer. It's specifically designed to parse PHP code.
Question: Why are you trying to strip comments in the first place?
Edit: I see now you are trying to parse JavaScript, not PHP. So, why not use a javascript minifier instead? It will strip comments, whitespace and do a lot more to make your file as small as possible.

PHP: How are comments skipped?

Well if I comment something it's skipped in all languages, but how are they skipped and what is readed?
Example:
// This is commented out
Now does PHP reads the whole comment to go to next lines or just reads the //?

The script is parsed and split into tokens.
You can actually try this out yourself on any valid PHP source code using token_get_all(), it uses PHP's native tokenizer.
The example from the manual shows how a comment is dealt with:
<?php
$tokens = token_get_all('<?php echo; ?>'); /* => array(
array(T_OPEN_TAG, '<?php'),
array(T_ECHO, 'echo'),
';',
array(T_CLOSE_TAG, '?>') ); */
/* Note in the following example that the string is parsed as T_INLINE_HTML
rather than the otherwise expected T_COMMENT (T_ML_COMMENT in PHP <5).
This is because no open/close tags were used in the "code" provided.
This would be equivalent to putting a comment outside of <?php ?>
tags in a normal file. */
$tokens = token_get_all('/* comment */');
// => array(array(T_INLINE_HTML, '/* comment */'));
?>

There is a tokenization phase while compiling. During this phase, it see the // and then just ignores everything to the end of the line. Compilers CAN get complicated, but for the most part are pretty straight forward.
http://compilers.iecc.com/crenshaw/

Your question doesn't make sense. Having read the '//', it then has to keep reading to the newline to find it. There's no choice about this. There is no other way to find the newline.
Conceptually, compiling has several phases that are logically prior to parsing:
Scanning.
Screening.
Tokenization.
(1) basically means reading the file character by character from left to right.
(2) means throwing things away of no interest, e.g. collapsing multiple newline/whitespace sequences to a single space.
(3) means combining what's left into tokens, e.g. identifiers, keywords, literals, punctuation.
Comments are screened out during (2). In modern compilers this is all done at once by a deterministic automaton.

preg_replace() help in PHP

Consider this string
hello awesome <a href="" rel="external" title="so awesome is cool"> stuff stuff
What regex could I use to match any occurence of awesome which doesn't appear within the title attribute of the anchor?
So far, this is what I've came up with (it doesn't work sadly)
/[^."]*(awesome)[^."]*/i
Edit
I took Alan M's advice and used a regex to capture every word and send it to a callback. Thanks Alan M for your advice. Here is my final code.
$plantDetails = end($this->_model->getPlantById($plantId));
$botany = new Botany_Model();
$this->_botanyWords = $botany->getArray();
foreach($plantDetails as $key=>$detail) {
$detail = preg_replace_callback('/\b[a-z]+\b/iU', array($this, '_processBotanyWords'), $detail);
$plantDetails[$key] = $detail;
}
And the _processBotanyWords()...
private function _processBotanyWords($match) {
$botanyWords = $this->_botanyWords;
$word = $match[0];
if (array_key_exists($word, $botanyWords)) {
return '' . $word . '';
} else {
return $word;
}
}
Hope this well help someone else some day! Thanks again for all your answers.

This subject comes up pretty much every day here and basically the issue is this: you shouldn't be using regular expressions to parse or alter HTML (or XML). That's what HTML/XML parsers are for. The above problem is just one of the issues you'll face. You may get something that mostly works but there'll still be corner cases where it doesn't.
Just use an HTML parser.

Asssuming this is related to the question you posted and deleted a little while ago (that was you, wasn't it?), it's your fundamental approach that's wrong. You said you were generating these HTML links yourself by replacing words from a list of keywords. The trouble is that keywords farther down the list sometimes appear in the generated title attributes and get replaced by mistake--and now you're trying to fix the mistakes.
The underlying problem is that you're replacing each keyword using a separate call to preg_replace, effectively processing the entire text over and over again. What you should do is process the text once, matching every single word and looking it up in your list of keywords; if it's on the list, replace it. I'm not set up to write/test PHP code, but you probably want to use preg_replace_callback:
$text = preg_replace_callback('/\b[A-Za-z]+\b/', "the_callback", $text);
"the_callback" is the name of a function that looks up the word and, if it's in the list, generates the appropriate link; otherwise it returns the matched word. It may sound inefficient, processing every word like this, but in fact it's a great deal more efficient than your original approach.

Sure, using a parsing library is the industrial-strength solution, but we all have times were we just want to write something in 10 seconds and be done. Next time you want to process the meaty text of a page, ignoring tags, try just run your input through strip_tags first. This way you will get only the plain, visible text and your regex powers will again reign supreme.

This is so horrible I hesitate to post it, but if you want a quick hack, reverse the problem--instead of finding the stuff that isn't X, find the stuff that IS, change it, do the thing and change it back.
This is assuming you're trying to change awesome (to "wonderful"). If you're doing something else, adjust accordingly.
$string = 'Awesome is the man who <b>awesome</b> does and awesome is.';
$string = preg_replace('#(title\s*=\s*\"[^"]*?)awesome#is', "$1PIGDOG", $string);
$string = preg_replace('#awesome#is', 'wonderful', $string);
$string = preg_replace('#pigdog#is', 'awesome', $string);
Don't vote me down. I know it's hack.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.