PHP: How are comments skipped?

PHP: How are comments skipped? - php

Well if I comment something it's skipped in all languages, but how are they skipped and what is readed?
Example:
// This is commented out
Now does PHP reads the whole comment to go to next lines or just reads the //?

The script is parsed and split into tokens.
You can actually try this out yourself on any valid PHP source code using token_get_all(), it uses PHP's native tokenizer.
The example from the manual shows how a comment is dealt with:
<?php
$tokens = token_get_all('<?php echo; ?>'); /* => array(
array(T_OPEN_TAG, '<?php'),
array(T_ECHO, 'echo'),
';',
array(T_CLOSE_TAG, '?>') ); */
/* Note in the following example that the string is parsed as T_INLINE_HTML
rather than the otherwise expected T_COMMENT (T_ML_COMMENT in PHP <5).
This is because no open/close tags were used in the "code" provided.
This would be equivalent to putting a comment outside of <?php ?>
tags in a normal file. */
$tokens = token_get_all('/* comment */');
// => array(array(T_INLINE_HTML, '/* comment */'));
?>

There is a tokenization phase while compiling. During this phase, it see the // and then just ignores everything to the end of the line. Compilers CAN get complicated, but for the most part are pretty straight forward.
http://compilers.iecc.com/crenshaw/

Your question doesn't make sense. Having read the '//', it then has to keep reading to the newline to find it. There's no choice about this. There is no other way to find the newline.
Conceptually, compiling has several phases that are logically prior to parsing:
Scanning.
Screening.
Tokenization.
(1) basically means reading the file character by character from left to right.
(2) means throwing things away of no interest, e.g. collapsing multiple newline/whitespace sequences to a single space.
(3) means combining what's left into tokens, e.g. identifiers, keywords, literals, punctuation.
Comments are screened out during (2). In modern compilers this is all done at once by a deterministic automaton.

Related

Replace // comments by /* comments */ Except in URLs [duplicate]

I need to remove the comment lines from my code.
preg_replace('!//(.*)!', '', $test);
It works fine. But it removes the website url also and left the url like http:
So to avoid this I put the same like preg_replace('![^:]//(.*)!', '', $test);
It's work fine. But the problem is if my code has the line like below
$code = 'something';// comment here
It will replace the comment line with the semicolon. that is after replace my above code would be
$code = 'something'
So it generates error.
I just need to delete the single line comments and the url should remain same.
Please help. Thanks in advance

try this
preg_replace('#(?<!http:)//.*#','',$test);
also read more about PCRE assertions http://cz.php.net/manual/en/regexp.reference.assertions.php

If you want to parse a PHP file, and manipulate the PHP code it contains, the best solution (even if a bit difficult) is to use the Tokenizer : it exists to allow manipulation of PHP code.
Working with regular expressions for such a thing is a bad idea...
For instance, you thought about http:// ; but what about strings that contain // ?
Like this one, for example :
$str = "this is // a test";

This can get complicated fast. There are more uses for // in strings. If you are parsing PHP code, I highly suggest you take a look at the PHP tokenizer. It's specifically designed to parse PHP code.
Question: Why are you trying to strip comments in the first place?
Edit: I see now you are trying to parse JavaScript, not PHP. So, why not use a javascript minifier instead? It will strip comments, whitespace and do a lot more to make your file as small as possible.

Confusion with ereg_replace() Beginning PHP and MySQL Example by W Jason Gilmore

Just a note to begin I am aware that ereg_replace() is deprecated, since POSIX is no longer being used. But in "Beginning PHP and MySQL" by W Jason Gilmore, Gilmore emphasizes that although POSIX isn't to be used, an understanding is still necessary as a means of conversion to Perl. So once again I understand it's deprecated but since I'm trying to understand everything in the book I might as well understand this.
So the example is as follows:
<?php
$text = "This is a link to http://www.example.com/.";
echo ereg_replace("http://([a-zA-Z0-9./-]+)$", "\\0",
$text);
?>
//Output
This is a link to http://www.example.com/..
So I understand the majority of code in the above example, my problem lies with the ./- and the output. For the ./- I tried to think according to quantifiers where . = between, so everything between [:alnum:] and / is replaced. I also thought maybe ./- are characters within the range which would also be replaced since [:alnum:] doesn't include punctuation. For verfication I looked at the output but theres no - present. If only the / is replaced than the code would make sense, since /0 outputs http://www.example.com/ but than the problem lies with the missing - which I presume to be pertinent to the brackets rather than as a quantifier.
My other question is in regards to the output, if the function returns the string with the modified string why does the period which was present in the original string appear after the second /0, not the first, if its the original text, why does the tag follow it and not precede it?
Just for some quick background, I have a basic understanding of php,html,css,javascript,C++ and I'm reading this for a more in depth understanding of php and an introduction to MySQL, so unfortunately explanations which are entirely advanced code/concepts go right over my head.

why does the period which was present in the original string appear after the second /0, not the first
This is not the case, because the actual output is:
This is a link to http://www.example.com/.
The period is included in both the attribute as well as the tag contents.
my problem lies with the ./- and the output
When present inside a character set, ./- means to match either a period, forward slash or a dash; it's important to note that the dash must appear at the end of the character set to avoid ambiguity.

Matching all three kinds of PHP comments with a regular expression

I need to match all three types of comments that PHP might have:
# Single line comment
// Single line comment
/* Multi-line comments */
 
/**
* And all of its possible variations
*/
Something I should mention: I am doing this in order to be able to recognize if a PHP closing tag (?>) is inside a comment or not. If it is then ignore it, and if not then make it count as one. This is going to be used inside an XML document in order to improve Sublime Text's recognition of the closing tag (because it's driving me nuts!). I tried to achieve this a couple of hours, but I wasn't able. How can I translate for it to work with XML?
So if you could also include the if-then-else login I would really appreciate it. BTW, I really need it to be in pure regular expression expression, no language features or anything. :)
Like Eicon reminded me, I need all of them to be able to match at the start of the line, or at the end of a piece of code, so I also need the following with all of them:
<?php
echo 'something'; # this is a comment
?>

Parsing a programming language seems too much for regexes to do. You should probably look for a PHP parser.
But these would be the regexes you are looking for. I assume for all of them that you use the DOTALL or SINGLELINE option (although the first two would work without it as well):
~#[^\r\n]*~
~//[^\r\n]*~
~/\*.*?\*/~s
Note that any of these will cause problems, if the comment-delimiting characters appear in a string or somewhere else, where they do not actually open a comment.
You can also combine all of these into one regex:
~(?:#|//)[^\r\n]*|/\*.*?\*/~s
If you use some tool or language that does not require delimiters (like Java or C#), remove those ~. In this case you will also have to apply the DOTALL option differently. But without knowing where you are going to use this, I cannot tell you how.
If you cannot/do not want to set the DOTALL option, this would be equivalent (I also left out the delimiters to give an example):
(?:#|//)[^\r\n]*|/\*[\s\S]*?\*/
See here for a working demo.
Now if you also want to capture the contents of the comments in a group, then you could do this
(?|(?:#|//)([^\r\n]*)|/\*([\s\S]*?)\*/)
Regardless of the type of comment, the comments content (without the syntax delimiters) will be found in capture 1.
Another working demo.

Single-line comments
singleLineComment = /'[^']*'|"[^"]*"|((?:#|\/\/).*$)/gm
With this regex you have to replace (or remove) everything that was captured by ((?:#|\/\/).*$). This regex will ignore contents of strings that would look like comments (e.g. $x = "You are the #1"; or $y = "You can start comments with // or # in PHP, but I'm a code string";)
Multiline comments
multilineComment = /^\s*\/\*\*?[^!][.\s\t\S\n\r]*?\*\//gm

(php) regexto remove comments but ignore occurances within strings

I am writing a comment-stripper and trying to accommodate for all needs here. I have the below stack of code which removes pretty much all comments, but it actually goes too far. A lot of time was spent trying and testing and researching the regex patterns to match, but I don't claim that they are the best at each.
My problem is that I also have situation where I have 'PHP comments' (that aren't really comments' in standard code, or even in PHP strings, that I don't actually want to have removed.
Example:
<?php $Var = "Blah blah //this must not comment"; // this must comment. ?>
What ends up happening is that it strips out religiously, which is fine, but it leaves certain problems:
<?php $Var = "Blah blah ?>
Also:
will also cause problems, as the comment removes the rest of the line, including the ending ?>
See the problem? So this is what I need...
Comment characters within '' or "" need to be ignored
PHP Comments on the same line, that use double-slashes, should remove perhaps only the comment itself, or should remove the entire php codeblock.
Here's the patterns I use at the moment, feel free to tell me if there's improvement I can make in my existing patterns? :)
$CompressedData = $OriginalData;
$CompressedData = preg_replace('!/\*.*?\*/!s', '', $CompressedData); // removes /* comments */
$CompressedData = preg_replace('!//.*?\n!', '', $CompressedData); // removes //comments
$CompressedData = preg_replace('!#.*?\n!', '', $CompressedData); // removes # comments
$CompressedData = preg_replace('/<!--(.*?)-->/', '', $CompressedData); // removes HTML comments
Any help that you can give me would be greatly appreciated! :)

If you want to parse PHP, you can use token_get_all to get the tokens of a given PHP code. Then you just need to iterate the tokens, remove the comment tokens and put the rest back together.
But you would need a separate procedure for the HTML comments, preferably a real parser too (like DOMDocument provides with DOMDocument::loadHTML).

You should first think carefully whether you actually want to do this. Though what you're doing may seem simple, in the worst case scenario, it becomes extremely complex problem (to solve with just few regular expressions). Let me just illustrate just of the few problems you would be facing when trying to strip both HTML and PHP comments from a file.
You can't straight out strip HTML comments, because you may have PHP inside the HTML comments, like:

You can't just simply separately deal with stuff inside the <?php and ?> tags either, since the ending thag ?> can be inside strings or even comments, like:
<?php /* ?> This is still a PHP comment <?php */ ?>
Let's not forget, that ?> actually ends the PHP, if it's preceded by one line comment. For example:
<?php // ?> This is not a PHP comment <?php ?>
Of course, like you already illustrated, there will be plenty of problems with comment indicators inside strings. Parsing out strings to ignore them isn't that simple either, since you have to remember that quotes can be escaped. Like:
<?php
$foo = ' /* // None of these start a comment ';
$bar = ' \' // Remember escaped quotes ';
$orz = " ' \" \' /* // Still not a comment ";
?>
Parsing order will also cause you headache. You can't just simply choose to parse either the one line comments first or the multi line comments first. They both have to be parsed at the same time (i.e. in the order they appear in the document). Otherwise you may end up with broken code. Let me illustrate:
<?php
/* // Multiline comment */
// /* Single Line comment
$omg = 'This is not in a comment */';
?>
If you parse multi line comments first, the second /* will eat up part of the string destroying the code. If you parse the single line comments first, you will end up eating the first */, which will also destroy the code.
As you can see, there are many complex scenarios you'd have to account, if you intend to solve your problem with regular expression. The only correct solution is to use some sort of PHP parser, like token_get_all(), to tokenize the entire source code and strip the comment tokens and rebuild the file. Which, I'm afraid, isn't entirely simple either. It also won't help with HTML comments, since the HTML is left untouched. You can't use XML parsers to get the HTML comments either, because the HTML is rarely well formed with PHP.
To put it short, the idea of what you're doing is simple, but the actual implementation is much harder than it seems. Thus, I would recommend trying to avoid doing this, unless you have a very good reason to do it.

One way to do this in REGEX is to use one compound expression and preg_replace_callback.
I was going to post a poor example but the best place to look is at the source code to the PHP port of Dean Edwards' JS packer script - you should see the general idea.
http://joliclic.free.fr/php/javascript-packer/en/

try this
private function removeComments( $content ){
$content = preg_replace( "!/\*.*?\*/!s" , '', $content );
$content = preg_replace( "/\n\s*\n/" , "\n", $content );
$content = preg_replace( '#^\s*//.+$#m' , "", $content );
$content = preg_replace( '![\s\t]//.*?\n!' , "\n", $content );
$content = preg_replace( '/<\!--.*-->/' , "\n", $content );
return $content;
}

Replace all "\" characters which are not inside "<code>" tags

First things first: Neither this, this, this nor this answered my question. So I'll open a new one.
Please read
Okay okay. I know that regexes are not the way to parse general HTML. Please take note that the created documents are written using a limited, controlled HTML subset. And people writing the docs know what they're doing. They are all IT professionals!
Given the controlled syntax it is possible to parse the documents I have here using regexes.
I am not trying to download arbitrary documents from the web and parse them!
And if the parsing does fail, the document is edited, so it'll parse. The problem I am addressing here is more general than that (i.e. not replace patterns inside two other patterns).
A little bit of background (you can skip this...)
In our office we are supposed to "pretty print" our documentation. Hence why some came up with putting it all into Word documents. So far we're thankfully not quite there yet. And, if I get this done, we might not need to.
The current state (... and this)
The main part of the docs are stored in a TikiWiki database. I've created a daft PHP script which converts the documents from HTML (via LaTeX) to PDF. One of the must have features of the selected Wiki-System was a WYSIWYG editor. Which, as expected leaves us with documents with a less then formal DOM.
Consequently, I am transliterating the document using "simple" regexes. It all works (mostly) fine so far, but I encountered one problem I haven't figured out on my own yet.
The problem
Some special characters need to replaced by LaTeX markup. For exaple, the \ character should be replaced by $\backslash$ (unless someone knows another solution?).
Except while in a verbatim block!
I do replace <code> tags with verbatim sections. But if this code block contains backslashes (as is the case for Windows folder names), the script still replaces these backslashes.
I reckon I could solve this using negative LookBehinds and/or LookAheads. But my attempts did not work.
Granted, I would be better off with a real parser. In fact, it is something on my "in-brain-roadmap", but it is currently out of the scope. The script works well enough for our limited knowledge domain. Creating a parser would require me to start pretty much from scratch.
My attempt
Example Input
The Hello \ World document is located in:
<code>C:\documents\hello_world.txt</code>
Expected output
The Hello $\backslash$ World document is located in:
\begin{verbatim}C:\documents\hello_world.txt\end{verbatim}
This is the best I could come up with so far:
<?php
$patterns = array(
"special_chars2" => array( '/(?<!<code[^>]*>.*)\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);
foreach( $patterns as $name => $p ){
$tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>
Note that this is only an excerpt, and the [^$] is another LaTeX requirement.
Another attempt which seemed to work:
<?php
$patterns = array(
"special_chars2" => array( '/\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);
foreach( $patterns as $name => $p ){
$tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>
... in other words: leaving out the negative lookbehind.
But this looks more error-prone than with both lookbehind and lookahead.
A related question
As you may have noticed, the pattern is ungreedy (/.../U). So will this match only as little possible inside a <code> block? Considering the look-arounds?

If me, I will try to find HTML parser and will do with that.
Another option is will try to chunk the string into <code>.*?</code> and other parts.
and will update other parts, and will recombine it.
$x="The Hello \ World document is located in:\n<br>
<code>C:\documents\hello_world.txt</code>";
$r=preg_split("/(<code>.*?<\/code>)/", $x,-1,PREG_SPLIT_DELIM_CAPTURE);
for($i=0;$i<count($r);$i+=2)
$r[$i]=str_replace("\\","$\\backslash$",$r[$i]);
$x=implode($r);
echo $x;
Here is the results.
The Hello $\backslash$ World document is located in:
C:\documents\hello_world.txt
Sorry, If my approach is not suitable for you.

I reckon I could solve this using negative LookBehinds and/or LookAheads.
You reckon wrong. Regular expressions are not a replacement for a parser.
I would suggest that you pipe the html through htmltidy, then read it with a dom-parser and then transform the dom to your target output format. Is there anything preventing your from taking this route?

Parser FTW, ok. But if you can't use a parser, and you can be certain that <code> tags are never nested, you could try the following:
Find <code>.*?</code> sections of your file (probably need to turn on dot-matches-newlines mode).
Replace all backslashes inside that section with something unique like #?#?#?#
Replace the section found in 1 with that new section
Replace all backslashes with $\backslash$
Replace als <code> with \begin{verbatim} and all </code> with \end{verbatim}
Replace #?#?#?# with \
FYI, regexes in PHP don't support variable-length lookbehind. So that makes this conditional matching between two boundaries difficult.

Pandoc? Pandoc converts between a bunch of formats. you can also concatenate a bunch of flies together then covert them. Maybe a few shell scripts combined with your php scraping scripts?
With your "expected input" and the command pandoc -o text.tex test.html the output is:
The Hello \textbackslash{} World document is located in:
\verb!C:\documents\hello_world.txt!
pandoc can read from stdin, write to stdout or pipe right into a file.

Provided that your <code> blocks are not nested, this regex would find a backslash after ^ start-of-string or </code> with no <code> in between.
((?:^|</code>)(?:(?!<code>).)+?)\\
| | |
| | \-- backslash
| \-- least amount of anything not followed by <code>
\-- start-of-string or </code>
And replace it with:
$1$\backslash$
You'd have to run this regex in "singleline" mode, so . matches newlines. You'd also have to run it multiple times, specifying global replacement is not enough. Each replacement will only replace the first eligible backslash after start-of-string or </code>.

Write a parser based on an HTML or XML parser like DOMDocument. Traverse the parsed DOM and replace the \ on every text node that is not a descendent of a code node with $\backslash$ and every node that is a code node with \begin{verbatim} … \end{verbatim}.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.