Matching all three kinds of PHP comments with a regular expression

Matching all three kinds of PHP comments with a regular expression - php

I need to match all three types of comments that PHP might have:
# Single line comment
// Single line comment
/* Multi-line comments */
 
/**
* And all of its possible variations
*/
Something I should mention: I am doing this in order to be able to recognize if a PHP closing tag (?>) is inside a comment or not. If it is then ignore it, and if not then make it count as one. This is going to be used inside an XML document in order to improve Sublime Text's recognition of the closing tag (because it's driving me nuts!). I tried to achieve this a couple of hours, but I wasn't able. How can I translate for it to work with XML?
So if you could also include the if-then-else login I would really appreciate it. BTW, I really need it to be in pure regular expression expression, no language features or anything. :)
Like Eicon reminded me, I need all of them to be able to match at the start of the line, or at the end of a piece of code, so I also need the following with all of them:
<?php
echo 'something'; # this is a comment
?>

Parsing a programming language seems too much for regexes to do. You should probably look for a PHP parser.
But these would be the regexes you are looking for. I assume for all of them that you use the DOTALL or SINGLELINE option (although the first two would work without it as well):
~#[^\r\n]*~
~//[^\r\n]*~
~/\*.*?\*/~s
Note that any of these will cause problems, if the comment-delimiting characters appear in a string or somewhere else, where they do not actually open a comment.
You can also combine all of these into one regex:
~(?:#|//)[^\r\n]*|/\*.*?\*/~s
If you use some tool or language that does not require delimiters (like Java or C#), remove those ~. In this case you will also have to apply the DOTALL option differently. But without knowing where you are going to use this, I cannot tell you how.
If you cannot/do not want to set the DOTALL option, this would be equivalent (I also left out the delimiters to give an example):
(?:#|//)[^\r\n]*|/\*[\s\S]*?\*/
See here for a working demo.
Now if you also want to capture the contents of the comments in a group, then you could do this
(?|(?:#|//)([^\r\n]*)|/\*([\s\S]*?)\*/)
Regardless of the type of comment, the comments content (without the syntax delimiters) will be found in capture 1.
Another working demo.

Single-line comments
singleLineComment = /'[^']*'|"[^"]*"|((?:#|\/\/).*$)/gm
With this regex you have to replace (or remove) everything that was captured by ((?:#|\/\/).*$). This regex will ignore contents of strings that would look like comments (e.g. $x = "You are the #1"; or $y = "You can start comments with // or # in PHP, but I'm a code string";)
Multiline comments
multilineComment = /^\s*\/\*\*?[^!][.\s\t\S\n\r]*?\*\//gm

Related

Custom php Hide programmer's comments from public view

When we create a html page comments like
<!-- Comment 1 -->
or inside php
// Comment2
are obvious from a right click of the page - Show code
How can i prevent that ?

Hiding the comments inside is the answer.. Thanks everyone.

Html comments will show on the HTML page but as long as you include your PHP comments in the <?php tag they won't show to the user
To leave html comments are normal if you check amazon.com's code you will see all the html comments but none php or whatever server lang they use so don't worry about html comments just don't include stupid stuff like your admin password or some revealing database schema stuff in the html comments.
if you still want to remove all the comments even html(vscode):
Easy way:
Open extensions (ctrl-shift-x) * type in remove comments in the
search box. * Install the top pick and read instructions.
Hard way: * search replace(ctrl-h) * toggle regex on (alt-r). * Learn some regular expressions! https://docs.rs/regex/0.2.5/regex/#syntax
A simple //.* will match all single line comments (and more ;D).
#.* could be used to match python comments. And /\*[\s\S\n]*\*/
matches block comments. And you can combine them as well:
//.*|/\*[\s\S\n]*\*/ (| in regex means "or", . means any
character, * means "0 or more" and indicates how many characters to
match, therefore .* means all characters until the end of the line
(or until the next matching rule))
Of course with caveats, such as urls (https://...) has double
slashes and will match that first rule, and god knows where there are
# in code that will match that python-rule. So some
reading/adjusting has to be done!
Once you start fiddling with your regexes it can take a lifetime to
get them perfect, so be careful and go the easy route if you are short
on time, but knowing some simple regex by heart will do you good,
since regular expressions are usable almost everywhere.
From https://stackoverflow.com/a/50575194/17239314

Make certain part of string bold using preg_replace()

I have a string of a comment that looks like this:
**#First Last** rest of comment info.
What I need to do is replace the ** with <b> and </b> so that the name is bold.
I currently have this:
$result = preg_replace('/\*\*/i', '<b>$0</b>', $comment);
But that results in:
"<b>**</b>#First Last<b>**</b> rest of comment info"
Which is obviously not what I need. Does anyone have any suggestions on how to achieve this?

A better idea would be to use a markdown parser library which does this for you.
Improper use of the regular expressions can lead to security vulnerabilities with unclosed tags etc..
However, a simple approach could be this:
$result = preg_replace('/\*\*(.+)\*\*/sU', '<b>$1</b>', $comment);
The U means ungreedy and takes the first possible match.
The s modifies . to be any character, including newline.
(see modifiers page)
This also works with an example text like this:
**#First Last** rest of comment info **also bold**.
The next line also **has a bold example**. This spans **two
lines**.
I am sure there are counter examples which are broken with this. Again, please use a proper markdown library to do this for you (or copy their implementation if the LICENSE allows it).

Recursive Regex in PHP with variable names

I try to make bbcode-ish engine for me website. But the thing is, it is not clear which codes are available, because the codes are made by the users. And on top of that, the whole thing has to be recursive.
For example:
Hello my name is [name user-id="1"]
I [bold]really[/bold] like cheeseburgers
These are the easy ones and i achieved making it work.
Now the problem is, what happens, when two of those codes are behind each other:
I [bold]really[/bold] like [bold]cheeseburgers[/bold]
Or inside each other
I [bold]really like [italic]cheeseburgers[/italic][/bold]
These codes can also have attributes
I [bold strengh="600"]really like [text font-size="24px"]cheeseburgers[/text][bold]
The following one worked quite well, but lacks in the recursive part (?R)
(?P<code>\[(?P<code_open>\w+)\s?(?P<attributes>[a-zA-Z-0-1-_=" .]*?)](?:(?P<content>.*?)\[\/(?P<code_close>\w+)\])?)
I just dont know where to put the (?R) recursive tag.
Also the system has to know that in this string here
I [bold]really like [italic]cheeseburgers[/italic][/bold] and [bold]football[/bold]
are 2 "code-objects":
1. [bold]really like [italic]cheeseburgers[/italic][/bold]
and
2. [bold]football[/bold]
... and the content of the first one is
really like [italic]cheeseburgers[/italic]
which again has a code in it
[italic]cheeseburgers[/italic]
which content is
cheeseburgers
I searched the web for two days now and i cant figure it out.
I thought of something like this:
Look for something like [**** attr="foo"] where the attributes are optional and store it in a capturing group
Look up wether there is a closing tag somewhere (can be optional too)
If a closing tag exists, everything between the two tags should be stored as a "content"-capturing group - which then has to go through the same procedure again.
I hope there are some regex specialist which are willing to help me. :(
Thank you!
EDIT
As this might be difficult to understand, here is an input and an expected output:
Input:
[heading icon="rocket"]I'm a cool heading[/heading][textrow][text]<p>Hi!</p>[/text][/textrow]
I'd like to have an array like
array[0][name] = heading
array[0][attributes][icon] = rocket
array[0][content] = I'm a cool heading
array[1][name] = textrow
array[1][content] = [text]<p>Hi!</p>[/text]
array[1][0][name] = text
array[1][0][content] = <p>Hi!</p>

Having written multiple BBCode parsing systems, I can suggest NOT using regexes only. Instead, you should actually parse the text.
How you do this is up to you, but as a general idea you would want to use something like strpos to locate the first [ in your string, then check what comes after it to see if it looks like a BBCode tag and process it if so. Then, search for [ again starting from where you ended up.
This has certain advantages, such as being able to examine each code and skip it if it's invalid, as well as enforcing proper tag closing order ([bold][italic]Nesting![/bold][/italic] should be considered invalid) and being able to provide meaningful error messages to the user if something is wrong (invalid parameter, perhaps) because the parser knows exactly what is going on, whereas a regex would output something unexpected and potentially harmful.
It might be more work (or less, depending on your skill with regex), but it's worth it.

Help with Regex in PHP

Let's assume I do preg_replace as follows:
preg_replace ("/<my_tag>(.*)<\/my_tag>/U", "<my_new_tag>$1</my_new_tag>", $sourse);
That works but I do also want to grab the attribute of the my_tag - how would I do it with this:
<my_tag my_attribute_that_know_the_name_of="some_value">tra-la-la</my_tag>

You don't use regex. You use a real parser, because this stuff cannot be parsed with regular expressions. You'll never know if you've got all the corner cases quite right and then your regex has turned into a giant bloated monster and you'll wish you'd just taken fredley's advice and used a real parser.
For a humourous take, see this famous post.

preg_replace('#<my_tag\b([^>]*)>(.*?)</my_tag>#',
'<my_new_tag$1>$2</my_new_tag>', $source)
The ([^>]*) captures anything after the tag name and before the closing >. Of course, > is legal inside HTML attribute values, so watch out for that (but I've never seen it in the wild). The \b prevents matches of tag names that happen to start with my_tag, preventing bogus matches like this:
<my_tag_xyz>ooga-booga</my_tag_xyz><my_tag>tra-la-la</my_tag>
But that will still break on <my_tag> elements wrapped in other <my_tag> elements, yielding results like this:
<my_tag><my_tag>tra-la-la</my_tag>
If you know you'll never need to match tags with other tags inside them, you can replace the (.*?) with ([^<>]++).
I get pretty tired of the glib "don't do that" answers too, but as you can see, there are good reasons behind them--I could come up with this many more without having to consult any references. When you ask "How do I do this?" with no background or qualification, we have no idea how much of this you already know.

Forget regex's, use this instead:
http://simplehtmldom.sourceforge.net/

WordPress: Problem with the shortcode regex

This is the regular expression used for "shortcodes" in WordPress (one for the whole tag, other for the attributes).
return '(.?)\[('.$tagregexp.')\b(.*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)';
$pattern = '/(\w+)\s*=\s*"([^"]*)"(?:\s|$)|(\w+)\s*=\s*\'([^\']*)\'(?:\s|$)|(\w+)\s*=\s*([^\s\'"]+)(?:\s|$)|"([^"]*)"(?:\s|$)|(\S+)(?:\s|$)/';
It parses stuff like
[foo bar="baz"]content[/foo]
or
[foo /]
In the WordPress trac they say it's a bit flawed, but my main problem is that it don't support shortcodes inside the attributes, like in
[foo bar="[baz /]"]content[/foo]
because the regex stops the main shortcode at the first appearance of a closing bracket, so in the example it renders
[foo bar="[baz /]
and
"]content[/foo]
shows as it is.
Is there any way to change the regex so it bypass any occurrence of [ with ] and its content when occurs between the opening tag or self-closing tag?

What is your goal? Even if WordPress’ regex were better, the shortcode would not be executed.

return '(.?)\[('.$tagregexp.')\b((?:"[^"]*"|.)*?)(?:/)?\](?:(.+?)\[\/\2\])?(.?)';
is a variation on the first regex where the bit that matches the attributes has been changed to capture strings completely without regard to what's in them:
(?:"[^"]*"|.)*?
instead of
.*?
Note that it doesn't handle strings with escaped quote characters in them (yet - can be done, but is it necessary?). I haven't changed anything else because I don't know the syntax for WordPress shortcodes.
But it looks like it could have been cleaned up a little by removing unnecessary backslashes and parentheses:
return '(.?)\[(foo)\b((?:"[^"]*"|.)*?)/?\](?:(.+?)\[/\2\])?(.?)';
Perhaps further improvements are warranted. I'm a bit worried about the unprecise dot in the above snippet, and I'd rather use (?:"[^"]*"|[^/\]])* instead of (?:"[^"]*"|.)*?, but I don't know whether that would break something else. Also, I don't know what the leading and trailing (.?) are good for. They don't match anything in your example so I don't know their purpose.

Do you want a drop-in replacement for that regex? This one allows attribute values to contain things that look like tags, as in your example:
'(.?)\[(\w+)\b((?:[^"\'\[\]]++|(?:"[^"]*+")|(?:\'[^\']*+\'))*+)\](?:(?<=(\/)\])|([^\[\]]*+)\[\/\2\])(.?)'
Or, in more readable form:
/(.?) # could be [
\[(\w+)\b # tag name
((?:[^"'\[\]]++ # attributes
|(?:"[^"]*+")
|(?:'[^']*+')
)*+
)\]
(?:(?<=(\/)\]) # '/' if self-closing
|([^\[\]]*+) # ...or content
\[\/\2\] # ...and closing tag
)(.?) # could be ]
/
As I understand it, $tagregexp in the original is an alternation of all the tag names that have been defined; I substituted \w+ for readability. Everything the original regex captures, this one does too, and in the same groups. The only difference is that the / in a self-closing tag is captured in group #3 along with the attributes as well as in its own group (#4).
I don't think the other regex needs to be changed unless you want to add full support for tags embedded in attribute values. That would also mean allowing for escaped quotes in this one, and I don't know how you would want to do that. Doubling them would be my guess; that's how Textpattern does it, and WordPress is supposedly based on that.
This question is a good example of why apps like WordPress shouldn't be implemented with regexes. The only way to add or change functionality is by making the regexes bigger and uglier and even harder to maintain.

I found a way to fix it:
First, change the shortcode regex from:
(.?)\[('.$tagregexp.')\b(.*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)
To:
(.?)\[('.$tagregexp.')\b((?:[^\[\]]|(?R)|.)*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)
And then change the priority of the do_shortcode function to avoid conflict with wptexturize, the function that stylize the quotes and mess up this fix. It don't have problems with wpautop because that's somewhat fixed with another recent function I think.
Before:
add_filter('the_content', 'do_shortcode', 11); // AFTER wpautop()
After:
add_filter('the_content', 'do_shortcode', 9);
I submitted this to the trac and is on some kind of permanent hiatus. In the meanwhile I figure if I can make a plugin to apply my fix without changing the core files. Override the filter priority is easy, but I have no idea of how to override the regex.

This would be nice to fix! I do not have sufficient rep to comment, so I am leaving the following related wordpress trac link, maybe it is the same as the one you meant:
http://core.trac.wordpress.org/ticket/14481
I would hope that any fix would allow shortcode syntax like
[shortcode att1="val]ue"]content[/shortcode]
since in 3.0.1 the $content is mis-parsed as ue"]content instead of just content
Update: After spending time learning about regices (regexes?) I made it possible to allow ] and Pascal-style escaped quotes (eg arg='that''s [so] great') in these arguments with 2 changes: first change the (.*?) group in the first regex (get_shortcode_regex) to
((?:[^'"\]]|'[^']*'|"[^"]*")*)
(NB: make sure you escape everything properly in your php code) then in shortcode_parse_atts (the function containing the second regex) change the following (again, change ' to \' if you single-quote $pattern like in the original code)
in $pattern change "([^"]*)" to "((?:[^"]|"")*)"
in $pattern change '([^']*)' to '((?:[^']|'')*)'
$atts[strtolower($m[1])] = preg_replace('_""_', '"', stripcslashes($m[2]));
$atts[strtolower($m[3])] = preg_replace("_''_", "'", stripcslashes($m[4]));
NB again: changes to pattern may rely on greedy nature of matching so if that option's ever changed, the changed bits of $pattern might have to be terminated with something like (?!"), etc

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.