RegExp Capture literals - php

I need a way to strip all literals from PHP files. My current regexp solution works fine when there is no nested quotes in the string. Tried updating it to handle escaped quotes as well, which did work in most cases, except when there are escaped escape characters in the string.
This is what it should be able to handle, if this should be done correctly
"text"
"\"text\""
"\\"
"\"\\\""
So as I see it, it needs to handle cases where there are an even amount of escape characters and cases where there are an uneven amount. But how do you get this into regexp?
Update
I want to clean up PHP files to make them easier to search through and index different parts, something for a small project that I am playing with. Since literals can contain mostly anything, they can also contain data similar to some of the searches. So I want to remove anything in the files that is wrapped in " or '.
"/\"[^\"]*\"/"
This will work unless there is a nested quote "\"data\"".
"/\"(\\\\\"|[^\"])*\"/"
This will work unless there is "\\"
This is what I need
$var = "...";
Becomes
$var = ;

You could use this regular expression based substitution:
Find: ((?<!\\)(?:\\.)*)(["'])(?:\\.|(?!\2).)*?\2
Replace: $1
Note that if you are going to use this regular expression in PHP (where you encode it as a string literal) you need to escape the backslashes and quote in that regular expression, so like this:
preg_replace("~((?<!\\\\)(?:\\\\.)*)([\"'])(?:\\\\.|(?!\\2).)*?\\2~s", "$1", $input);
As PHP string literals can span multiple lines, the s modifier is added so that . matches newline characters also.
See it run on eval.in
NB: You'll need to think about heredoc notation also...

Related

How to use dot as in punctuation and not append in PHP

I'm putting a SQL query together in PHP. How do I declare a dot in punctuation?
Example code as requested:
$sql="SELECT COUNT(*) FROM Table1 WHERE LOWER(location2) REGEXP '.* .$location .*'";
See .* is a regexp and should not be interpreted by PHP as a concatenation.
This is nothing to do with PHP syntax. Your example contains a . inside a quoted string, which PHP interprets as a . inside a quoted string. Therefore nothing is wrong there.
What you're probably experiencing is MySQL treating the . as a wild-card operator in a regular expression. In regular expression syntax (whether in MySQL, PHP, Perl, wherever) . is a wild-card that matches any single character. If you want to include a literal . in your regex, you need to escape it, i.e. \..
Because you are using it inside a string inside a string, you also need to escape the \ character so that it makes it through to the regex correctly. Without testing I would say it needs escaping twice (once for PHP and once for MySQL), e.g. "'\\\\.'" in PHP becomes '\\.' in MySQL, becomes \. in the regular expression.
(Obviously, only escape the . characters that are meant to be treated literally - I would assume the .* is meant to match any character - these should not be changed.)

Regex to match characters that must be escaped in a PHP regex

I've had a look at this question, which shows what characters need to be escaped. However, I'm having a lot of trouble constructing a regex that will match any instance of one of those characters in a string.
For some background on the problem, I'm implementing a simple word-for-word (or term-for-term if you prefer) translation database where users enter language pairs, and can then trigger translations on blocks of text. The problem comes when users enter strings like "Yes/No". So, in PHP, I need to escape the string to be matched, and place it like this:
"/\b".$target."\b/"
So, what do I need to be looking at in terms of a preg_replace?
You want to use preg_quote(). As the documentation clearly states:
preg_quote() takes str and puts a backslash in front of every character that is part of the regular expression syntax. This is useful if you have a run-time string that you need to match in some text and the string may contain special regex characters.
Or \Q ... \E, ( What's between \Q and \E is treated as normal characters, not regular expression characters. )

Preg Patterns, to ignore escaped characters

I want to create a RegEx that finds strings that begin and end in single or double quotes.
For example I can match such a case like this:
String: "Hello World"
RegEx: /[\"\'][^\"\']+[\"\']/
However, the problem occurs when quotes appear in the string itself like so:
String: "Hello" World"
We know the above expression will not work.
What I want to be able to do, it to have the escape within the string itself, since that will be functionality required anyway:
String: "Hello\" World"
Now I could come up with a long and complicated expression with various patterns in a group, one of them being:
RegEx: /[\"\'][^\"\']+(\\\"|\\\')+[^\"\']+[\"\']/
However that to me seems excessive, and I think there may be a shorter and more elegant solution.
Intended syntax:
run arg1 "arg1" "arg3 with \"" "\"arg4" "arg\"\"5"
As you can see, the quotes are really only used to make sure that string with spaces are counted as a single string. Do not worry about arg1, I should be able to match unquoted arguments.
I will make this easier, arguments can only be quoted using double-quotes. So i've taken single quotes out of the requirements of this question.
I have modified Rui Jarimba's example:
/(?<=")(\\")*([^"]+((\\(\"))*[^"])+)((\\"")|")/
This now accounts pretty well for most cases, however there is one final case that can defeat this:
run -a "arg3 \" p2" "\"sa\"mple\"\\"
The second argument end with \\" which is a conventional way in this case to allow a backslash at the end of a nested string, unfortunately the regex thinks this is an escaped quote since the pattern \" still exists at the end of the pattern.
Firstly, please use ' strings to write your regexes. That saves you a lot of escaping.
Then I see two possibilities. The problem with your attempt is, it allows only consecutive escaped quotes in one place in the string. Also, this allows the use of different quotes at the beginning and the end. You could use a backreference to get around that. So this would be a) slightly more elegant and b) correct:
$pattern = '/(["\'])(\\"|\\\'|[^"\'])+\1/';
Note that the order of the alternation is important!
The problem with this is, you don't want to escape the quote that you don't use to delimit the string. Therefore, the other possibility is to use lookarounds (since backreferences cannot be used inside character classes):
$pattern = '/(["\'])(?:(?!\1).|(?<=\\\\)\1)+\1/';
Note that four consecutive backslashes are always necessary to match a single literal backslash. That is because in the actual string $pattern they end up as \\ and then the regex engine "uses" the first one to escape the second one.
This will match either an arbitrary character if it is not the starting quote. Or it will match the starting quote if the previous character was a backslash.
Working demo.
This by the way is equivalent to:
$pattern = '/(["\'])(?:\\\\\1|(?!\1).)+\1/';
But here you have to write the alternation in this order again.
Working demo.
One final note. You can avoid the backreference by providing the two possible strings separately (single and double quoted strings):
$pattern = '/"(?:\\\\"|[^"])+"|\'(?:\\\\\'|[^\'])+\'/';
But you said you were looking for something short and elegant ;) (although, this last one might be more efficient... but you'd have to profile that).
Note that all my regexes leave one case unconsidered: escaped quotes outside of quoted strings. I.e. Hello \" World "Hello" World will give you " World". You can avoid this using another negative lookbehind (using as an example the second regex for which I provided a working demo; it would work the same for all others):
$pattern = '/(?<!\\\\)(["\'])(?:\\\\\1|(?!\1).)+\1/';
Try this regex:
['"]([^'"]+((\\(\"|'))*[^'"])+)['"]
Given the following string:
"Hello" World 'match 2' "wqwqwqwq wwqwqqwqw" no match here oopop "Hello \" World"
It will match
"Hello"
'match 2'
"wqwqwqwq wwqwqqwqw"
"Hello \" World"

RegEx: Look-behind to avoid odd number of consecutive backslashes

I have user input where some tags are allowed inside square brackets. I've already wrote the regex pattern to find and validate what's inside the brackets.
In user input field opening-bracket could ([) be escaped with backslash, also backslash could be escaped with another backslash (\). I need look-behind sub-pattern to avoid odd number of consecutive backslashes before opening-bracket.
At the moment I must deal with something like this:
(?<!\\)(?:\\\\)*\[(?<inside brackets>.*?)]
It works fine, but problem is that this code still matches possible pairs of consecutive backslashes in front of brackets (even they are hidden) and look-behind just checks out if there's another single backslash appended to pairs (or directly to opening-bracket). I need to avoid them all inside look-behind group if possible.
Example:
my [test] string is ok
my \[test] string is wrong
my \\[test] string is ok
my \\\[test] string is wrong
my \\\\[test] string is ok
my \\\\\[test] string is wrong
...
etc
I work with PHP PCRE
Last time I checked, PHP did not support variable-length lookbehinds. That is why you cannot use the trivial solution (?<![^\\](?:\\\\)*\\).
The simplest workaround would be to simply match the entire thing, not just the brackets part:
(?<!\\)((?:\\\\)*)\[(?<inside_brackets>.*?)]
The difference is that now, if you're using that regex in a preg_replace, you gotta remember to prefix the replacement string by $1, to restore the backslashes being there.
You could do it without any look-behinds at all (the (\\\\|[^\\]) alternation eats anything but a single back-slash):
^(\\\\|[^\\])*\[(?<brackets>.*?)\]

Why don't reg expressions from regexlib.com work in PHP?

I found a regex on http://regexlib.com/REDetails.aspx?regexp_id=73
It's for matching a telephone number with international code like so:
^(\(?\+?[0-9]*\)?)?[0-9_\- \(\)]*$
When using with PHP's preg_match, the expression fails? Why is that?
You need to surround it with / delimiters:
preg_match('/^(\(?\+?[0-9]*\)?)?[0-9_\- \(\)]*$/', $phoneNumber)
And make sure you don't leave out the backslashes (\).
Because preg_match expects the regex to be delimited, usually with slashes (but, as correctly noted below, other characters are possible as long as they are matched):
preg_match('/^(\(?\+?[0-9]*\)?)?[0-9_ ()-]*$/', $subject)
Apart from that, the original regex was copied wrong - several characters were unescaped. The original on regexlib has a few warts, too (some characters were escaped needlessly).

Categories