RegEx to convert PHP variables into HTML echo snippets - php

PHP vars can be of the following formats and can contain letters numbers and underscores:
$var_1
$var_1[key_1]
$var_1['key_1']
$var_1["key_1"]
$var_1[key_1][key_2]
$var_1['key_1']['key_2']
$var_1["key_1"]["key_2"]
$var_1->property_1
$var_1->property_1->property_2
Array and object will never have more than 2 nested elements. Objects won't have methods (i.e. $var_1->method_1() is not needed).
I need a RegEx matching them all, or a minimum amount of several RegExes, that would convert them into HTML echo snippets in the following format:
<?=$1?>
Where $1 is the entire matched string. If possible to add constants to the same RegEx it would be just perfect:
CONST_1 into <?=CONST_1?>

This should do it for the given examples:
\$\w+(?:\[(["']|)\w+\1\]|->\w+){0,2}
Replace it with <?=$0?> (make sure to use 0, because 1 is the first capture and not the entire match). I did not include constants, because I think that is rather tricky (how do you know it's a constant and not a reserved keyword - include all keywords?).
Explanation of the regex:
\$ # literal $
\w+ # letters, digits, underscores
(?: # subpattern to match indexing or a member
\[ # literal [
(["']|) # a ', a " or nothing (capture it in group 1)
\w+ # letters, digits, underscores
\1 # the correct matching closing delimiter
\] # literal ]
| # or
-> # literal ->
\w+ # letters, digits, underscores
){0,2} # end of subpattern, repeat 0 to 2 times
Note that if you use this pattern within PHP, you might have to escape the '.

Related

explanation of preg_replace function in PHP [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 7 years ago.
The preg_replace() function has so many possible values, like:
<?php
$patterns = array('/(19|20)(\d{2})-(\d{1,2})-(\d{1,2})/', '/^\s*{(\w+)}\s*=/');
$replace = array('\3/\4/\1\2', '$\1 =');
echo preg_replace($patterns, $replace, '{startDate} = 1999-5-27');
What does:
\3/\4/\1\2
And:
/(19|20)(\d{2})-(\d{1,2})-(\d{1,2})/','/^\s*{(\w+)}\s*=/
mean?
Is there any information available to help understand the meanings at one place? Any help or documents would be appreciated! Thanks in Advance.
Take a look at http://www.tutorialspoint.com/php/php_regular_expression.htm
\3 is the captured group 3
\4 is the captured group 4
...an so on...
\w means any word character.
\d means any digit.
\s means any white space.
+ means match the preceding pattern at least once or more.
* means match the preceding pattern 0 times or more.
{n,m} means match the preceding pattern at least n times to m times max.
{n} means match the preceding pattern exactly n times.
(n,} means match the preceding pattern at least n times or more.
(...) is a captured group.
So, the first thing to point out, is that we have an array of patterns ($patterns), and an array of replacements ($replace). Let's take each pattern and replacement and break it down:
Pattern:
/(19|20)(\d{2})-(\d{1,2})-(\d{1,2})/
Replacement:
\3/\4/\1\2
This takes a date and converts it from a YYYY-M-D format to a M/D/YYYY format. Let's break down it's components:
/ ... / # The starting and trailing slash mark the beginning and end of the expression.
(19|20) # Matches either 19 or 20, capturing the result as \1.
# \1 will be 19 or 20.
(\d{2}) # Matches any two digits (must be two digits), capturing the result as \2.
# \2 will be the two digits captured here.
- # Literal "-" character, not captured.
(\d{2}) # Either 1 or 2 digits, capturing the result as \3.
# \3 will be the one or two digits captured here.
- # Literal "-" character, not captured.
(\d{2}) # Either 1 or 2 digits, capturing the result as \4.
# \4 will be the one or two digits captured here.
This match is replaced by \3/\4/\1\2, which means:
\3 # The two digits captured in the 3rd set of `()`s, representing the month.
/ # A literal '/'.
\4 # The two digits captured in the 4rd set of `()`s, representing the day.
/ # A literal '/'.
\1 # Either '19' or '20'; the first two digits captured (first `()`s).
\2 # The two digits captured in the 2nd set of `()`s, representing the last two digits of the year.
Pattern:
/^\s*{(\w+)}\s*=/
Replacement:
$\1 =
This takes a variable name encoded as {variable} and converts it to $variable = <date>. Let's break it down:
/ ... / # The starting and trailing slash mark the beginning and end of the expression.
^ # Matches the beginning of the string, anchoring the match.
# If the following character isn't matched exactly at the beginning of the string, the expression won't match.
\s* # Any whitespace character. This can include spaces, tabs, etc.
# The '*' means "zero or more occurrences".
# So, the whitespace is optional, but there can be any amount of it at the beginning of the line.
{ # A literal '{' character.
(\w+) # Any 'word' character (a-z, A-Z, 0-9, _). This is captured in \1.
# \1 will be the text contained between the { and }, and is the only thing "captured" in this expression.
} # A literal '}' character.
\s* # Any whitespace character. This can include spaces, tabs, etc.
= # A literal '=' character.
This match is replaced by $\1 =, which means:
$ # A literal '$' character.
\1 # The text captured in the 1st and only set of `()`s, representing the variable name.
# A literal space.
= # A literal '=' character.
Lastly, I wanted to show you a couple of resources. The regex-format you're using is called "PCRE", or Perl-Compatible Regular Expressions. Here is a quick cheat-sheet on PCRE for PHP. Over the last few years, several tools have been popping up to help you visualize, explain, and test regular expressions. One is Regex 101 (just Google "regex tester" or "regex visualizer"). If you look here, this is an explanation of the first RegEx, and here is an explanation of the second. There are others as well, like Debuggex, Regex Tester, etc. But I find the detailed match breakdown on Regex 101 to be pretty useful.

check two slashes in string

I have following sting. I wanted to know any string has two slashes or not.
$sting = "largeimg/fee0b04800e22590/myimage1.jpg";
I am trying to use the following PHP emthodl
if(preg_match("#^/([A-Za-z]|[0-9])/([A-Za-z]|[0-9]+)$#", $sting))
But it is not working properly. Please help me.
Here is how to do it in regex (see demo):
^([^/]*/){2}
Your code:
if(preg_match("#^([^/]*/){2}#", $sting)) {
// two slashes!
}
Explain Regex
^ # the beginning of the string
( # group and capture to \1 (2 times):
[^/]* # any character except: '/' (0 or more
# times (matching the most amount
# possible))
/ # '/'
){2} # end of \1 (NOTE: because you are using a
# quantifier on this capture, only the LAST
# repetition of the captured pattern will be
# stored in \1)
you could use substr_count(), do:
$sting = "largeimg/fee0b04800e22590/myimage1.jpg";
if(substr_count($sting, '/') == 2) { echo "has 2 slashes"; }
To check for 2 slashes you can use this regex:
preg_match('#/[^/]*/#', $sting)
Several other answers provide regular expressions that work but they do not explain why the expression in the question does not work. The expression is:
#^/([A-Za-z]|[0-9])/([A-Za-z]|[0-9]+)$#
The section ([A-Za-z]|[0-9]) is equivalent to ([A-Za-z0-9]). The extra + in the second similar section makes that part quite different. The + is of higher precedence than the |. Hence the section ([A-Za-z]|[0-9]+) is equivalent to ([A-Za-z]|([0-9]+)) (ignoring the difference between capturing and non-capturing brackets). The expression is interpreted as:
^ Start of string
/ The character '/'
([A-Za-z]|[0-9]) One alphanumeric
/ The character '/'
(
[A-Za-z] One alpha character
| or
[0-9]+ One or more digits
)
$ End of the string
This will only match strings where the first three characters are /, one alphanumeric, then /. Then the remainder of the string must be either one alpha or several digits. Thus these strings would be matched:
/a/b
/c/123
/4/d
/5/6
/7/890123456789
These strings would not be matched:
/aa/b
c/c/123
/44/d
/5/6a
/5/a6
/7/ee

PHP regex performance

I have to take out some data from strings. Unfortunately data has realy unfriendly format. I had to create about 15 regural expressions placed in separate preg_replace. It's worth to say that they have many OR (|) within itself. My question is what should I do finally: combine all expressions into one and separate them using | or leave them as is - in individual preg_replace?
Is it very bad practice to create other expressions to keep clarity? I think maybe I could combine some expressions into the one but they become very complicated and not understanding.
For example I have:
$itemFullName = preg_replace("#^\b(([a-zA-Z]{1,3})?[0-9]{1,2}(\.|\-|X)[0-9]{1,2}(\s|\.|\-)?(X|x)?\s?[0-9]{1,3}\.?(([0-9]{1,3})?(X[0-9]{1,3})|(\s[0-9]\/[0-9]|\/[0-9]{1,3}))?(\s\#[0-9]{1,3}\/[0-9]{1,3})?)\s#", ' ', $itemFullName, -1, $sum);
Untidy:
For starters your original PHP statement:
$itemFullName = preg_replace("#^\b(([a-zA-Z]{1,3})?[0-9]{1,2}(\.|\-|X)[0-9]{1,2}(\s|\.|\-)?(X|x)?\s?[0-9]{1,3}\.?(([0-9]{1,3})?(X[0-9]{1,3})|(\s[0-9]\/[0-9]|\/[0-9]{1,3}))?(\s\#[0-9]{1,3}\/[0-9]{1,3})?)\s#", ' ', $itemFullName, -1, $sum);
would be much more readable (and maintainable) if you write it in free-spacing mode with comments like so:
Tidy:
$itemFullName = preg_replace("/(?#!php re_item_tidy Rev:20180207_0700)
^ # Anchor to start of string.
\b # String must begin with a word char.
( # $1: Unnecessary group.
([a-zA-Z]{1,3})? # $2: Optional 1-3 alphas.
[0-9]{1,2} # 1-2 decimal digits.
(\.|\-|X) # $3: Either a dot, hyphen or X.
[0-9]{1,2} # One or two decimal digits.
(\s|\.|\-)? # $4: Optional whitespace, dot or hyphen.
(X|x)? # $5: Optional X or x.
\s?[0-9]{1,3}\.? # Optional whitespace, 1-3 digits, optional dot.
( # $6: Optional ??? from 2 alternatives.
([0-9]{1,3})? # Either a1of2 $7: Optional 1-3 digits.
(X[0-9]{1,3}) # $8: X and 1-3 digits.
| ( # Or a2of2 $9: one ??? from 2 alternatives.
\s[0-9]\/[0-9] # Either a1of2.
| \/[0-9]{1,3} # Or a2of2.
) # End $9: one ??? from 2 alternatives.
)? # End $6: optional ??? from 2 alternatives.
( # $10: Optional sequence.
\s\#[0-9]{1,3} # whitespace, hash, 1-3 digits.
\/[0-9]{1,3} # Forward slash, 1-3 digits.
)? # End $10: Optional sequence
) # End $1: Unnecessary group.
\s # End with a single whitespace char.
/x", ' ', $itemFullName, -1, $sum);
Critique:
This regex is really not bad performance-wise. It has a start of string anchor at the start which helps it fail quickly for non-matching strings. It also does not have any backtracking problems. However, there are a few minor improvements which can be made:
There are three groups of alternatives where each of the alternatives consists of only one character - each of these can be replaced with a simple character class.
There are 10 capture groups but preg_replace uses none of the captured data. These capture groups can be changed to be non-capturing.
There are several unnecessary groups which can be simply removed.
Group 2: ([a-zA-Z]{1,3})? can be written more simply as: [a-zA-Z]{0,3}. Group 7 has a similar construct.
The \b word boundary at the start is unnecessary.
With PHP, its best to enclose regex patterns inside single quoted strings. Double quoted strings have many metacharacters that must be escaped. Single quoted strings only have two: the single quote and the backslash.
There are a few unnecessarily escaped forward slashes.
Note also that you are using the $sum variable to count the number of replacements being made by preg_replace(). Since you have a ^ start of string anchor at the beginning of the pattern, you will only ever have one replacement because you have not specifid the 'm' multi-line modifier. I am assuming that you actually do want to perform multiple replacements (and count them in $sum), so I've added the 'm' modifier.
Here is an improved version incorporating these changes:
Tidier:
$itemFullName = preg_replace('%(?#!php/m re_item_tidier Rev:20180207_0700)
^ # Anchor to start of string.
[a-zA-Z]{0,3} # Optional 1-3 alphas.
[0-9]{1,2} # 1-2 decimal digits.
[.X-] # Either a dot, hyphen or X.
[0-9]{1,2} # One or two decimal digits.
[\s.-]? # Optional whitespace, dot or hyphen.
[Xx]? # Optional X or x.
\s?[0-9]{1,3}\.? # Optional whitespace, 1-3 digits, optional dot.
(?: # Optional ??? from 2 alternatives.
[0-9]{0,3} # Either a1of2: Optional 1-3 digits
X[0-9]{1,3} # followed by X and 1-3 digits.
| (?: # Or a2of2: One ??? from 2 alternatives.
\s[0-9]/[0-9] # Either a1of2.
| /[0-9]{1,3} # Or a2of2.
) # End one ??? from 2 alternatives.
)? # End optional ??? from 2 alternatives.
(?: # Optional sequence.
\s\#[0-9]{1,3} # whitespace, hash, 1-3 digits.
/[0-9]{1,3} # Forward slash, 1-3 digits.
)? # End optional sequence
\s # End with a single whitespace char.
%xm', ' ', $itemFullName, -1, $sum);
Note however, that I don't think you'll see much if any performance improvements - your original regex is pretty good. Your performance issues are probably coming from some other aspect of your program.
Hope this helps.
Edit 2018-02-07: Removed extraneous double quote, added regex shebangs.
My question is what should I do finally: combine all expressions into one and separate them using | or leave them as is - in individual preg_replace?
Keep the regular expressions in separate preg_replace() calls because this gives you more maintainability, readability and efficiency.
Using a lot of OR operators | in your regular expression is not performance friendly especially for large amounts of text because the regular expression engine has to apply at every character in the input, it has to apply every alternative in the OR operator's | list.
Please don't worry about "fastest" without having first done some sort of measurement that it matters. Unless your program is operating too slowly, and you've run a profiler like XDebug to determine that the regex matching is the bottleneck, then you're doing premature optimization.
Rather than worrying about fastest, think about which way is clearest.

How to evaluate constraints using regular expressions? (php, regex)

So, let's say I want to accept strings as follows
SomeColumn IN||<||>||= [123, 'hello', "wassup"]||123||'hello'||"yay!"
For example:MyValue IN ['value', 123] or MyInt > 123 -> I think you get the idea. Now, what's bothering me is how to phrase this in a regex? I'm using PHP, and this is what I'm doing right now: $temp = explode(';', $constraints);
$matches = array();
foreach ($temp as $condition) {
preg_match('/(.+)[\t| ]+(IN|<|=|>|!)[\t| ]+([0-9]+|[.+]|.+)/', $condition, $matches[]);
}
foreach ($matches as $match) {
if ($match[2] == 'IN') {
preg_match('/(?:([0-9]+|".+"|\'.+\'))/', substr($match[3], 1, -1), $tempm);
print_r($tempm);
}
}
Really appreciate any help right there, my regex'ing is horrible.
I assume your input looks similar to this:
$string = 'SomeColumn IN [123, \'hello\', "wassup"];SomeColumn < 123;SomeColumn = \'hello\';SomeColumn > 123;SomeColumn = "yay!";SomeColumn = [123, \'hello\', "wassup"]';
If you use preg_match_all there is no need for explode or to build the matches yourself. Note that the resulting two-dimensional array will have its dimensions switched, but that is often desirable. Here is the code:
preg_match_all('/(\w+)[\t ]+(IN|<|>|=|!)[\t ]+((\'[^\']*\'|"[^"]*"|\d+)|\[[\t ]*(?4)(?:[\t ]*,[\t ]*(?4))*[\t ]*\])/', $string, $matches);
$statements = $matches[0];
$columns = $matches[1];
$operators = $matches[2];
$values = $matches[3];
There will also be a $matches[4] but it does not really have a meaning and is only used inside the regular expression. First, a few things you did wrong in your attempt:
(.+) will consume as much as possible, and any character. So if you have something inside a string value that looks like IN 13 then your first repetition might consume everything until there and return it as the column. It also allows whitespace and = inside column names. There are two ways around this. Either making the repetition "ungreedy" by appending ? or, even better, restrict the allowed characters, so you cannot go past the desired delimiter. In my regex I only allow letters, digits and underscores (\w) for column identifiers.
[\t| ] this mixes up two concepts: alternation and character classes. What this does is "match a tab, a pipe or a space". In character classes you simply write all characters without delimiting them. Alternatively you could have written (\t| ) which would be equivalent in this case.
[.+] I don't know what you were trying to accomplish with this, but it matches either a literal . or a literal +. And again it might be useful to restrict the allowed characters, and to check for correct matching of quotes (to avoid 'some string")
Now for an explanation of my own regex (you can copy this into your code, as well, it will work just fine; plus you have the explanation as comments in your code):
preg_match_all('/
(\w+) # match an identifier and capture in $1
[\t ]+ # one or more tabs or spaces
(IN|<|>|=|!) # the operator (capture in $2)
[\t ]+ # one or more tabs or spaces
( # start of capturing group $3 (the value)
( # start of subpattern for single-valued literals (capturing group $4)
\' # literal quote
[^\']* # arbitrarily many non-quote characters, to avoid going past the end of the string
\' # literal quote
| # OR
"[^"]*" # equivalent for double-quotes
| # OR
\d+ # a number
) # end of subpattern for single-valued literals
| # OR (arrays follow)
\[ # literal [
[\t ]* # zero or more tabs or spaces
(?4) # reuse subpattern no. 4 (any single-valued literal)
(?: # start non-capturing subpattern for further array elements
[\t ]* # zero or more tabs or spaces
, # a literal comma
[\t ]* # zero or more tabs or spaces
(?4) # reuse subpattern no. 4 (any single-valued literal)
)* # end of additional array element; repeat zero or more times
[\t ]* # zero or more tabs or spaces
\] # literal ]
) # end of capturing group $3
/',
$string,
$matches);
This makes use of PCRE's recursion feature where you can reuse a subpattern (or the whole regular expression) with (?n) (where n is just the number you would also use for a backreference).
I can think of three major things that could be improved with this regex:
It does not allow for floating-point numbers
It does not allow for escaped quotes (if your value is 'don\'t do this', I would only captur 'don\'). This can be solved using a negative lookbehind.
It does not allow for empty arrays as values (this could be easily solved by wrapping all parameters in a subpattern and making it optional with ?)
I included none of these, because I was not sure whether they apply to your problem, and I thought the regex was already complex enough to present here.
Usually regular expressions are not powerful enough to do proper language parsing anyway. It is generally better to write your parser.
And since you said your regex'ing is horrible... while regular expressions seem like a lot of black magic due to their uncommon syntax, they are not that hard to understand, if you take the time once to get your head around their basic concepts. I can recommend this tutorial. It really takes you all the way through!

Regular Expression to match ([^>(),]+) but include some \w's in it?

I'm using php's preg_replace function, and I have the following regex:
(?:[^>(),]+)
to match any characters but >(),. The problem is that I want to make sure that there is at least one letter in it (\w) and the match is not empty, how can I do that?
Is there a way to say what i DO WANT to match in the [^>(),]+ part?
You can add a lookahead assertion:
(?:(?=.*\p{L})[^>(),]+)
This makes sure that there will be at least one letter (\p{L}; \w also matches digits and underscores) somewhere in the string.
You don't really need the (?:...) non-capturing parentheses, though:
(?=.*\p{L})[^>(),]+
works just as well. Also, to ensure that we always match the entire string, it might be a good idea to surround the regex with anchors:
^(?=.*\p{L})[^>(),]+$
EDIT:
For the added requirement of not including surrounding whitespace in the match, things get a little more complicated. Try
^(?=.*\p{L})(\s*)((?:(?!\s*$)[^>(),])+)(\s*)$
In PHP, for example to replace all those strings we found with REPLACEMENT, leaving leading and trailing whitespace alone, this could look like this:
$result = preg_replace(
'/^ # Start of string
(?=.*\p{L}) # Assert that there is at least one letter
(\s*) # Match and capture optional leading whitespace (--> \1)
( # Match and capture... (--> \2)
(?: # ...at least one character of the following:
(?!\s*$) # (unless it is part of trailing whitespace)
[^>(),] # any character except >(),
)+ # End of repeating group
) # End of capturing group
(\s*) # Match and capture optional trailing whitespace (--> \3)
$ # End of string
/xu',
'\1REPLACEMENT\3', $subject);
You can just "insert" \w inside (?:[^>(),]+\w[^>(),]+). So it will have at least one letter and obviously not empty. BTW \w captures digits as well as letters. If you want only letters you can use unicode letter character class \p{L} instead of \w.
How about this:
(?:[^>(),]*\w[^>(),]*)

Categories