So, let's say I want to accept strings as follows
SomeColumn IN||<||>||= [123, 'hello', "wassup"]||123||'hello'||"yay!"
For example:MyValue IN ['value', 123] or MyInt > 123 -> I think you get the idea. Now, what's bothering me is how to phrase this in a regex? I'm using PHP, and this is what I'm doing right now: $temp = explode(';', $constraints);
$matches = array();
foreach ($temp as $condition) {
preg_match('/(.+)[\t| ]+(IN|<|=|>|!)[\t| ]+([0-9]+|[.+]|.+)/', $condition, $matches[]);
}
foreach ($matches as $match) {
if ($match[2] == 'IN') {
preg_match('/(?:([0-9]+|".+"|\'.+\'))/', substr($match[3], 1, -1), $tempm);
print_r($tempm);
}
}
Really appreciate any help right there, my regex'ing is horrible.
I assume your input looks similar to this:
$string = 'SomeColumn IN [123, \'hello\', "wassup"];SomeColumn < 123;SomeColumn = \'hello\';SomeColumn > 123;SomeColumn = "yay!";SomeColumn = [123, \'hello\', "wassup"]';
If you use preg_match_all there is no need for explode or to build the matches yourself. Note that the resulting two-dimensional array will have its dimensions switched, but that is often desirable. Here is the code:
preg_match_all('/(\w+)[\t ]+(IN|<|>|=|!)[\t ]+((\'[^\']*\'|"[^"]*"|\d+)|\[[\t ]*(?4)(?:[\t ]*,[\t ]*(?4))*[\t ]*\])/', $string, $matches);
$statements = $matches[0];
$columns = $matches[1];
$operators = $matches[2];
$values = $matches[3];
There will also be a $matches[4] but it does not really have a meaning and is only used inside the regular expression. First, a few things you did wrong in your attempt:
(.+) will consume as much as possible, and any character. So if you have something inside a string value that looks like IN 13 then your first repetition might consume everything until there and return it as the column. It also allows whitespace and = inside column names. There are two ways around this. Either making the repetition "ungreedy" by appending ? or, even better, restrict the allowed characters, so you cannot go past the desired delimiter. In my regex I only allow letters, digits and underscores (\w) for column identifiers.
[\t| ] this mixes up two concepts: alternation and character classes. What this does is "match a tab, a pipe or a space". In character classes you simply write all characters without delimiting them. Alternatively you could have written (\t| ) which would be equivalent in this case.
[.+] I don't know what you were trying to accomplish with this, but it matches either a literal . or a literal +. And again it might be useful to restrict the allowed characters, and to check for correct matching of quotes (to avoid 'some string")
Now for an explanation of my own regex (you can copy this into your code, as well, it will work just fine; plus you have the explanation as comments in your code):
preg_match_all('/
(\w+) # match an identifier and capture in $1
[\t ]+ # one or more tabs or spaces
(IN|<|>|=|!) # the operator (capture in $2)
[\t ]+ # one or more tabs or spaces
( # start of capturing group $3 (the value)
( # start of subpattern for single-valued literals (capturing group $4)
\' # literal quote
[^\']* # arbitrarily many non-quote characters, to avoid going past the end of the string
\' # literal quote
| # OR
"[^"]*" # equivalent for double-quotes
| # OR
\d+ # a number
) # end of subpattern for single-valued literals
| # OR (arrays follow)
\[ # literal [
[\t ]* # zero or more tabs or spaces
(?4) # reuse subpattern no. 4 (any single-valued literal)
(?: # start non-capturing subpattern for further array elements
[\t ]* # zero or more tabs or spaces
, # a literal comma
[\t ]* # zero or more tabs or spaces
(?4) # reuse subpattern no. 4 (any single-valued literal)
)* # end of additional array element; repeat zero or more times
[\t ]* # zero or more tabs or spaces
\] # literal ]
) # end of capturing group $3
/',
$string,
$matches);
This makes use of PCRE's recursion feature where you can reuse a subpattern (or the whole regular expression) with (?n) (where n is just the number you would also use for a backreference).
I can think of three major things that could be improved with this regex:
It does not allow for floating-point numbers
It does not allow for escaped quotes (if your value is 'don\'t do this', I would only captur 'don\'). This can be solved using a negative lookbehind.
It does not allow for empty arrays as values (this could be easily solved by wrapping all parameters in a subpattern and making it optional with ?)
I included none of these, because I was not sure whether they apply to your problem, and I thought the regex was already complex enough to present here.
Usually regular expressions are not powerful enough to do proper language parsing anyway. It is generally better to write your parser.
And since you said your regex'ing is horrible... while regular expressions seem like a lot of black magic due to their uncommon syntax, they are not that hard to understand, if you take the time once to get your head around their basic concepts. I can recommend this tutorial. It really takes you all the way through!
Related
I would like to validate if user input string is in correct form for further processing / database update.
Form:
elephant1:elephant2:elephant3;cat1:cat2:cat3;unicorn1:unicorn2:unicorn3
: as separator between siblings and ; as separator between groups of siblings
Rules:
There are ALWAYS 3 siblings, since it is meant just for personal bulk import, i just want to avoid mistakes with very long strings. As for the groups, there could be one or more, so group separator not obligatory. Siblings names are letters only with exception of underscore (_) for spaces when there are two or more words in a name.
i was thinking regex, but i am not very familiar with it. If there are any other, simpler ways to achieve this, please suggest.
Valid examples
N-number of groups, separated by semicolon, each of which containing exactly three (3) members separated by punctuation. As mentioned before, names are letters only, with exception of underscore as space for names with multiple words.
VALIDS:
john:mike:dave;jenny:helen:jessica
dog:cat:frog;car:boat:ship;house:flat:shack
meat:vegetable:fruit
UPDATE:
This is what i came up with while trying to understand your answers, it works fine so far
"/^(([a-z]+:[a-z]+:[a-z]+;?)+)$/"
Upgraded to Roman's answer
"/([a-z_]+:[a-z_]+:[a-z_]+;?)+/i"
allowing function to ignore spaces, tabs and allow underscores where items have multiple words.
The solution using preg_match function with specific regex pattern:
$str = 'og:cat:frog;car:boat:ship;house:flat:shack';
if (preg_match("/([a-z_]+:[a-z_]+:[a-z_]+;?)+/i", $str)) {
echo 'valid';
} else {
echo 'invalid';
}
^(?:[a-zA-Z_]+:[a-zA-Z_]+:[a-zA-Z_]+(?:;(?!$)|$))+$ (demo, with multiline flag on)
^ # Anchors to beginning of string
(?: # Opens non-capturing group
[a-zA-Z_]+ # Any number of letters/underscore, one or more times
: # Literal :
[a-zA-Z_]+ # Any number of letters/underscore, one or more times
: # Literal :
[a-zA-Z_]+ # Any number of letters/underscore, one or more times
(?: # Opens non-capturing group
; # Literal ;
(?!$) # Negative Lookahead, ensuring that semi-colons are not at the end of line
| # Or
$ # End of string
) # Closes non-capturing group
)+ # Repeats overall non-capturing-group one or more times
$ # Anchors to end of string
You didn't specify if siblings could be 0 characters, if that's the case, change each [a-zA-Z_]+ to [a-zA-Z_]*
// PHP Code generated by Regex101.
$re = '/^(?:[a-zA-Z_]+:[a-zA-Z_]+:[a-zA-Z_]+(?:;(?!$)|$))+$/m';
$str = 'a_b:bread:stack_overflow;test:this_thing:jane;Get_me:h:down
ab:bread:stack_overflow;test:this_thing:jane;Get_me:h:down
a_b:any other characters break it:stack_overflow;test:this_thing:jane;Get_me:h:down
a_b:bread:format_messed_up-test:this_thing:jane;Get_me:h:down
a_b:bread:stack_overflow;test:this_thing:jane;semi_colon_at_end;';
preg_match_all($re, $str, $matches);
// Print the entire match result
print_r($matches);
I have to take out some data from strings. Unfortunately data has realy unfriendly format. I had to create about 15 regural expressions placed in separate preg_replace. It's worth to say that they have many OR (|) within itself. My question is what should I do finally: combine all expressions into one and separate them using | or leave them as is - in individual preg_replace?
Is it very bad practice to create other expressions to keep clarity? I think maybe I could combine some expressions into the one but they become very complicated and not understanding.
For example I have:
$itemFullName = preg_replace("#^\b(([a-zA-Z]{1,3})?[0-9]{1,2}(\.|\-|X)[0-9]{1,2}(\s|\.|\-)?(X|x)?\s?[0-9]{1,3}\.?(([0-9]{1,3})?(X[0-9]{1,3})|(\s[0-9]\/[0-9]|\/[0-9]{1,3}))?(\s\#[0-9]{1,3}\/[0-9]{1,3})?)\s#", ' ', $itemFullName, -1, $sum);
Untidy:
For starters your original PHP statement:
$itemFullName = preg_replace("#^\b(([a-zA-Z]{1,3})?[0-9]{1,2}(\.|\-|X)[0-9]{1,2}(\s|\.|\-)?(X|x)?\s?[0-9]{1,3}\.?(([0-9]{1,3})?(X[0-9]{1,3})|(\s[0-9]\/[0-9]|\/[0-9]{1,3}))?(\s\#[0-9]{1,3}\/[0-9]{1,3})?)\s#", ' ', $itemFullName, -1, $sum);
would be much more readable (and maintainable) if you write it in free-spacing mode with comments like so:
Tidy:
$itemFullName = preg_replace("/(?#!php re_item_tidy Rev:20180207_0700)
^ # Anchor to start of string.
\b # String must begin with a word char.
( # $1: Unnecessary group.
([a-zA-Z]{1,3})? # $2: Optional 1-3 alphas.
[0-9]{1,2} # 1-2 decimal digits.
(\.|\-|X) # $3: Either a dot, hyphen or X.
[0-9]{1,2} # One or two decimal digits.
(\s|\.|\-)? # $4: Optional whitespace, dot or hyphen.
(X|x)? # $5: Optional X or x.
\s?[0-9]{1,3}\.? # Optional whitespace, 1-3 digits, optional dot.
( # $6: Optional ??? from 2 alternatives.
([0-9]{1,3})? # Either a1of2 $7: Optional 1-3 digits.
(X[0-9]{1,3}) # $8: X and 1-3 digits.
| ( # Or a2of2 $9: one ??? from 2 alternatives.
\s[0-9]\/[0-9] # Either a1of2.
| \/[0-9]{1,3} # Or a2of2.
) # End $9: one ??? from 2 alternatives.
)? # End $6: optional ??? from 2 alternatives.
( # $10: Optional sequence.
\s\#[0-9]{1,3} # whitespace, hash, 1-3 digits.
\/[0-9]{1,3} # Forward slash, 1-3 digits.
)? # End $10: Optional sequence
) # End $1: Unnecessary group.
\s # End with a single whitespace char.
/x", ' ', $itemFullName, -1, $sum);
Critique:
This regex is really not bad performance-wise. It has a start of string anchor at the start which helps it fail quickly for non-matching strings. It also does not have any backtracking problems. However, there are a few minor improvements which can be made:
There are three groups of alternatives where each of the alternatives consists of only one character - each of these can be replaced with a simple character class.
There are 10 capture groups but preg_replace uses none of the captured data. These capture groups can be changed to be non-capturing.
There are several unnecessary groups which can be simply removed.
Group 2: ([a-zA-Z]{1,3})? can be written more simply as: [a-zA-Z]{0,3}. Group 7 has a similar construct.
The \b word boundary at the start is unnecessary.
With PHP, its best to enclose regex patterns inside single quoted strings. Double quoted strings have many metacharacters that must be escaped. Single quoted strings only have two: the single quote and the backslash.
There are a few unnecessarily escaped forward slashes.
Note also that you are using the $sum variable to count the number of replacements being made by preg_replace(). Since you have a ^ start of string anchor at the beginning of the pattern, you will only ever have one replacement because you have not specifid the 'm' multi-line modifier. I am assuming that you actually do want to perform multiple replacements (and count them in $sum), so I've added the 'm' modifier.
Here is an improved version incorporating these changes:
Tidier:
$itemFullName = preg_replace('%(?#!php/m re_item_tidier Rev:20180207_0700)
^ # Anchor to start of string.
[a-zA-Z]{0,3} # Optional 1-3 alphas.
[0-9]{1,2} # 1-2 decimal digits.
[.X-] # Either a dot, hyphen or X.
[0-9]{1,2} # One or two decimal digits.
[\s.-]? # Optional whitespace, dot or hyphen.
[Xx]? # Optional X or x.
\s?[0-9]{1,3}\.? # Optional whitespace, 1-3 digits, optional dot.
(?: # Optional ??? from 2 alternatives.
[0-9]{0,3} # Either a1of2: Optional 1-3 digits
X[0-9]{1,3} # followed by X and 1-3 digits.
| (?: # Or a2of2: One ??? from 2 alternatives.
\s[0-9]/[0-9] # Either a1of2.
| /[0-9]{1,3} # Or a2of2.
) # End one ??? from 2 alternatives.
)? # End optional ??? from 2 alternatives.
(?: # Optional sequence.
\s\#[0-9]{1,3} # whitespace, hash, 1-3 digits.
/[0-9]{1,3} # Forward slash, 1-3 digits.
)? # End optional sequence
\s # End with a single whitespace char.
%xm', ' ', $itemFullName, -1, $sum);
Note however, that I don't think you'll see much if any performance improvements - your original regex is pretty good. Your performance issues are probably coming from some other aspect of your program.
Hope this helps.
Edit 2018-02-07: Removed extraneous double quote, added regex shebangs.
My question is what should I do finally: combine all expressions into one and separate them using | or leave them as is - in individual preg_replace?
Keep the regular expressions in separate preg_replace() calls because this gives you more maintainability, readability and efficiency.
Using a lot of OR operators | in your regular expression is not performance friendly especially for large amounts of text because the regular expression engine has to apply at every character in the input, it has to apply every alternative in the OR operator's | list.
Please don't worry about "fastest" without having first done some sort of measurement that it matters. Unless your program is operating too slowly, and you've run a profiler like XDebug to determine that the regex matching is the bottleneck, then you're doing premature optimization.
Rather than worrying about fastest, think about which way is clearest.
preg_match('/.*MyString[ (\/]*([a-z0-9\.\-]*)/i', $contents, $matches);
I need to debug this one. I have a good idea of what it's doing but since I was never an expert at regular expressions I need your help.
Can you tell me what it does block by block (so I can learn)?
Does the syntax can be simplified (I think there is no need to escape the dot with a slash)?
The regexp...
'/.*MyString[ (\/]*([a-z0-9\.\-]*)/i'
.* matches any character zero or more times
MyString matches that string. But you are using case insensitive matching so the matched string will spell "mystring" by but with any capitalization
EDIT: (Thanks to Alan Moore) [ (\/]*. This matches any of the chars space ( or / repeated zero of more times. As Alan points out the final escape of / is to stop the / being treated as a regexp delimeter.
EDIT: The ( does not need escaping and neither does the . (thanks AlexV) because:
All non-alphanumeric characters other than \, -, ^ (at the start) and
the terminating ] are non-special in character classes, but it does no
harm if they are escaped.
-- http://www.php.net/manual/en/regexp.reference.character-classes.php
The hyphen, generally does need to be escaped, otherwise it will try to define a range. For example:
[A-Z] // matches all upper case letters of the aphabet
[A\-Z] // matches 'A', '-', and 'Z'
However, where the hyphen is at the end of the list you can get away with not escaping it (but always best to be in the habit of escaping it... I got caught out by this].
([a-z0-9\.\-]*) matches any string containing the characters a through z (note again this is effected by the case insensitive match), 0 through 9, a dot, a hyphen, repeated zero of more times. The surrounding () capture this string. This means that $matches[1] will contain the string matches by [a-z0-9\.\-]*. The brackets () tell preg_match to "capture" this string.
e.g.
<?php
$input = "aslghklfjMyString(james321-james.org)blahblahblah";
preg_match('/.*MyString[ (\/]*([a-z0-9.\-]*)/i', $input, $matches);
print_r($matches);
?>
outputs
Array
(
[0] => aslghklfjMyString(james321-james.org
[1] => james321-james.org
)
Note that because you use a case insensitive match...
$input = "aslghklfjmYsTrInG(james321898-james.org)blahblahblah";
Will also match and give the same answer in $matches[1]
Hope this helps....
Let's break this down step-by step, removing the explained parts from the expression.
"/.*MyString[ (\/]*([a-z0-9\.\-]*)/i"
Let's first strip the regex delimiters (/i at the end means it's case-insensitive):
".*MyString[ (\/]*([a-z0-9\.\-]*)"
Then we've got a wildcard lookahead (search for any symbol any number of times until we match the next statement.
"MyString[ (\/]*([a-z0-9\.\-]*)"
Then match 'MyString' literally, followed by any number (note the '*') of any of the following: ' ', '(', '/'. This is probably the error zone, you need to escape that '('. Try [ (/].
"([a-z0-9\.\-]*)"
Then we get a capture group for any number of any of the following: a-z literals, 0-9 digits, '.', or '-'.
That's pretty much all of it.
This question is an educational demonstration of the usage of lookahead, nested reference, and conditionals in a PCRE pattern to match ALL palindromes, including the ones that can't be matched by the recursive pattern given in the PCRE man page.
Examine this PCRE pattern in PHP snippet:
$palindrome = '/(?x)
^
(?:
(.) (?=
.*
(
\1
(?(2) \2 | )
)
$
)
)*
.?
\2?
$
/';
This pattern seems to detect palindromes, as seen in this test cases (see also on ideone.com):
$tests = array(
# palindromes
'',
'a',
'aa',
'aaa',
'aba',
'aaaa',
'abba',
'aaaaa',
'abcba',
'ababa',
# non-palindromes
'aab',
'abab',
'xyz',
);
foreach ($tests as $test) {
echo sprintf("%s '%s'\n", preg_match($palindrome, $test), $test);
}
So how does this pattern work?
Notes
This pattern uses a nested reference, which is a similar technique used in How does this Java regex detect palindromes?, but unlike that Java pattern, there's no lookbehind (but it does use a conditional).
Also, note that the PCRE man page presents a recursive pattern to match some palindromes:
# the recursive pattern to detect some palindromes from PCRE man page
^(?:((.)(?1)\2|)|((.)(?3)\4|.))$
The man page warns that this recursive pattern can NOT detect all palindromes (see: Why will this recursive regex only match when a character repeats 2n - 1 times? and also on ideone.com), but the nested reference/positive lookahead pattern presented in this question can.
Let's try to understand the regex by constructing it. Firstly, a palindrome must start and end with the same sequence of character in the opposite direction:
^(.)(.)(.) ... \3\2\1$
we want to rewrite this such that the ... is only followed by a finite length of patterns, so that it could be possible for us to convert it into a *. This is possible with a lookahead:
^(.)(?=.*\1$)
(.)(?=.*\2\1$)
(.)(?=.*\3\2\1$) ...
but there are still uncommon parts. What if we can "record" the previously captured groups? If it is possible we could rewrite it as:
^(.)(?=.*(?<record>\1\k<record>)$) # \1 = \1 + (empty)
(.)(?=.*(?<record>\2\k<record>)$) # \2\1 = \2 + \1
(.)(?=.*(?<record>\3\k<record>)$) # \3\2\1 = \3 + \2\1
...
which could be converted into
^(?:
(.)(?=.*(\1\2)$)
)*
Almost good, except that \2 (the recorded capture) is not empty initially. It will just fail to match anything. We need it to match empty if the recorded capture doesn't exist. This is how the conditional expression creeps in.
(?(2)\2|) # matches \2 if it exist, empty otherwise.
so our expression becomes
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*
Now it matches the first half of the palindrome. How about the 2nd half? Well, after the 1st half is matched, the recorded capture \2 will contain the 2nd half. So let's just put it in the end.
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*\2$
We want to take care of odd-length palindrome as well. There would be a free character between the 1st and 2nd half.
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2$
This works good except in one case — when there is only 1 character. This is again due to \2 matches nothing. So
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2?$
# ^ since \2 must be at the end in the look-ahead anyway.
I want to bring my very own solution to the table.
This is a regex that I've written a while ago to solve matching palindromes using PCRE/PCRE2
^((\w)(((\w)(?5)\5?)*|(?1)|\w?)\2)$
Example:
https://regex101.com/r/xvZ1H0/1
I have something like:
$string1="dog fox [cat]"
I need the contents inside [ ] i.e cat
another question: if you are familiar with regex in one language, will it do for the other languages as well?
$matches = array();
$matchcount = preg_match('/\[([^\]]*)\]/', $string1, $matches);
$item_inside_brackets = $matches[1];
If you want to match multiple bracketed terms in the same string, you'll want to look into preg_match_all instead of just preg_match.
And yes, regular expressions are a fairly cross-language standard (there are some variations in what features are available in different languages, and occasional syntax differences, but for the most part it's all the same).
Explanation of the above regex:
/ # beginning of regex delimiter
\[ # literal left bracket (normally [ is a special character)
( # start capture group - isolate the text we actually want to extract
[^\]]* # match any number of non-] characters
) # end capture group
\] # literal right bracket
/ # end of regex delimiter
The contents of the $matches array are set based on both the entirety of the text matched (which would include the brackets) in [0], and then the contents of each capture group from the matching in [1] and up (first capture group's contents in [1], second in [2], etc).
Here is the php code:
preg_match('/\[(.*)\]/', $string1, $matches);
echo $matches[1];
And yes, your knowledge will transfer. Although there my be subtle differences between each language's version of regular expressions.