Regular expression works in Javascript but not PHP preg_match - php

Regular expression:
/([^]+):([^\\r\\n]+)/
String:
f1:aaa\r\nf2:bbb\r\nf3:ccc\r\nf4:ddd
According to regexpal.com, this would give my desired sets: f1 & aaa, f2 & bbb, f3 & ccc etc.
But using http://www.functions-online.com/preg_match.html I only see [0] => "f1" and [1] => "f1"
Can anyone show how I should be doing this?

Some implementations of javascript allow [] and [^] as "no character" and "any character" respectively. But keep in mind that this is particular to the javascript regex flavour. (if your are interested by the subject you can take a look at this post.)
In other words [^] is a shortcut for [\s\S] since javascript doesn't have a dotall or singleline mode where the dot can match newlines.
Thus, to obtain the same result in PHP you must replace [^] by . (which by default matches any character except newline) with the singleline modifier s after the end delimiter or (?s) before the . to allow newlines too. Examples: /.+/s or /(?s).+/
But for your particular case this pattern seems to be more appropriate:
preg_match_all('~((?>[^rn\\\:]++|(?<!\\\)[rn])+):([^\\\]++)~', $subject, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
echo $match[1].' '.$match[2].'<br/>';
}
pattern explanation:
~ # pattern delimiter
( # open the first capturing group
(?> # open an atomic group
[^rn\\\:]++ # all characters that are not "r", "n", "\" or ":"
| # OR
(?<!\\\)[rn] # "r" or "n" not preceded by "\"
)+ # close the atomic group and repeat one or more times
) # close the first capturing group
:
( # open the second capturing group
[^\\\]++ # all characters except "\" one or more times
) # close the second capturing group
~
Notices:
When you want to represent a \ (backslash) in a string surrounded by single quotes, you must use a double escape: \\\
The principe of this pattern is to use negative character classes and negative assertions, in other words it looks for what the desired substrings can not be.
The above pattern use atomic groups (?>...) and possessive quantifiers ++ in place of non-capturing group (?:...) and simple quantifiers +. It is the same except that the regex engine can't go back to test other ways when it fails with atomic groups and possessive quantifiers, since it doesn't record backtrack positions. You can win in performance with this kind of features.

Try with:
/([a-z0-9]+):([a-z0-9]+)(?:\r\n)?/
or
/(\w+):(\w+)(?:\r\n)?/

I think you need:
/([^:]+):([^\\r\\n]+)/
//__^ note the colon

Related

regex to match of the occurrence for either "this" or "that" at least twice in a sentence

I want create a regex in PHP that searches the sentences in a text which contain "this" or "that" at least twice (so at least twice "this" or at least twice "that")
We got stuck at:
([^.?!]*(\bthis|that\b){2,}[^.?!]*[.|!|?]+)
Use this Pattern (\b(?:this|that)\b).*?\1 Demo
( # Capturing Group (1)
\b # <word boundary>
(?: # Non Capturing Group
this # "this"
| # OR
that # "that"
) # End of Non Capturing Group
\b # <word boundary>
) # End of Capturing Group (1)
. # Any character except line break
*? # (zero or more)(lazy)
\1 # Back reference to group (1)
This is mostly Wiktor's pattern with a deviation to isolate the sentences and omit the leading white-space characters from the fullstring matches.
Pattern: /\b[^.?!]*\b(th(?:is|at))\b[^.?!]*(\b\1\b)[^.?!]*\b[.!?]/i
Here is a sample text that will demonstrate how the other answers will not correctly disqualify unwanted matches for "word boundary" or "case-insensitive" reasons: (Demo - capture group applied to \b\1\b in the demo to show which substrings are qualifying the sentences for matching)
This is nothing.
That is what that will be.
The Indian policeman hit the thief with his lathis before pushing him into the thistles.
This Indian policeman hit the thief with this lathis before pushing him into the thistles. This is that and that.
The Indian policeman hit the thief with this lathis before pushing him into the thistles.
To see the official breakdown of the pattern, refer to the demo link.
In plain terms:
/ #start of pattern
\b #match start of a sentence on a "word character"
[^.?!]* #match zero or more characters not a dot, question mark, or exclamation
\b(th(?:is|at))\b #match whole word "this" or "that" (not thistle)
[^.?!]* #match zero or more characters not a dot, question mark, or exclamation
\b\1\b #match the earlier captured whole word "this" or "that"
[^.?!]* #match zero or more characters not a dot, question mark, or exclamation
\b #match second last character of sentence as "word character"
[.!?] #match the end of a sentence: dot, question mark, exclamation
/ #end of pattern
i #make pattern case-insensitive
The pattern will match three of the five sentences from the above sample text:
That this is what that will be.
This Indian policeman hit the thief with this lathis before pushing him into the thistles.
This is that and that.
*note, previously I was using \s*\K at the start of my pattern to omit the white-space characters. I've elected to alter my pattern to use additional word boundary meta-characters for improved efficiency. If this doesn't work with your project text, it may be better to revert to my original pattern.
Use this
.*(this|that).*(this|that).*
http://regexr.com/3ggq5
UPDATE:
This is another way, based in your regex:
.*(this\s?|that\s?){2,}.*[\.\n]*
http://regexr.com/3ggq8

How to simplify this regex to avoid recursion?

Regex:
(?|`(?>[^`\\]|\\.|``)*`|'(?>[^'\\]|\\.|'')*'|"(?>[^"\\]|\\.|"")*"|(\?{1,2})|(:{1,2})([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*))
Example input:
INSERT INTO xyz WHERE
a=?
and b="what?"
and ??="cheese"
and `col?`='OK'
and ::col='another'
and last!=:least
https://regex101.com/r/HnTVXx/6
It should match ?, ??, :xyz and ::xyz but not if they are inside of a backquoted-string, double-quoted string, or single-quoted string.
When I try running this in PHP with a very large input I get PREG_RECURSION_LIMIT_ERROR from preg_last_error().
How can I simplify this regex pattern so that it doesn't do so much recursion?
Here's some test code that shows the error in PHP using Niet's optimized regex: https://3v4l.org/GdtmP Error code 6 is PREG_JIT_STACKLIMIT_ERROR. The other one I've seen is 3=PREG_RECURSION_LIMIT_ERROR
The general idea of "match this thing, but not in this condition" can be achieved with this pattern:
(don't match this(*SKIP)(*FAIL)|match this)
In your case, you'd want something like...
(
(['"`]) # capture this quote character
(?:\\.|(?!\1).)*+ # any escaped character, or
# any character that isn't the captured one
\1 # the captured quote again
(*SKIP)(*FAIL) # ignore this
|
\?\?? # one or two question marks
|
::?\w+ # word characters marked with one or two colons
)x
https://regex101.com/r/HnTVXx/7
Same idea to skip quoted parts (the (*SKIP)(*F) combo), but also 2 techniques to reduce the regex engine work:
the first character discrimination
the unrolled pattern
These 2 techniques have something in common: limiting the cost of alternations.
The first character discrimination is useful when your pattern starts with an alternation. The problem with an alternation at the beginning is that each branch should be tested so that a position where the pattern fails is identified. Since most of the time, there are many failing positions in a string, discarding them quickly constitutes a significant improvement.
For instance, something like: "...|'...|`...|:... can also be written like this:
(?=["'`:])(?:"...|'...|`...|:...)
or
["'`:](?:(?<=")...|(?<=')...|(?<=`)...|(?<=:)...)
This way, each position that doesn't start with one of these characters ["'`:] is immediately rejected with the first token without to test each branch.
The unrolled pattern consists to rewrite something like: " (?:[^"\\]|\\.)* " into:
" [^"\\]* (?: \\. [^"\\]* )* "
Note that this design eliminates the alternation and reduces the number of steps drastically:basicunrolled
Using these 2 techniques, your pattern can be written like this:
~
[`'"?:]
(?:
(?<=`) [^`\\]*+ (?s:\\.[^`\\]*|``[^`\\]*)*+ ` (*SKIP) (*F)
|
(?<=') [^'\\]*+ (?s:\\.[^'\\]*|''[^'\\]*)*+ ' (*SKIP) (*F)
|
(?<=") [^"\\]*+ (?s:\\.[^"\\]*|""[^"\\]*)*+ " (*SKIP) (*F)
|
(?<=\?) \??
|
(?<=:) :? ([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)
)
~x
demo
Other way: instead of using an alternation (improved or not) at the beginning, you can build a pattern that matches all the string with contiguous results. The general design is:
\G (all I don't want) (*SKIP) \K (what I am looking for)
\G is an anchor that matches either the position after the previous result or the start of the string. Starting a pattern with it ensures that all the matches are contiguous. In this situation (at the beginning of the pattern and in factor to the whole pattern), you can also replace it with the A modifier.
That gives:
~
[^`'"?:]*
(?:
` [^`\\]*+ (?s:\\.[^`\\]*|``[^`\\]*)*+ ` [^`'"?:]*
|
' [^'\\]*+ (?s:\\.[^'\\]*|''[^'\\]*)*+ ' [^`'"?:]*
|
" [^"\\]*+ (?s:\\.[^"\\]*|""[^"\\]*)*+ " [^`'"?:]*
)*
\K # only the part of the match after this position is returned
(*SKIP) # if the next subpattern fails, the contiguity is broken at this position
(?:
\?{1,2}
|
:{1,2} ([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)
)
~Ax
demo

regex matches numbers, but not letters

I have a string that looks like this:
[if-abc] 12345 [if-def] 67890 [/if][/if]
I have the following regex:
/\[if-([a-z0-9-]*)\]([^\[if]*?)\[\/if\]/s
This matches the inner brackets just like I want it to. However, when I replace the 67890 with text (ie. abcdef), it doesn't match it.
[if-abc] 12345 [if-def] abcdef [/if][/if]
I want to be able to match ANY characters, including line breaks, except for another opening bracket [if-.
This part doesn't work like you think it does:
[^\[if]
This will match a single character that is neither of [, i or f. Regardless of the combination. You can mimic the desired behavior using a negative lookahead though:
~\[if-([a-z0-9-]*)\]((?:(?!\[/?if).)*)\[/if\]~s
I've also included closing tags in the lookahead, as this avoid the ungreedy repetition (which is usually worse performance-wise). Plus, I've changed the delimiters, so that you don't have to escape the slash in the pattern.
So this is the interesting part ((?:(?!\[/?if).)*) explained:
( # capture the contents of the tag-pair
(?: # start a non-capturing group (the ?: are just a performance
# optimization). this group represents a single "allowed" character
(?! # negative lookahead - makes sure that the next character does not mark
# the start of either [if or [/if (the negative lookahead will cause
# the entire pattern to fail if its contents match)
\[/?if
# match [if or [/if
) # end of lookahead
. # consume/match any single character
)* # end of group - repeat 0 or more times
) # end of capturing group
Modifying a little results in:
/\[if-([a-z0-9-]+)\](.+?)(?=\[if)/s
Running it on [if-abc] 12345 [if-def] abcdef [/if][/if]
Results in a first match as: [if-abc] 12345
Your groups are: abc and 12345
And modifying even further:
/\[if-([a-z0-9-]+)\](.+?)(?=(?:\[\/?if))/s
matches both groups. Although the delimiter [/if] is not captured by either of these.
NOTE: Instead of matching the delimeters I used a lookahead ((?=)) in the regex to stop when the text ahead matches the lookahead.
Use a period to match any character.

How to evaluate constraints using regular expressions? (php, regex)

So, let's say I want to accept strings as follows
SomeColumn IN||<||>||= [123, 'hello', "wassup"]||123||'hello'||"yay!"
For example:MyValue IN ['value', 123] or MyInt > 123 -> I think you get the idea. Now, what's bothering me is how to phrase this in a regex? I'm using PHP, and this is what I'm doing right now: $temp = explode(';', $constraints);
$matches = array();
foreach ($temp as $condition) {
preg_match('/(.+)[\t| ]+(IN|<|=|>|!)[\t| ]+([0-9]+|[.+]|.+)/', $condition, $matches[]);
}
foreach ($matches as $match) {
if ($match[2] == 'IN') {
preg_match('/(?:([0-9]+|".+"|\'.+\'))/', substr($match[3], 1, -1), $tempm);
print_r($tempm);
}
}
Really appreciate any help right there, my regex'ing is horrible.
I assume your input looks similar to this:
$string = 'SomeColumn IN [123, \'hello\', "wassup"];SomeColumn < 123;SomeColumn = \'hello\';SomeColumn > 123;SomeColumn = "yay!";SomeColumn = [123, \'hello\', "wassup"]';
If you use preg_match_all there is no need for explode or to build the matches yourself. Note that the resulting two-dimensional array will have its dimensions switched, but that is often desirable. Here is the code:
preg_match_all('/(\w+)[\t ]+(IN|<|>|=|!)[\t ]+((\'[^\']*\'|"[^"]*"|\d+)|\[[\t ]*(?4)(?:[\t ]*,[\t ]*(?4))*[\t ]*\])/', $string, $matches);
$statements = $matches[0];
$columns = $matches[1];
$operators = $matches[2];
$values = $matches[3];
There will also be a $matches[4] but it does not really have a meaning and is only used inside the regular expression. First, a few things you did wrong in your attempt:
(.+) will consume as much as possible, and any character. So if you have something inside a string value that looks like IN 13 then your first repetition might consume everything until there and return it as the column. It also allows whitespace and = inside column names. There are two ways around this. Either making the repetition "ungreedy" by appending ? or, even better, restrict the allowed characters, so you cannot go past the desired delimiter. In my regex I only allow letters, digits and underscores (\w) for column identifiers.
[\t| ] this mixes up two concepts: alternation and character classes. What this does is "match a tab, a pipe or a space". In character classes you simply write all characters without delimiting them. Alternatively you could have written (\t| ) which would be equivalent in this case.
[.+] I don't know what you were trying to accomplish with this, but it matches either a literal . or a literal +. And again it might be useful to restrict the allowed characters, and to check for correct matching of quotes (to avoid 'some string")
Now for an explanation of my own regex (you can copy this into your code, as well, it will work just fine; plus you have the explanation as comments in your code):
preg_match_all('/
(\w+) # match an identifier and capture in $1
[\t ]+ # one or more tabs or spaces
(IN|<|>|=|!) # the operator (capture in $2)
[\t ]+ # one or more tabs or spaces
( # start of capturing group $3 (the value)
( # start of subpattern for single-valued literals (capturing group $4)
\' # literal quote
[^\']* # arbitrarily many non-quote characters, to avoid going past the end of the string
\' # literal quote
| # OR
"[^"]*" # equivalent for double-quotes
| # OR
\d+ # a number
) # end of subpattern for single-valued literals
| # OR (arrays follow)
\[ # literal [
[\t ]* # zero or more tabs or spaces
(?4) # reuse subpattern no. 4 (any single-valued literal)
(?: # start non-capturing subpattern for further array elements
[\t ]* # zero or more tabs or spaces
, # a literal comma
[\t ]* # zero or more tabs or spaces
(?4) # reuse subpattern no. 4 (any single-valued literal)
)* # end of additional array element; repeat zero or more times
[\t ]* # zero or more tabs or spaces
\] # literal ]
) # end of capturing group $3
/',
$string,
$matches);
This makes use of PCRE's recursion feature where you can reuse a subpattern (or the whole regular expression) with (?n) (where n is just the number you would also use for a backreference).
I can think of three major things that could be improved with this regex:
It does not allow for floating-point numbers
It does not allow for escaped quotes (if your value is 'don\'t do this', I would only captur 'don\'). This can be solved using a negative lookbehind.
It does not allow for empty arrays as values (this could be easily solved by wrapping all parameters in a subpattern and making it optional with ?)
I included none of these, because I was not sure whether they apply to your problem, and I thought the regex was already complex enough to present here.
Usually regular expressions are not powerful enough to do proper language parsing anyway. It is generally better to write your parser.
And since you said your regex'ing is horrible... while regular expressions seem like a lot of black magic due to their uncommon syntax, they are not that hard to understand, if you take the time once to get your head around their basic concepts. I can recommend this tutorial. It really takes you all the way through!

How does this PCRE pattern detect palindromes?

This question is an educational demonstration of the usage of lookahead, nested reference, and conditionals in a PCRE pattern to match ALL palindromes, including the ones that can't be matched by the recursive pattern given in the PCRE man page.
Examine this PCRE pattern in PHP snippet:
$palindrome = '/(?x)
^
(?:
(.) (?=
.*
(
\1
(?(2) \2 | )
)
$
)
)*
.?
\2?
$
/';
This pattern seems to detect palindromes, as seen in this test cases (see also on ideone.com):
$tests = array(
# palindromes
'',
'a',
'aa',
'aaa',
'aba',
'aaaa',
'abba',
'aaaaa',
'abcba',
'ababa',
# non-palindromes
'aab',
'abab',
'xyz',
);
foreach ($tests as $test) {
echo sprintf("%s '%s'\n", preg_match($palindrome, $test), $test);
}
So how does this pattern work?
Notes
This pattern uses a nested reference, which is a similar technique used in How does this Java regex detect palindromes?, but unlike that Java pattern, there's no lookbehind (but it does use a conditional).
Also, note that the PCRE man page presents a recursive pattern to match some palindromes:
# the recursive pattern to detect some palindromes from PCRE man page
^(?:((.)(?1)\2|)|((.)(?3)\4|.))$
The man page warns that this recursive pattern can NOT detect all palindromes (see: Why will this recursive regex only match when a character repeats 2n - 1 times? and also on ideone.com), but the nested reference/positive lookahead pattern presented in this question can.
Let's try to understand the regex by constructing it. Firstly, a palindrome must start and end with the same sequence of character in the opposite direction:
^(.)(.)(.) ... \3\2\1$
we want to rewrite this such that the ... is only followed by a finite length of patterns, so that it could be possible for us to convert it into a *. This is possible with a lookahead:
^(.)(?=.*\1$)
(.)(?=.*\2\1$)
(.)(?=.*\3\2\1$) ...
but there are still uncommon parts. What if we can "record" the previously captured groups? If it is possible we could rewrite it as:
^(.)(?=.*(?<record>\1\k<record>)$) # \1 = \1 + (empty)
(.)(?=.*(?<record>\2\k<record>)$) # \2\1 = \2 + \1
(.)(?=.*(?<record>\3\k<record>)$) # \3\2\1 = \3 + \2\1
...
which could be converted into
^(?:
(.)(?=.*(\1\2)$)
)*
Almost good, except that \2 (the recorded capture) is not empty initially. It will just fail to match anything. We need it to match empty if the recorded capture doesn't exist. This is how the conditional expression creeps in.
(?(2)\2|) # matches \2 if it exist, empty otherwise.
so our expression becomes
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*
Now it matches the first half of the palindrome. How about the 2nd half? Well, after the 1st half is matched, the recorded capture \2 will contain the 2nd half. So let's just put it in the end.
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*\2$
We want to take care of odd-length palindrome as well. There would be a free character between the 1st and 2nd half.
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2$
This works good except in one case — when there is only 1 character. This is again due to \2 matches nothing. So
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2?$
# ^ since \2 must be at the end in the look-ahead anyway.
I want to bring my very own solution to the table.
This is a regex that I've written a while ago to solve matching palindromes using PCRE/PCRE2
^((\w)(((\w)(?5)\5?)*|(?1)|\w?)\2)$
Example:
https://regex101.com/r/xvZ1H0/1

Categories