I'm trying to match a group that only matches when the first non-spacing character preceding the match is NOT an alphanumeric character.
RegExp i've tried, consuming the spaces first with \s* then looking behind to check for \w:
(?<!\w)\s*\({\w+}\)
Success
Input: this will = ({match})
Expected: ({match})
Actual: ({match})
Failure, still matches while preceded by alphanumeric (ignoring spaces)
Input: this should = not ({match})
Expected: -
Actual: ({match})
Using \s+ instead of \s* solved it partially but now it requires at least one space which is not desired!
(?<!\w)\s+\({\w+}\)
I've been looking around the internet but cannot solve the problem. Anyone?
Use this solution (a mix of #horcrux and #Wiktor Stribizew suggestions):
<?php
$regex = '/(?<![\w\s])\s*(\({\w+}\))/';
$string = 'this will = ({match})
this should = not ({match})';
preg_match_all($regex, $string, $matches);
var_dump($matches[1]);
?>
See regex proof.
Results:
array(1) {
[0]=>
string(9) "({match})"
}
See PHP proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
[\w\s] any character of: word characters (a-z,
A-Z, 0-9, _), whitespace (\n, \r, \t,
\f, and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
{ '{'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
} '}'
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of \1
If there has to be a first non-spacing character present, you could match it and use \K to clear the match buffer.
[^\w\s]\h*\K\({\w+}\)
The pattern matches
[^\w\s] Match a single char other than a word char or whitespace char
\h*\K Match 0+ horizontal whitespace chars and forget what is matched so far
\({\w+}\) Match 1+ word chars between ({ and })
Regex demo | Php demo
Related
stackers!
I have been trying to figure this out for some time but no luck.
(.*?(?:\.|\?|!))(?: |$)
the above pattern is capturing and breaking all sentences in a paragraph with ending punctuation.
example
Today is the greatest. You are the greatest.
The match comes back with three
Match {
1.
Today is the greatest.
You are the greatest.
}
However I am trying to get it to not break when there is a number with a period and would like to see the following match instead:
Match {
1.Today is the greatest.
You are the greatest.
}
Thanks for your help in advance
Use
.*?[.?!](?=(?<!\d\.)\s+|\s*$)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
[.?!] any character of: '.', '?', '!'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of look-ahead
I am developing a "word filter" class in PHP that, among other things, need to capture purposely misspelled words. These words are inputted by User as a sentence. Let me show a simple example of a sentence inputted by an User:
I want a coke, sex, drugs and rock'n'roll
The above example is a common phrase write correctly. My class will find the suspect words sex and drugs and everthing will be fine.
But I suppose that the User will try to hinder the detection of words and write the things a little different. In fact he has many different ways to write the same word so that it is still readable for certain types of people. For example, the word sex may be written as s3x or 5ex or 53x or s e x or s 3 x or s33x or 5533xxx of ss 33 xxx and so on.
I know the basics of regular expressions and tried the pattern bellow:
/(\b[\w][\w .'-]+[\w]\b)/g
Because of
\b word boundary
[\w] The word can start with one letter or one digit...
[\w .'-] ... followed by any letter, digit, space, dot, quotes or dash...
+ ... one or more times...
[\w] ... ending with one letter or one digit.
\b word boundary
This works partially.
If the sample phrase was written as I want a coke, 5 3 x, druuu95 and r0ck'n'r011 I get 3 matches:
I want a coke
5 3 x
druuu95 and r0ck'n'r011
What I need is 8 matches
I
want
a
coke
5 3 x
druuu95
and
r0ck'n'r011
To shorten, I need a regular expression that give me each word of a sentence, even if the word begins with a digit, contains a variable number of digits, spaces, dots, dashes and quotes, and end with a letter or digit.
Any help will be appreciated.
Description
Typically good words are 2 or more letters long (with the exception of I and a) and do not contain numbers. This expression isn't flawless, but does help illustrate why doing this type of language matching is absurdly difficult because it's an arms race between creative people trying to express themselves without getting caught, and the development team who is trying to catch flaws.
(?:\s+|\A)[#'"[({]?(?!(?:[a-z]{2}\s+){3})(?:[a-zA-Z'-]{2,}|[ia]|i[nst]|o[fnr])[?!.,;:'")}\]]?(?=(?:\s|\Z))|((?:[a-z]{2}\s+){3}|.*?\b)
** To see the image better, simply right click the image and select view in new window
This regular expression will do the following:
find all acceptable words
find all the rest and store them in Capture Group 1
Example
Live Demo
https://regex101.com/r/cL2bN1/1
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\A the beginning of the string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
[#'"[({]? any character of: '#', ''', '"', '[', '(',
'{' (optional (matching the most amount
possible))
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
(?: group, but do not capture (3 times):
----------------------------------------------------------------------
[a-z]{2} any character of: 'a' to 'z' (2 times)
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
){3} end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[a-zA-Z'-]{2,} any character of: 'a' to 'z', 'A' to
'Z', ''', '-' (at least 2 times
(matching the most amount possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[ia] any character of: 'i', 'a'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
i 'i'
----------------------------------------------------------------------
[nst] any character of: 'n', 's', 't'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
o 'o'
----------------------------------------------------------------------
[fnr] any character of: 'f', 'n', 'r'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
[?!.,;:'")}\]]? any character of: '?', '!', '.', ',', ';',
':', ''', '"', ')', '}', '\]' (optional
(matching the most amount possible))
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\Z before an optional \n, and the end of
the string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture (3 times):
----------------------------------------------------------------------
[a-z]{2} any character of: 'a' to 'z' (2 times)
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
){3} end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
I require to match first occurrence of the following pattern starting with \s or ( then NIC followed by any characters followed # or . followed by 5 or 6 digits.
Regular expression used :
preg_match('/[\\s|(]NIC.*[#|.]\d{5,6}/i', trim($test), $matches1);
Example:
$test = "(NIC.123456"; // works correctly
$test = "(NIC.123456 oldnic#65703 checking" // produce result (NIC.123456 oldnic#65703
But it needs to be only (NIC.123456. What is the problem?
You need to add the ? quantifier for a non-greedy match. Here .* is matching the most amount possible.
You also don't need to double escape \\s here, you can just use \s and you can just combine the selective characters inside your character class instead of adding in the pipe | delimiter.
Also note that your expression will match strings like the following (NIC_CCC.123456, to avoid this you can use a word boundary \b matching the boundary between a word character and not a word character.
preg_match('/(?<=^|\s)\(nic\b.*?[#.]\d{5,6}/i', $test, $match);
Regular expression:
(?<= look behind to see if there is:
^ the beginning of the string
| OR
\s whitespace (\n, \r, \t, \f, and " ")
) end of look-behind
\( '('
nic 'nic'
\b the boundary between a word char (\w) and not a word char
.*? any character except \n (0 or more times)
[#.] any character of: '#', '.'
\d{5,6} digits (0-9) (between 5 and 6 times)
See live demo
have tried using
$test1 = explode(" ", $test);
and use $test1[0] to display your result.
i use ((\d)(\d(?!\2))((?<!\3)\d(?!\3)))\1 to match arbitrary digit that not same one row sort like:
234234, 345345, 359359 but not match 211211, 355355 (removing the lookbehind assertation will match these)
i found the pattern got error when run with preg_match() in PHP since the length of offset must fixed, but its OK when tested in other debuger (i use kodos in this case)
preg_match_all(): Compilation failed: lookbehind assertion is not fixed length at offset 23
Are there any alternative of the pattern to match sort digit above? 245245 or other digit that fit ABCABC format pattern.
if the 3 digits must be different you can use:
((\d)(?!.?\2)(\d)(?!\3)\d)\1
but if 545545 is allowed you can use:
((\d)(?!\2)(\d)(?!\3)\d)\1
The problem is the lookbehind, this turns it into a lookahead and seems to work for me regex101
((\d)(\d(?!\2))(?!\3)(\d(?!\3)))\1
Just use lookahead instead of lookbehind?
((\d)(?!\2)(\d)(?!\2|\3)\d)\1
Explained by Regex Explainer:
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\2 what was matched by capture \2
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\2 what was matched by capture \2
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\3 what was matched by capture \3
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\1 what was matched by capture \1
I know this regex divides a text into sentences. Can someone help me understand how?
/(?<!\..)([\?\!\.])\s(?!.\.)/
You can use YAPE::Regex::Explain to decipher Perl regular expressions:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(?<!\..)([\?\!\.])\s(?!.\.)/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(?<!\..)([\?\!\.])\s(?!.\.))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
(?<! look behind to see if there is not:
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
) end of look-behind
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\?\!\.] any character of: '\?', '\!', '\.'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
There is the Regular Expression Analyzer which will do quite the same as toolic already suggested - but completely webbased.
(? # Find a group (don't capture)
< # before the following regular expression
! # that does not match
\. # a literal "."
. # followed by 1 character
) # (End look-behind group)
( # Start a group (capture it to $1)
[\?\!\.] # Containing any one of the characters in the following set "?!."
) # End group $1
\s # followed by a whitespace character " ", \t, etc.
(? # Followed by a group (don't capture)
# after the preceding regular expression
! # that does not have
. # 1 character
\. # followed by a literal "."
) # (End look-ahead group)
The first part (?<!\..) is a negative look-behind. It specifies a pattern which invalidates the match. In this case it's looking for two characters--the first a period and the other one any character.
The second part is a standard capture/group, which could be better expressed: ([?!.]) (you don't need the escapes in the class brackets), that is a sentence ending punctuation character.
The next part is a single (??) white-space character: \s
And the last part is a negative look-ahead: (?!.\.). Again it is guarding against the case of a single character followed by a period.
This should work, relatively well. But I don't think I would recommend it. I don't see what the coder was getting at trying to make sure that just a period wasn't the second most recent character, or that it wasn't the second one to come.
I mean if you are looking to split on terminal punctuation, why don't you want to guard against the same class being two-back or two-ahead? Instead it relies on periods not being there. Thus a more regular expression would be:
/(?<![?!.].)([?!.])\s(?!.[?!.])/
Portions:
([\?\!\.])\s: split by ending character (.,!,or ?) which is followed by a whitespace character (space, tab, newline)
(?<!\..) where the characters before this 'ending character' arent a .+anything
(?!.\.) after the whitespace character any character directly followed by any . isn't allowed.
Those look-ahead ((?!) & look-behind ((?<!) assertions mainly seem to prevent splitting on (whitespaced?) abbreviations (q. e. d. etc.).