How does this regex divide text into sentences?

How does this regex divide text into sentences? - php

I know this regex divides a text into sentences. Can someone help me understand how?
/(?<!\..)([\?\!\.])\s(?!.\.)/

You can use YAPE::Regex::Explain to decipher Perl regular expressions:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(?<!\..)([\?\!\.])\s(?!.\.)/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(?<!\..)([\?\!\.])\s(?!.\.))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
(?<! look behind to see if there is not:
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
) end of look-behind
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\?\!\.] any character of: '\?', '\!', '\.'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

There is the Regular Expression Analyzer which will do quite the same as toolic already suggested - but completely webbased.

(? # Find a group (don't capture)
< # before the following regular expression
! # that does not match
\. # a literal "."
. # followed by 1 character
) # (End look-behind group)
( # Start a group (capture it to $1)
[\?\!\.] # Containing any one of the characters in the following set "?!."
) # End group $1
\s # followed by a whitespace character " ", \t, etc.
(? # Followed by a group (don't capture)
# after the preceding regular expression
! # that does not have
. # 1 character
\. # followed by a literal "."
) # (End look-ahead group)

The first part (?<!\..) is a negative look-behind. It specifies a pattern which invalidates the match. In this case it's looking for two characters--the first a period and the other one any character.
The second part is a standard capture/group, which could be better expressed: ([?!.]) (you don't need the escapes in the class brackets), that is a sentence ending punctuation character.
The next part is a single (??) white-space character: \s
And the last part is a negative look-ahead: (?!.\.). Again it is guarding against the case of a single character followed by a period.
This should work, relatively well. But I don't think I would recommend it. I don't see what the coder was getting at trying to make sure that just a period wasn't the second most recent character, or that it wasn't the second one to come.
I mean if you are looking to split on terminal punctuation, why don't you want to guard against the same class being two-back or two-ahead? Instead it relies on periods not being there. Thus a more regular expression would be:
/(?<![?!.].)([?!.])\s(?!.[?!.])/

Portions:
([\?\!\.])\s: split by ending character (.,!,or ?) which is followed by a whitespace character (space, tab, newline)
(?<!\..) where the characters before this 'ending character' arent a .+anything
(?!.\.) after the whitespace character any character directly followed by any . isn't allowed.
Those look-ahead ((?!) & look-behind ((?<!) assertions mainly seem to prevent splitting on (whitespaced?) abbreviations (q. e. d. etc.).

Related

PHP preg_match to catch all but some options

What regex would I need to use to achieve the following;
Check if string does not contain one or more options... I tried a lot expressions.
I think this is closest to be the correct one.
/^[^(256K)]$|^[^(2M)]$/
I would like preg_match to tell me if there is anything other than 256K or 2M, and I cant negate preg_match (!preg_match) for reasons that take to long to explain ;)

You can not place whole words or capturing groups inside of Character Classes. A character class matches any one character from a set of characters.
Your regular expression matches the beginning of the string, any character except: (, 2, 5, 6, K, ), followed by the end of the string, OR the beginning of the string, any character except: (, 2, M, ), followed by the end of string.
I believe you are wanting a Negative Lookahead here instead.
/^((?!256K|2M).)*$/i
Regular expression:
^ # the beginning of the string
( # group and capture to \1 (0 or more times)
(?! # look ahead to see if there is not:
256K # '256K'
| # OR
2M # '2M'
) # end of look-ahead
. # any character except \n
)* # end of \1
$ # before an optional \n, and the end of the string

Regex for validating numbers preceding with two dashes (hyphens)

I got stuck with regexp to validate only numbers from 1-10 that could have two dashes(hyphens) before, for example:
--9
or
--10
or
--1
but not
--11 or not --0
I tried like seems to me everything, example:
/(-\-\[1-10])/
What is wrong?
EDIT 1:
Thanks a lot for so many working examples!!
What if I also wanted to validate to numbers before all of this, example:
8--10 but not 0--10 or not 11--11
I tried this but it didn't work:
/--([1-9]|10:[1-9]|10)\b/
EDIT 2:
Oh, this one works, finally:
/^(10|[1-9])--(10|[1-9])$/

Have a try with:
/\b(?:[1-9]|10)--(?:[1-9]|10)\b/
Change according to OP's edit.
Explanation:
The regular expression:
(?-imsx:\b(?:[1-9]|10)--(?:[1-9]|10)\b)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[1-9] any character of: '1' to '9'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
10 '10'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
-- '--'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[1-9] any character of: '1' to '9'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
10 '10'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

I guess this will fit
/\-\-([1-9]|10)\b/
if you don't want to capture your number, add ?: :
/\-\-(?:[1-9]|10)\b/

Outside a character class, you don't need to escape hyphens. Also, your character class [1-10] will only match 1 and 0, because [1-10] is equal to [10] and that will only match 1 and 0. Try this regex:
/^--(10|[1-9])$/

The correct regex is
/\b--([1-9]|10)\b/
You're incorrectly escaping the first [ of your character class as \[. The character class used is incorrect as well. It would be treated as a character class with members 1 to 1 and a 0 i.e. [10] which means it matches either 0 or 1.
Also, the hyphens - don't need to be escaped outside a character class []. To validate the numbers that come before the hyphens as well use
/\b([1-9]|10)--([1-9]|10)\b/

When you write [1-10], it mean characters 1 to 1 + the 0 character. It as if you had write [0-1].
In fact, in your case, it would be better to test cases --1 to --9 and case --10 separately with something like : /^(--10)|(--[1-9])$/
You can test your regex on http://myregexp.com/

Regex to match uri after mod_rewrite

I'm looking for a PHP PCRE regex to match uri's that are rewritted with Apache's mod_rewrite module. The uri's are as follow :
/param1/param2/param3/param4
The rules for the uri
must contain at least one /;
the params must only allow letters, numbers, - and _;
there must be zero or more instances of the first two rules;

/\/[a-zA-Z0-9_\-\/]+$/
I am assuming that it must start with an / and something like this should not match /param1/param2/param3/param4*

How about:
if (preg_match("~^(?:/[\w-]+)+/?$~", $string)) {
# do stuff
}
Explanation:
The regular expression:
(?-imsx:^(?:/[\w-]+)+/?$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
[\w-]+ any character of: word characters (a-z,
A-Z, 0-9, _), '-' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

Need help understanding preg_match regular expression

A regular expression in preg_match is given as /server\-([^\-\.\d]+)(\d+)/. Can someone help me understand what this means? I see that the string starts with server- but I dont get ([^\-\.\d]+)(\d+)'

[ ] -> Match anything inside the square brackets for ONE character position once and only once, for example, [12] means match the target to 1 and if that does not match then match the target to 2 while [0123456789] means match to any character in the range 0 to 9.
- -> The - (dash) inside square brackets is the 'range separator' and allows us to define a range, in our example above of [0123456789] we could rewrite it as [0-9].
You can define more than one range inside a list, for example, [0-9A-C] means check for 0 to 9 and A to C (but not a to c).
NOTE: To test for - inside brackets (as a literal) it must come first or last, that is, [-0-9] will test for - and 0 to 9.
^ -> The ^ (circumflex or caret) inside square brackets negates the expression (we will see an alternate use for the circumflex/caret outside square brackets later), for example, [^Ff] means anything except upper or lower case F and [^a-z] means everything except lower case a to z.
You can check more explanations about it in the source I got this information: http://www.zytrax.com/tech/web/regex.htm
And if u want to test, u can try this one: http://gskinner.com/RegExr/

Here's the explanation:
# server\-([^\-\.\d]+)(\d+)
#
# Match the characters “server” literally «server»
# Match the character “-” literally «\-»
# Match the regular expression below and capture its match into backreference number 1 «([^\-\.\d]+)»
# Match a single character NOT present in the list below «[^\-\.\d]+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# A - character «\-»
# A . character «\.»
# A single digit 0..9 «\d»
# Match the regular expression below and capture its match into backreference number 2 «(\d+)»
# Match a single digit 0..9 «\d+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
You can use programs such as RegexBuddy if you intend to work with regexes and are willing to spend some funds.
You can also use this free web based explanation utility.

^ means not one of the following characters inside the brackets
\- \. are the - and . characters
\d is a number
[^\-\.\d]+ means on of more of the characters inside the bracket, so one or more of anything not a -, . or a number.
(\d+) one or more number

Here is the explanation given by the perl module YAPE::Regex::Explain
The regular expression:
(?-imsx:server\-([^\-\.\d]+)(\d+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
server 'server'
----------------------------------------------------------------------
\- '-'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^\-\.\d]+ any character except: '\-', '\.', digits
(0-9) (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

Understanding Pattern in preg_match_all() Function Call

I am trying to understand how preg_match_all() works and when looking at the documentation on the php.net site, I see some examples but am baffled by the strings sent as the pattern parameter. Is there a really thorough, clear explanation out there? For example, I don't understand what the pattern in this example means:
preg_match_all("/\(? (\d{3})? \)? (?(1) [\-\s] ) \d{3}-\d{4}/x",
"Call 555-1212 or 1-800-555-1212", $phones);
or this:
$html = "<b>bold text</b><a href=howdy.html>click me</a>";
preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/", $html, $matches, PREG_SET_ORDER);
I've taken an introductory class on PHP, but never saw anything like this. Some clarification would be appreciated.
Thanks!

Those aren't "PHP patterns", those are Regular Expressions. Instead of trying to explain what has been explained before a thousand times in this answer, I'll point you to http://regular-expressions.info for information and tutorials.

You are looking for this,
PHP PCRE Pattern Syntax
PCRE Standard syntax
Note that first one is a subset of second one.

Also have a look at YAPE, which for example gives this nice textual explanation for your first regex:
(?x-ims:\(? (\d{3})? \)? (?(1) [\-\s] ) \d{3}-\d{4})
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?x-ims: group, but do not capture (disregarding
whitespace and comments) (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n):
----------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1 (optional
(matching the most amount possible)):
----------------------------------------------------------------------
\d{3} digits (0-9) (3 times)
----------------------------------------------------------------------
)? end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
----------------------------------------------------------------------
\)? ')' (optional (matching the most amount
possible))
----------------------------------------------------------------------
(?(1) if back-reference \1 matched, then:
----------------------------------------------------------------------
[\-\s] any character of: '\-', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
| else:
----------------------------------------------------------------------
succeed
----------------------------------------------------------------------
) end of conditional on \1
----------------------------------------------------------------------
\d{3} digits (0-9) (3 times)
----------------------------------------------------------------------
- '-'
----------------------------------------------------------------------
\d{4} digits (0-9) (4 times)
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

The pattern you write about is a mini-language in it's own called Regular Expression. It's specialized on finding patterns in strings, do replacements etc. for everything that follows some sort of pattern.
More specifically it's a Perl Compatible Regular Expression (PCRE).
The handbook for that language is not available on the PHP manual website, you find it here: PCRE Manpage.
A well made step-by-step introduction is on the Regular Expressions Info Website.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How does this regex divide text into sentences? - php

I know this regex divides a text into sentences. Can someone help me understand how? /(?<!\..)([\?\!\.])\s(?!.\.)/

There is the Regular Expression Analyzer which will do quite the same as toolic already suggested - but completely webbased.

Related

PHP preg_match to catch all but some options

Regex for validating numbers preceding with two dashes (hyphens)

Regex to match uri after mod_rewrite

Need help understanding preg_match regular expression

Understanding Pattern in preg_match_all() Function Call

Categories

Resources