Need help understanding preg_match regular expression

Need help understanding preg_match regular expression - php

A regular expression in preg_match is given as /server\-([^\-\.\d]+)(\d+)/. Can someone help me understand what this means? I see that the string starts with server- but I dont get ([^\-\.\d]+)(\d+)'

[ ] -> Match anything inside the square brackets for ONE character position once and only once, for example, [12] means match the target to 1 and if that does not match then match the target to 2 while [0123456789] means match to any character in the range 0 to 9.
- -> The - (dash) inside square brackets is the 'range separator' and allows us to define a range, in our example above of [0123456789] we could rewrite it as [0-9].
You can define more than one range inside a list, for example, [0-9A-C] means check for 0 to 9 and A to C (but not a to c).
NOTE: To test for - inside brackets (as a literal) it must come first or last, that is, [-0-9] will test for - and 0 to 9.
^ -> The ^ (circumflex or caret) inside square brackets negates the expression (we will see an alternate use for the circumflex/caret outside square brackets later), for example, [^Ff] means anything except upper or lower case F and [^a-z] means everything except lower case a to z.
You can check more explanations about it in the source I got this information: http://www.zytrax.com/tech/web/regex.htm
And if u want to test, u can try this one: http://gskinner.com/RegExr/

Here's the explanation:
# server\-([^\-\.\d]+)(\d+)
#
# Match the characters “server” literally «server»
# Match the character “-” literally «\-»
# Match the regular expression below and capture its match into backreference number 1 «([^\-\.\d]+)»
# Match a single character NOT present in the list below «[^\-\.\d]+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# A - character «\-»
# A . character «\.»
# A single digit 0..9 «\d»
# Match the regular expression below and capture its match into backreference number 2 «(\d+)»
# Match a single digit 0..9 «\d+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
You can use programs such as RegexBuddy if you intend to work with regexes and are willing to spend some funds.
You can also use this free web based explanation utility.

^ means not one of the following characters inside the brackets
\- \. are the - and . characters
\d is a number
[^\-\.\d]+ means on of more of the characters inside the bracket, so one or more of anything not a -, . or a number.
(\d+) one or more number

Here is the explanation given by the perl module YAPE::Regex::Explain
The regular expression:
(?-imsx:server\-([^\-\.\d]+)(\d+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
server 'server'
----------------------------------------------------------------------
\- '-'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^\-\.\d]+ any character except: '\-', '\.', digits
(0-9) (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

Related

Match regular expression specific character quantities in any order

I need to match a series of strings that:
Contain at least 3 numbers
0 or more letters
0 or 1 - (not more)
0 or 1 \ (not more)
These characters can be in any position in the string.
The regular expression I have so far is:
([A-Z0-9]*[0-9]{3,}[\/]?[\-]?[0-9]*[A-Z]*)
This matches the following data in the following cases. The only one that does not match is the first one:
02ABU-D9435
013DFC
1123451
03323456782
ADS7124536768
03SDFA9433/0
03SDFA9433/
03SDFA9433/1
A41B03423523
O4AGFC4430
I think perhaps I am being too prescriptive about positioning. How can I update this regex to match all possibilities?
PHP PCRE
The following would not match:
01/01/2018 [multiple / or -]
AA-AA [no numbers]
Thanks

One option could be using lookaheads to assert 3 digits, not 2 backslashes and not 2 times a hyphen.
(?<!\S)(?=(?:[^\d\s]*\d){3})(?!(?:[^\s-]*-){2})(?!(?:[^\s\\]*\\){2})[A-Z0-9/\\-]+(?!\S)
About the pattern
(?<!\S) Assert what is on the left is not a non whitespace char
(?=(?:[^\d\s]*\d){3}) Assert wat is on the right is 3 times a whitespace char or digit
(?!(?:[^\s-]*-){2}) Assert what is on the right is not 2 times a whitespace char a hyphen
(?!(?:[^\s\\]*\\){2}) Assert what is on the right is not 2 times a whitespace char a backslash
[A-Z0-9/\\-]+ Match any of the listed 1+ times
(?!\S) Assert what is on the right is not a non whitespace char
Regex demo

Your patterns can be checked with positive/negative lookaheads anchored at the start of the string:
at least 3 digits -> find (not necessarily consecutive) 3 digits
no more than 1 '-' -> assert absence of (not necessarily consecutive) 2 '-' characters
no more than 1 '/' -> assert absence of (not necessarily consecutive) 2 '/' characters
0 or more letters -> no check needed.
If these conditions are met, any content is permitted.
The regex implementing this:
^(?=(([^0-9\r\n]*\d){3}))(?!(.*-){2})(?!(.*\/){2}).*$
Check out this Regex101 demo.
Remark
This solution assumes that each string tested resides on its own line, ie. not just being separated by whitespace.
In case the strings are separated by whitespace, choose the solution of user #TheFourthBird (which essentially is the same as this one but caters for the whitespace separation)

You can test the condition for both the hyphen and the slash into a same lookahead using a capture group and a backreference:
~\A(?!.*([-/]).*\1)(?:[A-Z/-]*\d){3,}[A-Z/-]*\z~
demo
detailled:
~ # using the tild as pattern delimiter avoids to escape all slashes in the pattern
\A # start of the string
(?! .* ([-/]) .* \1 ) # negative lookahead:
# check that there's no more than one hyphen and one slash
(?: [A-Z/-]* \d ){3,} # at least 3 digits
[A-Z/-]* # eventual other characters until the end of the string
\z # end of the string.
~
To better understand (if you are not familiar with): these three subpatterns start from the same position (in this case the beginning of the string):
\A
(?! .* ([-/]) .* \1 )
(?: [A-Z/-]* \d ){3,}
This is possible only because the two first are zero-width assertions that are simple tests and don't consume any character.

PHP regex: each word must end with dot

Can someone help me how to specific pattern for preg_match function?
Every word in string must end with dot
First character of string must be [a-zA-Z]
After each dot there can be a space
There can't be two spaces next to each other
Last character must be a dot (logicaly after word)
Examples:
"Ing" -> false
"Ing." -> true
".Ing." -> false
"Xx Yy." -> false
"XX. YY." -> true
"XX.YY." -> true
Can you help me please how to test the string? My pattern is
/^(([a-zA-Z]+)(?! ) \.)+\.$/
I know it's wrong, but i can't figure out it. Thanks

Check how this fits your needs.
/^(?:[A-Z]+\. ?)+$/i
^ matches start
(?: opens a non-capture group for repetition
[A-Z]+ with i flag matches one or more alphas (lower & upper)
\. ? matches a literal dot followed by an optional space
)+ all this once or more until $ end
Here's a demo at regex101
If you want to disallow space at the end, add negative lookbehind: /^(?:[A-Z]+\. ?)+$(?<! )/i

Try this:
$string = "Ing
Ing.
.Ing.
Xx Yy.
XX. YY.
XX.YY.";
if (preg_match('/^([A-Za-z]{1,}\.[ ]{0,})*/m', $string)) {
// Successful match
} else {
// Match attempt failed
}
Result:
The Regex in detail:
^ Assert position at the beginning of a line (at beginning of the string or after a line break character)
( Match the regular expression below and capture its match into backreference number 1
[A-Za-z] Match a single character present in the list below
A character in the range between “A” and “Z”
A character in the range between “a” and “z”
{1,} Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\. Match the character “.” literally
[ ] Match the character “ ”
{0,} Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
)* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)

explanation of preg_replace function in PHP [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 7 years ago.
The preg_replace() function has so many possible values, like:
<?php
$patterns = array('/(19|20)(\d{2})-(\d{1,2})-(\d{1,2})/', '/^\s*{(\w+)}\s*=/');
$replace = array('\3/\4/\1\2', '$\1 =');
echo preg_replace($patterns, $replace, '{startDate} = 1999-5-27');
What does:
\3/\4/\1\2
And:
/(19|20)(\d{2})-(\d{1,2})-(\d{1,2})/','/^\s*{(\w+)}\s*=/
mean?
Is there any information available to help understand the meanings at one place? Any help or documents would be appreciated! Thanks in Advance.

Take a look at http://www.tutorialspoint.com/php/php_regular_expression.htm
\3 is the captured group 3
\4 is the captured group 4
...an so on...
\w means any word character.
\d means any digit.
\s means any white space.
+ means match the preceding pattern at least once or more.
* means match the preceding pattern 0 times or more.
{n,m} means match the preceding pattern at least n times to m times max.
{n} means match the preceding pattern exactly n times.
(n,} means match the preceding pattern at least n times or more.
(...) is a captured group.

So, the first thing to point out, is that we have an array of patterns ($patterns), and an array of replacements ($replace). Let's take each pattern and replacement and break it down:
Pattern:
/(19|20)(\d{2})-(\d{1,2})-(\d{1,2})/
Replacement:
\3/\4/\1\2
This takes a date and converts it from a YYYY-M-D format to a M/D/YYYY format. Let's break down it's components:
/ ... / # The starting and trailing slash mark the beginning and end of the expression.
(19|20) # Matches either 19 or 20, capturing the result as \1.
# \1 will be 19 or 20.
(\d{2}) # Matches any two digits (must be two digits), capturing the result as \2.
# \2 will be the two digits captured here.
- # Literal "-" character, not captured.
(\d{2}) # Either 1 or 2 digits, capturing the result as \3.
# \3 will be the one or two digits captured here.
- # Literal "-" character, not captured.
(\d{2}) # Either 1 or 2 digits, capturing the result as \4.
# \4 will be the one or two digits captured here.
This match is replaced by \3/\4/\1\2, which means:
\3 # The two digits captured in the 3rd set of `()`s, representing the month.
/ # A literal '/'.
\4 # The two digits captured in the 4rd set of `()`s, representing the day.
/ # A literal '/'.
\1 # Either '19' or '20'; the first two digits captured (first `()`s).
\2 # The two digits captured in the 2nd set of `()`s, representing the last two digits of the year.
Pattern:
/^\s*{(\w+)}\s*=/
Replacement:
$\1 =
This takes a variable name encoded as {variable} and converts it to $variable = <date>. Let's break it down:
/ ... / # The starting and trailing slash mark the beginning and end of the expression.
^ # Matches the beginning of the string, anchoring the match.
# If the following character isn't matched exactly at the beginning of the string, the expression won't match.
\s* # Any whitespace character. This can include spaces, tabs, etc.
# The '*' means "zero or more occurrences".
# So, the whitespace is optional, but there can be any amount of it at the beginning of the line.
{ # A literal '{' character.
(\w+) # Any 'word' character (a-z, A-Z, 0-9, _). This is captured in \1.
# \1 will be the text contained between the { and }, and is the only thing "captured" in this expression.
} # A literal '}' character.
\s* # Any whitespace character. This can include spaces, tabs, etc.
= # A literal '=' character.
This match is replaced by $\1 =, which means:
$ # A literal '$' character.
\1 # The text captured in the 1st and only set of `()`s, representing the variable name.
# A literal space.
= # A literal '=' character.
Lastly, I wanted to show you a couple of resources. The regex-format you're using is called "PCRE", or Perl-Compatible Regular Expressions. Here is a quick cheat-sheet on PCRE for PHP. Over the last few years, several tools have been popping up to help you visualize, explain, and test regular expressions. One is Regex 101 (just Google "regex tester" or "regex visualizer"). If you look here, this is an explanation of the first RegEx, and here is an explanation of the second. There are others as well, like Debuggex, Regex Tester, etc. But I find the detailed match breakdown on Regex 101 to be pretty useful.

Parse a skype log

I need to parse a skype log, grab all the call durations and add them up and find out the total duration of calls for the entire chat history.
Sample:
[3/12/2012 11:36:44 AM] * Call ended, duration 21:33 *
I think I need to use preg_match with the proper regex expression. If it's possible to store the actual timestamp in array at the same time that would be better.
I think what i'm really stumped on is the actual regex rule that's needed to grab just the call duration.

Try this
(?i)\[(?P<time_stamp>[^[]+)\]\s*[*]\s*[a-z ,]+(?P<duration>(?:\d{2}:?){2,3})\s*[*]
Explanation
"
(?i) # Match the remainder of the regex with the options: case insensitive (i)
\[ # Match the character “[” literally
(?P<time_stamp> # Match the regular expression below and capture its match into backreference with name “time_stamp”
[^[] # Match any character that is NOT a “[”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\] # Match the character “]” literally
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[*] # Match the character “*”
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[a-z ,] # Match a single character present in the list below
# A character in the range between “a” and “z”
# One of the characters “ ,”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?P<duration> # Match the regular expression below and capture its match into backreference with name “duration”
(?: # Match the regular expression below
\d # Match a single digit 0..9
{2} # Exactly 2 times
: # Match the character “:” literally
? # Between zero and one times, as many times as possible, giving back as needed (greedy)
){2,3} # Between 2 and 3 times, as many times as possible, giving back as needed (greedy)
)
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[*] # Match the character “*”
"

You can use this:
\*.+?([0-9]+:){1,2}([0-9]+)
Then it can catch both HH:MM:SS and MM:SS that comes after the first *.

How does this regex divide text into sentences?

I know this regex divides a text into sentences. Can someone help me understand how?
/(?<!\..)([\?\!\.])\s(?!.\.)/

You can use YAPE::Regex::Explain to decipher Perl regular expressions:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(?<!\..)([\?\!\.])\s(?!.\.)/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(?<!\..)([\?\!\.])\s(?!.\.))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
(?<! look behind to see if there is not:
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
) end of look-behind
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\?\!\.] any character of: '\?', '\!', '\.'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

There is the Regular Expression Analyzer which will do quite the same as toolic already suggested - but completely webbased.

(? # Find a group (don't capture)
< # before the following regular expression
! # that does not match
\. # a literal "."
. # followed by 1 character
) # (End look-behind group)
( # Start a group (capture it to $1)
[\?\!\.] # Containing any one of the characters in the following set "?!."
) # End group $1
\s # followed by a whitespace character " ", \t, etc.
(? # Followed by a group (don't capture)
# after the preceding regular expression
! # that does not have
. # 1 character
\. # followed by a literal "."
) # (End look-ahead group)

The first part (?<!\..) is a negative look-behind. It specifies a pattern which invalidates the match. In this case it's looking for two characters--the first a period and the other one any character.
The second part is a standard capture/group, which could be better expressed: ([?!.]) (you don't need the escapes in the class brackets), that is a sentence ending punctuation character.
The next part is a single (??) white-space character: \s
And the last part is a negative look-ahead: (?!.\.). Again it is guarding against the case of a single character followed by a period.
This should work, relatively well. But I don't think I would recommend it. I don't see what the coder was getting at trying to make sure that just a period wasn't the second most recent character, or that it wasn't the second one to come.
I mean if you are looking to split on terminal punctuation, why don't you want to guard against the same class being two-back or two-ahead? Instead it relies on periods not being there. Thus a more regular expression would be:
/(?<![?!.].)([?!.])\s(?!.[?!.])/

Portions:
([\?\!\.])\s: split by ending character (.,!,or ?) which is followed by a whitespace character (space, tab, newline)
(?<!\..) where the characters before this 'ending character' arent a .+anything
(?!.\.) after the whitespace character any character directly followed by any . isn't allowed.
Those look-ahead ((?!) & look-behind ((?<!) assertions mainly seem to prevent splitting on (whitespaced?) abbreviations (q. e. d. etc.).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Need help understanding preg_match regular expression - php

A regular expression in preg_match is given as /server\-([^\-\.\d]+)(\d+)/. Can someone help me understand what this means? I see that the string starts with server- but I dont get ([^\-\.\d]+)(\d+)'

^ means not one of the following characters inside the brackets \- \. are the - and . characters \d is a number [^\-\.\d]+ means on of more of the characters inside the bracket, so one or more of anything not a -, . or a number. (\d+) one or more number

Related

Match regular expression specific character quantities in any order

PHP regex: each word must end with dot

explanation of preg_replace function in PHP [duplicate]

Parse a skype log

How does this regex divide text into sentences?

Categories

Resources