preg_match string - php

Can someone explain me the meaning of this pattern.
preg_match(/'^(d{1,2}([a-z]+))(?:s*)S (?=200[0-9])/','21st March 2006','$matches);
So correct me if I'm wrong:
^ = beginning of the line
d{1,2} = digit with minimum 1 and maximum 2 digits
([a-z]+) = one or more letters from a-z
(?:s*)S = no idea...
(?= = no idea...
200[0-9] = a number, starting with 200 and ending with a number (0-9)
Can someone complete this list?

Here's a nice diagram courtesy of strfriend:
But I think you probably meant ^(\d{1,2}([a-z]+))(?:\s*)\S (?=200[0-9]) with the backslashes, which gives this diagram:
That is, this regexp matches the beginning of the string, followed by one or two digits, one or more lowercase letters, zero or more whitespace characters, one non-whitespace character and a space. Also, all this has to be followed by a number between 2000 and 2009, although that number is not actually matched by the regexp — it's only a look-ahead assertion. Also, the leading digits and letters are captures into $matches[1], and just the letters into $matches[2].
For more information on PHP's PCRE regexp syntax, see http://php.net/manual/en/pcre.pattern.php

regular-exressions.info is very helpful resource.
/'^(d{1,2}([a-z]+))(?:s*)S (?=200[0-9])/
(?:regex) are non-capturing parentheses; They aren't very useful in your example, but could be used to expres things like (?:bar)+, to mean 1 or more bars
(?=regex) does a positive lookahead, but matches the position not the contents. So (?=200[0-9]) in your example makes the regex match only dates in the previous decade, without matching the year itself.

Related

Regex to get the first number after a certain string followed by any data until the number

I have a piece of data, retrieved from the database and containing information I need. Text is entered in a free form so it's written in many different ways. The only thing I know for sure is that I'm looking for the first number after a given string, but after that certain string (before the number) can be any text as well.
I tried this (where mytoken is the string I know for sure its there) but this doesn't work.
/(mytoken|MYTOKEN)(.*)\d{1}/
/(mytoken|MYTOKEN)[a-zA-Z]+\d{1}/
/(mytoken|MYTOKEN)(.*)[0-9]/
/(mytoken|MYTOKEN)[a-zA-Z]+[0-9]/
Even mytoken can be written in capitals, lowercase or a mix of capitals and lowercase character. Can the expression be case insensitive?
You do not need any lazy matching since you want to match any number of non-digit symbols up to the first digit. It is better done with a \D*:
/(mytoken)(\D*)(\d+)/i
See the regex demo
The pattern details:
(mytoken) - Group 1 matching mytoken (case insensitively, as there is a /i modifier)
(\D*) - Group 2 matching zero or more characters other than a digit
(\d+) - Group 3 matching 1 or more digits.
Note that \D also matches newlines, . needs a DOTALL modifier to match across newlines.
You need to use a lazy quantifier. You can do that by putting a question mark after the star quantifier in the regex: .*?. Otherwise, the numbers will be matched by the dot operator until the last number, which will be matched by \d.
Regex: /(mytoken|MYTOKEN)(.*?)\d/
Regex demo
You can use the opposite:
/(mytoken|MYTOKEN)(\D+)(\d)/
This says: mytoken, followed by anything not a number, followed by a number. The (lazy) dot-star-soup is not always your best bet. The desired number will be in $3 in this example.

How can I taking apart a string?

I have a pattern like this:
[X number of digits][c][32 characters (md5)][X]
/* Examples:
2 c jg3j2kf290e8ghnaje48grlrpas0942g 65
5 c kdjeuw84398fj02i397hf4343i013g44 94824
1 c pokdk94jf0934nf0932mf3923249f3j3 3
*/
Note: Those spaces into those examples aren't exist in the real string.
I need to divide such a string into four parts:
// based on first example
$the_number_of_digits = 2
$separator = c // this is constant
$hashed_string = jg3j2kf290e8ghnaje48grlrpas0942g
$number = 65
How can I do that?
Here is what I've tried so far:
/^(\d+)(c)(\w{32})/
Online Demo
My pattern cannot get last part.
EDIT: I don't want to select the rest of number as last part. I need a algorithm based on the number which is in the beginning of that string.
Because maybe my string be like this:
2 c 65 jg3j2kf290e8ghnaje48grlrpas0942g
This regex uses named groups to access the results:
(?<numDigits>\d+) (?<separator>c) (?<hashedString>\w{32}) (?<number>\d+)
edit: (from #RocketHazmat's helpful comments) since the OP wants to also validate that "number" has the number of digits from "numDigits":
Use the regex provided then validate the length of number in PHP. if(
strlen($matches['number']) == $matches['numDigits'] )
regex demo output (your string as input):
The fact that one match drives the length of another match suggests that you will need something a bit more complicated than a single expression. However, it need not be that much more complicated: sscanf was designed for this kind of job:
sscanf($code, '%dc%32s%n', $length, $md5, $width);
$number = substr($code, $width, $length);
Live example.
The trick here is that sscanf gives you the width of the string (%n) at exactly the point you need to start cutting, as well as the length (from the first %d), so you have everything you need to do simple string cuts.
Add (\d+) to the end, like you have in the beginning.
/^(\d+)(c)(\w{32})(\d+)/
/(\d)(c)([[:alnum:]]{32})(\d+)/
preg_match('/(\d)(c)([[:alnum:]]{32})(\d+)/', $string, $matches);
$the_number_of_digits = $matches[1];
$separator = $matches[2];
$hashed_string = $matches[3];
$number = $matches[4];
Then, to check if the string length of $number is equal to $the_number_of_digits, you can use strlen, i.e.:
if(strlen($number) == $the_number_of_digits){
}
The main difference from other answers is the use of [[:alnum:]], unlike \w, it won't match _.
[:alnum:]
Alphanumeric characters: ‘[:alpha:]’ and ‘[:digit:]’; in the ‘C’
locale and ASCII character encoding, this is the same as
‘[0-9A-Za-z]’.
http://www.gnu.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressions.html
Regex101 Demo
Ideone Demo
Regex Explanation:
(\d)(c)([[:alnum:]]{32})(\d+)
Match the regex below and capture its match into backreference number 1 «(\d)»
Match a single character that is a “digit” (any decimal number in any Unicode script) «\d»
Match the regex below and capture its match into backreference number 2 «(c)»
Match the character “c” literally (case insensitive) «c»
Match the regex below and capture its match into backreference number 3 «([[:alnum:]]{32})»
Match a character from the **POSIX** character class “alnum” (Unicode; any letter or ideograph, digit, other number) «[[:alnum:]]{32}»
Exactly 32 times «{32}»
Match the regex below and capture its match into backreference number 4 «(\d+)»
Match a single character that is a “digit” (any decimal number in any Unicode script) «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»

php regex - find uppercase string with number and spaces in text

I want to write php regular expression to find uppercase string , which can also contain one number and spaces, from text.
For example from this text "some text to contain EXAM PL E 7STRING uppercase word" I want to get string- EXAM PL E 7STRING ,
found string should start and end only with uppercase, but in the middle, without uppercase letters can also contain(but not necessarily ) one number and spaces. So, regex should match any of these patterns
1) EXAMPLESTRING - just uppercase string
2) EXAMP4LESTRING - with number
3) EXAMPLES TRING - with space
4) EXAM PL E STRING - with more than one spaces
5) EXAMP LE4STRING - with number and space
6) EXAMP LE 4ST RI NG - with number and spaces
and with total length string should be equal or more than 4 letters
I wrote this regex '/[A-Z]{1,}([A-Z\s]{2,}|\d?)[A-Z]{1,}/', that can find first 4 patterns, but I can not figure it out to match also the last 2 patterns.
Thanks
There is a neat trick called a lookahead. It just checks what is following after the current position. That can be used to check for multiple conditions:
'/(?<![A-Z])(?=(?:[A-Z][\s\d]*){3}[A-Z])(?!(?:[A-Z\s]*\d){2})[A-Z][A-Z\s\d]*[A-Z]/'
The first lookaround is actually a lookbehind and checks that there is no previous uppercase letter. This is just a little speedup for strings that would fail the match anyway. The second lookaround (a lookahead) checks that there are at least four letters. The third one checks that there are no two digits. The rest just matches then a string of the allowed characters, starting and ending with an uppercase letter.
Note that in the case of two digits this will not match at all (instead of matching everything up to the second digit). If you do want to match in such a case, you could incorporate the "1 digit" rule into the actual match instead:
'/(?<![A-Z])(?=(?:[A-Z][\s\d]*){3}[A-Z])[A-Z][A-Z\s]*\d?[A-Z\s]*[A-Z]/'
EDIT:
As Ωmega pointed out, this will cause problems if there are less then four letters before the second digit, but more after that. This is actually quite tough, because the assertion needs to be, that there are more than 4 letters before the second digit. Since we do not know where the first digit occurs in those four letters, we have to check for all possible positions. For this I would do away with the lookaheads altogether, and simply provide the three different alternatives. (I will keep the lookbehind as an optimization for non-matching parts.)
'/(?<![A-Z])[A-Z]\s*(?:\d\s*[A-Z]\s*[A-Z]|[A-Z]\s*\d\s*[A-Z]|[A-Z]\s*[A-Z][A-Z\s]*\d?)[A-Z\s]*[A-Z]/'
Or here with added comments:
'/
(?<! # negative lookbehind
[A-Z] # current position is not preceded by a letter
) # end of lookbehind
[A-Z] # match has to start with uppercase letter
\s* # optional spaces after first letter
(?: # subpattern for possible digit positions
\d\s*[A-Z]\s*[A-Z]
# digit comes after first letter, we need two more letters before last one
| # OR
[A-Z]\s*\d\s*[A-Z]
# digit comes after second letter, we need one more letter before last one
| # OR
[A-Z]\s*[A-Z][A-Z\s]*\d?
# digit comes after third letter, or later, or not at all
) # end of subpattern for possible digit positions
[A-Z\s]* # arbitrary amount of further letters and whitespace
[A-Z] # match has to end with uppercase letter
/x'
That gives the same result on Ωmega's lengthy test input.
I suggest to use regex pattern
[A-Z][ ]*(\d)?(?(1)(?:[ ]*[A-Z]){3,}|[A-Z][ ]*(\d)?(?(2)(?:[ ]*[A-Z]){2,}|[A-Z][ ]*(\d)?(?(3)(?:[ ]*[A-Z]){2,}|[A-Z][ ]*(?:\d|(?:[ ]*[A-Z])+[ ]*\d?))))(?:[ ]*[A-Z])*
(see this demo).
[A-Z][ ]*(?:\d(?:[ ]*[A-Z]){2}|[A-Z][ ]*\d[ ]*[A-Z]|(?:[A-Z][ ]*){2,}\d?)[A-Z ]*[A-Z]
(see this demo)

Regex: how to match an word that doesn't end with a specific character

I would like to match the whole "word"—one that starts with a number character and that may include special characters but does not end with a '%'.
Match these:
112 (whole numbers)
10-12 (ranges)
11/2 (fractions)
11.2 (decimal numbers)
1,200 (thousand separator)
but not
12% (percentages)
A38 (words starting with a alphabetic character)
I've tried these regular expressions:
(\b\p{N}\S)*)
but that returns '12%' in '12%'
(\b\p{N}(?:(?!%)\S)*)
but that returns '12' in '12%'
Can I make an exception to the \S term that disregards %?
Or will have to do something else?
I'll be using it in PHP, but just write as you would like and I'll convert it to PHP.
This matches your specification:
\b\p{N}\S*+(?<!%)
Explanation:
\b # Start of number
\p{N} # One Digit
\S*+ # Any number of non-space characters, match possessively
(?<!%) # Last character must not be a %
The possessive quantifier \S*+ makes sure that the regex engine will not backtrack into a string of non-space characters it has already matched. Therefore, it will not "give back" a % to match 12 within 12%.
Of course, that will also match 1!abc, so you might want to be more specific than \S which matches anything that's not a whitespace character.
Can i make an exception to the \S term that disregards %
Yes you can:
[^%\s]
See this expression \b\d[^%\s]* here on Regexr
\d+([-/\.,]\d+)?(?!%)
Explanation:
\d+ one or more digits
(
[-/\.,] one "-", "/", "." or ","
\d+ one or more digits
)? the group above zero or one times
(?!%) not followed by a "%" (negative lookahead)
KISS (restrictive):
/[0-9][0-9.,-/]*\s/
try this one
preg_match("/^[0-9].*[^%]$/", $string);
Try this PCRE regex:
/^(\d[^%]+)$/
It should give you what you need.
I would suggest just:
(\b[\p{N},.-]++(?!%))
That's not very exact regarding decimal delimiters or ranges. (As example). But the ++ possessive quantifier will eat up as many decimals as it can. So that you really just need to check the following character with a simple assertion. Did work for your examples.

Remove number from large string in specific position [PHP RegEx]

I have a large string (multiple lines) I need to find numbers in with regex. The position the number I need is always proceeded/follow by an exact order of characters so I can use non-capturing matches to pinpoint the exact number I need. I put together a regex to get this number but it refuses to work and I can't figure it out!
Below is a small bit of php code that I can't get to work showing the basic format of what i need
$sTestData = 'lak sjdhfklsjaf<?kjnsdfh461uihrfkjsn+%5Bmlknsadlfjncas dlk';
$sNumberStripRE = '/.*?(?:sjdhfklsjaf<\\?kjnsdfh)(\\d+)(?:uihrfkjsn\\+%5Bmlknsadlfjncas).*?/gim';
if (preg_match_all($sNumberStripRE, $sTestData, $aMatches))
{
var_dump($aMatches);
}
the number I need is 461 and the characters before/after the spaces on either side of this number are always the same
any help getting the above regex working would be great!
This link RegExr: My Reg Ex (to an online regex genereator and my regex) shows that it should work!
g is an invalid modifier, drop it.
Ideone Link
With regard to that link, which regular expression engine is it working from? Built in Flex, so probably the ActionScript RegExp engine. They are not all the same, each one varies.
You have a number of double-backslashes, they should probably be single in those strings.
$sTestData = 'lak sjdhfklsjaf<?kjnsdfh461uihrfkjsn+%5Bmlknsadlfjncas dlk';
$lDelim = ' sjdhfklsjaf<?kjnsdfh';
$rDelim = 'uihrfkjsn+%5Bmlknsadlfjncas ';
$start = strpos($sTestData, $lDelim) + strlen($lDelim);
$length = strpos($sTestData, $rDelim) - $start;
$number = substr($sTestData, $start, $length);
Using regex you can accomplish your goal with the following code:
$string='lak sjdhfklsjaf<?kjnsdfh461uihrfkjsn+%5Bmlknsadlfjncas dlk';
if (preg_match('/(sjdhfklsjaf<\?kjnsdfh)(\d+)(uihrfkjsn\+%5Bmlknsadlfjncas)/', $string, $num_array)) {
$aMatches = $num_array[2];
} else {
$aMatches = "";
}
echo $aMatches;
Explanation:
I declared a variable entitled $string and made it equal to the variable you initially presented. You indicated that the characters on either side of the numeric value of interest were always the same. I assigned the numerical value of interest to $aMatches by setting $aMatches equal to back reference 2. Using the parentheses in regex you will get 3 matches: backreference 1 which will contain the characters before the number, backreference 2 which will contain the numbers that you want, and backreference 3 which is the stuff after the number. I assigned $num_array as the variable name for those backreferences and the [2] indicates that it is the second backreference. So, $num_array[1] would contain the match in backreference 1 and $num_array[3] would contain the match in backreference 3.
Here is the explanation of my regular expression:
Match the regular expression below and capture its match into backreference number 1 «(sjdhfklsjaf<\?kjnsdfh)»
Match the characters “sjdhfklsjaf<” literally «sjdhfklsjaf<»
Match the character “?” literally «\?»
Match the characters “kjnsdfh” literally «kjnsdfh»
Match the regular expression below and capture its match into backreference number 2 «(\d+)»
Match a single digit 0..9 «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regular expression below and capture its match into backreference number 3 «(uihrfkjsn+%5Bmlknsadlfjncas)»
Match the characters “uihrfkjsn” literally «uihrfkjsn»
Match the character “+” literally «+»
Match the characters “%5Bmlknsadlfjncas” literally «%5Bmlknsadlfjncas»
Hope this helps and best of luck to you.
Steve

Categories