I have a large string (multiple lines) I need to find numbers in with regex. The position the number I need is always proceeded/follow by an exact order of characters so I can use non-capturing matches to pinpoint the exact number I need. I put together a regex to get this number but it refuses to work and I can't figure it out!
Below is a small bit of php code that I can't get to work showing the basic format of what i need
$sTestData = 'lak sjdhfklsjaf<?kjnsdfh461uihrfkjsn+%5Bmlknsadlfjncas dlk';
$sNumberStripRE = '/.*?(?:sjdhfklsjaf<\\?kjnsdfh)(\\d+)(?:uihrfkjsn\\+%5Bmlknsadlfjncas).*?/gim';
if (preg_match_all($sNumberStripRE, $sTestData, $aMatches))
{
var_dump($aMatches);
}
the number I need is 461 and the characters before/after the spaces on either side of this number are always the same
any help getting the above regex working would be great!
This link RegExr: My Reg Ex (to an online regex genereator and my regex) shows that it should work!
g is an invalid modifier, drop it.
Ideone Link
With regard to that link, which regular expression engine is it working from? Built in Flex, so probably the ActionScript RegExp engine. They are not all the same, each one varies.
You have a number of double-backslashes, they should probably be single in those strings.
$sTestData = 'lak sjdhfklsjaf<?kjnsdfh461uihrfkjsn+%5Bmlknsadlfjncas dlk';
$lDelim = ' sjdhfklsjaf<?kjnsdfh';
$rDelim = 'uihrfkjsn+%5Bmlknsadlfjncas ';
$start = strpos($sTestData, $lDelim) + strlen($lDelim);
$length = strpos($sTestData, $rDelim) - $start;
$number = substr($sTestData, $start, $length);
Using regex you can accomplish your goal with the following code:
$string='lak sjdhfklsjaf<?kjnsdfh461uihrfkjsn+%5Bmlknsadlfjncas dlk';
if (preg_match('/(sjdhfklsjaf<\?kjnsdfh)(\d+)(uihrfkjsn\+%5Bmlknsadlfjncas)/', $string, $num_array)) {
$aMatches = $num_array[2];
} else {
$aMatches = "";
}
echo $aMatches;
Explanation:
I declared a variable entitled $string and made it equal to the variable you initially presented. You indicated that the characters on either side of the numeric value of interest were always the same. I assigned the numerical value of interest to $aMatches by setting $aMatches equal to back reference 2. Using the parentheses in regex you will get 3 matches: backreference 1 which will contain the characters before the number, backreference 2 which will contain the numbers that you want, and backreference 3 which is the stuff after the number. I assigned $num_array as the variable name for those backreferences and the [2] indicates that it is the second backreference. So, $num_array[1] would contain the match in backreference 1 and $num_array[3] would contain the match in backreference 3.
Here is the explanation of my regular expression:
Match the regular expression below and capture its match into backreference number 1 «(sjdhfklsjaf<\?kjnsdfh)»
Match the characters “sjdhfklsjaf<” literally «sjdhfklsjaf<»
Match the character “?” literally «\?»
Match the characters “kjnsdfh” literally «kjnsdfh»
Match the regular expression below and capture its match into backreference number 2 «(\d+)»
Match a single digit 0..9 «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regular expression below and capture its match into backreference number 3 «(uihrfkjsn+%5Bmlknsadlfjncas)»
Match the characters “uihrfkjsn” literally «uihrfkjsn»
Match the character “+” literally «+»
Match the characters “%5Bmlknsadlfjncas” literally «%5Bmlknsadlfjncas»
Hope this helps and best of luck to you.
Steve
Related
I have a piece of data, retrieved from the database and containing information I need. Text is entered in a free form so it's written in many different ways. The only thing I know for sure is that I'm looking for the first number after a given string, but after that certain string (before the number) can be any text as well.
I tried this (where mytoken is the string I know for sure its there) but this doesn't work.
/(mytoken|MYTOKEN)(.*)\d{1}/
/(mytoken|MYTOKEN)[a-zA-Z]+\d{1}/
/(mytoken|MYTOKEN)(.*)[0-9]/
/(mytoken|MYTOKEN)[a-zA-Z]+[0-9]/
Even mytoken can be written in capitals, lowercase or a mix of capitals and lowercase character. Can the expression be case insensitive?
You do not need any lazy matching since you want to match any number of non-digit symbols up to the first digit. It is better done with a \D*:
/(mytoken)(\D*)(\d+)/i
See the regex demo
The pattern details:
(mytoken) - Group 1 matching mytoken (case insensitively, as there is a /i modifier)
(\D*) - Group 2 matching zero or more characters other than a digit
(\d+) - Group 3 matching 1 or more digits.
Note that \D also matches newlines, . needs a DOTALL modifier to match across newlines.
You need to use a lazy quantifier. You can do that by putting a question mark after the star quantifier in the regex: .*?. Otherwise, the numbers will be matched by the dot operator until the last number, which will be matched by \d.
Regex: /(mytoken|MYTOKEN)(.*?)\d/
Regex demo
You can use the opposite:
/(mytoken|MYTOKEN)(\D+)(\d)/
This says: mytoken, followed by anything not a number, followed by a number. The (lazy) dot-star-soup is not always your best bet. The desired number will be in $3 in this example.
I have a pattern like this:
[X number of digits][c][32 characters (md5)][X]
/* Examples:
2 c jg3j2kf290e8ghnaje48grlrpas0942g 65
5 c kdjeuw84398fj02i397hf4343i013g44 94824
1 c pokdk94jf0934nf0932mf3923249f3j3 3
*/
Note: Those spaces into those examples aren't exist in the real string.
I need to divide such a string into four parts:
// based on first example
$the_number_of_digits = 2
$separator = c // this is constant
$hashed_string = jg3j2kf290e8ghnaje48grlrpas0942g
$number = 65
How can I do that?
Here is what I've tried so far:
/^(\d+)(c)(\w{32})/
Online Demo
My pattern cannot get last part.
EDIT: I don't want to select the rest of number as last part. I need a algorithm based on the number which is in the beginning of that string.
Because maybe my string be like this:
2 c 65 jg3j2kf290e8ghnaje48grlrpas0942g
This regex uses named groups to access the results:
(?<numDigits>\d+) (?<separator>c) (?<hashedString>\w{32}) (?<number>\d+)
edit: (from #RocketHazmat's helpful comments) since the OP wants to also validate that "number" has the number of digits from "numDigits":
Use the regex provided then validate the length of number in PHP. if(
strlen($matches['number']) == $matches['numDigits'] )
regex demo output (your string as input):
The fact that one match drives the length of another match suggests that you will need something a bit more complicated than a single expression. However, it need not be that much more complicated: sscanf was designed for this kind of job:
sscanf($code, '%dc%32s%n', $length, $md5, $width);
$number = substr($code, $width, $length);
Live example.
The trick here is that sscanf gives you the width of the string (%n) at exactly the point you need to start cutting, as well as the length (from the first %d), so you have everything you need to do simple string cuts.
Add (\d+) to the end, like you have in the beginning.
/^(\d+)(c)(\w{32})(\d+)/
/(\d)(c)([[:alnum:]]{32})(\d+)/
preg_match('/(\d)(c)([[:alnum:]]{32})(\d+)/', $string, $matches);
$the_number_of_digits = $matches[1];
$separator = $matches[2];
$hashed_string = $matches[3];
$number = $matches[4];
Then, to check if the string length of $number is equal to $the_number_of_digits, you can use strlen, i.e.:
if(strlen($number) == $the_number_of_digits){
}
The main difference from other answers is the use of [[:alnum:]], unlike \w, it won't match _.
[:alnum:]
Alphanumeric characters: ‘[:alpha:]’ and ‘[:digit:]’; in the ‘C’
locale and ASCII character encoding, this is the same as
‘[0-9A-Za-z]’.
http://www.gnu.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressions.html
Regex101 Demo
Ideone Demo
Regex Explanation:
(\d)(c)([[:alnum:]]{32})(\d+)
Match the regex below and capture its match into backreference number 1 «(\d)»
Match a single character that is a “digit” (any decimal number in any Unicode script) «\d»
Match the regex below and capture its match into backreference number 2 «(c)»
Match the character “c” literally (case insensitive) «c»
Match the regex below and capture its match into backreference number 3 «([[:alnum:]]{32})»
Match a character from the **POSIX** character class “alnum” (Unicode; any letter or ideograph, digit, other number) «[[:alnum:]]{32}»
Exactly 32 times «{32}»
Match the regex below and capture its match into backreference number 4 «(\d+)»
Match a single character that is a “digit” (any decimal number in any Unicode script) «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
I'm playing around with PHP Regex in order to improve my skills with it.
I'm having a hard time trying to understand the plus sign - so I wrote the following code:
$subject = 'aaa bbb cccc dddd';
echo preg_replace('/(\w)/',"$1*",$subject) . '<br>';
echo preg_replace('/(\w+)/',"$1*",$subject) . '<br>';
echo preg_replace('/(\w)+/',"$1*",$subject) . '<br>';
With results in:
a*a*a* b*b*b* c*c*c*c* d*d*d*d*
aaa* bbb* cccc* dddd*
a* b* c* d*
I don't understand why these results come about. Can someone please explain what's going on in this example
in regular expressions, + means one or more of the preceding character or group.
The pattern /(\w)/, means match a single word character (a-zA-Z0-9_) in a single group. So it will match each letter. The first match group will be just a. The replace will replace each individual letter with that letter followed by an asterisk.
The pattern /(\w+)/ will match one or more word characters in a group. So it will match each block of letters. The first match group will be aaa. The replace will replace each block of multiple letters followed by a asterisk.
The last pattern /(\w)+/ is a little more tricky, but will match a single word character in a group but the + means that it will match one or more of the group. So the first match will be a, but the replace will replace all of the groups until there isn't a match with the last matched group (of course followed by an asterisk). So if you tried the string aaab ccc, your result would end up as b* c*. b is the last matched group in the first sequence and so the replace would use that.
Your mistake isn't the plus sign, it's understanding what the parentesis is for and how it works. The parenthesis is for grouping your match into a variable, hence why you can do $1, the second set of () gives you $2 and so on...
(\w) means 1 word character
(\w+) means 1 or more word characters
(\w)+ matches 1 or more word characters, but only the first one is put into the variable, because only the \w is inside the paranthesis
Can someone explain me the meaning of this pattern.
preg_match(/'^(d{1,2}([a-z]+))(?:s*)S (?=200[0-9])/','21st March 2006','$matches);
So correct me if I'm wrong:
^ = beginning of the line
d{1,2} = digit with minimum 1 and maximum 2 digits
([a-z]+) = one or more letters from a-z
(?:s*)S = no idea...
(?= = no idea...
200[0-9] = a number, starting with 200 and ending with a number (0-9)
Can someone complete this list?
Here's a nice diagram courtesy of strfriend:
But I think you probably meant ^(\d{1,2}([a-z]+))(?:\s*)\S (?=200[0-9]) with the backslashes, which gives this diagram:
That is, this regexp matches the beginning of the string, followed by one or two digits, one or more lowercase letters, zero or more whitespace characters, one non-whitespace character and a space. Also, all this has to be followed by a number between 2000 and 2009, although that number is not actually matched by the regexp — it's only a look-ahead assertion. Also, the leading digits and letters are captures into $matches[1], and just the letters into $matches[2].
For more information on PHP's PCRE regexp syntax, see http://php.net/manual/en/pcre.pattern.php
regular-exressions.info is very helpful resource.
/'^(d{1,2}([a-z]+))(?:s*)S (?=200[0-9])/
(?:regex) are non-capturing parentheses; They aren't very useful in your example, but could be used to expres things like (?:bar)+, to mean 1 or more bars
(?=regex) does a positive lookahead, but matches the position not the contents. So (?=200[0-9]) in your example makes the regex match only dates in the previous decade, without matching the year itself.
I recently asked a question on formatting a telephone number and I got lots of responses. Most of the responses were great but one i really wanted to figure out what its doing because it worked great. If phone is the following how do the other lines work...what are they doing so i can learn
$phone = "(407)888-9999";
$phone = preg_replace("~[^0-9]~", "", $phone);
preg_match('~([0-9]{3})([0-9]{3})([0-9]{4})~', $phone, $matches);
Let's break the code into two lines.
preg_replace("~[^0-9]~", "", $phone);
First, we're going to replace matches to a regex with an empty string (in other words, delete matches from the string). The regex is [^0-9] (the ~ on each end is a delimiter). [...] in a regex defines a character class, which tells the regex engine to match one character within the class. Dashes are generally special characters inside a character class, and are used to specify a range (ie. 0-9 means all characters between 0 and 9, inclusive).
You can think of a character class like a shorthand for a big OR condition: ie. [0-9] is a shorthand for 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9. Note that classes don't have to contain ranges, either -- [aeiou] is a character class that matches a or e or i or o or u (or in other words, any vowel).
When the first character in the class is ^, the class is negated, which means that the regex engine should match any character that isn't in the class. So when you put all that together, the first line removes anything that isn't a digit (a character between 0 and 9) from $phone.
preg_match('~([0-9]{3})([0-9]{3})([0-9]{4})~', $phone, $matches);
The second line tries to match $phone against a second expression, and puts the results into an array called $matches, if a match is made. You will note there are three sets of brackets; these define capturing groups -- ie. if there is a match of a pattern as a whole, you will end up with three submatches, which in this case will contain the area code, prefix and suffix of the phone number. In general, anything contained in brackets in a regular expression is capturing (while there are exceptions, they are beyond the scope of this explanation). Groups can be useful for other things too, without wanting the overhead of capturing, so a group can be made non-capturing by prefacing it with ?: (ie. (?:...)).
Each group does a similar thing: [0-9]{3} or [0-9]{4}. As we saw above, [0-9] defines a character class containing the digits between 0 and 9 (as the classes here don't start with ^, these are not negated groups). The {3} or {4} is a repetition operator, which says "match exactly 3 (or 4) of the previous token (or group)". So [0-9]{3} will match exactly three digits in a row, and [0-9]{4} will match exactly four digits in a row. Note that the digits don't have to be all the same (ie. 111), because the character class is evaluate for each repetition (so 123 will match because 1 matches [0-9], then 2 matches [0-9], and then 3 matches [0-9]).
In the preg_replace it looks for anything that is not, ^ inside of the [], 0-9 (basically not a number) and replaces / removes it from that string given the replacement is "".
For the first section, it pulls out the first 3 numbers ([0-9]{3}) the {3} is the number of characters to match the items inside the [] are what to match and since this is inside of paranthesis () it stores it as a match in the array $matches. The second part pulls out the next 3 numbers and the last part pulls out the last 4 numbers from $phone and stores the matches that were matched in $matches.
The ~ are delimeters for the regular expressions.
You know it's a regular expression from the regex tag.
So, you are pattern matching.
The pattern you are matching is: [^0-9] followed by the phone number.
[^0-9] is NOT '^' any one digit
So, the match after that is any 3 digits, followed by any 3 digits, followed by any 4 digits.
I don't think it will match because of the () around the area code and the dash are missing.
I'd do this:
~\(([0-9]{3})\)([0-9]{3})-([0-9]{4})~'
"[^0-9]" means everything but numbers from 0 to 9. So basically, first line replace everything but numbers with "" (nothing)
[0-9]{3} means number from 0 to 9, 3 times in a row.
So it check if you have 3 numbers then 3 numbers than 4 numbers and try to match it with $matches.
Check this tuts
Using Regular Expressions with PHP
http://www.webcheatsheet.com/php/regular_expressions.php
$phone = "(407)888-9999";
$phone = preg_replace("~[^0-9]~", "", $phone);
In php you have to delimit regex pattern in some non-alphanumeric character "~" is used here.
[^0-9] is regex pattern used to remove anything out of $phone that is not in 0-9 range remember [^...] will negate the pattern it precedes.
preg_match('~([0-9]{3})([0-9]{3})([0-9]{4})~', $phone, $matches);
Again in this line of code you have "~" as delimiter and
([0-9]{3}) this part of pattern will return 3 numbers from string (note: {} is used to specify range/number of characters to match) in a different output array dimension (check your $matches variable for result) using ( ) in a pattern results in groups/submatches