Using RegEx to find a string (as variable number) - php

How can I find numbers inside certain strings in php?
For example, having this text inside a page, I would like to find for
|||12345|||
or
|||354|||
I'm interested in the numbers, they always change according to the page I visit (numbers being the id of the page and 3-5 characters length).
So the only thing I know for sure is those pipes surrounding the numbers.
Thanks in advance.

Using this \|\|\|\K\d{3,5}(?=\|\|\|)
gives many advantages.
https://regex101.com/r/LtbKfM/1
First, three literals without a quantifier is a simple strncmp() c
call. Also, anytime a regex starts with an assertion it is
inherently slower. Therefore, this is the fastest match for the 3
leading pipe symbols.
Second, using the \K construct excludes whatever was previously
matched from group 0. We don't want to get the 3 pipes in the
match, but we do want to match them.
edit
Note that capture group results are not stored in a special string
buffer.
Each group is really a pointer (or offset) and a length.
The pointer (or offset) is to somewhere in the source string.
When it comes time to extract a particular group string, the overload function for braces
matches[#] uses the pointer (or offset) and length to create and return a string instance.
Using the \K construct simply sets the group 0 pointer (or offset)
to the position in the string that represents the position that
matched after the \K construct.
Third, using a lookahead assertion for 3 pipe symbols does not
consume the symbols as far as the next match is concerned. This
makes these symbols available for the next match. I.e:
|||999|||888||| would get 2 matches as would
|||999|||||888|||.
The result is an array of just the numbers.
Formatted
\|\|\| # 3 pipe symbols
\K # Exclude previous items from the match (group 0)
\d{3,5} # 3-5 digits
(?= \|\|\| ) # Assertion, not consumed, 3 pipe symbols ahead

While #S.Kablar's suggestion is pretty valid, it makes use of a syntax that may be difficult for a beginner.
The more casual way to achieve your goal would be as follows:
$text = 'your input string';
if (preg_match_all('~\|{3}(\d+)\|{3}~', $text, $matches)) {
foreach($matches[1] as $number) {
var_dump($number); // prints smth like string(3) "345"
}
}
The breakdown of the regex:
~ and ~ surround the expression
\| stands for the pipe, which is a special character in regex and must be escaped with a backslash
{3} says 'the previous (the pipe) must be present exactly three times'
( and ) enclose a subpattern so that it is stored under $matches[1]
\d requires a digit
+ says 'the previous (a digit) may be repeated but must have at least one instance'

Related

Two or more occurrence of at least one in character set with PHP regex

I want to make PHP regex to find if text has two or more of at least one character in character set {-, l, s, i, a}.
I made like this.
preg_match("/[-lisa]{2,}/", $text);
But this doesn't work.
Please help me.
Matching two or more occurrences means matching two is enough for the check to be valid.
At least one in character set might either mean you want to match the same char from the set or any of the chars in the set two times. If you want the former, when the same char repeats, you can use preg_match('~([-lisa]).*?\1~', $string, $match) (note the single quotes delimiting the string literal, if you use double quotes, the backreference must have double backslash), if the latter, i.e. you want to match ..l...i.., you can use preg_match('~[-lisa].*?[-lisa]~', $string, $match) or preg_match('~([-lisa]).*?(?1)~', $string, $match) (where (?1) is a regex subroutine that repeats the corresponding group pattern).
If your strings contain line breaks, do not forget to add s modifier, preg_match('~([-lisa]).*?\1~s', $string, $match).
More than that, if you want to check for consecutive character repetition, you should remove .* from the above patterns, i.e. 1) must be preg_match('~([-lisa])\1~', $string, $match) and 2) must be preg_match('~[-lisa]{2}~', $string, $match) (though, this is not what you want judging by your own feeback, so this example here is just for the record).
The ([-lisa])\1{2} pattern that you find useful matches a repeated -, l, i, s or a char three times (---, lll, sss, etc.), thus only use it if it fits your requirements.
Note that preg_match functions searches for a match anywhere inside a string and does not require a full string match (thus, no need adding .* (or ^.*, .*$) at the start and end of the pattern).
See a sample regex demo, feel free to test your strings in this environment.

Regex for matching token wrapped in %

I have user-entered text with potentially mistyped "tokens" I'm trying to find using PHP.
A valid "token" is any number of word characters wrapped in percent signs - so %blah% %blah_moreblah%. Basically I'm looking for tokens where the user may have forgotten to put a leading or trailing '%'. I'm also looking for tokens in the valid format - as at this point in my code, all replaceable tokens have already been replaced.
So, the 3 situations I'm looking for are (to borrow regex syntax): %\w+, %\w+%, or \w+%.
In English, what I'm looking for is, "a string that starts with a % and/or ends with a % and contains only word characters'
The regex I have this far is: (%*\w+%*), but you'll notice it matches every single word. I'm stuck on making a match require at least a leading or a trailing %.
Edit: Initially I tried to have all 3 situations found with their own regex. However, I was finding that the regex for finding tokens in the first situation would also find tokens in the second situation, just without the trailing %. For example, /(%\w+)/, when checked against %before %both%, would match %before and %both.
To match tokens enclosed with %, or having % on either side, use
(?=\w*%)%*\w+%*
See another regex demo.
This is your pattern that I added a positive lookahead to. The (?=\w*%) restricts to only such matches where a % appears after a zero or more occurrences of word characters.
Note also that %* will match zero or more percent signs, it may match %%%word%%. If it is not what you need, and if you need to match 1 or 0 %s, just replace the * with ? quantifier.
Try this:
$input_lines = "Hello this is a %string% with %some_words in it just for demo% purposes.";
preg_match_all("/\s[\w_\-]+%\.?|%[\w_\-]+(%|\s|\.)/", $input_lines, $output_array);
That will output this:
array(
0 => %string%
1 => %some_words
2 => demo%
)
Note that this will catch the valid cases, as well as the typos you are looking for.

What is the use of '\G' anchor in regex?

I'm having a difficulty with understanding how \G anchor works in PHP flavor of regular expressions.
I'm inclined to think (even though I may be wrong) that \G is used instead of ^ in situations when multiple matches of the same string are taking place.
Could someone please show an example of how \Gshould be used, and explain how and why it works?
UPDATE
\G forces the pattern to only return matches that are part of a continuous chain of matches. From the first match each subsequent match must be preceded by a match. If you break the chain the matches end.
<?php
$pattern = '#(match),#';
$subject = "match,match,match,match,not-match,match";
preg_match_all( $pattern, $subject, $matches );
//Will output match 5 times because it skips over not-match
foreach ( $matches[1] as $match ) {
echo $match . '<br />';
}
echo '<br />';
$pattern = '#(\Gmatch),#';
$subject = "match,match,match,match,not-match,match";
preg_match_all( $pattern, $subject, $matches );
//Will only output match 4 times because at not-match the chain is broken
foreach ( $matches[1] as $match ) {
echo $match . '<br />';
}
?>
This is straight from the docs
The fourth use of backslash is for certain simple assertions. An
assertion specifies a condition that has to be met at a particular
point in a match, without consuming any characters from the subject
string. The use of subpatterns for more complicated assertions is
described below. The backslashed assertions are
\G
first matching position in subject
The \G assertion is true only when the current matching position is at
the start point of the match, as specified by the offset argument of
preg_match(). It differs from \A when the value of offset is non-zero.
http://www.php.net/manual/en/regexp.reference.escape.php
You will have to scroll down that page a bit but there it is.
There is a really good example in ruby but it is the same in php.
How the Anchor \z and \G works in Ruby?
\G will match the match boundary, which is either the beginning of the string, or the point where the last character of last match is consumed.
It is particularly useful when you need to do complex tokenization, while also making sure that the tokens are valid.
Example problem
Let us take the example of tokenizing this input:
input 'some input in quote' more input '\'escaped quote\'' lots#_$of_fun ' \' \\ ' crazy'stuff'
Into these tokens (I use ~ to denote end of string):
input~
some input in quote~
more~
input~
'escaped quote'~
lots#_$of_fun~
' \ ~
crazy~
stuff~
The string consists of a mix of:
Singly quoted string, which allows the escape of \ and ', and spaces are conserved. Empty string can be specified using singly quoted string.
OR unquoted string, which consists of a sequence of non-white-space characters, and does not contain \ or '.
Space between 2 unquoted string will delimit them. Space is not necessary to delimit other cases.
For the sake of simplicity, let us assume the input does not contain new line (in real case, you need to consider it). It will add to the complexity of the regex without demonstrating the point.
The RAW regex for singly quoted string is '(?:[^\\']|\\[\\'])*+'
And the RAW regex for unquoted string is [^\s'\\]++
You don't need to care too much about the 2 piece of regex above, though.
The solution below with \G can make sure that when the engine fails to find any match, all characters from the beginning of the string to the position of last match has been consumed. Since it cannot skip character, the engine will stop matching when it fails to find valid match for both specifications of tokens, rather than grabbing random stuff in the rest of the string.
Construction
At the first step of construction, we can put together this regex:
\G(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++))
Or simply put (this is not regex - just to make it easier to read):
\G(Singly_quote_regex|Unquoted_regex)
This will match the first token only, since when it attempts matching for the 2nd time, the match stops at the space before 'some input....
We just need to add a bit to allow for 0 or more space, so that in the subsequent match, the space at the position left off by the last match is consumed:
\G *+(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++))
The regex above will now correctly identify the tokens, as seen here.
The regex can be further modified so that it returns the rest of the string when the engine fails to retrieve any valid token:
\G *+(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++)|((?s).+$))
Since the alternation is tried in order from left-to-right, the last alternative ((?s).+$) will be match if and only if the string ahead doesn't make up a valid single quoted or unquoted token. This can be used to check for error.
The first capturing group will contain the text inside single quoted string, which needs extra processing to turn into the desired text (it is not really relevant here, so I leave it as an exercise to the readers). The second capturing group will contain the unquoted string. And the third capturing group acts as an indicator that the input string is not valid.
Demo for the final regex
Conclusion
The above example is demonstrate of one scenario of usage of \G in tokenization. There can be other usages that I haven't come across.

Regex replace for wrapping numbers with pound signs

I have numbers wrapped with curly brackets in my text i.e. {123} or {456ABC}. I also have numbers not wrapped with brackets i.e. 789. I want to match these not-yet wrapped numbers and use PHP's preg_replace to wrap them with pound signs i.e. #789#. The numbers usually range from 1-3 digits.
print(preg_replace('/\d+/','#$0#',
'1) I can count to 2997510. You can only count to {456ABC}.'));
Desired output:
#1#) I can count to #2997510#. You can only count to {456ABC}.
What regex would match the numbers? I've tried negative lookahead (?![^\{])\d+ and [^\{](\d+)[^\{]
[^\{\dA-F]([A-F\d]+)[^\}\dA-F]
(I'm assuming that you're trying to match hex numbers with capital letters; if not, just alter the character class appropriately.)
The extra \d's are in the negative character classes because if they aren't there, then the engine will avoid brackets by cutting off the outermost digits. For instance, [^\{](\d+)[^\}] will match the 456 in {34567}.
The number itself is "group 1" of any match. If you need the entire match itself to be the number, use a lookahead and a lookbehind:
(?<=[^\{\dA-F])([A-F\d]+)(?=[^\}\dA-F])
Here is a Perl-style search-and-replace to insert the #'s, with no lookahead or lookbehind:
s/([^\{\dA-F])([A-F\d]+)([^\}\dA-F])/$1#$2#$3/g
(\A|[^{\d])(\d[\d\w]*)(\z|[^\}\d\z]) should do it for you.
Used like:
print(preg_replace('/(\A|[^{\d])(\d[\d\w]*)(\z|[^\}\d\z])/','$1#$2#$3',
'1) I can count to 2997510. You can only count to {456ABC}.'));
Explanation:
The first part (\A|[^{\d]) matches either the start of the input (to catch numbers at the beginning of the string) or a non { or digit. This part ensures the numbers aren't already wrapped.
The second part (\d[\d\w]*) does the actual matching of the number. It matches anything that starts with a digit followed by any number of contiguous digits or letters.
The last part (\z|[^\}\d\z]) is analogous to the first part, except looks for the end of the input.
Because this regular expression can capture a character before and after the target number, it is important to add those characters back in using the 1st and 3rd matched subgroups (as seen in the PHP example.

How to get number data with dots within a string?

There is a string variable containing number data with dots , say $x = "OP/1.1.2/DIR"; . The position of the number data may change at any circumstance by user desire by modifying it inside the application , and the slash bar may be changed by any other character ; but the dotted number data is mandatory. So how to extract the dotted number data , here 1.1.2, from the string ?
Use a regular expression:
(\d+(?:\.\d+)*)
Breakdown:
\d+ look for one or more digits
\. a literal decimal . character
\d+ followed by one or more digits again
(...)* this means match 0 or more occurrences of this pattern
(?:...) this tells the engine not to create a backreference for this group (basically, we don't use the reference, so it's pointless to have one)
You haven't given much information about the data, so I've made the following assumptions:
The data will always contain at least one number
The data may contain only a number without a dot
The data may contain multi-digit numbers
The numbers themselves may contain any number of dot/digit pairs
If any of these assumptions are incorrect, you'll have to modify the regular expression.
Example usage:
$x = "OP/1.1.2/DIR";
if (!preg_match('/(\d+(\.\d+)*)/', $x, $matches)) {
// Could not find a matching number in the data - handle this appropriately
} else {
var_dump($matches[1]); // string(5) "1.1.2"
}

Categories