Regex for matching token wrapped in % - php

I have user-entered text with potentially mistyped "tokens" I'm trying to find using PHP.
A valid "token" is any number of word characters wrapped in percent signs - so %blah% %blah_moreblah%. Basically I'm looking for tokens where the user may have forgotten to put a leading or trailing '%'. I'm also looking for tokens in the valid format - as at this point in my code, all replaceable tokens have already been replaced.
So, the 3 situations I'm looking for are (to borrow regex syntax): %\w+, %\w+%, or \w+%.
In English, what I'm looking for is, "a string that starts with a % and/or ends with a % and contains only word characters'
The regex I have this far is: (%*\w+%*), but you'll notice it matches every single word. I'm stuck on making a match require at least a leading or a trailing %.
Edit: Initially I tried to have all 3 situations found with their own regex. However, I was finding that the regex for finding tokens in the first situation would also find tokens in the second situation, just without the trailing %. For example, /(%\w+)/, when checked against %before %both%, would match %before and %both.

To match tokens enclosed with %, or having % on either side, use
(?=\w*%)%*\w+%*
See another regex demo.
This is your pattern that I added a positive lookahead to. The (?=\w*%) restricts to only such matches where a % appears after a zero or more occurrences of word characters.
Note also that %* will match zero or more percent signs, it may match %%%word%%. If it is not what you need, and if you need to match 1 or 0 %s, just replace the * with ? quantifier.

Try this:
$input_lines = "Hello this is a %string% with %some_words in it just for demo% purposes.";
preg_match_all("/\s[\w_\-]+%\.?|%[\w_\-]+(%|\s|\.)/", $input_lines, $output_array);
That will output this:
array(
0 => %string%
1 => %some_words
2 => demo%
)
Note that this will catch the valid cases, as well as the typos you are looking for.

Related

Using RegEx to find a string (as variable number)

How can I find numbers inside certain strings in php?
For example, having this text inside a page, I would like to find for
|||12345|||
or
|||354|||
I'm interested in the numbers, they always change according to the page I visit (numbers being the id of the page and 3-5 characters length).
So the only thing I know for sure is those pipes surrounding the numbers.
Thanks in advance.
Using this \|\|\|\K\d{3,5}(?=\|\|\|)
gives many advantages.
https://regex101.com/r/LtbKfM/1
First, three literals without a quantifier is a simple strncmp() c
call. Also, anytime a regex starts with an assertion it is
inherently slower. Therefore, this is the fastest match for the 3
leading pipe symbols.
Second, using the \K construct excludes whatever was previously
matched from group 0. We don't want to get the 3 pipes in the
match, but we do want to match them.
edit
Note that capture group results are not stored in a special string
buffer.
Each group is really a pointer (or offset) and a length.
The pointer (or offset) is to somewhere in the source string.
When it comes time to extract a particular group string, the overload function for braces
matches[#] uses the pointer (or offset) and length to create and return a string instance.
Using the \K construct simply sets the group 0 pointer (or offset)
to the position in the string that represents the position that
matched after the \K construct.
Third, using a lookahead assertion for 3 pipe symbols does not
consume the symbols as far as the next match is concerned. This
makes these symbols available for the next match. I.e:
|||999|||888||| would get 2 matches as would
|||999|||||888|||.
The result is an array of just the numbers.
Formatted
\|\|\| # 3 pipe symbols
\K # Exclude previous items from the match (group 0)
\d{3,5} # 3-5 digits
(?= \|\|\| ) # Assertion, not consumed, 3 pipe symbols ahead
While #S.Kablar's suggestion is pretty valid, it makes use of a syntax that may be difficult for a beginner.
The more casual way to achieve your goal would be as follows:
$text = 'your input string';
if (preg_match_all('~\|{3}(\d+)\|{3}~', $text, $matches)) {
foreach($matches[1] as $number) {
var_dump($number); // prints smth like string(3) "345"
}
}
The breakdown of the regex:
~ and ~ surround the expression
\| stands for the pipe, which is a special character in regex and must be escaped with a backslash
{3} says 'the previous (the pipe) must be present exactly three times'
( and ) enclose a subpattern so that it is stored under $matches[1]
\d requires a digit
+ says 'the previous (a digit) may be repeated but must have at least one instance'

Regex OR matching stuff that I dont want

I am using PHP.
I have a strings like:
example.123.somethingelse
example.1234.somethingelse
example.2015.123.somethingelse
example.2015.1234.somethingelse
and I came up with this regex
/example\.(2015\.|)([0-9]{3,4})\./
What I want to get is "123" or "1234" and it works for these strings. But when the string is
example.2015.A01.somethingelse
the result is "2015".
The way that I see it, after "2015." I have "A" and this should not be matched by the regex, but it is ( and I suppose there is a solid reason for it that I dont understand atm).
How can I fix it ( make the regex match nothing since the last string does not follow the same structure as the others) ?
Your regex is this:
/example\.(2015\.|)([0-9]{3,4})\./
That says
First match "example" followed by a period
Then match either "2015" followed by a period OR nothing at all.
Then match 3 or 4 digits in a row followed by a period
When you have the string example.2015.A01.somethingelse it matches the "example.2015." but then, as you said, the "A" messes it up so it backtracks and matches just "example." (remember the "OR" allowed for nothing to be matched). So it matches "example." followed by NOTHING followed by 3 or 4 numeric digits -- since "2015" is 4 numeric digits it comfortably matches "example.2015".
It's hard to tell from your description, but I think you've just got a mis-placed vertical bar:
/example\.(2015\.)|([0-9]{3,4})\./
That should match EITHER "example.2015." OR numbers like 123 -- but "2015" is still 4 numeric digits in a row, so it will still match. I don't have a clear enough idea of the pattern to figure out how that could be avoided.
Maybe use \d+ and take the first result in the array.
In your regex, you use the following:
(2015\.|)
This allows the regex to match either 2015. or the empty string (zero characters).
When the regex example\.(2015\.|)([0-9]{3,4})\. is applied to the following example:
example.2015.A01.somethingelse
it will to match the literal characters example, and then the empty string with (2015\.|) and then uses ([0-9]{3,4})\. to match the string 2015, which is 4 numerical characters. Thus your expression matches the following:
example.2015.
Looks like you need a possessive quantifier:
/example\.(2015\.)?+([0-9]{3,4})\./
The 2015. is still optional, but once the regex has matched it, it won't give it up, even if that causes the match to fail. I'm assuming the substring you're trying to capture with ([0-9]{3,4}) can never have the value 2015. That is, you won't need to match something like this:
example.2015.somethingelse
If that's not the case, it's going to be much more complicated.
here is one more pattern
example\.(?:2015\.)?\K(\d+)
Demo
or to your specific amount of digits
example\.(?:2015\.)?\K(\d{3,4})

Using preg_match to validate a string format

I have an html form with an input for a sales order number which should have the format of K1234/5678. It should always start with the letter K then 4 numbers, a / and followed by another set of 4 numbers.
I'm trying to validate the formatting using preg_match and I'm getting lost in the syntax of preg_match. From http://php.net/manual/en/function.preg-match.php I've gotten close. With the following code I'm able to verify that it contains at least 1 letter, some numbers and at least 1 non- alphanumeric value.
$so= $_POST['so'];
if (preg_match(""/^(?=.*[a-z]{1})(?=.*[0-9]{4})(?=.*[^a-z0-9]{1})/i", $so))
{
print $so;
}
What is the correct syntax to use for this? Is preg_match even the best way to do this?
Try this:
preg_match("#^K[0-9]{4}/[0-9]{4}$#i", $so)
Explanation:
The # characters are regular expression delimiters - they indicate the start/end of the pattern. The ^ and $ indicate the start and end of the string - this means that it will only match if your sales order number is the only thing in the string. The letter K means match that letter, [0-9]{4} means match a digit exactly 4 times. The i at the end means a case-insensitive match - the K will match either "K" or "k".
When developing regular expressions, I often use regular expression testers - these allow you to enter your data and try a bunch of different things to refine your regex. Google PHP regex tester to find a list of tools. Also, there's a very complete reference to regular expressions at http://www.regular-expressions.info/.

Matching ugly extra abbreviations and numbers in titles with PHP regex

I have to create regex to match ugly abbreviations and numbers. These can be one of following "formats":
1) [any alphabet char length of 1 char][0-9]
2) [double][whitespace][2-3 length of any alphabet char]
I tried to match double:
preg_match("/^-?(?:\d+|\d*\.\d+)$/", $source, $matches);
But I coldn't get it to select following example: 1.1 AA My test title. What is wrong with my regex and how can I add those others to my regex too?
In your regex you say "start of string, followed by maybe a - followed by at least one digit or followed by 0 or more digits, followed by a dot and followed by at least one digit and followed by the end of string.
So you regex could match for example.. 4.5, -.1 etc. This is exactly what you tell it to do.
You test input string does not match since there are other characters present after the number 1.1 and even if it somehow magically matched your "double" matching regex is wrong.
For a double without scientific notation you usually use this regex :
[-+]?\b[0-9]+(\.[0-9]+)?\b
Now that we have this out of our way we need a whitespace \s and
[2-3 length of alphabet]
Now I have no idea what [2-3 length of alphabet] means but by combining the above you get a regex like this :
[-+]?\b[0-9]+(\.[0-9]+)?\b\s[2-3 length of alphabet]
You can also place anchors ^$ if you want the string to match entirely :
^[-+]?\b[0-9]+(\.[0-9]+)?\b\s[2-3 length of alphabet]$
Feel free to ask if you are stuck! :)
I see multiple issues with your regex:
You try to match the whole string (as a number) by the anchors: ^ at the beginning and $ at the end. If you don't want that, remove those.
The number group is non-catching. It will be checked for matches, but those won't be added to $matches. That's because of the ?: internal options you set in (?:...). Remove ?: to make that group catching.
You place the shorter digit-pattern before the longer one. If you swap the order, the regex engine will look for it first and on success prefer it over the shorter one.
Maybe this already solves your issue:
preg_match("/-?(\d*\.\d+|\d+)/", $source, $matches);
Demo

regex validation

I am trying to validate a string of 3 numbers followed by / then 5 more numbers
I thought this would work
(/^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9])/i)
but it doesn't, any ideas what i'm doing wrong
Try this
preg_match('#^\d{3}/\d{5}#', $string)
The reason yours is not working is due to the + symbols which match "one or more" of the nominated character or character class.
Also, when using forward-slash delimiters (the characters at the start and end of your expression), you need to escape any forward-slashes in the pattern by prefixing them with a backslash, eg
/foo\/bar/
PHP allows you to use alternate delimiters (as in my answer) which is handy if your expression contains many forward-slashes.
First of all, you're using / as the regexp delimiter, so you can't use it in the pattern without escaping it with a backslash. Otherwise, PHP will think that you're pattern ends at the / in the middle (you can see that even StackOverflow's syntax highlighting thinks so).
Second, the + is "greedy", and will match as many characters as it can, so the first [0-9]+ would match the first 3 numbers in one go, leaving nothing for the next two to match.
Third, there's no need to use i, since you're dealing with numbers which aren't upper- or lowercase, so case-sensitivity is a moot point.
Try this instead
/^\d{3}\/\d{5}$/
The \d is shorthand for writing [0-9], and the {3} and {5} means repeat 3 or 5 times, respectively.
(This pattern is anchored to the start and the end of the string. Your pattern was only anchored to the beginning, and if that was on purpose, the remove the $ from my pattern)
I recently found this site useful for debugging regexes:
http://www.regextester.com/index2.html
It assumes use of /.../ (meaning you should not include those slashes in the regex you paste in).
So, after I put your regex ^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9]) in the Regex box and 123/45678 in the Test box I see no match. When I put a backslash in front of the forward slash in the middle, then it recognizes the match. You can then try matching 1234/567890 and discover it still matches. Then you go through and remove all the plus signs and then it correctly stops matching.
What I particularly like about this particular site is the way it shows the partial matches in red, allowing you to see where your regex is working up to.

Categories