I'm creating a comment board feature that allows users to reference post-ID's, which will be auto-configured by regex to hyperlink to the relevant post.
Posts references are formatted as the following, using the double-arrow ASCII symbol: »1234
6 numbers maximum can follow the double-arrow in order for the reference to be hyperlinked, so »1234567 would not hyperlink, but »1, »12, »123, etc would.
How would I go about doing this with regex?
Match the special character followed by 1-6 digits and then followed by a word boundary, so it won't match if it's concatenated with any other string.
»\d{1,6}\b
Here is one solution: » matches the arrow character, \d matches a number between 0 and 9 and {1,6} specifies, that at least 1 and maximal 6 numbers should follow. If you want to match only whole words, you can use a word boundary on front and on back of the regex (\b). If you want to check if the whole string consists only of this pattern, you can use an anchor (^ in the beginning, $ at the end).
»\d{1,6}
Related
I have a string with the following "valid" pattern which is repeated multiple times:
A specific group of characters, say "ab", any number of other characters, say "xx", a different specific group of characters, say "cd", any number of other characters, say "xx".
So a valid sequence would be:
"abxcdabxxcdabxcdxx"
I'm trying to detect invalid sequences of this specific form: "abxxcdxxcd", and remove the middle "cd" to make it valid: "abxxxxcd"
I have tried the following regex:
/(?<=ab).*(cd).*(?=ab)/gsU
It works for a single sequence, but it fails for the following string:
"abxxcdxcdxxabxcdxxabxcdxxcd", which contains an invalid sequence, followed by a valid sequence, followed by another invalid sequence. I want to capture both groups in bold.
Note that the other characters "xx" may contain anything, including line breaks. They will never, however, contain the strings "ab" or "cd", except in the invalid case I specified.
Here's the corresponding regex101 link: https://regex101.com/r/U9pRfo/1
Edit:
Wiktor's answer worked out for me. I was however getting PREG_JIT_STACKLIMIT_ERROR in php when using that regex on a very large string. I ended up just splitting that string into smaller chunks and rebuilding the string after, which worked perfectly.
You may use
'~(?:\G(?!^)|ab)(?:(?!ab).)*?\Kcd(?=(?:(?!ab).)*?cd)~s'
See the regex demo
(?:\G(?!^)|ab) - a nbon-capturing group matching ab or the end of the previous match
(?:(?!ab).)*? - matches any char, 0 or more times, as few as possible, that does not start a ab char sequence
\K - match reset operator
cd - a substring
(?=(?:(?!ab).)*?cd) - a positive lookahead that requires any char, 0 or more repetitions, as few as possible, that does not start the ab char sequence and then cd char sequence.
I am trying to find many 8 digited words using regex,
which should contain either number/alphabets/both
after that 8 digits it should end with .php
it should only have 8 digits neither 7 nor 6
I Tried this \b\d{8}\b.php
But I failed it only works for numbers for example
12121212.php
23232323.php
Also i don't need
something-catergory.php
AB787C-category.php
has-bookshok.php
The final result should be like abcd1234.php rather than something-abcd1234.php
You can use character class
\b[a-zA-Z0-9]{8}\b
\b - Word boundry.
[a-zA-Z0-9]{8} - match number, alphabets or both. ( {8} -> length must be 8 character)
Update
The final result should be like abcd1234.php rather than
something-abcd1234.php
\b[a-zA-Z0-9]{8}\.php$
Demo
Well if you want complete string to match you need to use ^ anchor at start and $ at end instead of \b
^[a-zA-Z0-9]{8}\.php$
If your data is a list of filenames then this regex will work:
/^[a-z0-9]{8}\.php$/i
It asserts that the filename is exactly 8 [a-zA-Z0-9] characters followed by .php. Note that the i modifier makes it case insensitive so we don't have to specify A-Z in the character class as well.
Here's a demo on 3v4l.org
I have a piece of data, retrieved from the database and containing information I need. Text is entered in a free form so it's written in many different ways. The only thing I know for sure is that I'm looking for the first number after a given string, but after that certain string (before the number) can be any text as well.
I tried this (where mytoken is the string I know for sure its there) but this doesn't work.
/(mytoken|MYTOKEN)(.*)\d{1}/
/(mytoken|MYTOKEN)[a-zA-Z]+\d{1}/
/(mytoken|MYTOKEN)(.*)[0-9]/
/(mytoken|MYTOKEN)[a-zA-Z]+[0-9]/
Even mytoken can be written in capitals, lowercase or a mix of capitals and lowercase character. Can the expression be case insensitive?
You do not need any lazy matching since you want to match any number of non-digit symbols up to the first digit. It is better done with a \D*:
/(mytoken)(\D*)(\d+)/i
See the regex demo
The pattern details:
(mytoken) - Group 1 matching mytoken (case insensitively, as there is a /i modifier)
(\D*) - Group 2 matching zero or more characters other than a digit
(\d+) - Group 3 matching 1 or more digits.
Note that \D also matches newlines, . needs a DOTALL modifier to match across newlines.
You need to use a lazy quantifier. You can do that by putting a question mark after the star quantifier in the regex: .*?. Otherwise, the numbers will be matched by the dot operator until the last number, which will be matched by \d.
Regex: /(mytoken|MYTOKEN)(.*?)\d/
Regex demo
You can use the opposite:
/(mytoken|MYTOKEN)(\D+)(\d)/
This says: mytoken, followed by anything not a number, followed by a number. The (lazy) dot-star-soup is not always your best bet. The desired number will be in $3 in this example.
I'm trying for couple of days to create a regex for finding the correct picture by the product barcode from the pictures folder.
The folder containing something like 4500 pictures.
The name of the file can be in 4 formats.
XXXXXX.jpg/png - short barcode unknown number of characters(numbers only).
00000(from 1 to unknow number of leading zero)XXXX(then the short barcode).jpg/png
729(as leading number)00000(from 1 to unknow number of leading zero)XXXX(then the short barcode).jpg/png
72900000XXXXXXYYY YYY YYY.jpg/png same as option 3 but with some characters(Y-represent a character).
I came up with something like that:
$i = new RegexIterator($a, '($barcode)\D*|^([0][0-9]+$barcode)\D+|(729[0-9][0-9]+$barcode)\D+|(729[0-9][0-9]+$barcode).+/', RegexIterator::GET_MATCH);
$barcode - can be 7290000232 or 0000232 or 232
But it doesn't working.
Any ideas?
You have four cases that build up on each other:
Only numbers, 1 to unlimited times: \d+
1. with leading zeros: effectively the same as 1., as zeros are numbers ;) No need for a special case here
1. optionally preceeded by 729: (?:729)?\d+ (this may already be used for the cases 1.-3.)
3. with optional characters (zero to unlimited): (?:729)?\d+(?:[a-zA-Z])*
Only the extension is left to be added:
((?:729)?\d+(?:[a-zA-Z])*\.(?:jpg|png))
Now there's one thing left. This regex would match on abc123.jpg, as 123.jpg is perfectly valid. To counter this we add ^ (this denotes the start of the input):
^((?:729)?\d+(?:[a-zA-Z])*\.(?:jpg|png))
demo # regex101
As you insert the barcode (from case 1) yourself there are few adjustments to be made:
^((?:729)?0*?$barcode(?:[a-zA-Z])*\.(?:jpg|png))
Here we have to insert the second case with 0*? (0 zero to unlimited times, lazy).
Regarding the [a-zA-Z]: you have to decide what to allow here. Currently it only allows lowercase and uppercase letters. If you want to allow spaces (for example), then simply add them to the character group: [a-zA-Z ].
For non-latin characters you can use [\x{00BF}-\x{1FFF}\x{2C00}-\x{D7FF}a-zA-Z] (credits to this comment) as your character group, so your regex would then look like:
^((?:729)?0*?123(?:[\x{00BF}-\x{1FFF}\x{2C00}-\x{D7FF}a-zA-Z])*\.(?:jpg|png))
demo # regex101
From what I understand - options 1-3 are all the same (729 is a digit string same as others):
^\d+(?:jpg|png)$
With 4 you are saying 'allow word characters and whitespaces, but only if name starts with 729'. So it is now:
(?:(?:^\d+[.](?:jpg|png)$)|(?:^729\d*[\w\s]+[.](?:jpg|png)$))
Demo here.
\s matches spaces, '\w' matches word characters.
I have numbers wrapped with curly brackets in my text i.e. {123} or {456ABC}. I also have numbers not wrapped with brackets i.e. 789. I want to match these not-yet wrapped numbers and use PHP's preg_replace to wrap them with pound signs i.e. #789#. The numbers usually range from 1-3 digits.
print(preg_replace('/\d+/','#$0#',
'1) I can count to 2997510. You can only count to {456ABC}.'));
Desired output:
#1#) I can count to #2997510#. You can only count to {456ABC}.
What regex would match the numbers? I've tried negative lookahead (?![^\{])\d+ and [^\{](\d+)[^\{]
[^\{\dA-F]([A-F\d]+)[^\}\dA-F]
(I'm assuming that you're trying to match hex numbers with capital letters; if not, just alter the character class appropriately.)
The extra \d's are in the negative character classes because if they aren't there, then the engine will avoid brackets by cutting off the outermost digits. For instance, [^\{](\d+)[^\}] will match the 456 in {34567}.
The number itself is "group 1" of any match. If you need the entire match itself to be the number, use a lookahead and a lookbehind:
(?<=[^\{\dA-F])([A-F\d]+)(?=[^\}\dA-F])
Here is a Perl-style search-and-replace to insert the #'s, with no lookahead or lookbehind:
s/([^\{\dA-F])([A-F\d]+)([^\}\dA-F])/$1#$2#$3/g
(\A|[^{\d])(\d[\d\w]*)(\z|[^\}\d\z]) should do it for you.
Used like:
print(preg_replace('/(\A|[^{\d])(\d[\d\w]*)(\z|[^\}\d\z])/','$1#$2#$3',
'1) I can count to 2997510. You can only count to {456ABC}.'));
Explanation:
The first part (\A|[^{\d]) matches either the start of the input (to catch numbers at the beginning of the string) or a non { or digit. This part ensures the numbers aren't already wrapped.
The second part (\d[\d\w]*) does the actual matching of the number. It matches anything that starts with a digit followed by any number of contiguous digits or letters.
The last part (\z|[^\}\d\z]) is analogous to the first part, except looks for the end of the input.
Because this regular expression can capture a character before and after the target number, it is important to add those characters back in using the 1st and 3rd matched subgroups (as seen in the PHP example.