I'm trying to capture data in a web url with regex

I'm trying to capture data in a web url with regex - php

I'm trying to build my regex to match my urls
Here are 2 example urls
category/sorganiser/bouger/escalade/offre/78934/
category/sorganiser/savourer/offre/8040/
I would like to get the number just after offre (78934 and 8040)
as well as the word just before the word offre (escalade and savourer)
I did several tests but did not pass
^category/(((\w)+/){1,3})(\d+)/?$
^category/(((\w)+/){1,3})/offre/(\d+)/?$
https://regex101.com/r/S4MTvK/1
Thank you

Instead of repeating a single word char in a group (\w)+ you can repeat 1+ word chars in a single group (\w+)
Note to not match the / before /offre as it is already matched in the iteration ^category/(?:(\w+)/){1,3}
You can repeat the capture group inside a non capture group (?: to capture the last occurrence in the iteration.
^category/(?:(\w+)/){1,3}offre/(\d+)
The pattern matches
^ Start of string
category/ Match literally
(?: Non capture group
(\w+)/ Capture group 1, match 1+ word chars and match /
){1,3} Close non capture, repeat 1-3 times and capture group 1 contains the last occurrence of 1+ word chars which is escalade or savourer
offre/ Match literally
(\d+) Capture group 2, match 1+ digits
Regex demo
To also match an optional / before the end of the sting
^category/(?:(\w+)/){1,3}offre/(\d+)/?$
Regex demo

Related

regex repetitive pattern with monetary amount

(?:[.,]?\d{3})* this is a part of a monetary amount pattern which match the thousand separator and the 3 digits after. The thousand separator can either be ., , or nothing
But how to make it repetitive? If the first thousand separator is . then the rest should also be .

You can use a capture group with a repeating backreference:
^(?:\d{4,}|\d{1,3}(?:([,.])\d{3}(?:\1\d{3})*)?)$
Explanation
^ Start of string
(?: Non capture group for the alternatives
\d{4,} Match 4 or more digits
| Or
\d{1,3} Match 1-3 digits
(?: Non capture group
([,.])\d{3} capture either a . or comma in group 1
(?:\1\d{3})* Optionally repeat the backreference to group 1 (the same char) followed by 3 digits
)? Close the non capture group and make it optional (to also match 1-3 digits)
) Close the non capture group
$ End of string
See a regex demo.

Take it block by block: the begining the middle and the end. Then use a or to match the numbers < 1000.
^((\d{1,3}[.,]){,2}(\d{3}[,.])*(\d{3})|(\d{1,3}))$
(This works with python regex on this tool: https://regex101.com/)
For PHP I had to modify it like this:
^(\d{1,3}|(\d{1,3}[.,])+(\d{3}[.,])*\d{3})$

Maximum character length for PHP multiline regular expressions?

I'm trying to evaluate a multiline RegExp with preg_match_all.
Unfortunately there seems to be a character limit around 24,000 characters (24,577 to be specific).
Does anyone know how to get this to work?
Pseudo-code:
<?php
$data = 'TRACE: aaaa(24,577 characters)';
preg_match_all('/([A-Z]+): ((?:(?![A-Z]+:).)*)\n/s', $data, $matches);
var_dump($matches);
?>
Working example (with < 24,577 characters): https://3v4l.org/8iRCc
Example that's NOT working (with > 24,577 characters): https://3v4l.org/ceKn6

You might rewrite the pattern using a negated character class instead of the tempered greedy token approach with the negative lookahead:
([A-Z]+): ([^A-Z\r\n]*(?>(?:\r?\n|[A-Z](?![A-Z]*:))[^A-Z\r\n]*)*)\r?\n
([A-Z]+): Capture group 1, match 1+ uppercase chars : and a space
( Capture group 2
[^A-Z\r\n]* Match 1+ times any char except A-Z or a newline
(?> Atomic group
(?: Non capture group
\r?\n Match a newline
| Or
[A-Z] Match a char other than A-Z
(?![A-Z]*:) Negative lookahead, assert not optional chars A-Z and :
) Close non capture group
[^A-Z\r\n]* Optionally match any char except A-Z
)* Close atomic group and optionally repeat
)\r?\n Close group 2 and match a newline
Regex demo | Php demo
If the TRACE: is at the start of the string, you can also add an anchor:
^([A-Z]+): ([^A-Z\r\n]*(?>(?:\r?\n|[A-Z](?![A-Z]*:))[^A-Z\r\n]*)*)\r?\n
Regex demo
Edit
If the strings start with the same format, you can capture and match all lines that do not start with the opening format.
^([A-Z]+): (.*(?:\r?\n(?![A-Z]+: ).*)*)
The pattern matches:
^ Start of string
([A-Z]+): Capture group 1
( Capture group 2
.* Match the rest of the line
(?:\r?\n(?![A-Z]+: ).*)* Repeat matching all lines that do not start with the pattern [A-Z]+:
) Close group 2
Regex demo
In php you can use
$re = '/^([A-Z]+): (.*(?:\r?\n(?![A-Z]+: ).*)*)/m';
Php demo

Try this
preg_match('/\A(?>[^\r\n]*(?>\r\n?|\n)){0,4}[^\r\n]*\z/',$data)

How to add additional capture group to lookahead, lookbehind regex

I am using this regex: (?<=\[).+?(?=\]) to match data in my test string below.
This regex matches everything between my brackets. I need to also include the '1234567890ABC...' portion of my string as well. How would I do that?
This is my test string:
[one] [two] [three] 1234567890ABC...

You could make use of the \G anchor and match any char except the square brackets, or match \w+ to match only word characters.
(?:\[|\G(?!^)]\h\[?)\K[^][\s]+
(?: Non capture group
\[ Match [
| Or
\G(?!^) Assert the position at the previous match
]\h\[? Match ], horizontal whitespace char and optional [
)\K Close group and reset the match buffer
[^][\s]+ Match 1+ times any char except square brackets or whitespace char
Regex demo

You could try this pattern it's the same as the pattern you are using but it includes as well words and numbers after the brackets
(?<=\[).+?(?=\])\d+|\w+

Regex pattern for splitting BEM string into parts (PHP)

I would like to isolate the block, element and modifier parts of a string via PHP regex. The flavour of BEM I'm using is lowercase and hyphenated. For example:
this-defines-a-block__this-defines-an-element--this-defines-a-modifier
My string is always formatted as the above, so the regex does not need to filter out any invalid BEM, for example, I will never have dirty strings such as:
This.defines-a-block__this-Defines-an-ELEMENT--090283
Block, Element and Modifier names could contain numbers, so we could have any combination of the following:
this-is-block-001__this-is-element-001--modifier-002
Finally a modifier is optional so not every string will have one for example:
this-is-a-block-001__this-is-an-element
this-is-a-block-002__this-is-an-element--this-is-an-optional-modifier
I am looking for some regex to return each section of the BEM markup. Each string will be isolated and sent to the regex individually, not as a group or as multiline strings. The following sent individually:
# String 1
block__element--modifier
# String 2
block-one__element-one--modifier-one
# String 3
block-one-big__element-one-big--modifier-one-big
# String 4
block-one-001__element-one-001
Would return:
# String 1
block
element
modifier
# String 2
block-one
element-one
modifier-one
# String 3
block-one-big
element-one-big
modifier-one-big
# String 4
block-one-001
element-one-001

You could use 3 capturing groups and make the third one optional using the ?
As all 3 groups are lowercase, can contain numbers and use the hyphen as a delimiter you might use a character class [a-z0-9].
You could reuse the pattern for group 1 using (?1)
\b([a-z0-9]+(?:-[a-z0-9]+)*)__((?1))(?:--((?1)))?\b
Explanation
\b Word boundary
( First capturing group
[a-z0-9]+ Repeat 1+ times what is listed in the character class
(?:-[a-z0-9]+)* Repeat 0+ times matching - and 1+ times what is in the character class
) Close group 1
__ Match literally
((?1)) Capturing group 2, recurse group 1
(?: Non capturing group
-- Match literally
((?1)) Capture group 3, recurse group 1
)? Close non capturing group and make it optional
\b Word boundary
Regex demo
Or using named groups:
\b(?<block>[a-z0-9]+(?:-[a-z0-9]+)*)__(?<element>(?&block))(?:--(?<modifier>(?&block)))?\b
Regex demo

Update a regex that matches twitter like mentions to allow for dots

I have already found helpful answers for a regex that matches twitter like username mentions in this answer and this answer
(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9_]+)
(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9-_]+)
However, I need to update this regex to also include usernames that has dots.
One or more dots are allowed in a username.
The username must not start or end with a dot.
No two consecutive dots are allowed.
Example of a matched string:
#valid.user.name
^^^^^^^^^^^^^^^^
Examples of non-matched strings:
#.user.name // starts with a dot
#user.name. // ends with a dot
#user..name // has two consecutive dots

You can use this refactored regex:
(?<=[^\w.-]|^)#([A-Za-z]+(?:\.\w+)*)$
RegEx Demo
RegEx Details:
(?<=[^\w.-]|^): Lookbehind to assert that we have start of line or any non-word, non-dot, non-hyphen character before current position
#: Match literal `#1
(: Start capture group
[A-Za-z]+: Match 1+ ASCII letters
(?:\.\w+)*: Match 0 or more instances of dot followed 1+ word characters
): End capture group
$: End

The (?<=^|(?<=[^a-zA-Z0-9-_\.])) is a positive lookbehind that requires a match to be at the start of the string or right after an alphanumeric, -, _, ., you may write it in a more compact way as (?<![\w.-]), a negative lookbehind.
Next, ([A-Za-z]+[A-Za-z0-9_]+) captures 1+ ASCII letters and then 1+ ASCII letters or/and underscores. You seem to make sure the first char is a letter, then any number of sequences of . and 1+ word chars are allowed, that is, you may use [A-Za-z]\w*(?:\.\w+)*.
As you do not want to match it if there is a . right after the expected match, you need to set a lookahead that will require a space or end of string, (?!\S).
So, combining it, you can use
'~(?<![\w.-])#([A-Za-z]\w*(?:\.\w+)*)(?!\S)~'
See the regex demo
Details
(?<![\w.-]) - no letters, digits, _, . and - immediately to the left of the current location are allowed
# - a # char
([A-Za-z]\w*(?:\.\w+)*) - Group 1:
[A-Za-z] - an ASCII letter
\w* - 0+ letters, digits, _
(?:\.\w+)* - 0+ sequences of
\. - dot
\w+ - 1+ letters, digits, _
(?!\S) - whitespace or end of string are required immediately to the right of the current location.

EDIT: Simpler version (same result)
^#[a-zA-Z](\.?[\w-]+)*$
Original
Another one:
^#[a-zA-Z][a-zA-Z_-]?(\.?[\w\d-]+){0,}$
^# starts with #
[a-zA-Z] first char
[a-zA-Z_-]? match a-zA-Z_- 0 or more times
( start group
\.? match . (optional)
[\w\d-]+ match a-zA-Z0-9-_ 1 or more times
) end group
{0,} repeat group 0 to infinite times
$ end
Tests
valid:
#validusername
#valid.user.name
#valid-user-name
#valid_user-name
#valid-user123_name
#a.valid-user123_name
not valid:
#-invalid.user
#_invalid.user
#1notvalid-user_123name33
#.user.name
#user.name.
#user..name

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

I'm trying to capture data in a web url with regex - php

Related

regex repetitive pattern with monetary amount

Maximum character length for PHP multiline regular expressions?

How to add additional capture group to lookahead, lookbehind regex

Regex pattern for splitting BEM string into parts (PHP)

Update a regex that matches twitter like mentions to allow for dots

Categories

Resources