regex for matching url pattern not working - php

I am having hard time creating a regular expression that should match all urls built on a particular pattern such as http://subdomain.example.com/reply/aoo/spo/4429163785
here is the regex I so far have written.
"/(^https?)(\:\/\/)([a-z]+)(\.craigslist\.org/reply)\/([a-z]{3}\/spo)\/([0-9]{10}$)/"
please help me improve my regex.

"/(^https?)(\:\/\/)([a-z]+)(\.craigslist\.org\/reply)\/([a-z]{3}\/spo)\/([0-9]{10}$)/"
You forgot to escape the / in after the .org part right before reply.

Change the delimiter to something else so you don’t need to escape it every now and then. Also, don’t capture those parts of the string where no variations are allowed (e.g. .craigslist.org/reply/):
"~^(https?)://([a-z]+)\\.craigslist\\.org/reply/([a-z]{3})/spo/([0-9]{10})$~"
Explanation:
~ The opening delimiter
^ Match the beginning of the string
(https?) Match either http or https – capture it
:// Match a colon followed by two forward slashes
([a-z]+) Match one or more lowercase characters – capture it
\\. Match a period
craigslist Match the characters exactly as given
\\. Match a period
org/reply/ Match the characters exactly as given
([a-z]{3}) Match three lowercase characters – capture it
/spo/ Match the characters exactly as given
([0-9]{10}) Match ten numerical characters – capture it
$ Match the ending of the string
~ The closing delimiter

Related

Special Expression not allowed - Regular Expression in PHP

I am trying match my String to not allow the case: for example 150x150 from the image name below:
test-string-150x150.png
I am using the following pattern to match this String:
/^([^0-9x0-9]+)\..+/
It works fine, Except in such a case:
teststring.com-150x150.jpg
What i need to get - the mask must disallow only dimensions in the end of string, here is some examples:
test-string-150x150.png > must disallow
any-string.png > allow
200x200-test.png > allow
1x1.png-100x100.jpg > disallow
You could use a negative lookahead to assert that the string does not contain the sizes followed by a dot and 1+ word characters till the end of the string.
^(?!.*\d+x\d+\.\w+$).+$
Explanation
^ Start of string
(?! Negative lookahead, assert what is on the right is not
.* Match 0+ occurrences of any char except a newline
\d+x\d+ Match the sizes format, where \d+ means 1 or more digits
\.\w+$ Match a dot, 1+ word characters and assert the end of the string $
) Close lookahead
.+ Match 1+ occurrences of any char except a newline
$ End of string
Regex demo
If I understand your question, you're trying to find image names that do not include the image dimensions. If so, try this:
/^(?![\w-\.]+(\d+x\d+))[\w-\.]+\.\w+$/gm
For details about this code, please see regexr.com/4tmd1. This site is a great place to play around with regexes to make sure you're getting the results you expect.
Be aware that the exact syntax of the regular expression depends on the regex engine used by whatever program you're running.

preg_match wildcard, require at least one character

The preg_match below matches 'empty' (0 characters) against the wildcard. I want to disable that:
preg_match('/site.com\/subsection\/.*?/', $page_url);
So the thing above should match site.com/subsection/subpage, but shouldn't match the root dir site.com/subsection/
How can I adjust the regex above? Thanks in advance!
The .*? at the end of the pattern matches empty string. You need to make it match one or more characters using .+:
'/site\.com\/subsection\/.+/'
^
Now, it requires at least 1 char after site.com/subsection/.
Note the dot must be escaped to match a literal dot.
Also, it might be a good idea to use regex delimiters other than / (as OcuS suggests in the comments below) if you have many slashes in the pattern itself. I usually use tildes:
'~site\.com/subsection/.+~'

Remove nested bbcode style tags and anything inside them

I need help with a regex to remove some thing. I can't get it to work as I want.
Lets say I have this text:
[quote=test]
[quote=test]for sure[/quote]
Test
[/quote]
[this should not be removed]
Dont remove me
How can I remove everything above [this should not be removed]? Please note that Test can be anything.
So I want to remove anything inside:
[quote=*][/quote]
I've come this far:
preg_replace('#\[quote=(.+)](.+)\[/quote]#Usi', '', $message);
But it keeps: Test [/quote]
Matching nested bbcode style code is rather complex - usually involving a non-regular expression based string parser.
Seems you are using PHP it does support the regular expression (?R) syntax for "recursion" using this we can support nested bbcode like this.
Note that non-matching opening [quote=*] and closing [/quote] pairs will not be matched.
Regular Expression
\[(quote)=[^]]+\](?>(?R)|.)*?\[/quote]
https://regex101.com/r/xF3oR6/1
Code
$result = preg_replace('%\[(quote)=[^]]+\](?>(?R)|.)*?\[/quote]%si', '', $subject);
Human Readable
# \[(quote)=[^]]+\](?>(?R)|.)*?\[/quote]
#
# Options: Case insensitive; Exact spacing; Dot matches line breaks; ^$ don’t match at line breaks; Greedy quantifiers; Regex syntax only
#
# Match the character “[” literally «\[»
# Match the regex below and capture its match into backreference number 1 «(quote)»
# Match the character string “quote” literally (case insensitive) «quote»
# Match the character “=” literally «=»
# Match any character that is NOT a “]” «[^]]+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match the character “]” literally «\]»
# Match the regular expression below; do not try further permutations of this group if the overall regex fails (atomic group) «(?>(?R)|.)*?»
# Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Match this alternative (attempting the next alternative only if this one fails) «(?R)»
# Match the entire regular expression (recursion; restore capturing groups upon exit; do not try further permutations of the recursion if the overall regex fails) «(?R)»
# Or match this alternative (the entire group fails if this one fails to match) «.»
# Match any single character «.»
# Match the character “[” literally «\[»
# Match the character string “/quote]” literally (case insensitive) «/quote]»

What does this Regex pattern mean: '/&\w;/'

Can someone explain what this function
preg_replace('/&\w;/', '', $buf)
does? I have looked at various tutorials and found that it replaces the pattern /&\w;/ with string ''. But I can't understand the pattern /&\w;/. What does it represent?
Similarly in
preg_match_all("/(\b[\w+]+\b)/", $buf, $words)
I can't understand what does the string "/(\b[\w+]+\b)/" represents.
Please help. Thanks in advance :)
The explanation of your first expression is simple, it is:
& # Match the character “&” literally
\w # Match a single character that is a “word character” (letters, digits, and underscores)
; # Match the character “;” literally
The second one is:
( # Match the regular expression below and capture its match into backreference number 1
\b # Assert position at a word boundary
[\w+] # Match a single character present in the list below
# A word character (letters, digits, and underscores)
# The character “+”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
)
The preg_replace function makes use of regular expressions. Regular expressions allow you to find patterns in text in a really powerful way.
To be able to use functions like preg_replace or preg_match I recommend you to take a look first at how regular expressions work.
You can gather a lot of info on this site http://www.regular-expressions.info/
And you can use software tools to help you understand the regex (like RegexBuddy)
In regular expressions, \w stands for any "word" character. That is: a-z, A-Z, 0-9 and underscore. \b stands for "word boundary", that is the beginning and end of a word (a series of word characters).
So, /&\w;/ is a regular expression to match the & sign, followed by a series of word characters, followed by a ;. For example, &foobar; would match, and preg_replace will replace it with an empty string.
In that same manner, /(\b[\w+]+\b)/ matches a word boundary, followed by multiple word characters, followed by another word boundary. The words are captured separately using the parenthesis. So, this regular expression will simply return the words in a string as an array.

What does this regex mean in PHP?

/(?![a-z]+:)/
Anyone knows?
the / are delimiters.
?! is negative lookahead.
[a-z] is a character class (any character in the a-z range)
+ is one-or-more times of the preceding pattern ([a-z] in this case)
: is just the colon literal
It roughly means "look ahead and make sure there are no alpha characters followed by a colon".
This regex would make more sense if it had a start of string anchor: /^(?![a-z]+:/, so it wouldn't match abc: (like one of the other answers say), but without the (^) I don't know how useful this is.
according to Regex Buddy (a product i highly recommend):
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?![a-z]+:)»
Match a single character in the range between “a” and “z” «[a-z]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “:” literally «:»
(?!REGEX) is the syntax for negative lookahead. Check the link for an explanation of lookaheads.
The regex fails if the pattern [a-z]+: appear in the string from the current position. If the pattern is not found, regex would succeed, but won't consume any characters.
It would match 123: or abc but not abc:
It would match the : in abc:.

Categories