PHP preg_replace_callback match string but exclude urls - php

What I'm trying to do is find all the matches within a content block, but ignore anything that is inside tags, for use inside preg_replace_callback().
For example:
test
test title
test
In this case, I want the first line to match, and the third line to match, but NOT the url match, nor the title match in between the a tags.
I've got a regex that I feel like is close:
#(?!<.*?)(\btest\b)(?![^<>]*?>)#si
(and this will not match the url part)
But how do I modify the regex to also exclude the "test" between a and /a?

If it's always the same pattern you can use [A-Z] or a combination like [A-Za-z]

I ended up solving it myself. This regex pattern will do what I wanted:
#(?!<a[^>]*?>)(\btest\b)(?![^<]*?<\/a>)#si

Related

PHP regex last occurrence of words

My string is: /var/www/domain.com/public_html/foo/bar/folder/another/..
I want to remove the root folder from this string, to get only public folder, because some servers have multiple websites inside.
My actual regex is: /^(.*?)(www|public_html|public|html)/s
My actual result is: /domain.com/public_html/foo/bar/folder/another/..
But i want to remove the last ocorrence, and get somethig like this: /foo/bar/folder/another/..
Thanks!
You have to use a greedy quantifier and to check if the alternative is enclosed between slashes using lookarounds:
/^.*(?<![^\/])(?:www|public(?:_html)?|html)(?![^\/])/
About the lookarounds: I use negative lookarounds with a negated character class to check if there is a slash or the limit of the string at the same time. This way you are sure that for instance html is a folder and not the part of another folder name.
I removed the s modifier that is useless. I removed the capture groups too since the goal is to replace all with an empty string.
The ? makes your expression non-greedy which is not actually what you want here. Try:
^(.*)(www|public_html|public|html)
which should keep going until the last match.
Demo: https://regex101.com/r/v5WbB3/1/

Regex Including the next occurence of word

The regex works perfectly but the problem is it also include the next occurrence instead of ending with the first occurrence then start again from the
Regex : (?=<appView)\s{0,1}(.*)(?<=<\/appView>)
String: <appView></appView> <appView></appView>
But my problem is it eat matches the whole word like
(Match 1)<appView></appView> <appView></appView>
I want it to search the group differently but i cant make it work.
Desired output : (Match 1) <appView></appView> (Match 2)<appView></appView>
\s{0,1} equals \s? You need to use (.*?) to be lazy instead of (.*)
Use this pattern: ~(?=<appView)\s?(.*?)(?<=</appView>)~
Demo Link
*note, you don't have to escape / in the closing tag if you use something other than a slash as your pattern delimiter. I am using ~ at the beginning and end of my pattern to avoid escaping.
I fully recommend to switch from regex to an actual sequential xml parser. Regex is aweful for parsing xml based files, for example because of the problems below.
That said, you can "fix" your regex by using ([^<>]*). This will match all characters without < or >, which will make sure that no other tags are nested inside. If done with all tags, you cannot match something like <appview><unclosedTag></appView>, because it is invalid. If you can be certain that the structure is correct, this is slightly less of an issue.
Another problem your approach has is that if you have nested tags like so: <appView> something <appView> something else </appView> else </appView>, your approach will make you end up with [replaced] else </appView>.

How do I extract one group from a URL using regex for use in a redirect?

I've read the Best RegEx Trick Ever and tried to wrap my head around the other answers here on Stack Exchange and just can't seem to get it right. Take these three strings:
http://www.test.com/newyork/class-schedule
http://www.test.com/location/newyork/class-schedule
http://www.test.com/location/newyork/training
I need a regex that will extract the newyork from the first string and save it for a replace later, but will NOT match any part of the other strings. Also, for obscure reasons, I can not include http://www.test.com as a condition for matching (so I can't use anything before the slash that precedes newyork). Note that in this scenario, newyork could easily be chicago, atlanta, or any other city name with no spaces or punctuation.
The only thing I've been able to figure out that isolates only newyork in the first string is the following:
/.*\.com\/(.[^\/]*)\/class-schedule/g
However, this relies on using the URL first which I can't use.
Any ideas on how to achieve this WITHOUT using the URL?
[EDIT]
To clarify what I'm looking for, I'm trying to take the results from the first string and add "location" to it, still using regex. So:
http://www.test.com/newyork/class-schedule
would become
http://www.test.com/location/newyork/class-schedule
using something like
http://www.test.com/location/$1/class-schedule
Try this: ~/(\w+)/[-a-z]+?/?(?:\?.*?)*(:?\s|$)~gm
See it working here: https://regex101.com/r/4VMazZ/3.
So it will use the end of URL instead of the beginning and match only the word between slash 2 and 3 from the end. There can be a query string it will still work.
[EDIT 1]
I exchanged 2 chars doing typo in the end so it was capturing one extra group: /(\w+)/[-a-z]+?/?(?:\?.*?)*(?:\s|$). here: https://regex101.com/r/4VMazZ/4
If you use preg_match($pattern, $string, $matches); the result you want (newyork) will be in $matches[1];, $matches[0] contains everything.
You can see the captures in 'MATCH INFORMATION' panel on regex101 in my example!
[EDIT 2] after your comment.
If you want to replace the whole url you have to match the whole URL, something like this: .*?/(\w+)/[-a-z]+?/?(?:\?.*?)*(?:\s|$) will do in this example. See it working here: https://regex101.com/r/4VMazZ/5
[EDIT 3] Add capturing of last part for replacement.
So as you want to reuse last part you need to add capturing parenthesis: .*?/(\w+)/([-a-z]+?)/?(?:\?.*?)*(?:\s|$).
See it working here: https://regex101.com/r/4VMazZ/6
Could this work? See it here.
(?<=location\/|\.\w{3}\/|\.\w{2}\/)(?!location).*?(?=\/|$)
It matches everything following .xxx/ or .xx/ or location/. I don't know if one letter domain exist, in this case, you can add |\.\w\/ to the lookahead at the start of the regex.
(?<=location\/|\.\w{3}\/|\.\w{2}\/) is a lookahead, so it matches the following pattern only if preceded by location/ or .xxx or .xx
.*? matches every character (lazy)
(?=\/|$) end match if next character is / or on line end
Note: If location is counted as part of the url, I don't think what you are asking is possible in regex, as the city name could be anywhere in string. If so, then you could have a list of cities and check what part of the url matches one of them.
EDIT: You need the multiline m flag so $ also matches end of line

Regex ignore URL already in HTML tags

I'm having a little problem with my Regex
I've made a custom BBcode for my website, however I also want URLs to be parsed too.
I'm using preg_replace and this is the pattern used to identify URLS:
/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/is
Which works great, however if a URL is within a [img][/img] block, the above pattern also picks it up and produces a result like this:
//[img]http://url.com/toimg.jeg[/img] will produce this result:
<img src="<a href="http://url.com/toimg.jeg" target="_blank">/>
//When it should produce:
<img src="http://url.com/toimg.jeg"/>
I tried using this:
/([^"][\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/][^"])/is
With no luck.
Any help will be appreciated.
Edit:
For solution See the 2nd comment on stema's answer.
Try this
(?<!href=")(\b[\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])
See it here on Regexr
To make it more general you can simplify your lookbehind to check only for "=""
(?<!=")(\b[\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])
See it on Regexr
(?<!href=") is a negative lookbehind assertion, it ensures that there is no "href="" before your pattern.
\b is a word boundary that anchors the start of your link to a change from a non word to a word character. without this the lookbehind would be useless and it would match from the "ttp://..." on.

PHP Regex of Anchor with Class to get Inner Text

<a href="/search?hl=en&pwst=1&sa=X&ei=RCPqTqkHycryA_bK_f0J&ved=0CCUQvwUoAQ&q=psychology&spell=1" class=spell><b><i>psychology</i></b></a>
Hi, I'm looking to create a regex which matches this anchor and returns the inner text of it.
This is what I've been trying as a regex but without success.
'/<a[^>]+class=\"spell\"[^>]*>(.*?)<\/a>/isU'
It's probably something really silly. Thanks.
Problem was missing quotes surrounding the class. Not proper html markup but I neglected to notice so I just changed my regex to have quotes as optional.
Final regex:
'/<a[^>]+class=\"?spell\"?[^>]*>(.*?)<\/a>/is'
The regex looks OK, although you don't need to escape the quotes. Perhaps PHP doesn't like it if you use unnecessary escapes, although I doubt it. The problem is more likely the way you're using the regex. Did you access group number 1?
if (preg_match('%<a[^>]+class="spell"[^>]*>(.*?)</a>%', $subject, $regs)) {
$result = $regs[1];
}
Your problem might be the combination of (.*?) and /isU modifier. That U alters the meaning of ? making your match group (.*) greedy actually. Then you will match parts beyond the <\/a> end marker, until it encounters another.
If you remove the /U it works as expected. With your given input text, at least.
Here are two options to fix your expression:
For starters, you can simplify your expression to:
class=\"spell\"[^>]*>(.*?)<\/a>
This captures
<b><i>psychology</i></b>
in Group 1. I assume this is what you want to achieve.
Then, if you want to capture "psychology" without the bold and italic tags, you can use:
class=\"spell\"[^>]*>\s*<(\w+)>?\s*<(\w+)>?\s*(.*?)<\/\2>\s*<\/\1>\s*<\/a>
This captures "psychology" in group 3.
In group 1, you will find the first optional tag, whether it be "b", "strong" or nothing.
In group 2, you will find the second optional tag, which was "i" in your example.
The multiple instances of \s* allow for optional space between the tags.
Is this what you were looking for?

Categories