Regular expression to replace all url from string but skip one - php

I have regular expression that's is removing all url from a string but I want to change this and add exception for my site link.
$url = 'This is url for example to remove www.somewbsite.com but i want to skip removing this url www.mywebsite.com';
$no_url = preg_replace("/(https|http|ftp)\:\/\/|([a-z0-9A-Z]+\.[a-z0-9A-Z]+\.[a-zA-Z]{2,4})|([a-z0-9A-Z]+\.[a-zA-Z]{2,4})|\?([a-zA-Z0-9]+[\&\=\#a-z]+)/i", "★", $url);

First of all, since you are replacing with a hard-coded symbol, and you are using a case-insensitive modifier, your regex can be reduced to
'~(?:https?|ftp)://|(?:[a-z0-9]+\.)?[a-z0-9]+\.[a-z]{2,4}|\?[a-z0-9]+[&=#a-z]+~i'
whatever it means to match. Note that 2 alternatives here were too similar ([a-z0-9A-Z]+\.[a-z0-9A-Z]+\.[a-zA-Z]{2,4})|([a-z0-9A-Z]+\.[a-zA-Z]{2,4}), they are merged into 1 with the help of an optional non-capturing group ((?:[a-z0-9]+\.)?).
Now, if you want to avoid matching a specific pattern, you may use a SKIP-FAIL technique: match what you want to preserve and skip it.
'~www\.mywebsite\.com(*SKIP)(*FAIL)|(?:https?|ftp)://|(?:[a-z0-9]+\.)?[a-z0-9]+\.[a-z]{2,4}|\?[a-z0-9]+[&=#a-z]+~i'
See this regex demo.

Related

PHP preg_replace_callback match string but exclude urls

What I'm trying to do is find all the matches within a content block, but ignore anything that is inside tags, for use inside preg_replace_callback().
For example:
test
test title
test
In this case, I want the first line to match, and the third line to match, but NOT the url match, nor the title match in between the a tags.
I've got a regex that I feel like is close:
#(?!<.*?)(\btest\b)(?![^<>]*?>)#si
(and this will not match the url part)
But how do I modify the regex to also exclude the "test" between a and /a?
If it's always the same pattern you can use [A-Z] or a combination like [A-Za-z]
I ended up solving it myself. This regex pattern will do what I wanted:
#(?!<a[^>]*?>)(\btest\b)(?![^<]*?<\/a>)#si

Regex - Match characters but don't include within results

I have got the following Regex, which ALMOST works...
(?:^https?:\/\/)(?:www|[a-z]+)\.([^.]+)
I need the result to be the only result, or within the same position in the Array.
So for example this http://m.facebook.com/ matches perfect, there is only 1 group.
However, if I change it to http://facebook.com/ then I get com/in place of where Facebook should be. So I need to have (?:www|[a-z]+) as an optional check really.
Edit:
What I expect is just to match facebook, if ANY of the strings are as follows:
http://www.facebook.com
http://facebook.com
http://m.facebook.com
And obviously the https counterparts.
This is my Regex now
(?:^https?:\/\/)(?:www)?\.?([^.]+)
This is close, however it matches the m on when I try `http://m.facebook.com
https://regex101.com/r/GDapY5/1
So I need to have (?:www|[a-z]+) as an optional check really.
A ? at the end of a pattern is generally used for "optional" bits -- it means "match zero or one" of that thing, so your subpattern would be something like this:
(?:www|[a-z]+)?
If you're simply trying to get the second level domain, I wouldn't bother with regex, because you'll be constantly adjusting it to handle special cases you come across. Just split on dots and take the penultimate value:
$domain = array_reverse(explode('.', parse_url($str)['host']))[1];
Or:
$domain = array_reverse(explode('.', parse_url($str, PHP_URL_HOST)))[1];
Perhaps you could make the first m. part optional with (?:\w+\.)?.
Instead of a capturing group you could use \K to reset the starting point of the reported match.
Then match one or more word characters \w+ and use a positive lookahead to assert that what follows is a dot (?=\.)
For example:
^https?://(?:www)?(?:\w+\.)?\K\w+(?=\.)
Edit: Or you could match for m. or www. using an alternation:
^https?://(?:m\.|www\.)?\K\w+(?=\.)
Demo Php

PHP regex last occurrence of words

My string is: /var/www/domain.com/public_html/foo/bar/folder/another/..
I want to remove the root folder from this string, to get only public folder, because some servers have multiple websites inside.
My actual regex is: /^(.*?)(www|public_html|public|html)/s
My actual result is: /domain.com/public_html/foo/bar/folder/another/..
But i want to remove the last ocorrence, and get somethig like this: /foo/bar/folder/another/..
Thanks!
You have to use a greedy quantifier and to check if the alternative is enclosed between slashes using lookarounds:
/^.*(?<![^\/])(?:www|public(?:_html)?|html)(?![^\/])/
About the lookarounds: I use negative lookarounds with a negated character class to check if there is a slash or the limit of the string at the same time. This way you are sure that for instance html is a folder and not the part of another folder name.
I removed the s modifier that is useless. I removed the capture groups too since the goal is to replace all with an empty string.
The ? makes your expression non-greedy which is not actually what you want here. Try:
^(.*)(www|public_html|public|html)
which should keep going until the last match.
Demo: https://regex101.com/r/v5WbB3/1/

Including a literal string in the regex

Current URLs:
http://domain.com/index.php?route=common/home
http://domain.com/index.php?route=account/register
http://domain.com/index.php?route=checkout/cart
http://domain.com/index.php?route=checkout/checkout
Desired URLs:
http://domain.com/home
http://domain.com/register
http://domain.com/cart
http://domain.com/checkout
Regex:
(?=\=)(.*?)(?<=\/).+$
... almost works, but it matches (for example the last URL) =checkout/ whereas I need it to match index.php?route= as well so I can remove the whole index.php?route=checkout/ from the URL.
I tried index.php?route=(?=\=)(.*?)(?<=\/).+$ but ofcourse it doesn't work.
To remove the desired part, you should substitute:
index.php\?route\=[^\/]*\/
with an empty string.
To be more strict and precise, you should use lookbehind:
(?<=http:\/\/domain.com\/)index.php\?route\=[^\/]*\/
Check the regex here: https://regex101.com/r/sY7aV6/1

How can I replace only the captured elements of a regex?

I'm trying to extract only certain elements of a string using regular expressions and I want to end up with only the captured groups.
For example, I'd like to run something like (is|a) on a string like "This is a test" and be able to return only "is is a". The only way I can partially do it now is if I find the entire beginning and end of the string but don't capture it:
.*?(is|a).*? replaced with $1
However, when I do this, only the characters preceding the final found/captured group are eliminated--everything after the last found group remains.
is is a test.
How can I isolate and replace only the captured strings (so that I end up with "is is a"), in both PHP and Perl?
Thanks!
Edit:
I see now that it's better to use m// rather than s///, but how can I apply that to PHP's preg_match? In my real regex I have several captured group, resulting in $1, $2, $3 etc -- preg_match only deals with one captured group, right?
If all you want are the matches, the there is no need for the s/// operator. You should use m//. You might want to expand on your explanation a little if the example below does not meet your needs:
#!/usr/bin/perl
use strict;
use warnings;
my $text = 'This is a test';
my #matches = ( $text =~ /(is|a)/g );
print "#matches\n";
__END__
C:\Temp> t.pl
is is a
EDIT: For PHP, you should use preg_match_all and specify an array to hold the match results as shown in the documentation.
You can't replace only captures. s/// always replaces everything included in the match. You need to either capture the additional items and include them in the replacement or use assertions to require things that aren't included in the match.
That said, I don't think that's what you're really asking. Is Sinan's answer what you're after?
You put everything into captures and then replaces only the ones you want.
(.*?)(is|a)(.*?)

Categories