REGEX remove any kind of domain from a URL - php

I need to remove all the domain, in any format. I have all kind of url formats but i need to make sure I get only the URL content without the domain.
I receive URLS in any of these formats:
https://www.asdfasd.com/asdfasd/fa/sd/fa
http://www.asdfasd.com/asdfasd/fa/sd/fa
www.asdfasd.com/asdfasd/fa/sd/fa
asdfasd.com/asdfasd/fa/sd/fa
/asdfasd/fa/sd/fa
asdfasd/fa/sd/fa
The result for any of these scenarios should always be:
asdfasd/fa/sd/fa
I have try to make myself but not getting 100% results
^[^#]*?:(\/\/).*?(?=\/)(\/)

Use the following:
^(?:\S+\.\S+?\/|\/)?(\S+)$
The capturing group contains the required text. I am using alternation between the presence of a domain or not, and then capturing the rest of the part.
Demo

Related

Regular Expression to match markdown and regular href sources from specific domain(s)

I'm trying to append something to the query string to various assets hosted on filepicker which serves from a few specific domains, some of which already contain a query string. All other URLs should be left untouched.
For example, we might have the following in markdown:
![image](https://www.filepicker.io/api/file/12x3DD5667dxfjdf/convert?w=600)
We could also have
<img src="https://www.filepicker.io/api/file/someotherfile" />
Or
<a href='https://www.filestack.com/api/file/anothersdf?sdf=3&dfdf=1'>link</a>
Just trying to match against one domain for the moment I have the following regular expression which isn't matching all cases:
I do not want to match any references to other domains.
I've had mixed success with the following:
/(https:\/\/www\.filepicker.io\/api\/file\/[a-zA-Z0-9]+(\/convert)*[^)])$/is
This will match what you want:
/(https?:\/\/www\.(?:filepicker\.io\/|filestack\.com\/)api\/file\/[\w+?&=\/]+)/
https://regex101.com/r/DA8Gok/2/
But Regex is not well suited (by itself) to parse any type of structured data. Instead make a lexer/parser.
Here are some examples of ones I have written:
https://github.com/ArtisticPhoenix/MISC/tree/master/Lexers
http://artisticphoenix.com/2018/11/11/output-converter/

RegExp to match a segment of a URL

I'm trying to use RegExp to match a segment of a URL.
The URL in question is this:
http://www.example.com/news/region/north-america/
As I need this regex for the WordPress URL Rewrite API, the subject will only be the path section of the URL:
news/region/north-america
In the above example I need to be able to extract the north-america portion of the path, however when pagination is used the path becomes something like this:
news/region/north-america/page/2
Where I still only need to extract the north-america portion.
The RegExp I've come up with is as follows:
^news/region/(.*?)/(.*?)?/?(.*?)?$
However this does not match for news/region/north-america only news/region/north-america/page/2
From what I can tell I need to make the trailing slash after north-america optional, but adding /? doesn't seem to work.
Try this:
preg_match('/news\/region\/(.*?)\//',"http://www.example.com/news/region/north-america/page/2",$matches);
the $matches[1] will give you the output. as "north-america".
You should match using this regex:
^news/region/([^/]+)
This will give you news/region/north-america even when URI becomes /news/region/north-america/page/2
georg's suggested rule work like a charm:
^news/region/(.*?)(?:/(.*?)/(.*?))?$
For those interested in the application of this regex, I used it in the WP Rewrite API to grab the custom taxonomy and page number (if present) and assign the relevant matches to the the WP re-write:
$newRules['news/region/(.?)(?:/(.?)/(.*?))?$']='index.php?region=$matches[1]&forcetemplate=news&paged=$matches[3]';

Validate URL with or without protocol

Hi I would like to validate this following urls, so they all would pass with or without http/www part in them as long as there is TLD present like .com, .net, .org etc..
Valid URLs Should Be:
http://www.domain.com
http://domain.com
https://www.domain.com
https://domain.com
www.domain.com
domain.com
To support long tlds:
http://www.domain.com.uk
http://domain.com.uk
https://www.domain.com.uk
https://domain.com.uk
www.domain.com.uk
domain.com.uk
To support dashes (-):
http://www.domain-here.com
http://domain-here.com
https://www.domain-here.com
https://domain-here.com
www.domain-here.com
domain-here.com
Also to support numbers in domains:
http://www.domain1-test-here.com
http://domain1-test-here.com
https://www.domain1-test-here.com
https://domain1-test-here.com
www.domain1-test-here.com
domain-here.com
Also maybe allow even IPs:
127.127.127.127
(but this is extra!)
Also allow dashes (-), forgot to mantion that =)
I've found many functions that validate one or another but not both at same time.
If any one knows good regex for it, please share. Thank you for your help.
For url validation perfect solution.
Above Answer is right but not work on all domains like .me, .it, .in
so please user below for url match:
$pattern = '/(?:https?:\/\/)?(?:[a-zA-Z0-9.-]+?\.(?:[a-zA-Z])|\d+\.\d+\.\d+\.\d+)/';
if(preg_match($pattern, "http://website.in"))
{
echo "valid";
}else{
echo "invalid";
}
When you ignore the path part and look for the domain part only, a simple rule would be
(?:https?://)?(?:[a-zA-Z0-9.-]+?\.(?:com|net|org|gov|edu|mil)|\d+\.\d+\.\d+\.\d+)
If you want to support country TLDs as well you must either supply a complete (current) list or append |.. to the TLD part.
With preg_match you must wrap it between some delimiters
$pattern = ';(?:https?://)?(?:[a-zA-Z0-9.-]+?\.(?:com|net|org|gov|edu|mil)|\d+\.\d+\.\d+\.\d+);';
$index = preg_match($pattern, $url);
Usually, you use /. But in this case, slashes are part of the pattern, so I have chosen some other delimiter. Otherwise I must escape the slashes with \
$pattern = '/(?:https?:\/\/)?(?:[a-zA-Z0-9.-]+?\.(?:com|net|org|gov|edu|mil)|\d+\.\d+\.\d+\.\d+)/';
Don't use a regular expression. Not every problem that involves strings needs to use regexes.
Don't write your own URL validator. URL validation is a solved problem, and there is existing code that has already been written, debugged and testing. In fact, it comes standard with PHP.
Look at PHP's built-in filtering functionality: http://us2.php.net/manual/en/book.filter.php
I think you can use flags for filter_vars.
For FILTER_VALIDATE_URL there is several flags available:
FILTER_FLAG_SCHEME_REQUIRED Requires the URL to contain a scheme
part.
FILTER_FLAG_HOST_REQUIRED Requires the URL to contain a host
part.
FILTER_FLAG_PATH_REQUIRED Requires the URL to contain a path
part.
FILTER_FLAG_QUERY_REQUIRED Requires the URL to contain a query
string.
FILTER_FLAG_SCHEME_REQUIRED and FILTER_FLAG_HOST_REQUIRED used by default.
Lets say you want to check for path part and do not want to check for scheme part, you can do something like this (falg is a bitmask):
filter_var($url, FILTER_VALIDATE_URL, ~FILTER_FLAG_SCHEME_REQUIRED | FILTER_FLAG_PATH_REQUIRED)

Optional regular expression segment, but list of requirements if present?

I have a small routing engine in PHP. I'm trying to allow it to optionally match different "formats", such as requests to "/user/profile.json" or "/user/profile.xml". However, it should also match just a plain "/user/profile".
So, if if the format is present, it must be ".json" or ".xml". But it isn't required to be present at all.
Here is what I have so far:
#^GET /something/([a-zA-Z0-9\.\-_]+)(\.(html|json))?$#
Obviously, this doesn't work. This allows any "format" to be requested since the entire format segment is optional. How can I keep it optional, but constrain the formats that can be requested?
^GET /something/([a-zA-Z0-9._-]+)(\.(html|json))?$
allows dots in the first character class, so any file extension is legal. I expect you did that on purpose so filenames with dots in them are possible.
However, this means that if a filename contains a dot, it must end in either .html or .json. Right?
So change the regex to (using the \w shorthand for [A-Za-z0-9_]):
^GET /something/([\w.-]+\.(html|json)|[\w-]+)$
Alternative suggestion:
Instead of putting the desired output format into the URL, have the client specify it via the Accept Header in the HTTP Request (where it belongs). Content negotiation is baked into the HTTP protocol, so you do not have to reinvent it via URLs. Technically, it is wrong to put the format into the URL. Your URIs should point to the resource itself and not the resource representation.
Also see W3C: Content Negotiation: why it is useful, and how to make it work
The issue you're getting is arising from the fact that most extensions are alpha numeric, yet in your regex you're allowing a dot and characters:
#^GET /something/[a-zA-Z0-9\.\-_]+(\.(html|json))?$#
The section of problem being [a-zA-Z0-9\.\-_]+. For the example of the .csv making it though is because it's still matching that character range.
If something has dots in it's file name, then by default, it has a file extension (intentional or unintentional). The file My.Finance.Documents has the extension ".Documents" even though you'd assume it to be a text file or something else.
I hate doing it, but I think you might want to have a larger conditional in your regex, something along the lines of (this is an example, I haven't tested it):
#^GET /something/([^\.]+|.*\.(?:html|json))$#
Basically, if the file name has not dots in it, it's ok. If it does have a dot in it (which guarantees it has an extension), it must end with .html or .json.

Checking for valid web address, regular expressions in PHP

I have this text input, and I need to check if the string is a valid web address, like http://www.example.com. How can be done with regular expressions in PHP?
Use the filter extension:
filter_var($url, FILTER_VALIDATE_URL);
This will be far more robust than any regex you can write.
Found this:
(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?
From Here:
A regex that validates a web address and matches an empty string?
You need to first understand a web address before you can begin to parse it effectively. Yes, http://www.example.com is a valid address. So is www.example.com. Or example.com. Or http://example.com. Or prefix.example.com.
Have a look at the specifications for a URI, especially the Syntax components.
I found the below from http://www.roscripts.com/PHP_regular_expressions_examples-136.html
//URL: Different URL parts
//Protocol, domain name, page and CGI parameters are captured into backreferenes 1 through 4
'\b((?#protocol)https?|ftp)://((?#domain)[-A-Z0-9.]+)((?#file)/[-A-Z0-9+&##/%=~_|!:,.;]*)?((?#parameters)\?[-A-Z0-9+&##/%=~_|!:,.;]*)?'
//URL: Different URL parts
//Protocol, domain name, page and CGI parameters are captured into named capturing groups.
//Works as it is with .NET, and after conversion by RegexBuddy on the Use page with Python, PHP/preg and PCRE.
'\b(?<protocol>https?|ftp)://(?<domain>[-A-Z0-9.]+)(?<file>/[-A-Z0-9+&##/%=~_|!:,.;]*)?(?<parameters>\?[-A-Z0-9+&##/%=~_|!:,.;]*)?'
//URL: Find in full text
//The final character class makes sure that if an URL is part of some text, punctuation such as a
//comma or full stop after the URL is not interpreted as part of the URL.
'\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|!:,.;]*[-A-Z0-9+&##/%=~_|]'
//URL: Replace URLs with HTML links
preg_replace('\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|!:,.;]*[-A-Z0-9+&##/%=~_|]', '\0', $text);
In most cases you don't have to check if a string is a valid address.
Either it is, and a web site will be available or it won't be and the user will simply go back.
You should only escape illegals characters to avoid XSS, if your user doesn't want do give a valid website, it should be his problem.
(In most cases).
PS: If you still want to check URLs, look at nikic's answer.
To match more protocols, you can do:
((https?|s?ftp|gopher|telnet|file|notes|ms-help)://)?[\w:##%/;$()~=\.&-]+

Categories