Properly Matching a IDN URL - php

I need help building a regular expression that can properly match an URL inside free text.
scheme
One of the following: ftp, http, https (is ftps a protocol?)
optional user (and optional pass)
host (with support for IDNs)
support for www and sub-domain(s) (with support for IDNs)
basic filtering of TLDs ([a-zA-Z]{2,6} is enough I think)
optional port number
path (optional, with support for Unicode chars)
query (optional, with support for Unicode chars)
fragment (optional, with support for Unicode chars)
Here is what I could find out about sub-domains:
A "subdomain" expresses relative
dependence, not absolute dependence:
for example, wikipedia.org comprises a
subdomain of the org domain, and
en.wikipedia.org comprises a subdomain
of the domain wikipedia.org. In
theory, this subdivision can go down
to 127 levels deep, and each DNS label
can contain up to 63 characters, as
long as the whole domain name does not
exceed a total length of 255
characters.
Regarding the domain name itself I couldn't find any reliable source but I think the regular expression for non-IDNs (I'm not sure how to write a IDN compatible version) is something like:
[0-9a-zA-Z][0-9a-zA-Z\-]{2,62}
Can someone help me out with this regular expression or point me to a good direction?

John Gruber, of Daring Fireball fame, had a post recently that detailed his quest for a good URL-recognizing regex string. What he came up with was this:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
Which apparently does OK with Unicode-containing URLs, as well. You'd need to do the slight modification to it to get the rest of what you're looking for -- the scheme, username, password, etc. Alan Storm wrote a piece explaining Gruber's regex pattern, which I definitely needed (regex is so write-once-have-no-clue-how-to-read-ever-again!).

If you require the protocol and aren't worried too much about false positives, by far the easiest thing to do is match all non-whitespace characters around ://

This will get you most of the way there. If you need it more refined please provide test data.
(ftp|https?)://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?

Related

How can I manage a URL with special characters in cURL? [duplicate]

When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.
There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.
When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
http://www.mysite.com/myresume.html
And a rewriting+character translating function allows this reference
http://www.mysite.com/myresumé.html
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.
Considering URLs with accents often tend to end up looking like this :
http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.
You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat
There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)

Looking for tips to better understand Perl Compatible Regular Expression operators and syntax

My question is about Perl Compatible Regular Expression operators and syntax. I've learned the basic syntax of '/hello/' and that /i means case insensitive. I looked into this at jotform.com and will study this until I have a greater understanding. But I was hoping someone could give me a head start on understanding the Perl Syntax and Operators in the (2) PCRE I've posted below. They both work to keep users from posting links in the form textarea, but are very different in syntax and operators. Just wanting to know if one regex is preferred over the other. Which is best and why?
Update: After several months of live testing, it appears that PCRE 1 does not work to prevent URLs in PHP contact form. PCRE 2 does seem to work to prevent URLs in PHP contact form for the same live testing time period.
The 2 regex below were originally found here at How to prevent spam URLs in a PHP contact form
Is there is a better regex than PCRE 2? Any help or advice would be greatly appreciated.
Thanks.
<?php
//PCRE 1 - Does not work to prevent URLs
if (preg_match( '/www\.|http:|https:\/\/[a-z0-9_]+([\-\.]{1}[a-z_0-9]+)*\.[_a-z]{2,5}'.'((:[0-9]{1,5})?\/.*)?$/i', $_POST['message']))
{
echo 'error please remove URLs';
}else
{....
//PCRE 2 - Does work to prevent URLs
if (preg_match("/\b(?:(?:https?|ftp|http):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i",$_POST['message']))
{
echo 'error please remove URLs';
}else
{....
?>
For the sake of offering an answer so that this page can be marked as resolved (instead of abandoned), I'll offer a refinement of the second pattern.
/\b(?:(?:https?|ftp|http):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i
can be rewritten as:
\b(?:(?:f|ht)tps?:\/\/)[-\w+&##\/%?=~|!:,.;]*[-\w+&##\/%=~|]
The first segment matches https, http, ftps, or ftp as a "whole word" (\b) using alternation (|) and the zero or one quantifier (?). Your original pattern requires the "protocol" portion of the url to exist, so I will not change the pattern logic.
The subdomain in your pattern is requiring www. although the subdomain is not required in a valid url and there are valid values other than www. that can be used. I am going to change the pattern logic on this segment to make the subdomain optional and more flexible.
The character class (whitelisted characters) incorporates the characters in www., so the literal match can be omitted from the pattern.
I have reduced the length of both of your character classes by employing \w -- it includes all alphanumeric characters (uppercase and lowercase) as well as the underscore.
Here is a demonstration of what is matched: https://regex101.com/r/TP16iB/1 -- you will find that a valid url like www.example.com is not matched by your preferred pattern nor my pattern. To overcome this, you could hardcode the www. as the required subdomain and make the protocol optional, but then you would not be matching variable subdomains. So you see, this is a bit of a rabbit hole where you will need to weigh up how much time you wish to invest versus what your application really needs. Be warned, the more accurate your pattern becomes, so grows its total length/convolution.\b(?:(?:(?:f|ht)tps?:\/\/)|(?:www\.))\[-\w+&##\/%?=~|!:,.;\]*\[-\w+&##\/%=~|\]

php regexp for national domains

Thare are new nations domains and TLDs like "http://президент.рф/" - for Russian Federation domains, or http://example.新加坡 for Singapore...
Is there a regex to validate these domains?
I have found this one: What is the best regular expression to check if a string is a valid URL?
But when I try to use one of the expressions listed there - PHP is getting overhitted :)
preg_match(): Compilation failed: character value in \x{...} sequence is too large at offset 81
P.S.
1) Last part was solved by #OmnipotentEntity
2) But the main problem - to validate international domain - still exists, because example regexp doesn't validate well.
Use the "u" modifier to match unicode characters. The example you gave only uses the "i" modifier.
No, there's no regexp to validate those domains. Each TLD has different rules about which Unicode code points are permissible within their IDNs (if any). You would need a very big lookup table which would have to be kept up-to-date to know which specific characters are legal.
Furthermore there are rules about whether left-to-right written characters and right-to-left characters can be combined within a single DNS label.
BTW, the RFCs mentioned in the other comments are obsolete. The recently approved set are RFCs 5890 - 5895.

REGEX URL regular expression [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Regular expression for browser Url
Is this regex perfect for any url ?
preg_match_all(
'/([www]+(\.|dot))?[a-zA-Z0-9_\.-]+(\.|dot){1,}[com|net|org|info\.]+((\.|dot){0,}[a-zA-Z]){0,}+/i',
$url, $regp);
Don't use regex for that. If you cant resist, a valid one can be found here:
What is the best regular expression to check if a string is a valid URL?
but that regex is ridiculous. Try to use your framework for that, if you can (Uri class in .net for example).
No. In fact it doesn't match URLs at all. It's trying to detect hostnames written in text, like www.example.com.
Its approach is to try to detect some common known TLDs, but:
[com|net|org|info\.]+
is actually a character group, allowing any sequence of characters from the list |.comnetrgif. Probably this was meant:
((com|net|org|info)\.)+
and also [www] is similarly wrong, plus the business with dot doesn't really make any sense.
But this is in general a really bad idea. There are way more TLDs in common use than just those and the 2-letter CCTLDs. Also many/most of the CCTLDs don't have a second-level domain of com/net/org/info. This expression will fail to match those, and will match a bunch of other stuff that's not supposed to be a hostname.
In fact the task of detecting hostnames is basically impossible to do, since a single word can be a hostname, as can any dot-separated sequence of words. (And since internationalised domain names were introduced, almost anything can be a hostname, eg. 例え.テスト.)
'any' url is a tough call. In OZ you have .com.au, in the UK it is .co.uk Each country has its own set of rules, and they can change. .xxx has just been approved. And non-ascii characters have been approved now, but I suspect you don't need that.
I would wonder why you want validation which is that tight? Many urls that are right will be excluded, and it does not exlude all incorrect urls. www.thisisnotavalidurl.com would still be accepted.
I would suggest
A) using a looser check , just for ([a-zA-Z0-9_.-].)*[a-zA-Z0-9_.-] (or somthing), just as a sanity check
B) using a reverse lookup to check if the URL is actually valid if you want to only allow actual real urls.
Oh, and I find this: http://www.fileformat.info/tool/regex.htm to be a really useful tool if I am developing regex, which I am not great at.
[www]+ should be changed for (www)?
(\.|dot){1,} - one and more? mayby you wanted to do ([a-zA-Z0-9_\.-]+(\.|dot)){1,}
A URL also has a protocol like http, which you're missing. You're also missing a lot of TLDs, as already mentioned.
Something like an escaped space (%20) would also not be recognized.
Port numbers can also appear in an URL (e.g. :80)
No, and you can't create a REGEX that will parse any URI (or URL or URN) - the only way to parse them properly is to read them as per the spec of RFC-3986

Regex to validate URL - Not checking for HTTP?

I know there are tonns of questions on here to validate a web address with something like this
/^[a-zA-Z]+[:\/\/]+[A-Za-z0-9\-_]+\\.+[A-Za-z0-9\.\/%&=\?\-_]+$/i
The only problem is, not everybody uses the http:// or whatever comes before so i wanted to find a way to use the preg_match() but not checking for http as a must have but more of a doesn't really matter, i modified it to this but then it rejects the url it it does have http:// in it:
/^[A-Za-z0-9\-_]+\\.+[A-Za-z0-9\.\/%&=\?\-_]+$/i
I was hoping more to validate it on these conditions
If it has http:// or www then just ignore this
If the .extension is longer than 9 then reject
If it contains no full stops
Anybody got an idea, thanks :)
Can't you just use the built in filter_var function?
filter_var('example.com', FILTER_VALIDATE_URL);
Not sure about the nine chars extension limit, but I guess you could easily check this in an additional step.
Why not have a stage before the regexp to simply remove the http:// if present ? The same would apply to the www. That may make your life a bit easier.
/^(http\://|www\.)/
/^.+?\.\S{0,9}\./
/\./
Those should work for your bullet points?
not everybody uses the http://
They should. Without a scheme it simply isn't a URL, and omitting it can cause weird problems. For example:
www.example.com:8080/file.txt
This is a valid URL with the non-existant scheme www.example.com:.
If you are sure that the normal scheme should be http:, you could try automatically appending http:// to ‘fix up’ any URL that doesn't begin with https?:, before validation. But you shouldn't allow/keep/return schemeless URLs over the longer term.
Incidentally the current regex you are using is a long way from accurate according to the official URI syntax (see RFC 3986). It will disallow many valid URI characters, not to mention Unicode characters in IRI. If you want a proper validation you should use a real URL-parser; if you just want a quick check for obvious problems you should use something much more permissive. For example just checking for the absence of categorically-invalid characters like space and ".

Categories