Regex to validate URL - Not checking for HTTP? - php

I know there are tonns of questions on here to validate a web address with something like this
/^[a-zA-Z]+[:\/\/]+[A-Za-z0-9\-_]+\\.+[A-Za-z0-9\.\/%&=\?\-_]+$/i
The only problem is, not everybody uses the http:// or whatever comes before so i wanted to find a way to use the preg_match() but not checking for http as a must have but more of a doesn't really matter, i modified it to this but then it rejects the url it it does have http:// in it:
/^[A-Za-z0-9\-_]+\\.+[A-Za-z0-9\.\/%&=\?\-_]+$/i
I was hoping more to validate it on these conditions
If it has http:// or www then just ignore this
If the .extension is longer than 9 then reject
If it contains no full stops
Anybody got an idea, thanks :)

Can't you just use the built in filter_var function?
filter_var('example.com', FILTER_VALIDATE_URL);
Not sure about the nine chars extension limit, but I guess you could easily check this in an additional step.

Why not have a stage before the regexp to simply remove the http:// if present ? The same would apply to the www. That may make your life a bit easier.

/^(http\://|www\.)/
/^.+?\.\S{0,9}\./
/\./
Those should work for your bullet points?

not everybody uses the http://
They should. Without a scheme it simply isn't a URL, and omitting it can cause weird problems. For example:
www.example.com:8080/file.txt
This is a valid URL with the non-existant scheme www.example.com:.
If you are sure that the normal scheme should be http:, you could try automatically appending http:// to ‘fix up’ any URL that doesn't begin with https?:, before validation. But you shouldn't allow/keep/return schemeless URLs over the longer term.
Incidentally the current regex you are using is a long way from accurate according to the official URI syntax (see RFC 3986). It will disallow many valid URI characters, not to mention Unicode characters in IRI. If you want a proper validation you should use a real URL-parser; if you just want a quick check for obvious problems you should use something much more permissive. For example just checking for the absence of categorically-invalid characters like space and ".

Related

Is it safe to use (strip_tags, stripslashes, trim) to clear variable that holds URLs

It's quite pleasure to be posting my first question in here :-)
I'm running a URL Shortening / Redirecting service, PHP written.
I aim to store and handle valid URLs data as much as possible within my service.
I noticed that sometimes, invalid URL data is being handled over to the database, holding invalid characters (like spaces in the end or beginning of the URL).
I decided to make my URL-Check mechanism trim, stripslashes and strip_tags the values before storing them.
As far as I can think, these functions will not remove valid charterers that any URL may have.
Kindly, just correct me or advise me if I'm going into the wrong direction.
Regards..
If you're already trimming the incoming variable, as well as filtering it with the other built in PHP methods, and STILL running into issues, try changing the collation of your table to UTF-8 and see if that helps you get rid of the special characters you mention. (Could you paste a few examples to let us know?)

PHP Security Advice on $_GET (combining clean URLs with query string)

I am using "clean" URLs like this:
http://localhost/controller/action/param
I access the parameters with a custom function like this my_get(1), my_get(2), etc...
However there are times where I think I need to combine them with query strings.
For example: If I need parameter values containing paths with several slashes like:
http://localhost/controller/action/param?mypath=foo/bar/qux.jpg
I do that because it would be a little harder to implement if done with clean URL.
Now my question is, in combining clean URL and with query string, I only intend to allow this character class:
[.&=a-z0-9\/_-]
I was wondering would there be any security issue with it? Should I disallow certain characters?
Don't mind about string formatting, but please validate the path passed... In the example you said: " in the example above, mypath's value will be deleted with unlink();", well, if you don't validate it in worst cases an attacker could delete any file on the filesystem of the server... ;)
So don't bother about validating the string with a regex, but validate the content of the string and make it safe for your environment... :)

codeigniter disallowed characters error

if i trying to access this url http://localhost/common/news/33/+%E0%B0%95%E0%B1%87%E0%B0%B8.html , it shows an An Error Was Encountered, The URI you submitted has disallowed characters. I set $config['permitted_uri_chars'] = 'a-z 0-9~%.:??_=+-?' ; ..// WHat i do ?
Yeah, if you want to allow non-ASCII bytes you would have to add them to permitted_uri_chars. This feature operates on URL-decoded strings (normally, unless there is something unusual about the environment), so you have to put the verbatim bytes you want in the string and not merely % and the hex digits. (Yes, I said bytes: _filter_uri doesn't use Unicode regex, so you can't use a Unicode range.)
Trying to filter incoming values (instead of encoding outgoing ones) is a ludicrously basic error that it is depressing to find in a popular framework. You can turn this misguided feature off by setting permitted_uri_chars to an empty string, or maybe you would like a range of all bytes except for control codes ("\x20-\xFF"). Unfortunately the _filter_uri function still does crazy, crazy, broken things with some input, HTML-encoding some punctuation on the way in for some unknown bizarre reason. And you don't get to turn this off.
This, along with the broken “anti-XSS” mangler, makes me believe the CodeIgniter team have quite a poor understanding of how string escaping and security issues actually work. I would not trust anything they say on security ever.
What to do?
Stop using unicode characters in an URL - for the same reasons as you shouldn't name files on a filesystem with unicode characters.
But, if you really need it, I'll copy/paste some lines from the config:
Leave blank to allow all characters -- but only if you are insane.
I would NOT suggest trying to decode them or use any other tricks, instead I would suggest using urlencode() and urldecode() functions.
Since I don't have a copy of your code, I can't add examples, if you could provide me some, I can show you an example how to do it.
However, it's pretty straightforward to use, and it's built in PHP4 and PHP5.
I had a similar problem and wanted to share the solution. It was reset password, and I had to send the username and time, as the url will be active for an hour only. Codeigniter will not accept certain characters in url for security reasons and I did not want to change that. So here is what I did:
concat user name, '__' and time() in a var $str
encrypt $str using MCRYPT_BLOWFISH, this may contain '/', '+'
re-encrypt using str2hex (got it from here)
put the encoded string as the 3rd argument in the link sent by
email, like,
http://xyz.com/users/resetpassword/3123213213ABCDEF238746238469898
-you can see that the url contains only 0-9 and A-Z.
When link from email is clicked, get the 3rd uri segment, use
hex2str() to decrypt to blowfish encrypted string, and then apply
blowfish decrypt to get the original string.
split with '__' to get the user name and time
I know that its almost a year till this question was asked, but I am hoping that someone will find this solution helpful after coming here by google.

REGEX URL regular expression [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Regular expression for browser Url
Is this regex perfect for any url ?
preg_match_all(
'/([www]+(\.|dot))?[a-zA-Z0-9_\.-]+(\.|dot){1,}[com|net|org|info\.]+((\.|dot){0,}[a-zA-Z]){0,}+/i',
$url, $regp);
Don't use regex for that. If you cant resist, a valid one can be found here:
What is the best regular expression to check if a string is a valid URL?
but that regex is ridiculous. Try to use your framework for that, if you can (Uri class in .net for example).
No. In fact it doesn't match URLs at all. It's trying to detect hostnames written in text, like www.example.com.
Its approach is to try to detect some common known TLDs, but:
[com|net|org|info\.]+
is actually a character group, allowing any sequence of characters from the list |.comnetrgif. Probably this was meant:
((com|net|org|info)\.)+
and also [www] is similarly wrong, plus the business with dot doesn't really make any sense.
But this is in general a really bad idea. There are way more TLDs in common use than just those and the 2-letter CCTLDs. Also many/most of the CCTLDs don't have a second-level domain of com/net/org/info. This expression will fail to match those, and will match a bunch of other stuff that's not supposed to be a hostname.
In fact the task of detecting hostnames is basically impossible to do, since a single word can be a hostname, as can any dot-separated sequence of words. (And since internationalised domain names were introduced, almost anything can be a hostname, eg. 例え.テスト.)
'any' url is a tough call. In OZ you have .com.au, in the UK it is .co.uk Each country has its own set of rules, and they can change. .xxx has just been approved. And non-ascii characters have been approved now, but I suspect you don't need that.
I would wonder why you want validation which is that tight? Many urls that are right will be excluded, and it does not exlude all incorrect urls. www.thisisnotavalidurl.com would still be accepted.
I would suggest
A) using a looser check , just for ([a-zA-Z0-9_.-].)*[a-zA-Z0-9_.-] (or somthing), just as a sanity check
B) using a reverse lookup to check if the URL is actually valid if you want to only allow actual real urls.
Oh, and I find this: http://www.fileformat.info/tool/regex.htm to be a really useful tool if I am developing regex, which I am not great at.
[www]+ should be changed for (www)?
(\.|dot){1,} - one and more? mayby you wanted to do ([a-zA-Z0-9_\.-]+(\.|dot)){1,}
A URL also has a protocol like http, which you're missing. You're also missing a lot of TLDs, as already mentioned.
Something like an escaped space (%20) would also not be recognized.
Port numbers can also appear in an URL (e.g. :80)
No, and you can't create a REGEX that will parse any URI (or URL or URN) - the only way to parse them properly is to read them as per the spec of RFC-3986

How can I convert URLs to Markdown syntax, but NOT interfere with URLs already in Markdown syntax?

A system I am writing uses Markdown to modify links, but I also want to make plain links active, so that typing http://www.google.com would become an active link. To do this, I am using a regex replacement to find urls, and rewrite them in Markdown syntax. The problem is that I can not get the regex to not also parse links already in Markdown syntax.
I'm using the following code:
$value = preg_replace('#((?!\()https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)#', '[$1]($1)', $value);
This works well for plain links, such as http://www.google.com, but I need it to ignore links already in the Markdown format. I thought the section (?!() would prevent it from matching urls which followed a parenthesis, but it would seem that I am in error.
I realize that even this is not an ideal solution (if it worked), but this is pushing beyond my regex abilities.
I think (?<!\() is what you meant. If the match position is at the beginning of http://www.google.com, it's not the next character you need to check, but the previous one. In other words you need a negative lookbehind, not a negative lookahead.
regexes are notoriously bad at stuff like this, you might end up with all sorts of clever html exploits you never could have thought of. IMO you should mod the markdown script to flag markdown URLs as it sees them, so you can ignore flagged URLs when you find them all with a very very simple search that doesn't leave complexity to hack.

Categories