Odd HTTP URI Requests with Misplaced Periods - php

I received an error report from my system because of a request that looked like this:
https://www.example.com./
Note the added period before the third forward-slash.
I would not imagine this to be valid though the server says the $_SERVER['HTTP_HOST'] = www.example.net..
Is this technically valid?
Should I be using trim with odd characters to redirect to the actual host name URLs?
Are there other odd ways that an $_SERVER['HTTP_HOST'] could be requested that I should try to have my system compensate for?

Yes, it's valid! check out https://stackoverflow.com./.
Technically I believe the URIs are identical, so I don't know there's a strong reason to redirect from one to the other. If it works, I don't think I would touch this. Note that stackoverflow for example does not.
The HTTP Host header is controlled by the client and could be any string. So if you're doing anything with that header, such as adding it to your HTML or a SQL string, you need to treat it like user input and escape. You should assume this for every header. It's always possible to do a request with CURL and change any of them.

Related

Slashes in GET request (to be used with PHP back end)

I have to send a GET request to my Apache server. Whenever the parameters have values that are just one words, things work smoothly. Whenever, there are spaces, I am changing them to %20 and it does the trick
However, whenever I have slashes in my parameter values, things do not work.
For example, the URL I want to send to my server is:
https://randomness.com?path=/var/images/sub%20images/&name=image%2001.jpg
How can I get a workaround regarding this?
Many characters are specifically interpreted by the web host in URLs and the / character is one of them.
You can translate your / characters to %2F, like you translate to %20.
PHP's urlencode function can also handle these translations for you automatically.
A handy reference for these encodings can be found here,
should you wish to handle it manually.

Force GET variable (url) to not be encoded

I've been coding in PHP for a while, and this is the first time I came across this issue.
My goal is to pass a GET variable (a url) without encoding or decoding it. Which means that "%2F" will not turn to "/" and the opposite. The reason for that is that I'm passing this variable to a 3rd party website and the vairable must stay exactly the way it is.
Right now what's happening is that this url (passed as a GET variable):http://example.com/something%2Felse turns into http://example.com/something/else.
How can I prevent php from encoding what's passed in GET?
Apache denies all URLs with %2F in the path part, for security reasons: scripts can't normally (ie. without rewriting) tell the difference between %2F and / due to the PATH_INFO environment variable being automatically URL-decoded (which is stupid, but a long-standing part of the CGI specification so there's nothing can be done about it).
You can turn this feature off using the AllowEncodedSlashes directive, but note that other web servers will still disallow it (with no option to turn that off), and that other characters may also be taboo (eg. %5C), and that %00 in particular will always be blocked by both Apache and IIS. So if your application relied on being able to have %2F or other characters in a path part you'd be limiting your compatibility/deployment options.
I am using urlencode() while preparing the search URL
You should use rawurlencode(), not urlencode() for escaping path parts. urlencode() is misnamed, it is actually for application/x-www-form-urlencoded data such as in the query string or the body of a POST request, and not for other parts of the URL.
The difference is that + doesn't mean space in path parts. rawurlencode() will correctly produce %20 instead, which will work both in form-encoded data and other parts of the URL.
Hex base16 encoding it is part of the HTTP protocol you cant prevent it else it would break the actual HTTP socket request to the server.
Use:
urlencode() to encode
urldecode() to decode
Please show an actual example of how you are sending the url to the 3rd party.
As it should read http%3A%2F%2Fexample.com%2Fsomething%2Felse not just the odd %2F like in your example.

Is it safe to use (strip_tags, stripslashes, trim) to clear variable that holds URLs

It's quite pleasure to be posting my first question in here :-)
I'm running a URL Shortening / Redirecting service, PHP written.
I aim to store and handle valid URLs data as much as possible within my service.
I noticed that sometimes, invalid URL data is being handled over to the database, holding invalid characters (like spaces in the end or beginning of the URL).
I decided to make my URL-Check mechanism trim, stripslashes and strip_tags the values before storing them.
As far as I can think, these functions will not remove valid charterers that any URL may have.
Kindly, just correct me or advise me if I'm going into the wrong direction.
Regards..
If you're already trimming the incoming variable, as well as filtering it with the other built in PHP methods, and STILL running into issues, try changing the collation of your table to UTF-8 and see if that helps you get rid of the special characters you mention. (Could you paste a few examples to let us know?)

REGEX URL regular expression [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Regular expression for browser Url
Is this regex perfect for any url ?
preg_match_all(
'/([www]+(\.|dot))?[a-zA-Z0-9_\.-]+(\.|dot){1,}[com|net|org|info\.]+((\.|dot){0,}[a-zA-Z]){0,}+/i',
$url, $regp);
Don't use regex for that. If you cant resist, a valid one can be found here:
What is the best regular expression to check if a string is a valid URL?
but that regex is ridiculous. Try to use your framework for that, if you can (Uri class in .net for example).
No. In fact it doesn't match URLs at all. It's trying to detect hostnames written in text, like www.example.com.
Its approach is to try to detect some common known TLDs, but:
[com|net|org|info\.]+
is actually a character group, allowing any sequence of characters from the list |.comnetrgif. Probably this was meant:
((com|net|org|info)\.)+
and also [www] is similarly wrong, plus the business with dot doesn't really make any sense.
But this is in general a really bad idea. There are way more TLDs in common use than just those and the 2-letter CCTLDs. Also many/most of the CCTLDs don't have a second-level domain of com/net/org/info. This expression will fail to match those, and will match a bunch of other stuff that's not supposed to be a hostname.
In fact the task of detecting hostnames is basically impossible to do, since a single word can be a hostname, as can any dot-separated sequence of words. (And since internationalised domain names were introduced, almost anything can be a hostname, eg. 例え.テスト.)
'any' url is a tough call. In OZ you have .com.au, in the UK it is .co.uk Each country has its own set of rules, and they can change. .xxx has just been approved. And non-ascii characters have been approved now, but I suspect you don't need that.
I would wonder why you want validation which is that tight? Many urls that are right will be excluded, and it does not exlude all incorrect urls. www.thisisnotavalidurl.com would still be accepted.
I would suggest
A) using a looser check , just for ([a-zA-Z0-9_.-].)*[a-zA-Z0-9_.-] (or somthing), just as a sanity check
B) using a reverse lookup to check if the URL is actually valid if you want to only allow actual real urls.
Oh, and I find this: http://www.fileformat.info/tool/regex.htm to be a really useful tool if I am developing regex, which I am not great at.
[www]+ should be changed for (www)?
(\.|dot){1,} - one and more? mayby you wanted to do ([a-zA-Z0-9_\.-]+(\.|dot)){1,}
A URL also has a protocol like http, which you're missing. You're also missing a lot of TLDs, as already mentioned.
Something like an escaped space (%20) would also not be recognized.
Port numbers can also appear in an URL (e.g. :80)
No, and you can't create a REGEX that will parse any URI (or URL or URN) - the only way to parse them properly is to read them as per the spec of RFC-3986

Regex to validate URL - Not checking for HTTP?

I know there are tonns of questions on here to validate a web address with something like this
/^[a-zA-Z]+[:\/\/]+[A-Za-z0-9\-_]+\\.+[A-Za-z0-9\.\/%&=\?\-_]+$/i
The only problem is, not everybody uses the http:// or whatever comes before so i wanted to find a way to use the preg_match() but not checking for http as a must have but more of a doesn't really matter, i modified it to this but then it rejects the url it it does have http:// in it:
/^[A-Za-z0-9\-_]+\\.+[A-Za-z0-9\.\/%&=\?\-_]+$/i
I was hoping more to validate it on these conditions
If it has http:// or www then just ignore this
If the .extension is longer than 9 then reject
If it contains no full stops
Anybody got an idea, thanks :)
Can't you just use the built in filter_var function?
filter_var('example.com', FILTER_VALIDATE_URL);
Not sure about the nine chars extension limit, but I guess you could easily check this in an additional step.
Why not have a stage before the regexp to simply remove the http:// if present ? The same would apply to the www. That may make your life a bit easier.
/^(http\://|www\.)/
/^.+?\.\S{0,9}\./
/\./
Those should work for your bullet points?
not everybody uses the http://
They should. Without a scheme it simply isn't a URL, and omitting it can cause weird problems. For example:
www.example.com:8080/file.txt
This is a valid URL with the non-existant scheme www.example.com:.
If you are sure that the normal scheme should be http:, you could try automatically appending http:// to ‘fix up’ any URL that doesn't begin with https?:, before validation. But you shouldn't allow/keep/return schemeless URLs over the longer term.
Incidentally the current regex you are using is a long way from accurate according to the official URI syntax (see RFC 3986). It will disallow many valid URI characters, not to mention Unicode characters in IRI. If you want a proper validation you should use a real URL-parser; if you just want a quick check for obvious problems you should use something much more permissive. For example just checking for the absence of categorically-invalid characters like space and ".

Categories