I'm interested in extracting the directory from $_SERVER['REQUEST_URI']. I believe I specifically need to use $_SERVER['REQUEST_URI'] in case the URL is being mod_rewrite'd.
If $_SERVER['REQUEST_URI'] = 'http://example.com/path/to/resource';, then I can just use dirname() on it and get http://example.com/path/to.
If the URI is the directory index page/equivalent, but a query string or hash is appended to it, then $_SERVER['REQUEST_URI'] might be something like 'http://example.com/path/?query' or http://example.com/path/#hash. In this case, dirname() works as well -- both of those resolve to http://example.com/path.
If, however, the URI has no query string or hash, pathname() strips off the bottom-level directory, e.g. http://example.com/path/ becomes http://example.com. My solution was to simply append a hash symbol to every URI, i.e. dirname($_SERVER['REQUEST_URI'].'#'). dirname('http://example.com/path/'.'#') yields http://example.com/path.
But I don't understand how URIs work, so I don't really know if this is safe.
If a URI points to a directory, does it always end in the character /? If I type http://example.com/path in my browser, it redirects to http://example.com/path/, adding the trailing slash. Will this always happen, though?
Do query strings and hashes always get URL-encoded? For example, I can type http://example.com/path/?query=a/b/c in my browser and it gets interpreted as http://example.com/path/?query=a%2Bb%2Bc. But if that conversion didn't happen, then pathname() would return http://example.com/path/?query=a/b, which would be bad.
Can URIs have \s or are directories always separated with /?
Related
I have the following rewrite rule in .htaccess :-
RewriteRule ^.*/-y.* /handleurl.php [L]
Its purpose is to display appropriate pages depending on the values in the url, for example:
example.com/books/BookA/-y?act=x will display bookA page
the variable holding the book name is encoded such that ...
example.com/books/Book B/-y?act=x becomes example.com/books/book+B/-y?act=x
... which is fine (it's decoded in handleurl.php)
however if the book is called Book A/B I have ...
example.com/books/Book A/B/-y?act=x which becomes example.com/books/Book+A%2FB/-y?act=x
It appears that htaccess decodes this before the rewrite rule, so the rewrite rule sees too many elements in the URL delineated by the /.
Is there any way I can get the rewrite rule to ignore the encoded / as intended?
I have seen a previous response to a similar question, but I only need the / to be ignored, not other encoded characters.
It appears that htaccess decodes this before the rewrite rule, so the rewrite rule sees too many elements in the URL delineated by the /
This is not the problem. Regardless of whether the URL-path /books/Book+A%2FB/-y is decoded or not makes no difference here*1. Both would match the (rather generous) regex ^.*/-y.* in the RewriteRule pattern.
(*1 But yes, the URL-path matched by the RewriteRule pattern is URL decoded, ie. %-decoded.)
The problem is likely to be that Apache (by default) rejects - with a 404 - any URL that contains a %-encoded slash ie. %2F (or backslash %5C) in the URL-path portion of the URL. This is a security feature, that otherwise "could potentially allow unsafe paths" (source).
However, this can be overridden with the AllowEncodedSlashes directive. But this directive can only be used in a server or virtualhost context. It cannot be used in .htaccess.
You either need to set AllowEncodedSlashes On to allow encoded slashes, which are also decoded, as with other characters. Or set AllowEncodedSlashes NoDecode to permit encoded slashes, but do not decode them - which is preferred and probably what you are expecting.
Aside#1:
RewriteRule ^.*/-y.* /handleurl.php [L]
The regex ^.*/-y.* is very generic, possibly too generic. This is the same as simply /-y. What is the .* after -y intended to match? From your example URLs it looks like -y is always at the end of the URL-path, so this could be anchored, eg. /-y$. And if the URL that you need to match always starts /books/ then maybe this should also be included in the regex?
Aside#2:
...the book name is encoded such that ...
example.com/books/Book B/-y?act=x becomes example.com/books/book+B/-y?act=x ... which is fine (it's decoded in handleurl.php)
This isn't strictly "URL encoded", you have converted the space into a + in the URL-path. The + is a valid "URL encoding" for a space when used in the query string only. A + in the URL-path is a literal + (and will be seen by search engines as such). In the URL-path, a space would be URL encoded as %20. (You may have used the wrong PHP encoding functions, eg. urlencode() instead of rawurlencode()?)
Of course, you are free to convert/encode the URL however you wish to create a more readable URL - providing it's valid.
The rewrite rule was never the problem. I think it was Apache not liking the encoded '/' and the fact that the downstream url handling program was using '/' as a delimiter when identifying the individual url elements. I have to work out: 1) whether I want to allow '/' in the variables that make up the elements of the freindly url, and 2) if so how to pass it without upsetting Apache and how to subsequently disect the url. Maybe I will convert '/' to '~' for the benefit of the URL then convert back to '/' prior to subsequent display. Thank you Mr White.
Users can search my site. Sometimes they might use a search term containing a forward slash (search with / slash) which when submitted by the form turns into %2F in the url.
eg. www.mysite.com/search/search+with+%2F+slash
I have used the answer from here which works great to give me the right page and not send me to a 404.
My problem now is I use pagination on the page and custom filters which are both passed as get vars in the url and when accessing the GET var it's empty.
eg. www.mysite.com/search/search+with+%2F+slash?page=2
This is my current route
$this->get('search/{search_term}', ['uses' => 'SearchController#search'])
->where('search_term', '(.*(?:%2F:)?.*)');
Not sure what do from here
Including an encoded slash (%2F) in the path component of a URL is not a good idea. The HTTP RFC says that this sequence should be "equivalent" to a real slash:
Characters other than those in the "reserved" and "unsafe" sets (see
RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.
In practice, the handling of these URLs is inconsistent. Some web servers (and even some browsers!) will treat %2F as equivalent to a real slash, some will treat it differently, and some tools, including some web application firewalls and proxies, will simply reject URLs which contain such a sequence.
If you need to include user input like this in a URL, you should probably put it in a query string (/search/?q=search+with+%2f+slash).
I'm always confuse whether I should add a trailing slash at the end of a path, and often mix it up, leading to some file no found.
Example with drupal:
$base_theme = '/sites/all/themes/myTheme/';
or
$base_theme = '/sites/all/themes/myTheme';
The image path could extend the base theme:
$base_image = $base_theme.'images/';
or
$base_image = $base_theme.'/images';
Is there any convention? Or I can pick which one I prefer?
I would choose to finish all path with a trailing slash since too many slash is better than no slash.
TL;DR: There's no real convention. Trailing slash would be the more globally easy to recognize format. The important thing is that you're consistent through your design and that you convey your usage clearly.
There's no real convention; but there are considerations to make.
Advantages in trailing slash:
Trailing slash usually indicates a folder path (or a prettified URL) whereas a file extension denotes a direct file link. (Think example.com/home/ VS example.com/style.css).
This is usually friendlier for people coming from UNIX and such, as in the terminal a clear convention is to leave a trailing slash for directories.
As a programmer - adding a trailing slash will result in less-likely programmer errors; for example: accidentally adding a second slash will look ugly (http://example.com/styles//myfile.css) but will not break the file link. Forgetting a slash will: http://example.com/stylesmyfile.css, however the behavior might be confusing for query strings: http://example.com/thread?id=1 VS http://example.com/thread/?id=1 <- the result really depends on how you handle your .htaccess.
Advantages in no trail:
Prettier, some might say
It's easier to remember and it's more readable to always add a slash when appending paths to a variable string than not. i.e. it's easier to remember $baseURL . '/path.php' than $baseURL . 'path.php'
http://localhost/foo/profile/%26lt%3Bi%26gt%3Bmarco%26lt%3B%2Fi%26gt%3B
The url above gives me a 404 Error, the url code is this: urlencode(htmlspecialchars($foo));, as for the $foo: <i>badhtml</i>
The url works fine when there's nothing to encode e.g. marco.
Thanks. =D
Update: I'm supposed to capture the segment in the encoded part of the uri, so a 404 isn't supposed to appear.
There isn't any document there, marco is simply the string that I needed to fetch that person's info from db. If the user doesn't exist, it won't throw that ugly error anyways.
Slight idea what's wrong: I found out that if I used <i>badhtml<i>, it works just fine but <i>badhtml</i> won't, what do I do so that I can maintain the / in the <i>?
It probably think of the request as http://localhost/foo/profile/<i>badhtml<**/**i>
Since there is a slash / in the parameter, this is getting interpreted as a path name separator.
The solution, therefore, is to replace all occurrences of a slash with something that doesn't get interpreted as a separator. \u2044 or something. And when reading the parameter back in, change all \u2044s back to normal slashes.
(I chose \u2044 because this character looks remarkably like a normal slash, but you can use anthing that would never occur in the parameter, of course.)
It is most likely that the regex responsible for handling the URL rewrite does not like some of the characters in the URL-encoded string. This is most likely httpd/apache question, rather than PHP. Your best guess is to start by looking at the .htaccess (file containing URL rewrite rules).
This question assumes that your are trying to pass an argument through the URL, rather than access a file named <i>badhtml</i>.
Mr. Lister, you rocked.
"The solution, therefore, is to replace all occurrences of a slash with something that doesn't get interpreted as a separator. \u2044 or something. And when reading the parameter back in, change all \u2044s back to normal slashes."
Why would $_SERVER['PHP_SELF'] return a filename in one instance as /test/foo.bar and another instance (executed from the same php script) as //test/foo.bar (with double leading forward slashes)?
form.php sends $_GET to login.php. login.php redirects to
header ('Location: test/foo.bar')
foo.bar includes:
$page = filter_var($_SERVER['SCRIPT_NAME'], FILTER_SANITIZE_STRING);
(Additionally, I cannot replicate it on demand. )
I'd guess you're building links in code somewhere (or maybe someone just typed in an extra slash somewhere). You might have some code along these lines:
function buildLink($site, $relPath, $text) {
return "$text";
}
If $site is passed in with a trailing slash in some cases, you'd end up with the leading double slash when the server name was removed. Or if an absolute path was passed in. An extra slash won't affect which page is displayed, but it would still show up in the parsed url.
The value of $_SERVER['PHP_SELF'] depends on the actual request sent by the client. Apache allows multiple slashes between directory names, so it treats http://example/foo.php the same as http://example//foo.php -- both will call foo.php but the request URI will contain whatever the client requested.
If your script expects only one slash, you will have to manually strip the remaining.
You could try using $_SERVER['SCRIPT_NAME']. Take note that PHP_SELF is vulnerable to a few types of attack, so be careful when using it. You have to treat it as if it was user supplied input.