PHP redirect without # [duplicate]

PHP redirect without # [duplicate] - php

For some reason, non IE browsers seem to persist a URL hash (if present) when a server-side redirect is sent (using the Location header). Example:
// a simple redirect using Response.Redirect("http://www.yahoo.com");
Text.aspx
If I visit:
Test.aspx#foo
In Firefox/Chrome, I'm taken to:
http://www.yahoo.com#foo
Can anyone explain why this happens? I've tried this with various server side redirects in different platforms as well (all resulting in the Location header, though) and this always seems to happen. I don't see it anywhere in the HTTP spec, but it really seems to be a problem with the browsers themselves. The URL hash (as expected) is never sent to the server, so the server redirect isn't polluted by it, the browsers are just persisting it for some reason.
Any ideas?

I suggest that this is the correct behaviour. The 302 and 307 status codes indicate that the resource is to be found elsewhere. #bookmark is a location within the resource.
Once the resource (html document) has been located it is for the browser to locate the #bookmark within the document.
The analogy is this: You want to look something up in a book in chapter 57, so you go to the library to get the book. But there is a note on the shelf saying the book has moved, it is now in the other building. So you go to the new location. You still want chapter 57 - it is irrelevant where you got the book.

This is an aspect that was not covered by previous HTTP specifications but has been addressed in the later HTTP development:
If the server returns a response code of 300 ("multiple choice"), 301
("moved permanently"), 302 ("moved temporarily") or 303 ("see
other"), and if the server also returns one or more URIs where the
resource can be found, then the client SHOULD treat the new URIs as
if the fragment identifier of the original URI was added at the end.
The exception is when a returned URI already has a fragment
identifier. In that case the original fragment identifier MUST NOT be
not added to it.
So the fragment of the original URI should also be used for the redirection URI unless it also contains a fragment.
Although this was just a draft that expired in 2000, it seems that the behavior as described above is the de-facto standard behavior among todays web browsers.
#Julian Reschke or #Mark Nottingham probably know more/better about this.

From what I have found, it doesn't seem clear what the exact behaviour should be. There are plently of people having problems with this, some of them wants to keep the bookmark through the redirect, some of them wants to get rid of it.
Different browsers handle this differently, so in practice it's not useful to rely on either behaviour.
It definitely is a browser issue. The browser never sends the bookmark part of the URL to the server, so there is nothing that the server could do to find out if there is a bookmark or not, and nothing that could be done about it reliably.

When I put the full URL in the action attribute of the form, it will keep the hash. But when I just do the query string then it drops the hash. E.g.,
Keeps the hash:
https://example.com/edit#alrighty
<form action="https://example.com/edit?ok=yes">
Drops the hash:
https://example.com/edit
<form action="?ok=yes">

Related

Convert URL into one standard format

Here are a few URLs:
http://sub.example.com/?feed=atom&hello=world
http://www.sub.example.com/?feed=atom&hello=world
http://sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom#123
As you can see, they all lead to the exact same page but the URL format is different. Here is two other basic examples:
http://example.com/hello/
http://example.com/hello
Both are the same.
I want to convert the URL into one standard format so that when I store the URL in the database, I can easily check whether if the URL string already exists in the database.
Because of the various ways of how the URL can be formatted, this can be puzzling.
What's the definitive approach to converting URL into one standard format? Maybe parse_url() route...?
Edit
As outlined in the comments, there is no definitive solution to this, but the aim is to get as close as possible with what we have without "retrieving" the page. Please read comments before posting an answer to this bounty.

After you parse_url:
Remove the www prefix from the domain name
If the path is not empty - remove the trailing slash from it
Sort query parameters alphabetically by their name - if there are any
Combine these parts in order to get a canonical URL.

I had the same issue for a reports-configuration-save functionality. In our system, users can design his own reports of sales (like JQL of Jira); for that, we use get params as conditions, and fragment identifier (after #) as layout setup, like this:
http://example.com/report.php?since=20180101&until=20180806#sort=amount&color=blue
For our system, order of GET or after # params are irrelevant as well you reach the same report configuration if set param "until" first than "since", so for us are the same request.
Considering this, subdomains are out of discussion, cause you must solve this using rewrite techniques (like mod_rewrite with 301 in Apache) or create a pool of domain exceptions to do this at software level. Also, different domains can point into different websites, so you must decide if is a good idea; in subdos "www" is very easy to figured it out, but it will toke you time in another cases.
Server side can help to get vars in query section. For example, in PHP you can use function parse_str and $_SERVER['QUERY_STRING'] to get array, and then, you will need use asort() to order it to finnaly compare if are the same request (array_diff function).
Unfortunately, server side is not an option since have no capability to get after hash (#) content, and we still without consider another problems, like scriptname included, protocols or ports:
http://www.sub.example.com/index.php?hello=world&feed=atom
https://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com:8081/?hello=world&feed=atom
In my personal experience, the most close solution is JavaScript, for handling url, parsing query section as array, compare them and do the same with fragment identifier. If you need to use it in server side, every load page will must be followed with an ajax request sending this data to the server.
Apologies in advance for length of my answer, but it is what I had to go through in order to solve the same problems you have. Greetings!
Get protocol, domain, and port from URL
Get protocol, domain, and port from URL
How can I get query string values in JavaScript?
How can I get query string values in JavaScript?
How do I get the fragment identifier (value after hash #) from a URL?
How do I get the fragment identifier (value after hash #) from a URL?

adding the preferred <link rel="canonical" ... > tag into the HTML headers is the only reliable solution, in order to reference unique content to a single SEF URL. see Google's documentation, concerning Consolidate duplicate URLs, which possibly answers the whole question more autoritative and reliable, than I ever could.
the idea of being able to know of the canonical URL or to resolve a bunch externals URLs, without parsing those server's .htaccess rewrite-rules or the HTML headers, does not appear to be applicable (simply because one can maintain a table with URL aliases, which subsequently do not permit guessing how a HTTP request might have been re-written).
this question might belong to https://webmasters.stackexchange.com/search?q=cannonical.

Since the question is marked „PHP“ I assume you are in the backend.
There are enough answers how you can compare URLs (protocol, host, port, path, list of request params) where path is case sensitive, protocol and host are not. Changing the order of request parameters is strictly speaking also changing the URL.
My impression is that you want to differentiate by the RESOURCE which the server is serving (http://www.sub.example.com/ serves the same resource as http://sub.example.com/ or .../hello serves the same resource as .../hello/)
Which resource is served, you should perfectly know on the backend level, since you (the backend) know what you are serving. Find the perfect ID for the resource and use it.
PS: the URL is not a good identifier for that. But if you must use it, just use a sanitized version (sanitization for your purpose => sanitize to your preferred host, strip or add slashes at end of paths, drop things like /../ from path (security issue anyway), bring the request params in a certain order, whatever is right for your purpose.
Best regards, iPirat

It's the case with duplicate URLs and you can avoid these kind of duplicate URLs using a URL factory redirecting all URLs which are not proper to the proper URL.
And the same thing is explained in this article:
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
Any other URLs leading to the same page are 301 redirected to the proper version of the URLs.
This is the best practice of Search Engine Optimization(SEO). Here I'm going to give you a couple of examples.
You can consider the URLs of this website, for example the wrong links of this page are
https://stackoverflow.com/questions/51685850
https://stackoverflow.com/questions/51685850/convert-url-into-one-s
https://stackoverflow.com/questions/51685850/
If you go to the above wrong URLs of this page, you'll be redirected to the proper URL which is
https://stackoverflow.com/questions/51685850/convert-url-into-one-standard-format
And if you change the title of this question, all other URLs are 301 redirected to the proper URL. The idea here is the 301 redirection which tells the search engines to replace the old URL with the new one otherwise the search engines find different URLs providing the same content.
The real deal here is the id of the question, 51685850. This id is used to create the proper URL with the information from the database. With the URL factory that is created in the article in the link provided, you do not even need to store URLs in the database.
You can read more on duplicate content here:
https://moz.com/learn/seo/duplicate-content
The same rules are applied to tinywebhut.com as well, the wrong URLs are
https://www.tinywebhut.com/remove-duplicate-38
https://www.tinywebhut.com/some-text-38
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38/
In the above URLs the ID is appended to the end of the URL which is 38 and if you go to any of these URLs, you'll be 301 redirected to the proper version of the URLs which is
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
I didn't make any functions to explain this here because it is already done in this article:
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
You can achieve the goal with a couple of really simple functions and you can apply the same idea to remove other duplicate URLs such as /about.php, /about, /about.php/, /about/ and so on. And to achieve this you just need a little more code to your existing functions.
One alternative is adding canonical tag, for example, even if you have more than one URL to go the same page, you just need to apply canonical tag and add the link to the proper URL.
<link rel="canonical" href="https://stackoverflow.com/questions/51685850/convert-url-into-one-standard-format" />
This way you are telling the search engines that the multiple URLs should be considered as one and the search engines add the link used in the canonical tag in their search results. You can read more on canonicalization here:
https://moz.com/learn/seo/canonicalization
But still the best way to get rid of duplicate content is the 301 redirect. If you have a 301 redirect like I talked at the beginning, all problems are solved without surprises.

My original answer assumes that the pages are all owned by the OP, as per the line "As you can see, they all lead to the exact same page but the URL format is different...". I am adapting the answer to handle multiple options and adding a list of assumptions you can and cannot make about URLs.
As others have pointed out there is no definitive easy answer to this if you do not know that the page(s) are the same. However, if you follow these assumptions, you should be safe standardizing some things:
CAN ASSUME
Query strings with the same values point to the same location regardless of order. Example: https://example.com/?fruit=apple&color=red is the same as https://example.com/?color=red&fruit=apple
301 redirects to a specific source can be followed. If you receive a 301 redirect response, follow the redirect and use that URL. You can safely assume that if a URL actually does point to the same page, and page rank is optimized, then you can follow it.
If there is a single <link rel="canonical"> tag in the HTML, that too can be used to cover the canonical link (see below for why).
CANNOT ASSUME
Any URL is guaranteed to be the same as any other URL, if they are different (by URL in this case I am talking about anything before the query string).
http://example.com can be different from https://example.com can be different from http://www.example.com or https://www.example.com. There is no restriction against showing a different website when putting "www" or leaving it out. That's why page rank on search engines is really damaged here.
Any two URLs, even if they currently have exactly the same content, will keep exactly the same content. An example would be https://example.com/test and https://sub.example.com/test. Both may feasibly be set to the same generic test page content. In the future, https://sub.example.com/test may be changed. You can't assume it won't be.
If you own the site
Redirect all traffic in the first part of the URL format you want: Do you want www.example.com or example.com or sub.example.com? Do you want a trailing slash or not? Redirect this first, either using server rules or PHP. This is also highly beneficial for search page rank (if that matters to you).
An example of this would be something like this:
if (!$_SERVER['HTTPS'] || 'example.com' !== $_SERVER['HTTP_HOST'] || rtrim($_SERVER['PHP_SELF'], '/') !== $_SERVER['PHP_SELF']) {
header('HTTP/1.1 301 Moved Permanently');
header('Location: '. 'https://example.com/'.rtrim($_SERVER['PHP_SELF']), '/'));
exit;
}
Finally, to manage any remaining SEO concerns, you can add this HTML tag:
`<link rel="canonical" href="<?php echo $url; ?>">`
Whether you own the site or not, you can standardize query order
Even if you don't control the site, you can assume that query order does not matter. To standardize this, take your query and rebuild the parameters, appending it to your normalized URL.
function getSortedQuery()
{
$url = [];
parse_str($_SERVER['QUERY_STRING'], $url);
ksort($url);
return http_build_query($url);
}
$url = $_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF'].'?'.getSortedQuery();
Another option is to grab the contents of the page and see if there is a <link rel="canonical"> string, and use that string to log your data. This is a bit more costly as it requires a full page load.
To repeat, do make sure you grab 301 redirects as they are not suggestions, but directives, as to the end result URL.
One final suggestion
I might recommend using two columns, one being "canonical_url" and another being "effective_url". Sometimes a URL works and then later becomes a 301 redirect. This is just my take but I would like to know these things.

All of the answers have great information. Assuming you are using an Apache-like server, for the URL bit, I would use .htaccess (or, preferably, if you can change it - the equivalent server Apache config file) to do the rewrites. For a simple example:
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_HOST} ^www\.example\.com$
RewriteRule (.*) http://example.com/$1 [R=Permanent]
In this example, the "R=Permanent" DOES do a redirect. This is usually not a big issue as, a) it tells the browser to remember the redirect, and b) your internal links are presumably relative, so protocol (http or https) and the server (example.com or whatever) are preserved. So generally the redirect will be once per session or less - time well spent, IMO, to avoid doing all this in PHP.
I guess you could use it to rewrite the order of the query bits as well, though when the query bits are significant, I tend to (not recommending you do, just sayin') add them to my path (eg rewrite ".../blah/atom" to ".../blah.php?feed=atom"). At any rate, there are loads of rewrite tricks available, and I recommend you read about them in
Apache mod_rewrite.
If you do go this route, be sure to carefully think through what you want to happen - once you start mucking with the URL's, you are usually stuck with your decisions for a long while.

As several have pointed out, while the URLs you show may currently point to the same content, there is no way to tell if they will in the future. A change in either protocol or hostname can get you different sets of content, even example.com vs. www.example.com, even if served up by the same machine at the same IP. Not common, but it can happen...
So if I were wanting to maintain a list of URLs, I would store protocol, hostname, directory path, filename if present (aka "whatever came after the last slash before a questionmark"), and a sorted on key set of key/value pairs for the GET arguments
And then don't forget that you can go to https://www.google.com and not have anything BUT the protocol and hostname...

Avoid passing the parameters in the url. Pass your parameters to the web page using JSON.

$_SERVER["HTTP_REFERER"] not returning full URL

I'm trying to apply a quick patch to address an issue with an extension we're using. As a result, please pardon this "bandaid-like" fix that I'm requesting assistance with. This is merely an effort to fix an issue in about 20 minutes or less and schedule in a permanent fix for later in the week.
That being said, I am struggling with grabbing a value that I would expect with using $_SERVER["HTTP_REFERER"]. Our URL is somewhat odd at the moment. A URL example is below...:
http://domain.com/custom-wheels-performance-tires/custom-wheels.html#/custom-wheels-performance-tires/custom-wheels.html?wheel_diameter=2663
When using $_SERVER["HTTP_REFERER"], the value I'm getting (for the URL above) is:
http://domain.com/custom-wheels-performance-tires/custom-wheels.html
Evidently, it is being cut off at the # in the URL. Common sense would be to remove that from the URL, but I'm going to have to dig into someone else's code to do that and it exceeds the time allocated for this patch. Is there a way to get the full URL (even if it isn't $_SERVER["HTTP_REFERER"])?
I appreciate any and all assistance!

Due to the way URL's are handled by browsers, the server never receives anything past the hash fragment identifier (#). The fragment is intended to be used by the browser to scroll a page to an anchor.
However, It is possible to utilize JavaScript to get the fragment, and send it to the browser.

XSS Vulnerability in PHP scripts

I have been searching everywhere to try and find a solution to this. I have recently been running scans on our websites to find any vulnerabilities to XSS and SQL Injection. Some items have been brought to my attention.
Any data which is user inputted is now validated and sanitized using filter_var().
My issue now is with XSS and persons manipulating the URL. The simple one which seems to be everywhere is:
http://www.domainname.com/script.php/">< script>alert('xss');< /script >
This then changes some of the $_SERVER variables and causes all of my relative paths to CSS, links, images, etc.. to be invalid and the page doesn't load correctly.
I clean any variables that are used within the script, but I am not sure how I get around removing this unwanted data in the URL.
Thanks in advance.
Addition:
This then causes a simple link in a template file:
Link
to actually link to:
"http://www.domainname.com/script.php/">< script>alert('xss');< /script >/anotherpage.php

This then changes some of the $_SERVER variables and causes all of my relative paths to CSS, links, images, etc.. to be invalid and the page doesn't load correctly.
This sounds you made a big mistake with your website and should re-think how you inject link-information from the input into your output.
Filtering input alone does not help here, you need to filter the output as well.
Often it's more easy if your application recieves a request that does not match the superset of allowed requests to return a 404 error.
I am not sure how I get around removing this unwanted data in the URL.
Actually, the request has been already send, so the URL is set. You can't "change" it. It's just the information what was requested.
It's now your part to deal upon it, not to blindly pass it around any longer, e.g. into your output (and then your links are broken).
Edit: You now wrote more specifically what you're concerned about. I would go in one with dqhendricks here: Who cares?
If you really feel uncomfortable with the fact that a user is just using her browser and enters any URL she feels free to do so, well, the technically correct response is:
400 Bad Request (ref)
And return a page with no or only fully-qualified URIs (absolute URIs) or a redefinition of the Base-URI, otherwise the browser will take the URI entered into it's address bar as the Base-URI. See Uniform Resource Identifier (URI): Generic Syntax RFC 3986; Section 5. Reference ResolutionSpecs.

first, if someone adds that crap to their url, who cares if the page doesn't load images correctly? also if the request isn't valid, why would it load any page? why are you using SERVER vars to get paths anyways?
second, you should also be escaping any user submitted database input with the appropriate method for your particular database to avoid sql injection. filter_var generally will not help.
third, xss is simple to protect from. Any user submitted data that is to be displayed on any page needs to be escaped with htmlspecialchars(). this is easier to ensure if you use a view class that you can build this escaping in to.

To your concern about XSS: The altered URL won't get into your page unless you blindly use the related $_SERVER variables. The fact that the relative links seem to include the URL injected script is a browser behavior that risks only breaking your relative links. Since you are not blinding using the $_SERVER variables, you don't have to worry.
To your concern about your relative paths breaking: Don't use relative paths. Reference all your resources with at least a root-of-domain path (starting with a slash) and this sort of URL corruption will not break your site in the way you described.

Efficient Method for Preventing Hotlinking via .htaccess

I need to confirm something before I go accuse someone of ... well I'd rather not say.
The problem:
We allow users to upload images and embed them within text on our site. In the past we allowed users to hotlink to our images as well, but due to server load we unfortunately had to stop this.
Current "solution":
The method the programmer used to solve our "too many connections" issue was to rename the file that receives and processes image requests (image_request.php) to image_request2.php, and replace the contents of the original with
<?php
header("HTTP/1.1 500 Internal Server Error") ;
?>
Obviously this has caused all images with their src attribute pointing to the original image_request.php to be broken, and is also the wrong code to be sending in this case.
Proposed solution:
I feel a more elegant solution would be:
In .htaccess
If the request is for image_request.php
Check referrer
If referrer is not our site, send the appropriate header
If referrer is our site, proceed to image_request.php and process image request
What I would like to know is:
Compared to simply returning a 500 for each request to image_request.php:
How much more load would be incurred if we were to use my proposed alternative solution outlined above?
Is there a better way to do this?
Our main concern is that the site stays up. I am not willing to agree that breaking all internally linked images is the best / only way to solve this. I refuse to tell our users that because of something WE changed they must now manually change the embed code in all their previously uploaded content.

Ok, then you can use mod_rewrite capability of Apache to prevent hot-linking:
http://www.cyberciti.biz/faq/apache-mod_rewrite-hot-linking-images-leeching-howto/

Using ModRwrite will probably give you less load than running a PHP script. I think your solution would be lighter.
Make sure that you only block access in step 3 if the referer header is not empty. Some browsers and firewalls block the referer header completely and you wouldn't want to block those.

I assume you store image paths in database with ids of images, right?
And then you query database for image path giving it image id.
I suggest you install MemCached to the server and do caching of user requests. It's easy to do in PHP. After that you will see server load and decide if you should stop this hotlinking thing at all.

Your increased load is equal to that of a string comparison in PHP (zilch).
The obfuscation solution doesn't even solve the problem to begin with, as it doesn't stop future hotlinking from happening. If you do check the referrer header, make absolutely certain that all major mainstream browsers will set the header as you expect. It's an optional header, and the behavior might vary from browser to browser for images embedded in an HTML document.
You likely have sessions enabled for all requests (whether they're authenticated or not) -- as a backup plan, you can also rename your session cookie name to something obscure (edit: obscurity here actually doesn't matter as long as the cookie is set for your host only (and it is)) and check that a cookie by that name is set in image_request.php (no cookie set would indicate that it's a first-request to your site). Only use that as a fallback or redundancy check. It's worse than checking the referrer.
If you were generating the IMG HTML on the fly from markdown or something else, you could use a private key hash strategy with a short-live expire time attached to the query string. Completely air tight, but it seems way over the top for what you're doing.
Also, there is no "appropriate header" for lying to a client about the availability of a resource ;) Just send a 404.

PHP creating back-link with $_SERVER['HTTP_REFERER']

Is it safe to create a back link with:
$backLink = htmlentities($_SERVER['HTTP_REFERER']);
or is there a better solution?

An easier way might be to do something like this:
Go back
That does not rely on the browser populating the Referer header, but instead does exactly the same thing as pressing the browser "Back" button.
This may be considered better since it actually goes back in the browser history, instead of adding the previous page to the browser history in the forward direction. It acts just as you would expect the Back button to act.

It's quite safe, as long as you check for its existance. In some browsers it can be turned off, and I'm not sure that it's mandatory for browsers anyhow. But the baseline is, you can't count on it existing. (RFC2616 doesn't say the referer-header must exist.)
If you really need reverse navigation, perhaps you could instead use a session variable to save the previous (current really, but only update it after displaying the back-link) page visited.

Given that:
The referer header is optional
Some security software will rewrite the referer header (e.g. to XXXX:XXXXXXXX or Advert For Product)
Linking to the referer will, at best, duplicate the built in functionality of the back button
User's will often expect a link marked 'back' to take them to the previous page in a sequence, not to the previous page they were on
No, it isn't safe. The dangers are not great, but the benefits are tiny.

It will work in some cases. However, you should be aware that the HTTP referer header is not guaranteed. User agents (browsers, search spoders etc) cannot be relied on to send anything, correct or not. In addition, if a user browses directly to the page, no referer header will be present. Some internet security software products even strip out the HTTP referer for "security" reasons.
If you wish to use this solution, be sure to have a fallback in place such as not showing the back link, or linking to a default start page or something (it would depend on the situation this is to be used in).
An alternative solution might be to use javascript to navigate to "history.back". This will use the browser's back/history function to return to the previous page the user was on.

I think Facebook use a similar technique to redirect the user.
They use GET variable called 'from'.

You must be careful with htmlentities because it corrupts non-ASCII encoding. For example,
echo(htmlentities("Привет, друг!")); //Contains russian letters
is displayed as
Ïðèâåò,
äðóã!
Which is of course incorrect.
Every browser sends non-ASCI chars in URLs as it wants to. Mozilla in Unicode, IE in system's current charset (Windows-1251 in Russia).
So, that might be useful to replace htmlentities with htmlspecialchars.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.