$_SERVER["HTTP_REFERER"] not returning full URL

$_SERVER["HTTP_REFERER"] not returning full URL - php

I'm trying to apply a quick patch to address an issue with an extension we're using. As a result, please pardon this "bandaid-like" fix that I'm requesting assistance with. This is merely an effort to fix an issue in about 20 minutes or less and schedule in a permanent fix for later in the week.
That being said, I am struggling with grabbing a value that I would expect with using $_SERVER["HTTP_REFERER"]. Our URL is somewhat odd at the moment. A URL example is below...:
http://domain.com/custom-wheels-performance-tires/custom-wheels.html#/custom-wheels-performance-tires/custom-wheels.html?wheel_diameter=2663
When using $_SERVER["HTTP_REFERER"], the value I'm getting (for the URL above) is:
http://domain.com/custom-wheels-performance-tires/custom-wheels.html
Evidently, it is being cut off at the # in the URL. Common sense would be to remove that from the URL, but I'm going to have to dig into someone else's code to do that and it exceeds the time allocated for this patch. Is there a way to get the full URL (even if it isn't $_SERVER["HTTP_REFERER"])?
I appreciate any and all assistance!

Due to the way URL's are handled by browsers, the server never receives anything past the hash fragment identifier (#). The fragment is intended to be used by the browser to scroll a page to an anchor.
However, It is possible to utilize JavaScript to get the fragment, and send it to the browser.

Related

PHP redirect without # [duplicate]

For some reason, non IE browsers seem to persist a URL hash (if present) when a server-side redirect is sent (using the Location header). Example:
// a simple redirect using Response.Redirect("http://www.yahoo.com");
Text.aspx
If I visit:
Test.aspx#foo
In Firefox/Chrome, I'm taken to:
http://www.yahoo.com#foo
Can anyone explain why this happens? I've tried this with various server side redirects in different platforms as well (all resulting in the Location header, though) and this always seems to happen. I don't see it anywhere in the HTTP spec, but it really seems to be a problem with the browsers themselves. The URL hash (as expected) is never sent to the server, so the server redirect isn't polluted by it, the browsers are just persisting it for some reason.
Any ideas?

I suggest that this is the correct behaviour. The 302 and 307 status codes indicate that the resource is to be found elsewhere. #bookmark is a location within the resource.
Once the resource (html document) has been located it is for the browser to locate the #bookmark within the document.
The analogy is this: You want to look something up in a book in chapter 57, so you go to the library to get the book. But there is a note on the shelf saying the book has moved, it is now in the other building. So you go to the new location. You still want chapter 57 - it is irrelevant where you got the book.

This is an aspect that was not covered by previous HTTP specifications but has been addressed in the later HTTP development:
If the server returns a response code of 300 ("multiple choice"), 301
("moved permanently"), 302 ("moved temporarily") or 303 ("see
other"), and if the server also returns one or more URIs where the
resource can be found, then the client SHOULD treat the new URIs as
if the fragment identifier of the original URI was added at the end.
The exception is when a returned URI already has a fragment
identifier. In that case the original fragment identifier MUST NOT be
not added to it.
So the fragment of the original URI should also be used for the redirection URI unless it also contains a fragment.
Although this was just a draft that expired in 2000, it seems that the behavior as described above is the de-facto standard behavior among todays web browsers.
#Julian Reschke or #Mark Nottingham probably know more/better about this.

From what I have found, it doesn't seem clear what the exact behaviour should be. There are plently of people having problems with this, some of them wants to keep the bookmark through the redirect, some of them wants to get rid of it.
Different browsers handle this differently, so in practice it's not useful to rely on either behaviour.
It definitely is a browser issue. The browser never sends the bookmark part of the URL to the server, so there is nothing that the server could do to find out if there is a bookmark or not, and nothing that could be done about it reliably.

When I put the full URL in the action attribute of the form, it will keep the hash. But when I just do the query string then it drops the hash. E.g.,
Keeps the hash:
https://example.com/edit#alrighty
<form action="https://example.com/edit?ok=yes">
Drops the hash:
https://example.com/edit
<form action="?ok=yes">

cURL source retrieval (ignore files/images)

We've developed an irc bot in php which, (among many other functions), will respond with the page title of any url a user sends to the channel. The problem i'm having is that when someone puts the url of an image or a file, the bot tries to retrieve that file or image.
I'm trying to determine the best way to go about solving this issue. Should I filter the url inputs and regex them for all possible file types? That seems daunting and exhaustive, to say the least. If anyone caught on to it they could simply put a huge file somewhere with a senseless extension and then say that url in the channel and time the bot out.
I feel like i'm missing a curl option which could make it simply ignore file retrievals which aren't simply ascii in nature. Any advice or suggestions?

One idea could be that you do a HEAD request first and if the content type is text/html you download it otherwise you don't. Or you could just read the first 1000 characters (or something small) and check if the title is there. And if isn't you assume it is something else than html.

pseudo-random URL generation

I'm writing python code to parse data from http://www.istockphoto.com/ and it seems like the URL that is generated from a search seems to be pseudo-random; For example if you do a 'photos' search for 'meow' you get the URL: http://www.istockphoto.com/search/text/meow/filetype/photos/source/basic#e2430b3
I've looked at the source code carefully, but since I don't know much about PHP/javascript (I assume that's how the URL is being generated), I can't figure out exactly which lines of code are generating this URL. Could someone please point me in the right direction and show me which lines of code are responsible for the URL?

It's not a (pseudo-)random url, as the first part is clearly unique for your search: http://www.istockphoto.com/search/text/meow/filetype/photos/source/basic
The last part, #e2430b3 is just an anchor to somewhere on the page, or used by some scripts.
It is not used by the query, as you can type the url without this part and it works the same.
This part perhaps can be used by the server as a cache identifier, to speed up repetitive requests.

How do I get this URL without considering the Apache settings?

HEllo I have this URL I need to get with PHP
http://www.domain.com/forum/#forum/General-discussions-0.htm
The problem is this is not a real URL, but this the mask created by the .htaccess.
I need to get the visible URL and not the real path of the file, because I need to compare it with some PHP variables I have.
In fact the real path will look like this:
http://domain.com/modules/boonex/forum/index.php
And in that way is totally useless for me.
How do I get the first URL as it is?

You can't get that from http://www.domain.com/forum/#forum/General-discussions-0.htm. Everything after the fragment (#) is not even send to the server, there is no way to retrieve it save for a delayed update with javascript. All you'll get it is http://www.domain.com/forum/ send to the server, and on the onload event of your document you can possibly load something in with javascript.

Look into the source code or it may not have real urls at all. The part is for ajax based navigation. It may mean that there are no real urls on that site and if there are then they should be extracted from <a href="someurl"> as they might masked using javascript.

With
file_get_contents();
for example. Neither user nor your server mind about .htaccess
It's server proccessing the request who have to direct you to correct address
however php does ignore everything after #, so in this case you have no chance to get it without real url
As #Wrikken said, there is no way to get url after # fragment

Efficient Method for Preventing Hotlinking via .htaccess

I need to confirm something before I go accuse someone of ... well I'd rather not say.
The problem:
We allow users to upload images and embed them within text on our site. In the past we allowed users to hotlink to our images as well, but due to server load we unfortunately had to stop this.
Current "solution":
The method the programmer used to solve our "too many connections" issue was to rename the file that receives and processes image requests (image_request.php) to image_request2.php, and replace the contents of the original with
<?php
header("HTTP/1.1 500 Internal Server Error") ;
?>
Obviously this has caused all images with their src attribute pointing to the original image_request.php to be broken, and is also the wrong code to be sending in this case.
Proposed solution:
I feel a more elegant solution would be:
In .htaccess
If the request is for image_request.php
Check referrer
If referrer is not our site, send the appropriate header
If referrer is our site, proceed to image_request.php and process image request
What I would like to know is:
Compared to simply returning a 500 for each request to image_request.php:
How much more load would be incurred if we were to use my proposed alternative solution outlined above?
Is there a better way to do this?
Our main concern is that the site stays up. I am not willing to agree that breaking all internally linked images is the best / only way to solve this. I refuse to tell our users that because of something WE changed they must now manually change the embed code in all their previously uploaded content.

Ok, then you can use mod_rewrite capability of Apache to prevent hot-linking:
http://www.cyberciti.biz/faq/apache-mod_rewrite-hot-linking-images-leeching-howto/

Using ModRwrite will probably give you less load than running a PHP script. I think your solution would be lighter.
Make sure that you only block access in step 3 if the referer header is not empty. Some browsers and firewalls block the referer header completely and you wouldn't want to block those.

I assume you store image paths in database with ids of images, right?
And then you query database for image path giving it image id.
I suggest you install MemCached to the server and do caching of user requests. It's easy to do in PHP. After that you will see server load and decide if you should stop this hotlinking thing at all.

Your increased load is equal to that of a string comparison in PHP (zilch).
The obfuscation solution doesn't even solve the problem to begin with, as it doesn't stop future hotlinking from happening. If you do check the referrer header, make absolutely certain that all major mainstream browsers will set the header as you expect. It's an optional header, and the behavior might vary from browser to browser for images embedded in an HTML document.
You likely have sessions enabled for all requests (whether they're authenticated or not) -- as a backup plan, you can also rename your session cookie name to something obscure (edit: obscurity here actually doesn't matter as long as the cookie is set for your host only (and it is)) and check that a cookie by that name is set in image_request.php (no cookie set would indicate that it's a first-request to your site). Only use that as a fallback or redundancy check. It's worse than checking the referrer.
If you were generating the IMG HTML on the fly from markdown or something else, you could use a private key hash strategy with a short-live expire time attached to the query string. Completely air tight, but it seems way over the top for what you're doing.
Also, there is no "appropriate header" for lying to a client about the availability of a resource ;) Just send a 404.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.