Dealing with extra '/' in URL - php

So I have a custom made site that uses this type of input:
example.com/?id=4e2dc982
Or this would also work:
example.com/index.php?id=4e2dc982
But now I've started seeing hits in my log from GoogleBot trying to retrieve this for some reason:
example.com/index.php/?id=4e2dc982
The worse thing is that this actually works, it pulls the page with the right GET parameter, but because of the extra '/' all the links and references don't work. When it tries to load "image.jpg" instead of loading the proper "example.com/image.jpg" it tries to load "example.com/index.php/image.jpg". How do I best fix this? I know I could go back and replace every link to use absolute paths but that's silly. The link with an extra '/' shouldn't work in the first place.
Update:
I found the fix, but still don't know why this is even allowed. I went to:
http://ca1.php.net/manual-lookup.php?pattern=test
And tried to see if the following was possible, and sure enough it works:
http://ca1.php.net/manual-lookup.php/?pattern=test
But their page doesn't break. So I looked at it and found out why:
<base href="http://ca1.php.net/manual-lookup.php" />
So basically, ANY PHP script seems to accept an extra /, but if you didn't code all your links to have absolute paths, or use a base tag, your site will be screwed up whenever someone adds an extra '/'.

It must be linked from somewhere, which you need to figure out from where. You can use google site search to (i.e. site:yoursie) may be to figure out.
One suggestion for now is to use canonical tag
http://googlewebmastercentral.blogspot.com.au/2009/02/specify-your-canonical.html

I think that one of the things you could actually do is getting the header or browser agent (although some browsers don't send this), you could possibly do it. Then if the header contains anything like Google, do not allow the bot to crawl the page, else redirect the user to the site.
Below is an example:
$browser = $_SERVER['HTTP_USER_AGENT'];
checkbrowser($browser); //Calls checkbrowser(); with the browser version.
function checkbrowser($analyze) {
$searchwords = array("bot","google","crawler");
$matches = array();
$matchFound = preg_match_all(
"/\b(" . implode($searchwords,"|") . ")\b/i",
$analyze,
$matches
);
if ($matchFound) {
$words = array_unique($matches[0]);
foreach($words as $word) {
if($word == "bot") {
echo "Sorry, bots are not allowed to crawl this specific page.";
die(); //Terminate the script and leave the bot with that message so it cannot crawl.
}
}
}
}
This is how I often do it, but I utilize this method for different things. You can modify the function by changing the $searchwords to something that fits you best.

Related

Get Page URL In Order To Use It To Include

So I made a script so that I can just use includes to get my header, pages, and then footer. And if a file doesnt exist a 404. That all works. Now my issue is how I'm supposed to get the end of the url being the page. For example,
I want to make it so that when someone goes to example.com/home/test, it will automatically just include test.php for example.
Moral of the story. How to some how get the page name. And then use it to "mask" the end of the page so that I don't need to have every URL being something.com/home/?p=home
Heres my code so far.
<?php
include($_SERVER['DOCUMENT_ROOT'].'/home/lib/php/_dc.php');
include($_SERVER['DOCUMENT_ROOT'].'/home/lib/php/_home_fns.php');
$script = $_SERVER['SCRIPT_NAME']; //This returns /home/index.php for example =/
error_reporting(E_ALL);
include($_SERVER['DOCUMENT_ROOT'].'/home/default/header.php');
if($_GET["p"] == 'home' || !isset($_GET["p"])) {
include($_SERVER['DOCUMENT_ROOT'].'/home/pages/home.php');
} else if(file_exists($_SERVER['DOCUMENT_ROOT'].'/home/pages/'.$_GET["p"].'.php')) {
include($_SERVER['DOCUMENT_ROOT'].'/home/pages/'.$_GET["p"].'.php');
} else {
include($_SERVER['DOCUMENT_ROOT'].'/home/default/404.php');
}
include($_SERVER['DOCUMENT_ROOT'].'/home/default/footer.php');
?>
PHP by itself wouldn't be the best choice here unless you want your website littered with empty "redirect" PHP files. I would recommend looking into the Apache server's mod_rewrite module. Here are a couple of guides to get you started. Hope this helps!
The simplest way would be to have an index.php file inside the /home/whatever folder. Then use something like $_SERVER['PHP_SELF'] and extract the name if you want to automate it, or since you are already writing the file yourself, hardcode it into it.
That however looks plain wrong, you should probably look into mod-rewrite if you are up to creating a more complex/serious app.
I would also recommend cakePHP framework that has the whole path-to-controller thing worked out.

If URL contains something include .php file else do nothing?

I wish to include Smart PHP Cache layer on top of main script on site. It works great, but Smart Cache also caches some pages which should not be cached (search results, admin area...).
I looked into Smart PHP Cache source code, and I am not sure if there is some way to configure which pages should be excluded from cache, or how to configure it.
So, what I need is some php code which will be inserted at top of main script of site, before Smart PHP Cache code which will first check if page contains for example:
"/search/"
"/admin/"
"/latest/"
"/other-live-pages/live-page.php"
and then, if something from above example is in URL to do nothing, (not to include smart_cache.php and to continue with other normal code, so user could see live results) and otherwise if there is nothing from above to include smart_cache.php.
Or.
If you have better knowledge to make modification inside Smart PHP Cache to be able to exclude some URLs from caching mechanism (or to tell me how to do that, because it looks like there is something in configuration of Smart PHP Cache that can bypass the cache layer but I am not sure how to use it.
Best regards.
Question update:
Thanks for answer. It works nice, I just wish to ask can you please little change code to make this:
If "pos1" (if URL contains "/search"), than nothing, false, like it is now
if "pos2" (if URL contains "/admin"), than nothing, false, like it is now
if "pos3" (if URL contains "/latest") include file "smart_cache_latest.php"
and after that like it is now, include "smart_cache.php" for any other URLs.
So practically only change is for URLs with "/latest", which should be cached too by including "smart_cache_latest.php".
Best regards.
$currenturl = $_SERVER['REQUEST_URI'];
$pos1 = strpos($currenturl, "/search");
$pos2 = strpos($currenturl, "/admin");
$pos3 = strpos($currenturl, "/latest");
if ($pos1 === false && $pos2 === false){
require '/path/to/smart_cache.php';
} elseif($pos3 == true) {
require '/path/to/smart_cache_latest.php';
}

Passing variable through subdomain wildcard redirect?

I am working on a new script that basically instead when somebody searches for something on my website how it normally goes to here:
http://domain.com/index.php?q=apples
to
http://apples.domain.com
I have made this work perfectly in PHP as well as htaccess but the problem I am having is using the original keyword afterwards on the new subdomain page.
Right now I can use parse_url to get the keyword out of the url but my script also filters out potential problems like:
public function sanitise($v, $separator = '-')
{
return trim(
preg_replace('#[^\w\-]+#', $separator, $v),
$separator
);
}
So if somebody searches for netbook v1.2
The new subdomain would be:
http://netbook-v1-2.domain.com
Now I can take the keyword out but it's with the dashes and not original. I am looking for a way to send over the original keyword with the 301 redirect as well.
Thanks!
You can either just replace the hyphen with spaces when they visit the new subdomain or, since you're on the same top-level domain, you can just cookie the keyword when redirecting them:
setcookie('clientkeyword', 'netbook-v1-2.domain.com:netbook v1.2', 0, '/', '.domain.com');
Look at this answer: https://stackoverflow.com/a/358334/992437
And see if you can use the POST or GET data that's already there. If so, that might be your best bet.

HTML5 domain locking?

I've got a project where we're creating a dynamic html5 based video player with a bunch of Ajax controls and features. The intention is that this player will be used by other domains. Our old player used flash and could easily domain-lock, but now is there any method at all to do domain locking in HTML5?
Keep in mind that's its not just the video itself, we're also wanting to load html content for our ajax based controls. It seems like iframe is the obvious choice for this but then there's no way to do domain locking.
Any ideas?
You could use the function above, but its pretty obvious what it's doing, so anyone can just remove the domain lock.
There are services out there that will lock your page to a domain name, I know of two off the top of my head.
jscrambler.com - this is a paid tool, but it might be a bit of an overkill if all you want to do is lock your domain.
DomainLock JS - this is a free domain locking tool.
I came here looking for the same thing. But I think I have an answer worked out.
The best way I found sofar is to strip the location.href of its http:// and then check the first few characters for a whitelisted domain. So:
if(checkAllowedDomains())
{
initApplication();
}
else
{
// it's not the right domain, so redirect them!
top.location.href="http://www.snoep.at";
}
function checkAllowedDomains()
{
var allowed_domains=new Array();
allowed_domains.push("www.snoep.at");
allowed_domains.push("www.makinggames.nl");
allowed_domains.push("www.google.com");
// add whatever domain here!
var domain=top.location.href;
domain.replace('http://','');
var pass=false;
for(i=0;i<allowed_domains.length;i++)
{
var shortened_domain=domain.substr(2,allowed_domains[i].length);
if(shortened_domain.indexOf(allowed_domains[i])!=-1)
{
pass=true;
}
}
}
This bit of code checks several allowed_domains, you can easily extend the array.
That is the problem with the code, it's very readable. So, I'd advise you to put it through a js-minimizer to make it less obvious and include it in EVERY js on your page. InitApplication() is a function that starts your page or application.
Because you strip the location of http:// (which may or may not be there) and then checking only for the specific length (including the WWW!) of the allowed domain, you rule out subdomains, that might look like this: google.com.mydomain.com and throw the check of!
Hope this helps.
Try reading REFERER header, and if the site isn't blacklisted, don't display player.

PHP Input validation for a single input for a url

I have this very simple script that allows the user to specify the url of any site. The the script replaces the url of the "data" attribute on an object tag to display the site of the users choice inside the object on the HTML page.
How could I validate the input so the user can't load any page from my site inside the object because I have noticed that it will display my code.
The code:
<?php
$url = 'http://www.google.com';
if (array_key_exists('_check', $_POST)) {
$url = $_POST['url'];
}
//gets the title from the selected page
$file = # fopen(($url),"r") or die ("Can't read input stream");
$text = fread($file,16384);
if (preg_match('/<title>(.*?)<\/title>/is',$text,$found)) {
$title = $found[1];
} else {
$title = "Untitled Document";
}
?>
Edit: (more details)
This is NOT meant to be a proxy. I am letting the users decide which website is loaded into an object tag (similar to iframe). The only thing php is going to read is the title tag from the input url so it can be loaded into the title of my site. (Don't worry its not to trick the user) Although it may display the title of any site, it will not bypass any filters in any other way.
I am also aware of vulnerabilities involved with what I am doing that's why im looking into validation.
As gahooa said, I think you need to be very careful with what you're doing here, because you're playing with fire. It's possible to do safely, but be very cautious with what you do with the data from the URL the user gives you.
For the specific problem you're having though, I assume it happens if you get an input of a filename, so for example if someone types "index.php" into the box. All you need to do is make sure that their URL starts with "http://" so that fopen uses the network method, instead of opening a local file. Something like this before the fopen line should do the trick:
if (!preg_match('/^http:\/\//', $url))
$url = 'http://'.$url;
parse_url: http://us3.php.net/parse_url
You can check for scheme and host.
If scheme is http, then make sure host is not your website. I would suggest using preg_match, to grab the part between dots. As in www.google.com or google.com, use preg_match to get the word google.
If the host is an ip, I am not sure what you want to do in that situation. By default, the preg match would only get the middle 2 numbers and the dot(assuming u try to use preg_match to get the sitename before the .com)
Are you aware that you are creating an open HTTP proxy, which can be a really bad idea?
Do you even need to fetch the contents of the URL? Why don't you let your user's browser do that by supplying it with the URL?
Assuming you do need to fetch the URL, consider validating against a known "whitelist" of URLs. If you can't restrict it to a known list, then you are back to the open proxy again...
Use a regular expression (preg) to ensure it is a good HTTP url, and then use the CURL extension to do the actual request.
Mixing the fopen() family of functions with user supplied parameters is a recipe for potential disaster.
You could use PHP filter.
filter_var($url, FILTER_VALIDATE_URL) or
filter_input(INPUT_POST, 'url', FILTER_VALIDATE_URL);
http://php.net/manual/en/function.filter-input.php
Also try these documents referenced by this PHP wiki post related to filter
https://wiki.php.net/rfc/add_validate_functions_to_filter?s[]=filter
by Yasuo Ohgaki
https://www.securecoding.cert.org/confluence/display/seccode/Top+10+Secure+Coding+Practices
https://www.owasp.org/index.php/OWASP_Secure_Coding_Practices_-_Quick_Reference_Guide
http://cwe.mitre.org/top25/mitigations.html

Categories