I am developing a website crawler using golang. When i tried to crawl some websites, I am getting weird results. Root Url of some website returns script tag as shown below.
<script>window.location="index.php";</script>
And it redirects to index.php page. Why people are using this approach to redirect user to index page. Any security vulnerability with this approach? And also, how can i handle this situation in crawler?
Well, if you really want to hide the page by redirecting the user to another page, then you obviously cannot use this method, because anyone can turn javascript off and see the page, thus this can be a security risk. However, if you just simply want to redirect for some reason, this is fine.
As for you crawler, what you can do is search the source code with regex for redirections like that, but it can be very challenging to cover all cases.
Related
I have http://mysite.com/index.php.
And a sub menu
home => http://mysite.com/index.php
about us => http://mysite.com/about.us.php
products => http://mysite.com/products.php
But i want http://mysite.com/index.php to process every request, and just change the content using Ajax request. This way, the site only loads the content part, and is much faster and easy to navigate.
The problem here is SEO, because the only URL google will see is http://mysite.com/index.php and I would like to associate http://mysite.com/about-us to the About Us content, http://mysite.com/product to the Products content, etc.
I know I can do this with PHP just reading the URL and writing the Ajax on the fly, but doing so the whole page is going to be reloaded every time.
Is there a way to do this without reloading the whole page?
What I think I need is to have a regular anchor in the submenu, for exampel pointing to "http://mysite.com/contact-us" but when clicked, instead of opening this page, process the Ajax request.
And if this is possible, Google is going to see this as black hat probably, right?
Regards
Alex
HERE THERE IS A SOLUTION:
window.history.pushState(data, title, url)
Here Rob explains how it works, and you have a working example:
http://moz.com/blog/create-crawlable-link-friendly-ajax-websites-using-pushstate
you can't change the URL in the address bar without changing the page because to be able to do that I couldlet you visit me at http://www.imhackingyou.com/sucker but change the addressbar to read http://www.bankofamerica.com/login
This is a routing issue, not an AJAX issue.
If you were using another tool (cough ASP.NET MVC cough), you'd just add a route (and I'm hopeful there's a way to do this in PHP) that accepted URLS like
/home
/products
...
and routed them to, say,
/index.php?area=home
/index.php?area=products
This is typically accomplished with a rewrite engine when used outside of a good MVC or RESTful URL system. I use ISAPI Rewrite on IIS, but if you're working on the LAMP stack, I think Apache provides a module that provides the same capabilities. (Google .htaccess )
WARNING: RANT FOLLOWS
And, for what it's worth,
Avoid trying to write your entire application in JavaScript. The server's there for a reason. Part of your job as a web developer is to absorb as much of the work onto your server as possible. Browser performance and compatibility issues will drive you mad when you try to do everything on the client.
Avoiding postbacks makes sense in a lot of circumstances, but it's not a silver bullet that you should try to apply to every page. Usually it makes sense to load a new page when a link is clicked. It's what the user expects, it's more stable (since most of the infrastructure required is server-side) and it's not slower than an AJAX request to retrieve the same thing.
Rules:
NEVER break the back button. Without careful planning, most AJAX apps break this rule.
See rule #1.
This sounds like it should be accomplished with a rewrite engine, but assuming that you have a good reason to use AJAX, you can change urls with javascript by modifying the portion after the hash, or better yet, the hashbang:
window.location.hash = "#!about-us";
http://mysite.com/
http://mysite.com/#!about-us
http://mysite.com/#!products
For more info on the hashbang from an SEO perspective, check out http://www.seomoz.org/blog/how-to-allow-google-to-crawl-ajax-content
How does Shopify do it then? Go to their website, click on the Features link and you'll see the URL says:
http://www.shopify.com/tour/sell-online
Then click on any of the sub links and you'll see that the address in the URl changes without using a hash but there is no page flip.
I don't think they are using ajax to change the content because it all appears to be included in hidden divs on the page, but regardless, you can apparently change the URL using client side tricks.
I am building a shopping cart front and am at an impasse with this problem:
The secure side of the shopping cart is hosted on another site, and I need to be able to get access to their PHP GET variables that are placed in two places:
In the url
In the page itself under a meta tag.
The only problem, as I was going to do this with an iframe src, is that the variables are generated after a page redirect from the url supplied. For instance, I give the browser:
https://www.domain.com/file.php
and the page will redirect to:
https://www.domain.com/file2.php?CFTOKEN=231351332&CFID=23423235
I need to get those two PHP GET variables, but I cannot do it through an iframe because of access-control-point-origin. I also cannot do it through fopen, file_get_contents, or cURL getting source code because it times out.
How can I do this? It doesn't seem like it would be that difficult.
If I could get access to the URL or the source code I could accomplish this.
if you really have access to both sides you can use web messaging, so after you iframe loaded you can ask it for variables ;)
A simple example of code is here http://dev.opera.com/articles/view/window-postmessage-messagechannel/
I am looking to create an SEO friendly URL after a filter has been submitted in an html form.
After playing around for a while I have found a way, however I was wondering what otehr people think of it as I'm new to development.
I have added some rewrite rules in the .htaccess file to make the urls more friendly. Examples below:
Original URL:
site-nane/list.php?brand=brand1&min-price=0&max-price=2000
URL after rewrite:
site-name/section/brand1/0-200
Currently I have the form that submits the information to a separate php page which collects the variables and creates a new url from it which then redirects with a 301. Example of php below:
$min = $_GET[‘min-price’];
$max = $_GET[‘max-price’];
$brand = $_GET[‘brand’] ;
header ('HTTP/1.1 301 Moved Permanently');
header('Location: http://site-name/section/' . $brand . '/'. $min.'-'.$max );
exit();
As you can see it collects the info and takes you back to the page and declares the previous page has permanently moved.
Questions:
Although this maybe quite primitive, will this still be ok to use without causing too much trouble?
Will google hate me for creating so many 301’s
Just noticed the code header("Location: /foo.php",TRUE,301); would it be best to use this or no difference?
Yes, I see no issues with your solution. Even if malicious user input was given it would just redirect to a non-existing page.
I don't think so. You already use the right code 301 instead of the default 302 which might cause some trouble / did create some havoc with regard to Google, stolen PR and SEO
Using header("Location:...", true, 301); is advisable. This way php could automatically make decisions based on the environment. E.g. if using an HTTP/1.0 connection php could send the 301 code with HTTP/1.0 instead of your fixed HTTP/1.1 in your solution. But still, either way is fine.
But one question: why don't you link directly to your nice URL? mod_rewrite which you are using would then already take care of assigning the parameters given with the URL to variables that you could access via $_GET as usual.
The way you've done things is a permutation of the post redirect get (PRG) pattern. In your case, it's Get redirect get.
What PRG normally means is:
User POSTs the form to a controller which does whatever you want, and build the desired url
Script REDIRECTs use to the desired url
User GETs the url, sees the result.
Generally speaking, it's a good pattern to follow, it allows you the control to - e.g. - remove default values from the resultant search url.
Regarding your specific questions
If it works, it's fine
Google won't be submitting forms, so it won't be finding the pre-processed urls to get 301-redirected unless those urls exist somewhere else already. However, you need to be linking to "http://site-name/section/brand/min-max" for any search engine to know the pages exist.
Either is fine, one line of code is easier to type though
Search engines do not submit forms.
Thus, no form action have to be friendly. (And,as another consequence - no, google won't hate you).
There is just no point in doing additional redirect instead of displaying form results.
I am building an AJAX deep-linked site.
I want PHP to load all the HTML code of the page if the user is trying to access the site with a Javascript non-supported browser or if it is a search crawler. Basically PHP will return the whole page.
On the contrary, when the user is trying to access the site with Javascript supported browser, I want PHP to return only the template code, and let Javascript (AJAX) take care of the rest. Basically PHP will only load design elements and let Javascript populate them with content.
I looked into PHP's get_browser() function, however it seems it is not such a reliable tool. What is the industry's practice see if the browser supports Javascript or it is a search crawler using PHP?
Background:
Why I want the site to have this behavior.
Since I want the home page to load just by loading the address: example.com, which does not send any query to PHP, PHP returns the HTML code of the home page. This however causes issues when the user tries to load the following page: example.com#foo. So, for this example, PHP will return the home page and once the home page is loaded, Javascript (AJAX) will change the content around so that it shows proper content for #foo. This will make the user to see the home page, therefore load time will be slower and user-experience will not be so nice. However if my PHP script can figure out that if the use with Javascript supported browser is trying to load the page, it will only return the template of the web site, which has no content) and the javascript will populate that template with content whatever is supposed to be displayed for #foo. On the other hand, if the Javascript non-separated browser or a crawler will try to access the page example.com#foo, home page will be returned.
I am using SWFaddress (http://www.asual.com/swfaddress/) library for the deep-linking.
Edit
Thank you guys. I did not think of using <noscript></noscript> before.
Here is what I decided to do. PHP by default will load pages such as example.com or example.com#foo (which is essentially the same as example.com from PHP's point of view since fragments by definition are not sent to the server) blank (just visual template) with <noscript> tag inside for the content of the home page. This way users with javascript will not see the home page and AJAX will populate the content of the page according to the #foo fragment. On the other hand, search crawlers and users without javascript will see a home page.
Thank you again. I think this is pretty simple and elegant solution. If you have any further suggestions, please post a comment or another answer.
You can't do this using PHP. What you can do though is use a noscript tag to redirect to another php page if they don't have javascript:
<noscript>
<meta http-equiv="refresh" content="0; URL=nojavascript.php">
</noscript>
It's not possible to accomplish this in the way you're trying to do it.
It's rare that someone has JS turned off and doesn't know it.
PHP doesn't get passed anything after #, only javascript can do anything with that. So even if PHP could determine if the browser has javascript turned on then it still couldn't read # anyways.
You could include a link inside some <NOSCRIPT> tags that point the user to something like example.com#foo?javascript=disabled.
Unfortunately, browsers do not report whether JS is enabled or not, so there's no way to know from a simple HTTP GET whether or not you should send JS reliant pages.
You should just build an AJAX query that sets a session variable for javascript enabled.
Run this AJAX query before any other information on the site is loaded and then do a simple redirect to the actual site.
You could do something like this pseudo code:
Index.php:
ajax(check_js.php);
redirect(main_page.php);
check_js.php
$_SESSION['js_enable'] = true;
main_page.php
if($_SESSION['js_enable'] == true) {
//execute page
} else {
header("Location: no_js_error.php");
}
Instead of the server trying to sniff our the user's settings, how about using unobtrusive javascript in the first place? This way, the page will degrade gracefully (to the desired state) if JS is not available.
This is a noob question I belieive, in a content management system as well as several other types of sites that work on submissions, once you submit a URL in a URL shortening website for instance, how do you use PHP to redirect to the appropriate URL without a 404 or without using an htaccess.
Based on what I've found in simple url shortening scripts online, an htaccess is always used to redirect 404s to a PHP file which process the URL and goto the specific page, how do you do this without an htaccess?
Another example would be any blog software, once you submit a post, if you goto the specific URL it retrieves the appropriate post without the use of an htaccess.
I hope I'm being clear, thanks.
You are talking about two different concepts here. One is "url rewriting" the other is "redirection".
Url rewriting is the process of transforming one URL into another, and it may involve or not redirection. This happens server-side, before PHP kicks in. In fact, PHP is not aware of anything. This is performed as htaccess directives. What you obtain is usually the transformation of a complex nested url into a simple url with query.
For example: /blog/2010/10/30 rewritten to blog.php?year=2010&month=10&day=30
This is a beautification, in the sense that PHP responds to the second URL, and you could skip entirely the url rewriting, which is just for the sake of search engines and URL usability.
All of this happens before PHP starts. Then PHP could make its own redirections, and this is done using a call to header("Location: ..."), or a redirection through javascript or as html meta header.
None of this involves any 404.