What "version" of a php page would crawlers see? - php

I am considering building a website using php to deliver different html depending on browser and version. A question that came to mind was, which version would crawlers see? What would happen if the content was made different for each version, how would this be indexed?

The crawlers see the page you show them.
See this answer for info on how Googlebot identifies itself as. Also remember that if you show different content to the bot than what the users see, your page might be excluded from Google's search results.
As a sidenote, in most cases it's really not necessary to build separate HTML for different browsers, so it might be best to rethink that strategy altogether which will solve the search engine indexing issue as well.

The crawlers would see the page that you have specified for them to see via your user-agent handling.
Your idea seems to suggest trying to trick the indexer somehow, don't do that.

You'd use the User-Agent HTTP Header, which is often sent by the browsers, to identify the browsers/versions that interest you, and send a content that would be different in some cases.
So, the crawlers would receive the content you'd send for their specific User-Agent string -- or, if you don't code a specific case for those, your default content.
Still, note that Google doesn't really appreciate if you send it content that is not the same as what real users get (and if a someone using a given browser sends a link to some friend, who doesn't see the same thing as he's using another browser, this will not feel "right").
Basically : sending content that differs on the browser is not really a good practice ; and should in most/all cases be avoided

That depends on what content you'll serve to bots. Crawlers usually identify themselves as some bot or other in the user agent header, not as a regular browser. Whatever you serve these clients is what they'll index.

The crawler obviously only sees the version your server hands to it.

If you create a designated version for the search engine, this version would be indexed (and eventually makes you banned from the index).
If you have a version for the default/undetected browser - this one.
If you have no default version - nothing would be indexed.
Sincerely yours, colonel Obvious.
PS. Assuming you are talking of contents, not markup. Search engines do not index markup.

Related

how to check for mobile version of a website?

I would like to check if mobile version exists for a specific website or not. To my understanding, we cannot be sure if every website has mobile version located at http://m.example.com/ therefore I am testing through CURL() request. Here is how I am doing it:
* I send mobile browser headers in curl request, this returns contents of
the returning URL.
* If it has a mobile version, then it would return contents of a mobile version site.
* I then check if the content includes #media keyword, if it exists then I assume it has a mobile version.
The problem is, if its css loads externally then I will have to further send CURL() requests to the CSS files as well, which will make it even more slower. Is there any specific solution to my problem or can I boost this process a bit more?
Any help would be appreciated. Thanks.
The problem with your approach, which smells a bit like an XY Problem, is that it is simply unreliable.
The website has many choices for mobile websites, which include:
1. Using CSS media queries
The problem with this method is twofold. For starters, you would have to scan every single CSS file and <link> declaration. Secondly, the site can dynamically introduce stylesheets to the page using JavaScript, which you will never see using cURL because it lacks a JavaScript parser.
2. Browser sniffing using (client side) JavaScript, or screen width sniffing using JavaScript
Again, this JavaScript will never get executed, so you will never see that result.
3. Browser sniffing using server side code
Well, I guess you could try to use a mobile user-agent string with your cURL request, and see where that takes you, but all of these methods are hackish and unreliable.
4. The page could be mobile friendly from the get-go (credit to #Quentin)
As #Quentin mentioned in the comments, the page could be mobile friendly without any additional checks on the client/server side (responsive design without media queries, by simply using percentage-based values, for example).

Should I set up a custom user-agent for Simplepie

I'm creating a feed aggregator. I will be crawling blogs and checking sometimes every hour or every two hours to see if they have new posts. I am using Simplepie for this.
I want to know if I should change the custom user-agent that Simplepie has (SIMPLEPIE_USERAGENT). Also, what are best practices for user-agents if I should change it. Thanks!
Yes, you should, otherwise they might start complaining about it to the SimplePie maintainer (i.e. me :) ). Using a custom useragent lets them know who to contact if something breaks.
The ideal format is "Your Program Name/1.0" where 1.0 is the version. You can also include URLs (put a + in front of them if you do so) and contact addresses, making it "Your Program Name/1.0 (+http://example.com/)"
Should you change it? Well, that depends on what you're doing. Some sites will block you based on the UA. That's their right.
If you're trying to scrape data and don't care about obeying rules, then you can change it to whatever you want.
Best practice is to identify yourself and obey robots.txt
I always put the name of my app as the user agent, that way server admins can contact me if my script ever causes problems with their server. (Which is the only reason anybody would care)

So the user agent can be faked.. Ok... is there a valid reason why I shouldn't use php to detect the browser?

I have never understood why some people say making custom css for each browser is a bad thing. To keep my page size down and download times fast it makes perfect sense to me to make a custom css for the major browsers (especially IE in its many different forms), and a general catch all css for everything else.
If you want to send out a bloated, huge, Swiss army knife of the css world, for all situations then go right ahead I'm not going to stop you.
Fast detection of the browser is important when doing this. Loading a JavaScript file to detect the browser seems slow. So I would prefer to use php to detect the browser, and send out the specified css. Or at least a general browser specific css then use the JavaScript to load a more detailed version of the css.
But I've read article after article about why this is a bad thing. The main reason behind each of these articles is because the user agent can be faked. Or there using Firefox but the server thinks they're using IE7 so it sends out the wrong css file.
As a developer/designer of web apps why is this my problem? If you want to use Firefox, but tell my server your using safari or IE*, and get a crappy looking page, why is it my problem?
And don't throw that whole if the user can't see your site right they'll never come back, or some kind of similar argument at me. a normal user isn't going to be doing this. its only going to be the people who know how to do this, and will know whats wrong when my site looks crappy.
This is similar to looking at my site on a old Apple II (I have no clue how), and yelling at me because everything looks green.
So is there a good reason, not a personal preference, why I shouldn't use php to detect the browser and send out customized css files?
I do this mostly for the different versions of IE. It just seems like for some sites, adding the if IE6 and if IE7 parts just double or triple the size of the css file.
Typically when a user intentionally fakes the user agent string, it is because something is not viewable in the user's browser that should be. For example, some sites may restrict users to IE or Firefox, but the user is using Iceweasel on Debian. Iceweasel is just a Firefox renamed for trademarked reasons (there are a few other changes also), so there is no reason that the site should not work.
Realize that this happens because of (bad) browser detection, not despite it. I would say you don't need to be terribly concerned about this issue. Further, if you can just make your site reasonably cross-browser compatible, it won't matter at all. If you really want to use browser-specific CSS, and you don't want to do so all in one CSS file, don't let a fake user agent stop you.
As long as the only thing you're doing is changing style sheets, there is no valid reason as far as I can tell. If you're attempting to deliver custom security measures by browser, then you'll have issues.
Not sure about php but in Rails it is normal and dead simple practice to provide css files and different layouts based on the user agent particularly when considering that your site is just as likely to be accessed by any of the myriad of available mobile devices, never mind writing for the most popular (Currently Firefox) browsers and even writing custom MIME types if need be is also dead simple.
IMO not doing so is pure laziness on the coders part but then not all sites are developed by professional teams of developers with styling gurus at hand. Also in languages other than Rails it might not be so simple. Sorry, I haven't a clue about PHP so this may not be an appropriate reply
In my opinion, starting with normalize.css, and having a base style sheet to start, overriding the base styles as needed usually works along with making sure you set appropriate fallbacks. If you really need it a few media queries, and feature detection can go a long way.
One reason you shouldn't base things off of the browser is because major search engines like Google and Yahoo prohibit displaying different content for different browsers. GoogleBot can detect different CSS and HTML and you may get bad search positioning. Additionally, if you use any advertising services you may be in breach of their contract by displaying varying content.

can a php piece of code that block old browsers from accessing a website block search engine spiders?

i was looking for a way to block old browsers from accessing the contents of a page because the page isn't compatible with old browsers like IE 6.0 and to return a message saying that the browser is outdated and that an upgrade is needed to see that webpage.
i know a bit of php and doing a little script that serves this purpose isn't hard, then i was just about to start doing it and a huge question popped up in my mind.
if i do a php script that blocks browsers based on their name and version is it impossible that this may block some search engine spiders or something?
i was thinking about doing the browser identification via this function: http://php.net/manual/en/function.get-browser.php
a crawler will probably be identified as a crawler but is it impossible that the crawler supplies some kind of browser name and version?
if nobody tested this stuff before or played a bit with this kind of functions i will probably not risk it, or i will make a testfolder inside a website to see if the pages there get indexed and if not i abandon this idea or i will try to modify it in a way that it works but to save me the trouble i figured it would be best to ask around and because i didn't found this info after a lot of searching.
No, it shouldn't affect any of major crawlers. get_browser() relies on the User-Agent string sent with the request, and thus it shouldn't be a problem for crawlers, which happen to use custom user-agent strings (eg: Google's spiders will have "Google" in their names).
Now, I personally think it's a bit unfriendly to completely block a website to someone with IE. I'd just put a red banner above saying "Site might not function correctly. Please update your browser or get a new one" or something to that effect.

How to disable or encrypt "View Source" for my site

Is there any way to disable or encrypt "View Source" for my site so that I can secure my code?
Fero,
Your question doesn't make much sense. The "View Source" is showing the HTML source—if you encrypt that, the user (and the browser) won't be able to read your content anymore.
If you want to protect your PHP source, then there are tools like Zend Guard. It would encrypt your source code and make it hard to reverse engineer.
If you want to protect your JavaScript, you can minify it with, for example, YUI Compressor. It won't prevent the user from using your code since, like the user, the browser needs to be able to read the code somehow, but at least it would make the task more difficult.
If you are more worried about user privacy, you should use SSL to make sure the sensitive information is encrypted when on the wire.
Finally, it is technically possible to encrypt the content of a page and use JavaScript to decrypt it, but since this relies on JavaScript, an experienced user could defeat this in a couple of minutes. Plus all these problems would appear:
Search engines won't be able to index your pages...
Users with JavaScript disabled would see the encrypted page
It could perform really poorly depending the amount of content you have
So I don't advise you to use this solution.
You can't really disable that because eventually the browser will still need to read and parse the source in order to output.
If there is something SO important in your source code, I recommend you hide it on server side.
Even if you encrypt or obfuscate your HTML source, eventually we still can eval and view it. Using Firebug for instance, we can see source code no matter what.
If you are selling PHP software, you can consider Software as a Service (SaaS).
So you want to encrypt your HTML source. You can encrypt it using some javascript tool, but beware that if the user is smart enough, he will always be able to decrypt it doing the same thing that the browser should do: run the javascript and see the generated HTML.
EDIT: See this HTML scrambler as an example on how to encrypt it:
http://www.voormedia.com/en/tools/html-obfuscate-scrambler.php
EDIT2: And .. see this one for how to decrypt it :)
http://www.gooby.ca/decrypt/
Short answer is not, html is an open text format what ever you do if the page renders people will be able to see your source code. You can use javascript to disable the right click which will work on some browsers but any one wanting to use your code will know how to avoid this. You can also have javascrpit emit the html after storing this encoded, this will have bad impacts on development, accessibility, and speed of load. After all that any one with firebug installed will still be able to see you html code.
There is also very really a lot of value in your html, your real ip is in your server code which stays safe and sound on your server.
This is fundamentally impossible. As (almost) everybody has said, the web browser of your user needs to be able to read your html and Javascript, and browsers exist to serve their users -- not you.
What this means is that no matter what you do there is eventually going to be something on a user's machine that looks like:
<html>
<body>
<div id="my secret page layout trick"> ...
</div>
</body>
</html>
because otherwise there is nothing to show the user. If that exists on the client-side, then you have lost control of it. Even if you managed to convince every browser-maker on the planet to not make that available through a "view source" option -- which is, you know, unlikely -- the text will still exist on that user's machine, and somebody will figure out how to get to it. And that will never happen, browsers will always exist to serve their users before all others. (Hopefully)
The same thing is true for all of your Javascript. Let me say it again: nothing that you send to a user is secure or secret from that user. The encryption via Javascript hack is stupid and cannot work in any meaningful sense.
(Well, actually, Flash and Silverlight ship binaries, but I don't think that they're encrypted. So they are at the least irritating to get data out of.)
As others have said, the only way to keep something secret from your users is to not give it to them: put the logic in your server and make sure that it is never sent. For example, all of the code that you write in PHP (or Python/Ruby/Perl/Java/C...) should never be seen by your users. This is e.g. why Google still has a business. What they give you is fundamentally uninteresting compared to what they never send to you. And, because they realize this, they try to make most things that they send you as open as useful as possible. Because it's the infrastructure -- the Terrabyte-huge maps database and pathfinding software, as opposed to the snazzy map that you can click and drag -- that you are trading your privacy for.
Another example: I'm not sure if you remember how many tricks people employed in the early days of the web to try and keep people from saving images to disk. When was the last time you ran across one of those? Know why? Because once data is on your user's machine, she controls it. Not you.
So, in short: if you want to keep something secret from your user, don't give it to her.
You cant. The browser needs the source to render the page. If the user user wishes the user may have the browser show the source. Firefox can also show you the DOM of the page. You can obfuscate the source but not encrypt or lock the user out.
Also why would you want this, it seem like a lame ass thing to do :P
I don't think there is a way to do this. Because if you encrypt how the browser will understand the HTML?
No. The browsers offer no ability for the HTML/javascript to disable that feature (thankfully). Plus even if you could the HTML is still transmitted in plain text ready for a HTTP sniffer to read.
Best you could do would be to somehow obscure the HTML/javascript to make it hard to read. But then debuggers like Firebug and IE 8's debugger will reconstruct it from the DOM making it easy to read,
You can, in fact, disable the right click function. It is useless to do so, however, as most browsers now have built in inspector tools which show the source anyway. Not to mention that other workarounds (such as saving the page, then opening the source, or simply using hotkeys) exist for viewing the html source. Tutorials for disabling the right click function abound across the web, so a quick google search will point you in the right direction if you fell an overwhelming urge to waste your time.
There is no full proof way.
But You can fool many people using simple Hack using below methods:
"window.history.pushState()" and
adding oncontextmenu="return false" in body tag as attribute
Detail here - http://freelancer.usercv.com/blog/28/hide-website-source-code-in-view-source-using-stupid-one-line-chinese-hack-code
You can also use “javascript obfuscation” to further complicate things, but it won’t hide it completely.
“Inspect Element” can reveal everything beyond view-source.
Yes, you can have your whole website being rendered dynamically via javascript which would be encrypted/packed/obfuscated like there is no tomorrow.

Categories