There are lots of web pages which simply run a script without having any material on them.
Is there anyway of seeing the page source without actually visiting the page because it just redirects you ?
Will using an html parser work to do this ? I'm using simpleHTMLdom to parse the page ?
In firefox you can use the view-source protocol to view only the sourcecode of a site without actually rendering it or executing JavaScripts on it.
Example: view-source:http://stackoverflow.com/q/5781021/298479 (copy it to your address bar)
Yes, simple parsing the HTML will get you the client-side (Javascript) code.
When these pages are accessed through a browser, the browser runs the code and redirects it but when you access it using a scraper or your own program, the code is not run and static script can be obtained.
Ofcourse you can't access the server side (php). That's impossible.
If you need a quick & dirty fix, you could disable JavaScript and Meta redirects (Internet Explorer can disable these in the Internet Options dialog. Firefox can use the NoScript add-in for same effect.)
This won't any server-side redirects, but will prevent client-side redirects and allow you to see the document's HTML source.
The only way to get the page HTML source is to send HTTP request to the web server and receive answer which is equal to visiting the page.
If you're on a *nix based operating system, try using curl from the terminal.
curl http://www.google.com
wget or lynx will also work well if you have access to a command line linux shell:
wget http://myurl
lynx -dump http://myurl
If you are trying to HTML-Scrape the contents of a page that builds 90%+ of its content/view through executing JavaScript you are going to encounter issues unless you are rendering to a screen (hidden) and then scraping that. Otherwise you'll end up scraping a few script tags which does you little good.
e.g. If I try to scrape my Gmail inbox page, it is an empty HTML page with just a few scattered script tags (likely typical of almost all GWT based apps)
Does the page/site you are scraping have an API? If not, is it worth asking them if they have one in the works?
Typically these types of tools run along a fine line between "stealing" information and "sharing" information thus you may need to tread lightly.
Related
Is it possible to hide the .php file on the server...?
I have a website which sometimes calls php files inside iframes, now I wouldn't like it if somebody copied that code, so how would I hide it?
Or do I have to encrypt it?
Speed is a huge matter in my case, so anything that doesn't affect performance is appreciated!
Thanks
With a correctly configured web server, the PHP code isn't visible to your website visitors. For the PHP code to be accessible by people who visit your website, the server would have to be configured to display it as text instead of processing it as PHP code.
So, in other words, if you visit your website and you see a HTML page and not PHP code, your server is working correctly and no one can get to the PHP code.
Which code? Your PHP source code? The only code a user see is your html code, PHP is processed on the server side!
If your php-files are parsed by the http server, nobody can get them.
If you're still paranoid after the assurances provided here, you can make your code much more difficult for someone else to read by "obfuscating" it (Wikipedia link).
If you Google "php obfuscator", you'll find tons of PHP obfuscator products, many of them free.
Some examples:
PHP Obfuscator
Code Eclipse
Professional PHP Obfuscator/Encoder
Obfuscation does not affect performance. Only readability for humans.
If someone access a php file on your site all they will see is the code output by the PHP script (e.g. any HTML, or Javascript) - they won't see the source for the PHP page itself (and will have no way to access it).
If you are concerned about them seeing the output (e.g. the HTML the PHP script generates) from a practical point of view, there isn't anything you can do about that (the most you can do is obfuscate it, but that is largely pointless).
I have a website which sometimes calls
php files inside iframes, now I
wouldn't like it if somebody copied
that code, so how would I hide it? Or
do I have to encrypt it?
No, that makes no sense and would not work. You have to realize that the PHP code is executed on your server to serve a HTTP request, and that the iframe results in a separate HTTP request from the main page.
If you want to prevent others from including the iframe in their own page, you could check the referrer header and have the iframe page show an error if the referrer is not from your site, but that could cause problems for some legimitate users and can also be circumvented.
Alternative solution: do not use iframes; instead, integrate the PHP code that currently displays the iframe's content in your main page. This will work for all users and cannot be circumvented.
Of course, you still can't prevent others from requesting your page, extracting the content from the HTML and displaying it on their page - that's just how the internet works.
Put your important files like passwords login etc into a folder outside the web folder. E.g. under C: you can set this include path in php ini file. Then you are pretty safe. Definitely you should store your mysql access code outside the htdocs folders. I think The php code is "includes". So check yourself. Good luck
So I am using AJAX to call a server file which uses WordPress to populate a pages content and return. Which I than use to populate fields. Now what I am confused about is, how do I create the snapshot and what do I have to do to make google know I am creating one besides #! also why do I do this? The escaped_fragments are a little unclear to and hope I could get a more detailed explanation. Does anyone have any tutorials that walk you through this process similar to what I am doing?
David
Google's crawlers don't typically run your JavaScript. They hit your page, scrape your HTML, and move on. This is much more efficient than loading your page and all of its resources, running your JavaScript, guessing at when everything finished loading, and then scraping data out of the DOM.
If your site uses AJAX to populate the page with content, this is a problem for Google and others. Your page is effectively empty... void of any content... in its HTML state. It requires your JavaScript to fill it in. Since the crawlers don't run your JavaScript, your page isn't all that useful to the crawler.
These days, there are an awful lot of sites that blend the line between web-based applications and content-driven sites. These sites (like yours) require client-side code to run to get the content. Google doesn't have the resources to do this on every site they encountered, but they did provide an option. That's the info you found about escaped anchor fragments.
Google has given you the opportunity to do the work of scraping the full finished DOM for them. They have put the CPU and memory burden of running your JavaScript back on you. You can signify to Google that this is encouraged by using links with #!. Google sees this and knows that they can then request the same page, but convert everything after #! (which isn't sent to the server) to ?_escaped_fragment_= and make a request to your server. At this point, your server should generated a snapshot of the complete finished DOM, after JavaScript has ran.
The good news is that these days you don't have to hack a lot of code in place to do it. I've written a server to do this using PhantomJS. (I'm trying to get permission to open the source code up, but it's in legal limbo, sorry!) Basically, PhantomJS is a full webkit web browser but it runs without a GUI. You can use PhantomJS to load your site, run all the JavaScript, and then when its ready scrape the HTML back out of the page and send that version to Google. This doesn't require you to do anything special, other than fix your routing to point requests with _escaped_fragment_ at your snapshot server.
You can do this in about 20 lines of code. PhantomJS even has a mini web server built into it, but they recommend not using it for production code.
I hope this helps clear up some confusion!
In my project I have to do some screen scrapping.The source pages return data after executing the javascript embedded within them.In my php script I fetch the page using file_get_contents() and as usual,it returns the page simply as text.My question is that,is there a way to get the final output from the webpage (the output after executing javascript).
I know some of you might suggest embedding a webbrowser inside and using that to execute the page.But how to do that?.Is there a working browser available.Or is there executable non GUI versions of opensource browsers such as chromium,so that I can run it as a CGI script or something
You will have to have some real browser like client for this, php alone won't cut it. For automation purposes you are most likely want a "headless" (without gui) browser like PhantomJS (the new hotness). Check out this answer.
I need to load content from a remote uri into a PHP variable locally. The remote page only shows content when JavaScript is turned on. How can I get around this?
Essentially, how can I use cURL for pages requiring JavaScript loaded content?
Mink was the only php headless browswer that I could find.
As noted selenium is another popular choice. I don't know how good of performance these will offer though if you have a lot of scraping to do. They seem to be more geared towards testing?
A number of other languages have them which are listed in the link below. Since php does does not process javascript you will need another tool. Headless browswers expose the javascript engine and allow you to interact with the browser programattically.
headless internet browser?
To do this you have to emulate a browser using a browser plugin such as selenium. This will involve slightly more than just a simple get request though.
http://seleniumhq.org/
Is it possible to hide the .php file on the server...?
I have a website which sometimes calls php files inside iframes, now I wouldn't like it if somebody copied that code, so how would I hide it?
Or do I have to encrypt it?
Speed is a huge matter in my case, so anything that doesn't affect performance is appreciated!
Thanks
With a correctly configured web server, the PHP code isn't visible to your website visitors. For the PHP code to be accessible by people who visit your website, the server would have to be configured to display it as text instead of processing it as PHP code.
So, in other words, if you visit your website and you see a HTML page and not PHP code, your server is working correctly and no one can get to the PHP code.
Which code? Your PHP source code? The only code a user see is your html code, PHP is processed on the server side!
If your php-files are parsed by the http server, nobody can get them.
If you're still paranoid after the assurances provided here, you can make your code much more difficult for someone else to read by "obfuscating" it (Wikipedia link).
If you Google "php obfuscator", you'll find tons of PHP obfuscator products, many of them free.
Some examples:
PHP Obfuscator
Code Eclipse
Professional PHP Obfuscator/Encoder
Obfuscation does not affect performance. Only readability for humans.
If someone access a php file on your site all they will see is the code output by the PHP script (e.g. any HTML, or Javascript) - they won't see the source for the PHP page itself (and will have no way to access it).
If you are concerned about them seeing the output (e.g. the HTML the PHP script generates) from a practical point of view, there isn't anything you can do about that (the most you can do is obfuscate it, but that is largely pointless).
I have a website which sometimes calls
php files inside iframes, now I
wouldn't like it if somebody copied
that code, so how would I hide it? Or
do I have to encrypt it?
No, that makes no sense and would not work. You have to realize that the PHP code is executed on your server to serve a HTTP request, and that the iframe results in a separate HTTP request from the main page.
If you want to prevent others from including the iframe in their own page, you could check the referrer header and have the iframe page show an error if the referrer is not from your site, but that could cause problems for some legimitate users and can also be circumvented.
Alternative solution: do not use iframes; instead, integrate the PHP code that currently displays the iframe's content in your main page. This will work for all users and cannot be circumvented.
Of course, you still can't prevent others from requesting your page, extracting the content from the HTML and displaying it on their page - that's just how the internet works.
Put your important files like passwords login etc into a folder outside the web folder. E.g. under C: you can set this include path in php ini file. Then you are pretty safe. Definitely you should store your mysql access code outside the htdocs folders. I think The php code is "includes". So check yourself. Good luck