Pointing crawler to HTML snapshot

Pointing crawler to HTML snapshot - php

I'm trying to make my AJAX website crawlable:
Here is the website in question.
I've created a htmlsnapshot.php that generates the page (this file needs to be passed the hash fragment to be able to generate the right content).
I don't know how to get the crawler to load this file while getting normal users to load the normal file.
I don't really understand what the crawler does to the hash fragment (and this probably is part of my problem.)
Does anybody have any tips?

The crawler will divert itself. You just need to configure your PHP script to handle the GET parameters that Google will be sending your site (instead of relying on the AJAX).
Basically, when Google finds a link to yourdomain.com/#!something instead of requesting / and running the JavaScript to make an AJAX request for data something, Google will automatically (WITHOUT you doing anything) translate anything that comes after #! in your URL to ?_escaped_fragment_=something.
You just need to (in your PHP script) check if $_GET['_escaped_fragment_'] is set, and if so, display the content for that value of something.
It's actually very easy.

Related

How to retrieve webpage data (not source code) using Android?

I am trying to pass http variables to a page on my website containing some PHP code, and retrieve a response using Android.
I directed URL.openStream() to the desired website, and collected the first string using BufferedRaader, but it gave me the first line of the source code, as opposed to what a browser would see if it were navigating the page.
This question is difficult to ask because I am not familiar enough with web language to describe exactly what I want, but...
Using Android, How would I retrieve what my browser sees on a page, and not the actual source code for the page?

I think "webview" is what you are looking for.
Useful link: http://developer.android.com/reference/android/webkit/WebView.html

How to get javascript-generated content from another website using cURL?

Basically, a page generates some dynamic content, and I want to get that dynamic content, and not just the static html. I am not being able to do this with cURL. Help please.

You can't with just cURL.
cURL will grab the specific raw (static) files from the site, but to get javascript generated content, you would have to put that content into a browser-like envirionment that supports javascript and all other host objects that the javascript uses so the script can run.
Then once the script runs, you would have to access the DOM to grab whatever content you wanted from it.
This is why most search engines don't index javascript-generated content. It's not easy.
If this is one specific site that you're trying to gather info on, you may want to look into exactly how the site gets the data itself and see if you can't get the data directly from that source. For example, is the data embedded in JS in the page (in which case you can just parse out that JS) or is the JS obtained from an ajax call (in which case you can maybe just make that ajax call directly) or some other method.

you could try selenium at http://seleniumhq.org, which supports js.

What is the use of # in url

I realized that many of web app use # in their app's URL.
For example, Google Analytics.
This address is in the URL bar when I am viewing the visitor's language page:
https://www.google.com/analytics/web/?hl=en#report/visitors-language/a33185827w60383872p61754588/
This address is in the address bar when I am viewing the visitors' geolocation page:
https://www.google.com/analytics/web/?hl=en#report/visitors-geo/a33185827w60383872p61754588/
I think that this is the Google Analytics web app passing #report/visitors-language and #report/vistiors-geo.
I know that Google analytics is using an <iframe>. It seems that only the main content box is changing when displaying content.
Is # used because of the <iframe> functionality?

There are several answers but none cover the backend part.
Here is a URL, one from your own example:
www.google.com/analytics/web/?hl=en#report/visitors-language/a33185827w60383872p61754588/
You can think about the post-hash (including the hash #) part as a client-side request.
The web server will never know what was entered after the hash sign. It is the browser pointing to a specific ID on the page.
For basic web pages, if you have this HTML: <a name="main">welcome</a>
on a web page at www.example.com/welcome, going to www.example.com/welcome#main will scroll your browser viewport to the welcome text in the <a> HTML tag.
The web server will not know whether #main was in the URL or not.
Values in the URL after a question mark are called URL parameters, e.g. www.example.com/?foo=bar. The web server can deliver different content based on those values.
However, there is a technology developed by Google called AJAX (Asynchronous JavaScript and XML) that makes use of the # part in the URL to deliver different content without a page load. It's not using an <iframe>.
Using JavaScript, you can trigger a change in the URL's post-hash part and make a request to the server to get a specific part of the page, for example for the URL www.example.com/welcome#main2 Even if an element named #main2 does not exist, you can show one using JavaScript.
A hashbang is #!. It is used to make search engine indexing easier by indicating that this part is a dynamic web page.

This is the "hash" in the url.
Many browsers support hash change event in javascript.
as per my knowledge the hash change is the revolution in the ajax callbacks.
as such when the user interacts with the any link with a hash then on the hash change the event is fired and you can apply any thing with the javascript.
one more thing is that hash change is supported by the browser history.

see below URL
SEO and the use of !# in a url
or Read it
'#! is called a "hashbang" and they are the root of all that is evil in web development.'
Basically, weak web developers decided to use #anchor names as a kludgy hack to get "web 2.0" things to work on their page, then complained to google that their page rank suffered. Google made a work around to their kludge by enabling the hashbang.
Weak web developers took this work around as gospel. Don't use it. It is a crutch.
Web development that depends on hashbangs is web-development done wrong.
This article is far more well worded than I could ever be, and deals with the Gawker media fiasco from their migration to a (failed) hashbang centric website. It tells you WHAT is happening and why it's bad.
http://isolani.co.uk/blog/javascript/BreakingTheWebWithHashBangs

Correct me if I'm wrong, the hashtag in that URL would be used as an anchor to scroll the page to an element with an id. For example, I send you to the url http://example.com/sample#example, and the page would scroll (just display) at the element (I'm using a div as an arbitrary example, it could be anything).

Ajax and hash mark in the url mostly used for quick action.
If you have a part in your site that can be visible only by fire event (mostly click) - it would be hard to share it. With hash mark in the url you can (by javascript) make the browser think that you did the required action and it will display the relevant part.

Normally the '#' is using in url will find the particular id which is next to '#' in that particular page. By using this we can view the particular content at middle of the page also.

Using YQL in javascript/php to scrape article html?

I'm new to YQL, and just trying to learn how to do some fairly simple tasks.
Let's say I have a list of URLs and I want to get their HTML source as a string in javascript (so I can later insert it to a database via ajax). How would I go about getting this info back in Javascript? Or would I have to do it in PHP? I'm fine with either, really - whatever can work.
Here's the example queries I'd run on their console:
select * from html where url="http://en.wikipedia.org/wiki/Baroque_music"
And the goal is to essentially save the HTML or maybe just the text or something, as a string.
How would I go about doing this? I somewhat understand how the querying works, but not really how to integrate with javascript and/or php (say I have a list of URLs and I want to loop through them, getting the html at each one and saving it somewhere).
Thanks.

You can't read other pages with Javascript due to a built-in security feature in web browsers. It is called the Same origin policy.
The usual method is to scrape the content of these sites from the server using PHP.
There is an other option with javascript called a bookmarklet.
You can add the bookmarklet in your bookmarks bar, and each time you want the content of a site click the bookmark.
A script will be loaded in the host page, it can read the content and post it back to your server.
Oddly enough, the same origin policy, does not prevent you to POST data from this host page to your domain. You need to POST a FORM to an IFRAME that has a source hosted on your domain.
You won't be able to read the response you get back from the POST.
But you can poll with a setInterval making a JSONP call to your domain to know if the POST was successful.

Detecting when a link has been clicked with PHP

I have built an in browser engine that will retrieve pages without executing server side scripting... seems ridiculous, I know, but I'm doing this as part of a school project.
The problem that I am having is that once it displays the page if a link is clicked it will bring you to www.their-site.com instead of www.my-site.com?site=www.their-site.com.
Basically I need my php page to detect if a link is clicked and, if so, add "www.my-site.com?" before it so that all sites will still be rendered without all the server side scripting. Is their any way to do this?
---------------EDIT---------------------------------------------------------------------------
Ok I guess I wasn't clear enough the first time sorry about that.
I have made a php page that will display the contents of any site without executing the server side scripting that belongs with that page. This allows you to get around those annoying news articles that allow you to have a glimpse at them for two seconds and then a login box appears. the problem is once you've accessed the pages if you click any links you are connected to their server and the scripts turn back on. I want MY php to execute, not THEIRS

You need to know what you want first.
You say no server side scripting, then you mention php.
To do this, I don't think you can do it with just js.
You need to get the pages, using php, depending on what exactly, modify them such that when a link is clicked, it sends an ajax call to another page. This will require either regex replacement or the use of htmldom.
When a link is clicked, it should send the ajax response to the php page which can then request tha page, make modifications and send it back to the browser. You can then use js to replace the page contents.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Pointing crawler to HTML snapshot - php

Related

How to retrieve webpage data (not source code) using Android?

How to get javascript-generated content from another website using cURL?

What is the use of # in url

Using YQL in javascript/php to scrape article html?

Detecting when a link has been clicked with PHP

Categories

Resources