Using YQL in javascript/php to scrape article html?

Using YQL in javascript/php to scrape article html? - php

I'm new to YQL, and just trying to learn how to do some fairly simple tasks.
Let's say I have a list of URLs and I want to get their HTML source as a string in javascript (so I can later insert it to a database via ajax). How would I go about getting this info back in Javascript? Or would I have to do it in PHP? I'm fine with either, really - whatever can work.
Here's the example queries I'd run on their console:
select * from html where url="http://en.wikipedia.org/wiki/Baroque_music"
And the goal is to essentially save the HTML or maybe just the text or something, as a string.
How would I go about doing this? I somewhat understand how the querying works, but not really how to integrate with javascript and/or php (say I have a list of URLs and I want to loop through them, getting the html at each one and saving it somewhere).
Thanks.

You can't read other pages with Javascript due to a built-in security feature in web browsers. It is called the Same origin policy.
The usual method is to scrape the content of these sites from the server using PHP.
There is an other option with javascript called a bookmarklet.
You can add the bookmarklet in your bookmarks bar, and each time you want the content of a site click the bookmark.
A script will be loaded in the host page, it can read the content and post it back to your server.
Oddly enough, the same origin policy, does not prevent you to POST data from this host page to your domain. You need to POST a FORM to an IFRAME that has a source hosted on your domain.
You won't be able to read the response you get back from the POST.
But you can poll with a setInterval making a JSONP call to your domain to know if the POST was successful.

Related

PHP: download complete html code from url (with closed tags) [duplicate]

Please let me know is it possible to scrap some info after ajax loaded with PHP? I had only used SIMPLE_HTML_DOM for static pages.
Thanks for advice.

Scraping the entire site
Scraping Dynamic content requires you to actually render the page. A PHP server-side scraper will just do a simple file_get_contents or similar. Most server based scrappers wont render the entire site and therefore don't load the dynamic content generated by the Ajax calls.
Something like Selenium should do the trick. Quick google search found numerous examples on how to set it up. Here is one
Scraping JUST the Ajax calls
Though I wouldn't consider this scraping you can always examine an ajax call by using your browsers dev tools. In chrome while on the site hit F12 to open up the dev tools console.
You should then see a window like the above. Hit the network tab and then hit chrome's refresh button. This will show every request made between you and the site. You can then filter out specific requests.
For example if you are interested in Ajax calls you can select XHR
You can then click on any of the listed items in the tabled section to get more information.
File get content on AJAX call
Depending on how robust the APIs are on these ajax calls you could do something like the following.
<?php
$url = "http://www.example.com/test.php?ajax=call";
$content = file_get_contents($url);
?>
If the return is JSON then add
$data = json_decode($content);
However, you are going to have to do this for each AJAX request on a site. Beyond that you are going to have to use a solution similar to the ones presented [here].
Finally you can also implement PhantomJS to render an entire site.
Summary
If all you want is the data returned by specific ajax calls you might be able to get them using file_get_contents. However, if you are trying to scrape the entire site that happens to also use AJAX to manipulate the document then you will NOT be able to use SIMPLE_HTML_DOM.

Finally I worked around my problem. I just get a POST url with all parameters from ajax call and make the same request using SIMPLE_HTML_DOM class.

How should implement Hashbang (AJAX) in content page tabs?

As some of you may know, Google is now crawling AJAX. The implementation is by far something elegant, but at least it still applies to Yahoo and Bing AFAIK.
Context: My site is driven by Wordpress & HTML5. An Custom Post Type has tree types of content, and the contents of these are driven by AJAX. The solution I came for not using hashbangs (#!) until fully understand how to implement them is rather "risqué". Every link as HREF linking to *site.com/article-one/?tab=first_tab*, that shows only the contents of the selected tab (<div>Content...</div>). Like this:
This First Tab
As you may note, data-tab is the value that JavaScript sends with AJAX Get, that gets the related content and renders inside a container. At the other side, the server gets the variable and does a <?php get_template_part('tab-first-tab'); ?> to deliver the content.
About the risqué, well, I can see that Google and other search engines will fetch *http://site.com/article-one/?tab=first_tab* instead of http://site.com/article-one/, making users come to that URL instead of showing the home page with the tab content selected automatically.
The problem now is the implementation to avoid that.
Hashbang: From what I learned, I should do this.
HREF should become site.com/article-one/#!first-tab
JS should extract the "first-tab" of the href and pass it out to $_GET (just for the sake of not using "data-tab").
JS should change the URL to site.com/article-one/#!first-tab
JS should detect if the URL has #!first-tab, and show the selected tab instead of the default one.
Now, for the server-side implementation, here is where I'm kind lost in the woods.
How Wordpress will handle site.com/article-one/?_escaped_fragment_=first-tab?
Do I have to change something in .htaccess?
What should have the HTML snapshot? My guess is all the site, but with the requested tab showing, instead of showing only the content.
I think that I can separate what Wordpress will handle when it detects the _escaped_fragment_. If is requested, like by Google, it will show all the content plus the selected content, and if not, it's because AJAX is requesting it and will show only the content. That should be right?

I'm gonna talk third person.
Since this has no responses, I have a good one why you should not do this. Yes, the same reason why Twitter banged them:
http://danwebb.net/2011/5/28/it-is-about-the-hashbangs
Instead of doing hashbangs, you should make normal URIs. For example, an article with summary tab on should be "site.com/article/summary", and if it is the default one that pops out (or is it already requested) it also should change to that URI using pushState().
If the user selects the tab "exercises", the URL should change to "site.com/article/exercises" using pushState() while the site loads the content throught AJAX, and while you still maintain the original href to "site.com/article/exercises". Without JavaScript the user should still see the content - not only the content, the whole page with the tab selected.
For that to work, some editing to the .htaccess to handle the /[tab] in the URL should be done.

load external url inside div like google search

I want to load external websites inside div and make it a bit smaller to accommodate inside div more properly.
just like Google search do
I tried this:
$("#targetDiv").load("www.google.com");
but it is not working.
I tried iframe but it has still 2 problems:
scrolling is still enabled by pressing arrow keys & PGUP PGDOWN
how to make contents inside iframe smaller
Don't know which method i should use
which is more optimized
or any alternative?

What you're trying to do is not going to work. Unfortunately, JavaScript isn't allowed to make cross-domain requests for security reasons (reference: http://en.wikipedia.org/wiki/Same%5Forigin%5Fpolicy).
If you create a script written in PHP that resides on your own server that submits the request, that could work but the user wouldn't have a valid session and there's a risk that the URL (links) from the other site won't work if they're relative.
Example:
$('#targetDiv').load('load.php?url=www.google.com')
You could also have a look at jquery-crossframe. I've never used it but it claims to do what you're looking for.
The best option is to use an iframe element.

You are not going to be able to load a cross domain ajax call like that with jquery. from http://api.jquery.com/load/
Additional Notes:
Due to browser security restrictions, most "Ajax" requests are subject to the same origin policy; the request can not successfully retrieve data from a different domain, subdomain, or protocol.
If iframe is not an option you can retrieve the data via an ajax call to a php page using curl.

Francois is right in that your ajax requests are restricted to same origin policy. That means you cannot load contents from other websites directly. What your are trying to achieve, however, is possible if your source supports JSONP. If you want to specifically load google search engine results check out Google Custom Search API

Ajax page fetch design requires physical address

I am creating a web app in php. i am loading content through a ajax based request.
when i click on a hyperlink, the corresponding page gets fetched through ajax and the content is replaced by the fetched page.
now the issue is, i need a physical href so that i can implement facebook like functionality and also maintain the browser history property. i cannot do a old school POSTBACK to the php page as I am doing a transition animation in which the current page slides away and the new page slides in.
Is there a way I can keep the animation and still have a valid physical href and history.
the design of the application is such:
the app grabs an rss feed.
it creates the DOM for those rss feeds.
upon clicking on any headline, the page animates and takes to the full story of the rss feed.
i need to create "like" button on the full story page. but i dont have a valid url.

While Alexander's answer works great on the client side, Facebook's linter tool does not run javascript, so it will get the old content. Neither of the two links provide a solution to this.
What amit needs to implement is server-side parsing of the url. See http://www.php.net/manual/en/function.parse-url.php. Fragment is what the server sees as the hash tag value. In your php code, render the correct og: tags for based upon the fragment.

Firstly, if you need a URL for facebook then think up a structure that gives you one, such that your server-side code will load the correct page when given that URL. This could be something like http://yourdomain.com/page.php?feed=<feedname>&link=<linknumber>, which would allow you to check the parameters using the PHP $_GET array. If you don't have the parameters then load the index page; if you do then load the relevant article.
Secondly, use something like history.js to give you cross-browser support for the HTML5 pushState() functionality so that you can set the page URL when you do the AJAX call, without requiring the browser to do a full reload.

You have to implement hash navigation.
Here is short tutorial.
Here is more conceptual introduction.

If you're using jQuery, I can recommend BBQ for hash navigation:
http://benalman.com/projects/jquery-bbq-plugin/

This actually sounds pretty straight forward to me.
You have the urls as usual, using hash (#) you can extract the info both in the client and server side.
There is only one thing that is missing, on the server side before you return the content, check the user agent string and compare it to the facebook bot (if i'm not mistaken it's something like "facebookexternalhit"), if it turns out to be the facebook bot then return what ever you want which describes the url for a like/share (open graph meta data), and if it's any other user agent string return the content as usual.

How do you do dynamic refresh of a single div tab using php/ajax and have the content actually change the local html on the page

How do you do dynamic refresh of a single div tab using php/ajax and have the content actually change the local html on the page (so that it is changed when you go to ‘view source’ in a browser) instead of just putting the change in a JavaScript object? I am trying to design a webpage that loads search results without refreshing the entire page. I use a simple hash followed by a GET/query string request to determine what content to load. This gets passed to a JavaScript XMLHttpRequest, then to some php which picks up the GET and passes it to a SOAP service and finally echo’s the SOAP results back to the XMLHttpRequest to get displayed in a document.getElementById div change. This works fine for usual display in conventional browsers. However I am concerned that search bots and screen readers are not going to recognize the majority of the content that shows in browsers because it is all contained within a client side JavaScript object.
So, I guess my first question is: is this a valid concern? If it is, is there a work around?
Thanks!

AJAX content is very hard to get indexed. Google has webmaster guidelines for AJAX. This should get you started in the right direction on getting your content indexed.

I'm inexperienced with search engine behavior but as far as i know the best option is to load the full content of your div on a php page, when the page load you can include that page inside the div, and then start using js/jquery to refresh that every so many seconds.
this way when a search bot gets on the site it will see the current content, and users will see it update.
updating the div box can be done quite easy using ajax function and jquery.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.