I want to parse data of a website like this: (website is a dictionary - search website)
When I search on website, the website does not refresh, the result will be shown, and the URL does not change.
Now I want to know is it possible to using this or cURL or anything else for parsing the website ?
There are two options:
Check the ajax calls that are triggered in the console and try to fetch these urls instead using the correct method and data.
Disable javascript and see if it still works. If it does, you will see what you need to get.
However, if the site does not want you to get the data, they can prevent it and / or block you if you make too many requests.
Related
Please let me know is it possible to scrap some info after ajax loaded with PHP? I had only used SIMPLE_HTML_DOM for static pages.
Thanks for advice.
Scraping the entire site
Scraping Dynamic content requires you to actually render the page. A PHP server-side scraper will just do a simple file_get_contents or similar. Most server based scrappers wont render the entire site and therefore don't load the dynamic content generated by the Ajax calls.
Something like Selenium should do the trick. Quick google search found numerous examples on how to set it up. Here is one
Scraping JUST the Ajax calls
Though I wouldn't consider this scraping you can always examine an ajax call by using your browsers dev tools. In chrome while on the site hit F12 to open up the dev tools console.
You should then see a window like the above. Hit the network tab and then hit chrome's refresh button. This will show every request made between you and the site. You can then filter out specific requests.
For example if you are interested in Ajax calls you can select XHR
You can then click on any of the listed items in the tabled section to get more information.
File get content on AJAX call
Depending on how robust the APIs are on these ajax calls you could do something like the following.
<?php
$url = "http://www.example.com/test.php?ajax=call";
$content = file_get_contents($url);
?>
If the return is JSON then add
$data = json_decode($content);
However, you are going to have to do this for each AJAX request on a site. Beyond that you are going to have to use a solution similar to the ones presented [here].
Finally you can also implement PhantomJS to render an entire site.
Summary
If all you want is the data returned by specific ajax calls you might be able to get them using file_get_contents. However, if you are trying to scrape the entire site that happens to also use AJAX to manipulate the document then you will NOT be able to use SIMPLE_HTML_DOM.
Finally I worked around my problem. I just get a POST url with all parameters from ajax call and make the same request using SIMPLE_HTML_DOM class.
I'm making a PHP crawler to explore e-shop called alza.cz. I want links to all products in that e-shop. I'm on address: http://www.alza.cz/notebooky/18842920.htm., but this display only first 21 items. To get all items I must go to address: http://www.alza.cz/notebooky/18842920.htm#f&pg=1/10000.
Crawler uses file_get_contents to get HTML of the page, which is then parsed using DOM. Problem is, that it looks like that file_get_contents ignores that part after # (returns only first 21 items instead of all). Any ideas?
file_get_contents would ignore the #xxxxx part of the URL (the fragment identifier), and would not include it in the requested URL. It's something a user agent would use on the client side - most likely, the website has some Javascript which would use AJAX to load a new page of results.
You could see if the page obeys the Google AJAX Crawling Specification, though based on your example, it doesn't look like it. If you see "hash bang" fragment identifiers like #!foo=bar, that's a good sign.
So, you'll need to observe the AJAX requests in Firebug or similar and replicate the same requests yourself.
I'm trying to make my AJAX website crawlable:
Here is the website in question.
I've created a htmlsnapshot.php that generates the page (this file needs to be passed the hash fragment to be able to generate the right content).
I don't know how to get the crawler to load this file while getting normal users to load the normal file.
I don't really understand what the crawler does to the hash fragment (and this probably is part of my problem.)
Does anybody have any tips?
The crawler will divert itself. You just need to configure your PHP script to handle the GET parameters that Google will be sending your site (instead of relying on the AJAX).
Basically, when Google finds a link to yourdomain.com/#!something instead of requesting / and running the JavaScript to make an AJAX request for data something, Google will automatically (WITHOUT you doing anything) translate anything that comes after #! in your URL to ?_escaped_fragment_=something.
You just need to (in your PHP script) check if $_GET['_escaped_fragment_'] is set, and if so, display the content for that value of something.
It's actually very easy.
I want to load external websites inside div and make it a bit smaller to accommodate inside div more properly.
just like Google search do
I tried this:
$("#targetDiv").load("www.google.com");
but it is not working.
I tried iframe but it has still 2 problems:
scrolling is still enabled by pressing arrow keys & PGUP PGDOWN
how to make contents inside iframe smaller
Don't know which method i should use
which is more optimized
or any alternative?
What you're trying to do is not going to work. Unfortunately, JavaScript isn't allowed to make cross-domain requests for security reasons (reference: http://en.wikipedia.org/wiki/Same%5Forigin%5Fpolicy).
If you create a script written in PHP that resides on your own server that submits the request, that could work but the user wouldn't have a valid session and there's a risk that the URL (links) from the other site won't work if they're relative.
Example:
$('#targetDiv').load('load.php?url=www.google.com')
You could also have a look at jquery-crossframe. I've never used it but it claims to do what you're looking for.
The best option is to use an iframe element.
You are not going to be able to load a cross domain ajax call like that with jquery. from http://api.jquery.com/load/
Additional Notes:
Due to browser security restrictions, most "Ajax" requests are subject to the same origin policy; the request can not successfully retrieve data from a different domain, subdomain, or protocol.
If iframe is not an option you can retrieve the data via an ajax call to a php page using curl.
Francois is right in that your ajax requests are restricted to same origin policy. That means you cannot load contents from other websites directly. What your are trying to achieve, however, is possible if your source supports JSONP. If you want to specifically load google search engine results check out Google Custom Search API
I'm new to YQL, and just trying to learn how to do some fairly simple tasks.
Let's say I have a list of URLs and I want to get their HTML source as a string in javascript (so I can later insert it to a database via ajax). How would I go about getting this info back in Javascript? Or would I have to do it in PHP? I'm fine with either, really - whatever can work.
Here's the example queries I'd run on their console:
select * from html where url="http://en.wikipedia.org/wiki/Baroque_music"
And the goal is to essentially save the HTML or maybe just the text or something, as a string.
How would I go about doing this? I somewhat understand how the querying works, but not really how to integrate with javascript and/or php (say I have a list of URLs and I want to loop through them, getting the html at each one and saving it somewhere).
Thanks.
You can't read other pages with Javascript due to a built-in security feature in web browsers. It is called the Same origin policy.
The usual method is to scrape the content of these sites from the server using PHP.
There is an other option with javascript called a bookmarklet.
You can add the bookmarklet in your bookmarks bar, and each time you want the content of a site click the bookmark.
A script will be loaded in the host page, it can read the content and post it back to your server.
Oddly enough, the same origin policy, does not prevent you to POST data from this host page to your domain. You need to POST a FORM to an IFRAME that has a source hosted on your domain.
You won't be able to read the response you get back from the POST.
But you can poll with a setInterval making a JSONP call to your domain to know if the POST was successful.