I want to crawl a site which pagination's are loaded by ajax. I'm using FriendsOfPHP/Goutte for crawling in laravel 5.
Is it possible to do this with Goutte?
I tried out the following code,
$link = $crawler->selectLink('Next>')->link();
$crawler = $client->click($link);
but its not working.
How can I crawl ajax site using PHP/ Laravel 5?
Crawler usually works without any hitch on rendered HTML content (which you see on view-source option of browser) , but If you want to crawl ajaxed content you need to manually do it by studying ajax calls of that site. Please check on pagination links and find some sequence you can definitely scrape that content.
Related
I'm doing web scraping for a profile seller of Amazon like this one: https://www.amazon.es/sp?_encoding=UTF8&asin=B07KS22WVT&isAmazonFulfilled=1&isCBA=&marketplaceID=A1RKKUPIHCS9HS&orderID=&seller=A1KD8FXP0BE5W2&tab=&vasStoreID=
I'm using PHP and Goutte. The thing is that in the comment section, when I clik on "Siguiente" (Next) the url doesn't change, and I cant scrape the next comments.
I saw that Goutte supports "click on link" issue. I tried:
$link = $crawler->selectLink('Siguiente')->link();
$crawler = $client->click($link);
but it doesnt work. Is there any other solution?
Goutte can only load pages which are rendered server-side (with php for instance). Anything which changes without a new pageload is probably done with javascript, which is not supported. You could look at this question. It's probably better to use something like phantomjs for crawling pages as a lot of pages depend on javascript.
Please let me know is it possible to scrap some info after ajax loaded with PHP? I had only used SIMPLE_HTML_DOM for static pages.
Thanks for advice.
Scraping the entire site
Scraping Dynamic content requires you to actually render the page. A PHP server-side scraper will just do a simple file_get_contents or similar. Most server based scrappers wont render the entire site and therefore don't load the dynamic content generated by the Ajax calls.
Something like Selenium should do the trick. Quick google search found numerous examples on how to set it up. Here is one
Scraping JUST the Ajax calls
Though I wouldn't consider this scraping you can always examine an ajax call by using your browsers dev tools. In chrome while on the site hit F12 to open up the dev tools console.
You should then see a window like the above. Hit the network tab and then hit chrome's refresh button. This will show every request made between you and the site. You can then filter out specific requests.
For example if you are interested in Ajax calls you can select XHR
You can then click on any of the listed items in the tabled section to get more information.
File get content on AJAX call
Depending on how robust the APIs are on these ajax calls you could do something like the following.
<?php
$url = "http://www.example.com/test.php?ajax=call";
$content = file_get_contents($url);
?>
If the return is JSON then add
$data = json_decode($content);
However, you are going to have to do this for each AJAX request on a site. Beyond that you are going to have to use a solution similar to the ones presented [here].
Finally you can also implement PhantomJS to render an entire site.
Summary
If all you want is the data returned by specific ajax calls you might be able to get them using file_get_contents. However, if you are trying to scrape the entire site that happens to also use AJAX to manipulate the document then you will NOT be able to use SIMPLE_HTML_DOM.
Finally I worked around my problem. I just get a POST url with all parameters from ajax call and make the same request using SIMPLE_HTML_DOM class.
I’m converting our website to google amp, but one of the main pages use a “load more” button that loads more results using ajax.
Is this possible in amp? I couldn’t find something in the documentation.
I need to create a contact form in a Facebook tab, that is supposed to store every entry/submission in a MySQL database. But i'm quite stuck.
Doing the HTML part, creating the form and the inputs wasn't that hard, but the main problem comes from the iFrame/HTML applications from Facebook (i used some already - popular - made apps that i found around).
I cannot include PHP into the Facebook tab, because it's only has support for html/css/js. I tried using but failed, as they are not displaying their source inside the Facebook tabs, so how am i supposed to accomplish this task?
LE: I managed to self-sign a certificate. Now the application is starting, as i provided secured URLs, but instead of loading the index.php in that directory i pointed, it gives me a 404 of the main site.
The main site is quart.ro, the url/secured urls are: http://quart.ro/beautydistrict and https://quart.ro/beautydistrict.
Should i write an .htaccess only for that folder? Or should i change the urls to point directly to the file ex: http/s://quart.ro/beautydistrict/index.php/?
LE2: On the application's page - https://apps.facebook.com/beautydistrict/ - it displays the content correctly (this content - http://quart.ro/beautydistrict). This app is a page tab. If i install the app to a facebook page, then instead of displaying the content correctly it gives the 404 of the main website (intead of http://quart.ro/beautydistrict i get http://www.quart.ro/404).
You can post your html data into external PHP file using jquery and ajax. Use this guide http://api.jquery.com/jQuery.post/
I have created an ajax driven website which can load any page when given the correct parameters. For instance: www.mysite.com/?page=blog&id=7 opens a blog post.
If I create a sitemap with links to all pages within the website will this be indexed?
Many thanks.
If you provide a url for each page that will actually display the full page, then yes. If those requests are just responding with JSON, or only part of a page, then no. In reality this is probably a poor design SEO wise. Each page should have it's own URL e.g. www.mysite.com/unicorns instead of www.mysite.com/?page=blog&id=1, and the links on the page should point to those. Then you should be using Javascript to capture all the link click events for the AJAX links, and then use Javascript how you like to update the page. Or better yet maybe try out PJAX which will load just the content of a page instead of a full page refresh speeding things up a little without really any changes from your normal site setup.
You do realize that making that sitemap all your search engine links will be ugly.
As Google said a page can still be crawled with nice url if you use fragment identifier:
<meta name="fragment" content="!"> // for meta fragment
and when you generate your page by ajax append the fragment to URL:
www.mysite.com/#!page=blog-7 //(and split them)
The page should load content directly in PHP by using $_GET['_escaped_fragment_']
As I've read that Bing and Yahoo started crawling with same process.