Basically, a page generates some dynamic content, and I want to get that dynamic content, and not just the static html. I am not being able to do this with cURL. Help please.
You can't with just cURL.
cURL will grab the specific raw (static) files from the site, but to get javascript generated content, you would have to put that content into a browser-like envirionment that supports javascript and all other host objects that the javascript uses so the script can run.
Then once the script runs, you would have to access the DOM to grab whatever content you wanted from it.
This is why most search engines don't index javascript-generated content. It's not easy.
If this is one specific site that you're trying to gather info on, you may want to look into exactly how the site gets the data itself and see if you can't get the data directly from that source. For example, is the data embedded in JS in the page (in which case you can just parse out that JS) or is the JS obtained from an ajax call (in which case you can maybe just make that ajax call directly) or some other method.
you could try selenium at http://seleniumhq.org, which supports js.
Related
There is a site that I want to scrape: https://tse.ir/MarketWatch.html
I know that I have to use:
file_get_contents("https://examplesite.html")
to get the html part of site, but how can I find a specific part of site for example like this part in text file:
<td title="دالبر"title="something" class="txtclass="someclass">Tag namad">دالبر<Name</td>
When I open the text file, I never see this part and I think it is because in website there is JavaScript file. How can I get all information of website that include every part I want?
Content loaded by ajax request via javascript. This means you can't get this data simply grabbing the page contents.
There are two ways of collecting data you need:
Use solution based on selenium webdriver to load this page by real browser (which will execute JS), and collect data from rendered DOM.
Research what kind of requests are sent by website to get this data. You could use network activity tab in browser dev tools. Here is example for chrome. For other browsers is the same or similar. Than you send the same request and pase response regarding to your needs.
In your specific case, probably, you could use this url: https://tseest.ir/json/MarketWatch/data_211111.json to accees the json object with data you need.
YOU have three variants for scraping the data:
There's an export to excel file: https://tse.ir/json/MarketWatch/MarketWatch_1.xls?1582392259131. Parse through it, just remember that this number is Unix Timestamp, where first 10 numbers are the month/day/year/hours/minutes
Also there's probably a refresh function(s) for the market data somewhere in all .js files loaded in the page. Just find it and see if you can connect directly to the source (usually a .json)
Download the page at your specific interval and scrape each table row using PHP's DOMXPath::query
Please let me know is it possible to scrap some info after ajax loaded with PHP? I had only used SIMPLE_HTML_DOM for static pages.
Thanks for advice.
Scraping the entire site
Scraping Dynamic content requires you to actually render the page. A PHP server-side scraper will just do a simple file_get_contents or similar. Most server based scrappers wont render the entire site and therefore don't load the dynamic content generated by the Ajax calls.
Something like Selenium should do the trick. Quick google search found numerous examples on how to set it up. Here is one
Scraping JUST the Ajax calls
Though I wouldn't consider this scraping you can always examine an ajax call by using your browsers dev tools. In chrome while on the site hit F12 to open up the dev tools console.
You should then see a window like the above. Hit the network tab and then hit chrome's refresh button. This will show every request made between you and the site. You can then filter out specific requests.
For example if you are interested in Ajax calls you can select XHR
You can then click on any of the listed items in the tabled section to get more information.
File get content on AJAX call
Depending on how robust the APIs are on these ajax calls you could do something like the following.
<?php
$url = "http://www.example.com/test.php?ajax=call";
$content = file_get_contents($url);
?>
If the return is JSON then add
$data = json_decode($content);
However, you are going to have to do this for each AJAX request on a site. Beyond that you are going to have to use a solution similar to the ones presented [here].
Finally you can also implement PhantomJS to render an entire site.
Summary
If all you want is the data returned by specific ajax calls you might be able to get them using file_get_contents. However, if you are trying to scrape the entire site that happens to also use AJAX to manipulate the document then you will NOT be able to use SIMPLE_HTML_DOM.
Finally I worked around my problem. I just get a POST url with all parameters from ajax call and make the same request using SIMPLE_HTML_DOM class.
I'm trying to make my AJAX website crawlable:
Here is the website in question.
I've created a htmlsnapshot.php that generates the page (this file needs to be passed the hash fragment to be able to generate the right content).
I don't know how to get the crawler to load this file while getting normal users to load the normal file.
I don't really understand what the crawler does to the hash fragment (and this probably is part of my problem.)
Does anybody have any tips?
The crawler will divert itself. You just need to configure your PHP script to handle the GET parameters that Google will be sending your site (instead of relying on the AJAX).
Basically, when Google finds a link to yourdomain.com/#!something instead of requesting / and running the JavaScript to make an AJAX request for data something, Google will automatically (WITHOUT you doing anything) translate anything that comes after #! in your URL to ?_escaped_fragment_=something.
You just need to (in your PHP script) check if $_GET['_escaped_fragment_'] is set, and if so, display the content for that value of something.
It's actually very easy.
I'm new to YQL, and just trying to learn how to do some fairly simple tasks.
Let's say I have a list of URLs and I want to get their HTML source as a string in javascript (so I can later insert it to a database via ajax). How would I go about getting this info back in Javascript? Or would I have to do it in PHP? I'm fine with either, really - whatever can work.
Here's the example queries I'd run on their console:
select * from html where url="http://en.wikipedia.org/wiki/Baroque_music"
And the goal is to essentially save the HTML or maybe just the text or something, as a string.
How would I go about doing this? I somewhat understand how the querying works, but not really how to integrate with javascript and/or php (say I have a list of URLs and I want to loop through them, getting the html at each one and saving it somewhere).
Thanks.
You can't read other pages with Javascript due to a built-in security feature in web browsers. It is called the Same origin policy.
The usual method is to scrape the content of these sites from the server using PHP.
There is an other option with javascript called a bookmarklet.
You can add the bookmarklet in your bookmarks bar, and each time you want the content of a site click the bookmark.
A script will be loaded in the host page, it can read the content and post it back to your server.
Oddly enough, the same origin policy, does not prevent you to POST data from this host page to your domain. You need to POST a FORM to an IFRAME that has a source hosted on your domain.
You won't be able to read the response you get back from the POST.
But you can poll with a setInterval making a JSONP call to your domain to know if the POST was successful.
I Have to load a html file using jquery. When i referred google, i got the snippet
$("#feeds").load("feeds.html");
But I dont want to load the content of feed.html into feeds object. Instead I need to load that page entirely. How to load that page. Help me plz
If you're not wanting to load() some HTML into an element on the existing page, maybe you mean that you want to redirect to another page?
url = "feeds.html";
window.location = url;
Or maybe you just want to fill an entire body? You could load() into the body tag if you wanted.
$("body").load("feeds.html");
$.get()
Here's a summary:
Load a remote page using an HTTP GET
request. This is an easy way to send a
simple GET request to a server without
having to use the more complex $.ajax
function. It allows a single callback
function to be specified that will be
executed when the request is complete
(and only if the response has a
successful response code). If you need
to have both error and success
callbacks, you may want to use $.ajax.
$.get() returns the XMLHttpRequest
that it creates. In most cases you
won't need that object to manipulate
directly, but it is available if you
need to abort the request manually.
Are you sure feeds.html is located under the same domain as your script?
If so, that line is ok, you'll search for problem anywhere else. Try to debug it with Firebug under Net panel.
If not, so you are able to send only JSON requests to urls under other domains. But you can easily write a proxy script with PHP (noticed php tag under your question :) ) to load your content with cURL. Take a look at CatsWhoCode tutorial at point 9, hope it'll help.