I am trying to scrape a website for article titles, however this page only loads the five first titles and loads more when the user scrolls down the page (JSON calls more articles and injects into the page).
The web scraper that I built works perfectly, but only finds the first 5 default articles, and what I am trying to achieve is to load more than 5. Is there any way of achieving that using PHP and if you can explain me why/how it works I would really appreciate because I love to learn these things.
you can use chrome's network monitor to log the source of the ajax requests and then request those from your webscraper, but this really is a "make shift api" , and will brake if the site changes it's json format, you can use the php function json_decode to decode the json.
http://php.net/manual/en/function.json-decode.php
in order to first retrieve the data, you will have to use file_get_contents
http://php.net/manual/en/function.file-get-contents.php
but this will only allow GET
If you want more "advanced" options ( like POST ) you will have to look into cURL
http://php.net/manual/en/book.curl.php
Related
There is a site that I want to scrape: https://tse.ir/MarketWatch.html
I know that I have to use:
file_get_contents("https://examplesite.html")
to get the html part of site, but how can I find a specific part of site for example like this part in text file:
<td title="دالبر"title="something" class="txtclass="someclass">Tag namad">دالبر<Name</td>
When I open the text file, I never see this part and I think it is because in website there is JavaScript file. How can I get all information of website that include every part I want?
Content loaded by ajax request via javascript. This means you can't get this data simply grabbing the page contents.
There are two ways of collecting data you need:
Use solution based on selenium webdriver to load this page by real browser (which will execute JS), and collect data from rendered DOM.
Research what kind of requests are sent by website to get this data. You could use network activity tab in browser dev tools. Here is example for chrome. For other browsers is the same or similar. Than you send the same request and pase response regarding to your needs.
In your specific case, probably, you could use this url: https://tseest.ir/json/MarketWatch/data_211111.json to accees the json object with data you need.
YOU have three variants for scraping the data:
There's an export to excel file: https://tse.ir/json/MarketWatch/MarketWatch_1.xls?1582392259131. Parse through it, just remember that this number is Unix Timestamp, where first 10 numbers are the month/day/year/hours/minutes
Also there's probably a refresh function(s) for the market data somewhere in all .js files loaded in the page. Just find it and see if you can connect directly to the source (usually a .json)
Download the page at your specific interval and scrape each table row using PHP's DOMXPath::query
Please let me know is it possible to scrap some info after ajax loaded with PHP? I had only used SIMPLE_HTML_DOM for static pages.
Thanks for advice.
Scraping the entire site
Scraping Dynamic content requires you to actually render the page. A PHP server-side scraper will just do a simple file_get_contents or similar. Most server based scrappers wont render the entire site and therefore don't load the dynamic content generated by the Ajax calls.
Something like Selenium should do the trick. Quick google search found numerous examples on how to set it up. Here is one
Scraping JUST the Ajax calls
Though I wouldn't consider this scraping you can always examine an ajax call by using your browsers dev tools. In chrome while on the site hit F12 to open up the dev tools console.
You should then see a window like the above. Hit the network tab and then hit chrome's refresh button. This will show every request made between you and the site. You can then filter out specific requests.
For example if you are interested in Ajax calls you can select XHR
You can then click on any of the listed items in the tabled section to get more information.
File get content on AJAX call
Depending on how robust the APIs are on these ajax calls you could do something like the following.
<?php
$url = "http://www.example.com/test.php?ajax=call";
$content = file_get_contents($url);
?>
If the return is JSON then add
$data = json_decode($content);
However, you are going to have to do this for each AJAX request on a site. Beyond that you are going to have to use a solution similar to the ones presented [here].
Finally you can also implement PhantomJS to render an entire site.
Summary
If all you want is the data returned by specific ajax calls you might be able to get them using file_get_contents. However, if you are trying to scrape the entire site that happens to also use AJAX to manipulate the document then you will NOT be able to use SIMPLE_HTML_DOM.
Finally I worked around my problem. I just get a POST url with all parameters from ajax call and make the same request using SIMPLE_HTML_DOM class.
I'm new to YQL, and just trying to learn how to do some fairly simple tasks.
Let's say I have a list of URLs and I want to get their HTML source as a string in javascript (so I can later insert it to a database via ajax). How would I go about getting this info back in Javascript? Or would I have to do it in PHP? I'm fine with either, really - whatever can work.
Here's the example queries I'd run on their console:
select * from html where url="http://en.wikipedia.org/wiki/Baroque_music"
And the goal is to essentially save the HTML or maybe just the text or something, as a string.
How would I go about doing this? I somewhat understand how the querying works, but not really how to integrate with javascript and/or php (say I have a list of URLs and I want to loop through them, getting the html at each one and saving it somewhere).
Thanks.
You can't read other pages with Javascript due to a built-in security feature in web browsers. It is called the Same origin policy.
The usual method is to scrape the content of these sites from the server using PHP.
There is an other option with javascript called a bookmarklet.
You can add the bookmarklet in your bookmarks bar, and each time you want the content of a site click the bookmark.
A script will be loaded in the host page, it can read the content and post it back to your server.
Oddly enough, the same origin policy, does not prevent you to POST data from this host page to your domain. You need to POST a FORM to an IFRAME that has a source hosted on your domain.
You won't be able to read the response you get back from the POST.
But you can poll with a setInterval making a JSONP call to your domain to know if the POST was successful.
Can someone tell me what happens when i enter a link into the Facebook Status Update Form and it loads up a mini info kinda thing of the website (I'm guessing its RSS or something?)
How do i implement this on my site using PHP?
What do i need to learn to be able to implement that?
It scrapes the page you are linking to. It doesn't have anything to do with RSS.
By looking at the HTML of the page it can get the page title for you and find all the images that can be used as a thumbnail.
Take a look at HTTP or cURL in the PHP manual for methods to get webpage content.
for one project, i need to get the facebook source page (html one) via a php application.
i try lot of method like curl, file_get_content, change my ini_set, etc.... but facebook never let me get the html result file.
Does anyone can help ?
for example this page :
ini_set('user_agent', $_SERVER['HTTP_USER_AGENT']);
$data = file_get_contents("http://apps.facebook.com/is_cool/?cafe_action=album&view=scroll",0);
Print strip_tags($data,"");
Thanks a lot.
Damien
Comment 1 :
- I need to create 2 application. I want to parse the html code to get some information from one to the other. I don't want to duplicate or take the facebook code. I just want to make a "view source" (like IE or firefox) and put it on a file, without ask my users. When my user is logged in my first application, i just want to is is credential to get the other content.
The reason you're having problems is that the majority of the facebook homepage content is loaded via AJAX. The data is not hardcoded into what your browser renders.
You should think of a different way to accomplish your goals. If you tell us a little more about what you're trying to do, we can probably help you find an alternate method.