php crawler for website with ajax content and https

php crawler for website with ajax content and https - php

i'm trying to grab the content of a website based on ajax and https but with no luck.
Is this possible.
The website i'm trying to crawl is this:
https://www.bet3000.com/en/html/home.html#!https://www.bet3000.com/html/en/eventssportsbook.html?category_id=2117
Thanks

If you take a look at the HTTP requests that this page is doing (using, for example, Firebug for Firefox), you'll notice it makes several Ajax requests.
Instead of trying to execute the Javascript code, a possible solution could be for you to request one of those URLs, and get the data -- you'd also not have to parse the HTML, this way.
In this specific case, one of those requests is made to the following URL :
https://www.bet3000.com/ajax/en/sportsbook.json.html?category_id=2117&offset=&live=&sportsbook_id=0
This URL seems to return some JSON data, that should interest you quite a bit ;-)
(There is a few characters before and after the JSON, that will need to be removed, but, asides from that, I don't see anything that doesn't look good.)

Related

Get response from ajax and parse the response with html on webpage

I need to run a search using Ajax, such that the response I get from Ajax (should not contain HTML, it should only contain data) is fetched on webpage and then parse that response with HTML on the page and display.
I want to know can it be done, if yes then how to do it. Also is it going to make process run faster or consume less resources on server?

Since you have made no attempt to try to code this, I will give you a couple pointers.
1.) It is very possible, I do it on login forms.
2.) Post data to a external page, then encode the response on that page to an array in JSON. echo out the JSON on the external page.
3.) After your ajax post is finished, you can carry out a function similar to this:
function(data){alert(data.given_name_on_external_page)}
or something similar. Once you google around for Ajax form examples you should be able to grasp a little better.
4.) Now for displaying this on a web page, it is fairly easy.
HTML
<div id='response'></div>
Javascript
function(data){document.GetElementById('response').html=data.data};
That should be enough for you to understand what needs to be done, I will leave the rest to you and your ability to use google :).

use escape and load() functionality in JQUERY

How to interact with page elements while crawling a website with PHP?

I need to go to http://butlercountyclerk.org/bcc-11112005/ForeclosureSearch.aspx, enter data in the fields, then click the button to get results. When taken to the result page, I'm given a table of data but it's paginated into 5 different pages.
I'm able to do the above using cURL, but it's at this point that I get stuck.
Once I'm on the result page, I need to click the "date" header twice to make the data order by decreasing date, then skim off the current day's results.
Any idea how to do this, advanced detail or in concept? Either way should help.
Thanks!

The problem is that the click is actually performing a postback using javascript, with the limitations of PHP and cURL you will need to inspect the HTTP headers (GET, POST and COOKIES) being sent by the browser, and emulate them. Taking in mind that some values might be session dependent. Right now I don't have time to do this for you but I know it can be quite tricky with ASP.Net websites in some cases. There might be easier ways to do it, but that's what it will always come down to, because that's what happens.
If you weren't tied to PHP a whole world of options open - for example, the aggregator in the project I'm working on is actually capable of executing (controlled) javascript specifically for these kinds of tasks/pages (albeit on a grander scale).

I can't get a working set of results - if you could post some dummy data that gives results, that would help.
As a generic answer, you need something that can manipulate the DOM. You can go server-side with something like PHP and Webdriver, or purely client-side with Selenium. Simulate the click, get the resulting HTML and parse that.

This should work. try this.
$url ='http://butlercountyclerk.org/bcc-11112005/ForeclosureSearch.aspx';
## do curl , with cookies enabled.
## after do this.
$url =$url.'?'.'__EVENTTARGET=Search%3AdgSearch%3A_ctl2%3A_ctl1&__EVENTARGUMENT=&__VIEWSTATE=dDwtMjk2Mjk5NzczO3Q8O2w8aTwxPjs%2BO2w8dDw7bDxpPDE%2BOz47bDx0PDtsPGk8Mz47aTwxNz47aTwxOT47PjtsPHQ8dDw7cDxsPGk8MD47aTwxPjtpPDI%2BO2k8Mz47aTw0PjtpPDU%2BOz47bDxwPDIwMDY7MjAwNj47cDwyMDA3OzIwMDc%2BO3A8MjAwODsyMDA4PjtwPDIwMDk7MjAwOT47cDwyMDEwOzIwMTA%2BO3A8MjAxMTsyMDExPjs%2BPjs%2BOzs%2BO3Q8cDxwPGw8VmlzaWJsZTs%2BO2w8bzx0Pjs%2BPjs%2BOzs%2BO3Q8QDA8cDxwPGw8Q3VycmVudFBhZ2VJbmRleDtQYWdlQ291bnQ7XyFJdGVtQ291bnQ7XyFEYXRhU291cmNlSXRlbUNvdW50O0RhdGFLZXlzOz47bDxpPDA%2BO2k8ND47aTwxMD47aTw0MD47bDw%2BOz4%2BOz47Ozs7Ozs7Ozs7PjtsPGk8MD47PjtsPHQ8O2w8aTwyPjtpPDM%2BO2k8ND47aTw1PjtpPDY%2BO2k8Nz47aTw4PjtpPDk%2BO2k8MTA%2BO2k8MTE%2BOz47bDx0PDtsPGk8MD47aTwxPjtpPDI%2BO2k8Mz47aTw0Pjs%2BO2w8dDw7bDxpPDA%2BOz47bDx0PHA8cDxsPFRleHQ7TmF2aWdhdGVVcmw7PjtsPENWIDIwMTEgMDUgMTQzNjtodHRwOi8vd3d3LmJ1dGxlcmNvdW50eWNsZXJrLm9yZy9wYS9wYS51cmQvcGFtdzIwMDAtb19jYXNlX3N1bT8xNjE3NzE0OSAgICAgICAgICAgIDs%2BPjs%2BOzs%2BOz4%2BO3Q8cDxwPGw8VGV4dDs%2BO2w8NS8zLzIwMTE7Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPFNVTlRSVVNUIE1PUlRHQUdFIElOQzs%2BPjs%2BOzs%2BO3Q8cDxwPGw8VGV4dDs%2BO2w8TkFUSEFOSUVMIEdBQkJBUkQ7Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPDEzNTQgVkFOREVSVkVFUiBBVkUgSEFNSUxUT04sIE9IIDQ1MDExOz4%2BOz47Oz47Pj47dDw7bDxpPDA%2BO2k8MT47aTwyPjtpPDM%2BO2k8ND47PjtsPHQ8O2w8aTwwPjs%2BO2w8dDxwPHA8bDxUZXh0O05hdmlnYXRlVXJsOz47bDxDViAyMDExIDA1IDE0MTU7aHR0cDovL3d3dy5idXRsZXJjb3VudHljbGVyay5vcmcvcGEvcGEudXJkL3BhbXcyMDAwLW9fY2FzZV9zdW0%2FMTk2MzQ4ODUgICAgICAgICAgICA7Pj47Pjs7Pjs%2BPjt0PHA8cDxsPFRleHQ7PjtsPDUvMi8yMDExOz4%2BOz47Oz47dDxwPHA8bDxUZXh0Oz47bDxUSElSRCBGRURFUkFMIFNBVklOR1MgQU5EIExPQU4gQVNTTiBPRiBDTEVWRUxBTkQ7Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPEdBWUxFIE5BU0g7Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPDg5NDEgQ09YIFJEIFdFU1QgQ0hFU1RFUiwgT0ggNDUwNjk7Pj47Pjs7Pjs%2BPjt0PDtsPGk8MD47aTwxPjtpPDI%2BO2k8Mz47aTw0Pjs%2BO2w8dDw7bDxpPDA%2BOz47bDx0PHA8cDxsPFRleHQ7TmF2aWdhdGVVcmw7PjtsPENWIDIwMTEgMDUgMTUwMztodHRwOi8vd3d3LmJ1dGxlcmNvdW50eWNsZXJrLm9yZy9wYS9wYS51cmQvcGFtdzIwMDAtb19jYXNlX3N1bT8yMjY1MTYxMiAgICAgICAgICAgIDs%2BPjs%2BOzs%2BOz4%2BO3Q8cDxwPGw8VGV4dDs%2BO2w8NS85LzIwMTE7Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPFUgUyBCQU5LIE4gQTs%2BPjs%2BOzs%2BO3Q8cDxwPGw8VGV4dDs%2BO2w8TE9VSVMgTUlSTUFOOz4%2BOz47Oz47dDxwPHA8bDxUZXh0Oz47bDw2OTkxIEdBUlkgTEVFIERSIFdFU1QgQ0hFU1RFUiwgT0ggNDUwNjk7Pj47Pjs7Pjs%2BPjt0PDtsPGk8MD47aTwxPjtpPDI%2BO2k8Mz47aTw0Pjs%2BO2w8dDw7bDxpPDA%2BOz47bDx0PHA8cDxsPFRleHQ7TmF2aWdhdGVVcmw7PjtsPENWIDIwMTEgMDUgMTQ5MjtodHRwOi8vd3d3LmJ1dGxlcmNvdW50eWNsZXJrLm9yZy9wYS9wYS51cmQvcGFtdzIwMDAtb19jYXNlX3N1bT8yMzk3NTc5MiAgICAgICAgICAgIDs%2BPjs%2BOzs%2BOz4%2BO3Q8cDxwPGw8VGV4dDs%2BO2w8NS82LzIwMTE7Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPEZJRlRIIFRISVJEIE1PUlRHQUdFIENPOz4%2BOz47Oz47dDxwPHA8bDxUZXh0Oz47bDxSQVlNT05EIFNURUlOOz4%2BOz47Oz47dDxwPHA8bDxUZXh0Oz47bDwyMzU5IFRIUlVTSCBBVkUgRkFJUkZJRUxELCBPSCA0NTAxNDs%2BPjs%2BOzs%2BOz4%2BO3Q8O2w8aTwwPjtpPDE%2BO2k8Mj47aTwzPjtpPDQ%2BOz47bDx0PDtsPGk8MD47PjtsPHQ8cDxwPGw8VGV4dDtOYXZpZ2F0ZVVybDs%2BO2w8Q1YgMjAxMSAwNSAxNDM4O2h0dHA6Ly93d3cuYnV0bGVyY291bnR5Y2xlcmsub3JnL3BhL3BhLnVyZC9wYW13MjAwMC1vX2Nhc2Vfc3VtPzI0NzgyOTYzICAgICAgICAgICAgOz4%2BOz47Oz47Pj47dDxwPHA8bDxUZXh0Oz47bDw1LzMvMjAxMTs%2BPjs%2BOzs%2BO3Q8cDxwPGw8VGV4dDs%2BO2w8V0VMTFMgRkFSR08gQkFOSyBOIEE7Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPEpBTkVUIEJPRUhNOz4%2BOz47Oz47dDxwPHA8bDxUZXh0Oz47bDw4NjA4IEdPTERGSU5DSCBXQVkgV0VTVCBDSEVTVEVSLCBPSCA0NTA2OTs%2BPjs%2BOzs%2BOz4%2BO3Q8O2w8aTwwPjtpPDE%2BO2k8Mj47aTwzPjtpPDQ%2BOz47bDx0PDtsPGk8MD47PjtsPHQ8cDxwPGw8VGV4dDtOYXZpZ2F0ZVVybDs%2BO2w8Q1YgMjAxMSAwNSAxNDQwO2h0dHA6Ly93d3cuYnV0bGVyY291bnR5Y2xlcmsub3JnL3BhL3BhLnVyZC9wYW13MjAwMC1vX2Nhc2Vfc3VtPzI1NTkwMjAzICAgICAgICAgICAgOz4%2BOz47Oz47Pj47dDxwPHA8bDxUZXh0Oz47bDw1LzQvMjAxMTs%2BPjs%2BOzs%2BO3Q8cDxwPGw8VGV4dDs%2BO2w8RklGVEggVEhJUkQgQkFOSzs%2BPjs%2BOzs%2BO3Q8cDxwPGw8VGV4dDs%2BO2w8VEhFT0RPUkUgQ09PSzs%2BPjs%2BOzs%2BO3Q8cDxwPGw8VGV4dDs%2BO2w8UE8gQk9YIDE3MTEgV0VTVCBDSEVTVEVSLCBPSCA0NTA3MTs%2BPjs%2BOzs%2BOz4%2BO3Q8O2w8aTwwPjtpPDE%2BO2k8Mj47aTwzPjtpPDQ%2BOz47bDx0PDtsPGk8MD47PjtsPHQ8cDxwPGw8VGV4dDtOYXZpZ2F0ZVVybDs%2BO2w8Q1YgMjAxMSAwNSAxNDkwO2h0dHA6Ly93d3cuYnV0bGVyY291bnR5Y2xlcmsub3JnL3BhL3BhLnVyZC9wYW13MjAwMC1vX2Nhc2Vfc3VtPzI2ODY3MDkxICAgICAgICAgICAgOz4%2BOz47Oz47Pj47dDxwPHA8bDxUZXh0Oz47bDw1LzYvMjAxMTs%2BPjs%2BOzs%2BO3Q8cDxwPGw8VGV4dDs%2BO2w8Q0lUSUZJTkFOQ0lBTCBJTkM7Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPERPTk5BIE1BUkRJUzs%2BPjs%2BOzs%2BO3Q8cDxwPGw8VGV4dDs%2BO2w8NjU0OSBDQU5BU1RPVEEgRFJJVkUgSEFNSUxUT04sIE9IIDQ1MDExOz4%2BOz47Oz47Pj47dDw7bDxpPDA%2BO2k8MT47aTwyPjtpPDM%2BO2k8ND47PjtsPHQ8O2w8aTwwPjs%2BO2w8dDxwPHA8bDxUZXh0O05hdmlnYXRlVXJsOz47bDxDViAyMDExIDA1IDE0Njg7aHR0cDovL3d3dy5idXRsZXJjb3VudHljbGVyay5vcmcvcGEvcGEudXJkL3BhbXcyMDAwLW9fY2FzZV9zdW0%2FMjk4NzU2MDIgICAgICAgICAgICA7Pj47Pjs7Pjs%2BPjt0PHA8cDxsPFRleHQ7PjtsPDUvNS8yMDExOz4%2BOz47Oz47dDxwPHA8bDxUZXh0Oz47bDxDSVRJTU9SVEdBR0UgSU5DOz4%2BOz47Oz47dDxwPHA8bDxUZXh0Oz47bDxNQVRUSEVXIEJMVU5ERUxMOz4%2BOz47Oz47dDxwPHA8bDxUZXh0Oz47bDwxNDEyIEhFTE1BIEFWRSBIQU1JTFRPTiwgT0ggNDUwMTM7Pj47Pjs7Pjs%2BPjt0PDtsPGk8MD47aTwxPjtpPDI%2BO2k8Mz47aTw0Pjs%2BO2w8dDw7bDxpPDA%2BOz47bDx0PHA8cDxsPFRleHQ7TmF2aWdhdGVVcmw7PjtsPENWIDIwMTEgMDUgMTQzMjtodHRwOi8vd3d3LmJ1dGxlcmNvdW50eWNsZXJrLm9yZy9wYS9wYS51cmQvcGFtdzIwMDAtb19jYXNlX3N1bT8zMjI0MzYxNyAgICAgICAgICAgIDs%2BPjs%2BOzs%2BOz4%2BO3Q8cDxwPGw8VGV4dDs%2BO2w8NS8zLzIwMTE7Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPFdFTExTIEZBUkdPIEJBTksgTiBBOz4%2BOz47Oz47dDxwPHA8bDxUZXh0Oz47bDxKT0hOIEJPV01BTjs%2BPjs%2BOzs%2BO3Q8cDxwPGw8VGV4dDs%2BO2w8Jm5ic3BcOzs%2BPjs%2BOzs%2BOz4%2BO3Q8O2w8aTwwPjtpPDE%2BO2k8Mj47aTwzPjtpPDQ%2BOz47bDx0PDtsPGk8MD47PjtsPHQ8cDxwPGw8VGV4dDtOYXZpZ2F0ZVVybDs%2BO2w8Q1YgMjAxMSAwNSAxNDYzO2h0dHA6Ly93d3cuYnV0bGVyY291bnR5Y2xlcmsub3JnL3BhL3BhLnVyZC9wYW13MjAwMC1vX2Nhc2Vfc3VtPzQyMjcwMTE5ICAgICAgICAgICAgOz4%2BOz47Oz47Pj47dDxwPHA8bDxUZXh0Oz47bDw1LzQvMjAxMTs%2BPjs%2BOzs%2BO3Q8cDxwPGw8VGV4dDs%2BO2w8VSBTIEJBTksgTkFUSU9OQUwgQVNTT0NJQVRJT047Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPEJSWUFOIFNDSE1JRFQ7Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPDI4OTUgV0VFUElORyBXSUxMT1cgRFJJVkUgSEFNSUxUT04sIE9IIDQ1MDExOz4%2BOz47Oz47Pj47Pj47Pj47Pj47Pj47Pj47PtVTse1TdIXrxq%2FXrY%2Fp22QQ7pAh&Search%3AddlMonth=5&Search%3AddlYear=2011&Search%3AtxtCompanyName=&Search%3AtxtLastName=&Search%3AtxtCaseNumber=';
## DO curl with cookies on again

How to parse content loaded by javascript after the dom is complete

I have been working on parsing some of the data from the wow armory and have come into a bit of a snag. When it comes to the site serving up the achievements that players have received, it uses javascript to intemperate a string such as #73:1283 to display the requested information. (I made this number up but the data for the requests are formated like this).
Is it possible to pull data from a page that requires javascript to display its data with php?
How do you parse data from a site that has been loaded after the dom is ready or complete using php?

By using Firebug, I was able to look at the HTTP headers to see what AJAX calls were being made to generate the content on these pages: http://us.battle.net/wow/en/character/black-dragonflight/glitchshot/achievement#96:14861 and http://us.battle.net/wow/en/character/black-dragonflight/glitchshot/achievement#96
It looks the page is making an asynchronous call to load this page: http://us.battle.net/wow/en/character/black-dragonflight/glitchshot/achievement/14861 when the part after the hash is 96:14861, and a call to http://us.battle.net/wow/en/character/black-dragonflight/glitchshot/achievement/96 when the part after the hash is just 96. Both of those pages return XML that can be parsed to render HTML.
So generally speaking, if there's just one number after the hash, just put http://.../achievement/<number here> as the URL. If there are two numbers, put the second number at the end of the URL instead.
What you'll need to do, rather than pulling the Javascript and interpreting it, is make HTTP requests to those URLs by yourself in PHP (using cURL, for example) and parse the data on your own.
I would really recommend learning JavaScript and jQuery, since it will be very hard for you to really build a good site that pulls information from the WoW Armory without understanding all the AJAX loads that are going on in the background.

I would recommend seeing if you can replicate the query sent by JavaScript in PHP. While I don't believe there is a way to process JavaScript in PHP, there definitely isn't a simple or scalable way.
I would attempt to scan the first page's source that you downloaded with PHP for strings of that format you mention. Then if the JS on their site is querying something like http://www.wow.com/armory.php?id=#72:1284 you can just download the source of that next. You can find out how the JS is querying the server with something like FireBug or the Inspector in Chrome or Safari.
So in summary:
Check to find the JS URL format and if you can replicate it.
Create PHP to get main page and extract all strings.
Create PHP to loop through these strings and get these pages (with URL that JS requests).
Do whatever you wanted to with that information.

You can try jquery's $(document).onready function which helps
to run java script code when the web page loads up.
ex
<div id="wowoData">#4325325</div>
<script>
$(document).ready(
function(){
$("#wowoData").css("border","1px solid red");
}
)
</script>

Fetching content from Website on another Server

What i basically want to do is to get content from a website and load it into a div of another website. This should be no problem so far.
The problem is, that the content that should be fetched is located on a different server and i have no source access to it.
I'd prefer a solution using JavaScript of jQuery.
Can i use a .htacces redirect to fetch the content from a remote server with client-side (js) techniques?
I will also go with other solutions though.
Thanks a lot in advance!

You can't execute an AJAX call against a different domain, due to the same-origin policy. You can add a <script> tag to the DOM which points at a Javascript file on another domain. If this JS file contains some JSON data that you can use, you're all set.
The only problem is you need to get at the JSON data somehow, which is where JSON-P callbacks come into the picture. If the foreign resource supports JSON-P, it will give you something that looks like
your_callback( { // JSON data } );
You then specify your code in the callback.
See JSONP for more.
If JSONP isn't an option, then the best bet is to probably fetch the data server-side, say with a cron job every few minutes, and store it locally on your own site.

You can use a server-side XMLHTTP request to grab your content from the other server. You can then parse it on you server (A.K.A screen-scraping) and serve-up the portion you want along with your web page.

If the content from the other website is just an HTML doc that you want to display on your site, you could also use an iframe to pull it in. You won't have access to any of its content because of browser security rules.

You will likely have to "scrape" the data you need and store it on your server.
This is a great tutorial on how to cache data from an external site. It is actually written to fetch and store XML, so it'll need some modification. Also, if your site doesn't allow file_get_contents then you may have to modify it to use cUrl.

emulating LiveHTTPheader in server side script or javascript?

I ran into this problem when scraping sites with heavy usage of javascript to obfuscate it's data.
For example,
"a href="javascript:void(0)" onClick="grabData(23)"> VIEW DETAILS
This href attribute, reveals no information about the actual URL. You'd have to manually look and examine the grabData() javascript function to get a clue.
OR
The old school way is manually opening up Live HTTP header add on for firefox, and monitoring the POST perimeters, which reveals the actual URL being POSTed.
So i'm wondering, is there a way to capture the POST parameters in a server side script or Javscript, as Live HTTP header does, for the outgoing and incoming POST parameters? This would make even the most javscript obfuscated web pages easily scrapable.
thanks.

I'm not sure I understand the question but...
In PHP, incoming POST parameters are stored in the $_POST array, you can display them with print_r($_POST);.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php crawler for website with ajax content and https - php

i'm trying to grab the content of a website based on ajax and https but with no luck. Is this possible. The website i'm trying to crawl is this: https://www.bet3000.com/en/html/home.html#!https://www.bet3000.com/html/en/eventssportsbook.html?category_id=2117 Thanks

Related

Get response from ajax and parse the response with html on webpage

How to interact with page elements while crawling a website with PHP?

How to parse content loaded by javascript after the dom is complete

Fetching content from Website on another Server

emulating LiveHTTPheader in server side script or javascript?

Categories

Resources