I ran into this problem when scraping sites with heavy usage of javascript to obfuscate it's data.
For example,
"a href="javascript:void(0)" onClick="grabData(23)"> VIEW DETAILS
This href attribute, reveals no information about the actual URL. You'd have to manually look and examine the grabData() javascript function to get a clue.
OR
The old school way is manually opening up Live HTTP header add on for firefox, and monitoring the POST perimeters, which reveals the actual URL being POSTed.
So i'm wondering, is there a way to capture the POST parameters in a server side script or Javscript, as Live HTTP header does, for the outgoing and incoming POST parameters? This would make even the most javscript obfuscated web pages easily scrapable.
thanks.
I'm not sure I understand the question but...
In PHP, incoming POST parameters are stored in the $_POST array, you can display them with print_r($_POST);.
Related
I am building an app that extracts data from a website and displays them in my app. I am using PHPQuery to extract data in my server-side code.
However, one page contains an .asp form with two dropdown menus. I need to select an option in both of them and then parse the resulting html. I need to do this server-side, so javascript doesn't seem to be the option.
How can I do so? Can it be done using PHPQuery or some other technology is required?
The page in question is: http://www.bput.ac.in/exam_schedule_ALL.asp
Since you're using PHP and phpQuery, I suggest you also try cURL.
Explore what the form submits via JavaScript and replicate that via cURL. Do this to get the format of the posted (assumption) data, which you can then replicate in a cURL request to the same endpoints. JavaScript won't be necessary, and you can get the same results you need. In this case, you won't need the item mentioned next.
Alternatively, if you have a browser, such as webkit, phantomJS, etc, you can write an automation script to run those steps and return the results, depending on exactly what you need returned. See more complete examples here: https://stackoverflow.com/a/17449757/573688 for how others suggest you do this. NOTE this is not usually necessary if you just need to emulate POST requests.
This is a non-coded answer because it's not entirely clear what direction helps you the most.
On page works JavaScript, which should make an AJAX request to the server and pass the selected value from both SELECT.
The server receives the AJAX request, performs the request to the correct address (you can use phpQuery) and prints the response (gets to the response on AJAX request).
JavaScript on the page receives a response to an AJAX request and performs actions affected.
i'm trying to grab the content of a website based on ajax and https but with no luck.
Is this possible.
The website i'm trying to crawl is this:
https://www.bet3000.com/en/html/home.html#!https://www.bet3000.com/html/en/eventssportsbook.html?category_id=2117
Thanks
If you take a look at the HTTP requests that this page is doing (using, for example, Firebug for Firefox), you'll notice it makes several Ajax requests.
Instead of trying to execute the Javascript code, a possible solution could be for you to request one of those URLs, and get the data -- you'd also not have to parse the HTML, this way.
In this specific case, one of those requests is made to the following URL :
https://www.bet3000.com/ajax/en/sportsbook.json.html?category_id=2117&offset=&live=&sportsbook_id=0
This URL seems to return some JSON data, that should interest you quite a bit ;-)
(There is a few characters before and after the JSON, that will need to be removed, but, asides from that, I don't see anything that doesn't look good.)
I am unable to get a lot of referral URLS using document.referrer. I'm not sure what is going on. I would appreciate it if anyone had any info on its limitations (like which browser does not support what) etc.
Is there something else i could use (in a different language perhaps) that covers more browsers etc?
I wouldn't put any faith in document.referrer in your Javascript code. The value is sent in client side request headers (Referer) and as such it can be spoofed and manipulated.
For more info see my answer to this question about the server side HTTP_REFERER server variable:
How reliable is HTTP_REFERER
Which browser are you looking in? If the referring website is sending the traffic via window.open('some link') instead of a regular <a> tag, then IE will not see a referrer. It thinks it's a new request at that point, similar to you simply going to a URL directly (in which case there is no referrer). Firefox and Chrome do not have the same issue.
This is NOT just a javascript limitation, HTTP_REFERRER will NOT work either in this specific scenario.
Just to make sure you're on the same page, you do know that if someone types a URL directly in their web browser, the document.referrer property is empty, right? That being said, you might be interested in a JavScript method to get all HTTP headers. If you prefer PHP (since you're using that tag), the standard $_SERVER variable will provide what information is available. Note that the information is only as reliable as the reporting web browser and server, as noted by Kev.
The document.referrer will be an empty string if:
You access the site directly, by entering the URL;
You access the site by clicking on a bookmark;
The source link contains rel="noreferrer";
The source is a local file;
Check out https://developer.mozilla.org/en-US/docs/Web/API/Document/referrer
What i basically want to do is to get content from a website and load it into a div of another website. This should be no problem so far.
The problem is, that the content that should be fetched is located on a different server and i have no source access to it.
I'd prefer a solution using JavaScript of jQuery.
Can i use a .htacces redirect to fetch the content from a remote server with client-side (js) techniques?
I will also go with other solutions though.
Thanks a lot in advance!
You can't execute an AJAX call against a different domain, due to the same-origin policy. You can add a <script> tag to the DOM which points at a Javascript file on another domain. If this JS file contains some JSON data that you can use, you're all set.
The only problem is you need to get at the JSON data somehow, which is where JSON-P callbacks come into the picture. If the foreign resource supports JSON-P, it will give you something that looks like
your_callback( { // JSON data } );
You then specify your code in the callback.
See JSONP for more.
If JSONP isn't an option, then the best bet is to probably fetch the data server-side, say with a cron job every few minutes, and store it locally on your own site.
You can use a server-side XMLHTTP request to grab your content from the other server. You can then parse it on you server (A.K.A screen-scraping) and serve-up the portion you want along with your web page.
If the content from the other website is just an HTML doc that you want to display on your site, you could also use an iframe to pull it in. You won't have access to any of its content because of browser security rules.
You will likely have to "scrape" the data you need and store it on your server.
This is a great tutorial on how to cache data from an external site. It is actually written to fetch and store XML, so it'll need some modification. Also, if your site doesn't allow file_get_contents then you may have to modify it to use cUrl.
Specifically. I am making an ajax app and trying to preserve the back button. My javascript is working properly and registering a new url in the address bar with an anchor-like hash in the url:
http://t2b.localhost/#/clients/
I can catch the url when the page loads with javascript and load the "clients" page, but I want to know if there is a way to read the entire url with php or with htaccess? Looking at normal variables, I seem to only be able to get the url up to the occurrence of the "#" (http://t2b.localhost/).
The browser don't send to the server the fragment (the text after the #) part of the url.
It is intended to be used locally by the client.
In firefox (and in explorer too) there is document.location.hash that contains the fragment part of the URL. If you use javascript you can read it and send his value into a common variable.
Please use any of the available javascript libraries to track the history state or browse by ajax requests. There are so many problems involved, such as certain browsers not notifying scripts when the hash part changes, or not adding a pseudo-'navigation' event to the browser's history list etc., that you'll end up recreating an expensive wheel that wouldn't work very well. I recommend YUI's History library, although it has problems on Google Chrome.
I'm pretty sure that you can't parse it strictly with PHP because the hash part is parsed only on the client-side ( Javascript ).
For history I'd recommend Ben Alman's BBQ plugin.
See: Can I read the hash portion of the URL on my server-side application (PHP, Ruby, Python, etc.)?
You could use javascript and set a cookie as the current URL then get it with PHP