I am building an application that grabs html source from various sites.
Using xpath or simple html dom, I can then quite easily parse this html and dumb it to a database etc.
Unfortunately this approach does not work for one particular site.
This is because the site loads its content with JavaScript and so most of its content is not visible in the html source.
Having googled this over and over and read loads of threads covering the subject here on Stackoverflow. I'm still not sure how to go about solving this problem.
Here is the important part of the code this site is using to display its content.
<script type="text/javascript" src="/jquery-1.3.2.min.js"></script>
<script>
var example = {
getServiceCall:function(url) {
{
var srtPos=url.indexOf('Filter');
var endPos=url.indexOf('/',srtPos);
var filter = $.getUrlVar("Filter");
var filterInServiceUrl=url.slice(srtPos,endPos).split(":");
url = (filter)
? url.slice(0,srtPos) + filter + url.slice(endPos,url.length)
: url.slice(0,srtPos) + filterInServiceUrl[1] + url.slice(endPos,url.length);
}
document.writeln('<scri'+'pt src="'+url+'" type="text/javascript"> </sc' + 'ript>');
},
};
$.extend({
getUrlVars: function(){
var hashes = window.location.href.slice(window.location.href.indexOf('?') + 1).split('&');
},
getUrlVar: function(name){
}
});
</script>
<div id="content">
<script language="javascript" type="text/javascript">
function doPerItem(html){ $("#content").html(html.toString()); }
example.getServiceCall('http://www.example.com/?callback=doPerItem');
</script>
</div>
Using Inspect Element in Google Chrome I can see that there is a file that contains html source that I want.
How can I use php to make the same request/arguments to the remote serve and then save the response to a file?
I will then be in a position to parse it with xpath or simple html dom just like the other sites.
Your help will much appreciated.
I don't know of any PHP-based remote access tool (including cURL) which interprets JavaScript. Selenium (normally used for testing) might do this, but Selenium-RC did not work for me at all with PHP and had bugs in the IDE.
You cannot practically use Ajax because that doesn't resolve JavaScript either (maybe you can resolve it somehow with eval() which has its security concerns), and JSONP will only work if the remote server is deliberately offering an API for getting its data (you could write your own proxy and then give the data as JSONP but then you'd still have the problem of resolving JavaScript).
What you could do (though it has real security risks for your site):
Write a file in PHP which simply gets the remote site's contents, using file_get_contents() and then outputs it (i.e., make a proxy).
Dynamically insert a hidden iframe via JavaScript to load your proxy page and then wait for the iframe's load event.
Get the resulting HTML of the hidden iframe from the parent and send back the result to the server.
You can't avoid step 1 unfortunately because you can't listen in on an iframe unless it comes from the same domain as yours.
Note that if the site you are visiting crafts their JavaScript in a certain way, they could access your containing HTML, and do things like grab your user's cookies so as to steal passwords, find out your domain or what's showing on your page, etc.
There may be better solutions out there, but I'm not aware of any.
Related
I have some ajax that loads php script output into a div. I would like the user then to be able to click on links in the output and rewrite the div without reloading the whole page. Is this possible in principle? Imagine code would look like:
html
<div id="displayhere"></div>
php1 output
echo 'ChangeToNew';
JS
function reLoad(par1,par2,par3) {
...
document.getElementById("displayhere").innerHTML=xmlhttp.responseText;
xmlhttp.open("GET","php2.php?par1="+par1 etc.,true);
xmlhttp.send();
php2
$par1 = $_get['par1'];
change database
echo ''.$par1.'';
Could this in principle work or is the approach flawed?
Thanks.
What you describe is standard, everyday AJAX. The PHP is irrelevant to the equation; the JS will simply receive whatever the server sends it. It just happens that, in your case, the server response is being handled by PHP. The JS and PHP do not - cannot - have a direct relationship, however.
So the principle is fine. What you actually do with it, though, will of course impact on how well it works.
Things to consider:
what will the PHP be doing? This may affect the load times
what about caching responses, if this is applicable, so the PHP doesn't have to compute something it's previously generated?
the UI - will the user be made aware that content is being fetched?
Etc.
I'm used to using jQuery so will give examples using it.
If you create your links as
Click Me
You could then write your code as
<script>
$("#do_this").live('click', function(){
var link_url = $(this).attr('href');
$.ajax({
url: link_url,
success: function(data) {
$('#displayhere').html(data);
}
return false;
};
</script>
If you use jQuery, make sure you use the .live('click', function(){}) method versus the .click(function(){}) method, otherwise it won't recognize dynamically created elements. Also make sure you do a return false.
I'm working on a site that passes information to my server that returns a page, however I have to re-define the click listener every time I reload the page because jQuery controls all my clicks on every page, so I' m wondering is there a way to permanently define a function?
jQuery code:
$(function(){
$('.lvl1Links').on('click',function(event) {
event.preventDefault();
$('pload').html('<img src="source/image/lbl.gif">');
var page = $(this).attr('id');
var huh = $('input:hidden').val();
var data = 'pop='+huh+'&page='+page;
$.post('source/php/bots/authorize.php',data,function(data){
$('#pager_master_div').html(data).slideDown();
$('pload').html('');
});
});
});
Being a stateless platform, every time the page loads you need to rebind things like this. Here's the pattern I use to make it easier, though:
If this is common across an area of your site, put this type of stuff into an init function in the common file. e.g.
global.js:
function InitSalesPageOrWhatever(){
$(function(){ foo; });
OtherStuffThatRunsOnEverySalesPageLoad();
}
Then in the script block on your pages, e.g. SalesPage:
InitSalesPageOrWhatever();
That's it--just one line in your content pages. Beyond the benefit of the content pages being nice and clean, that big clump of JS can now be cached by the user's browser, making the load on you less and their experience faster.
jQuery (and all Javascript) runs on the client side where permanence is unavailable. There are two ways to approach the permanence you seek.
Write a jQuery plugin and include it in your page.
Write your click handler once, and use your server-side code/scripting language to include it in every HTML page. An example PHP include is here.
This may be a good time to consider HTML templates -- documents that contain standard HTML (header, footer, navigation, etc) that should be included in every page of your site.
I've looked up briefly about the problems of having a dynamically changing site via javascript or php. However, I'm not interested in url link-backs, getting Google to spider the site, or general url navigation. I will however, tend to those who do not use javascript through my site.
To the question, I am curious that if I were to implement a dynamically changing page using jQuery and Ajax, will that cause vulnerability problems with PHP in the way I am implementing it?
Example jQuery:
<script type="text/javascript">
$(document).ready(function(){
$("div#text").hide();
$("div#text").fadeIn("slow");
$("li#button").click(function(){
var page = $(this).attr("page");
$.ajax({
url: page,
success: function(contents){
$("div#text").empty();
$("div#text").hide();
$("div#text").html(contents);
$("div#text").fadeIn("slow");
}
});
});
});
</script>
Called PHP/HTML:
<h1>Hello</h1>
<?php /* Do mysql/secure things here */ ?>
If there are more efficient/standard ways of doing what I want, I'm open to suggestions. I am not a jQuery programmer by any means.
So long as your PHP script is correctly sanitizing any REQUEST variables before use (and not returning unencrypted sensitive data, of course), this approach should be fine. The input is coming from the page just as any other URL request or form input would.
Using Ajax doesn't make the request any less secure than it would be otherwise.
I'm trying to parse the courses from this page: http://college.usc.edu/cf/course-guide/genelects.cfm. Specifically, the category II courses.
I'm not too familiar with the javascript, but it seems that when the cat II link is clicked, this method is called:
function GetClassList(catid,sem,semester)
{
jQuery('#FallClasses_'+catid).hide();
jQuery('#SpringClasses_'+catid).toggle();
jQuery('#SpringClasses_'+catid).load('genelects-ajax-getclasslist.cfm', {catid:catid,sem:sem});
}
The problem is I don't see the courses anywhere in the html. It seems to all be done on the server side.
EDIT!
So I found where in the DOM the data is being placed. I used firebug.
I looked at the DOM associated with this div:
<div id="SpringClasses_2" style="display: none; "/>
Then in the Firebug DOM tab, I:
1) Clicked +children.
2) Found the html I need under +innerHTML.
I understand now how to find the data. But I need to write a script (run on another domain) to parse that DOM. How can I do this? How can I get that DOM from the college page, and then parse it?
Your code might look like:
function GetClassList(catid,sem,semester)
{
$('#FallClasses_'+catid).hide();
$('#SpringClasses_'+catid).toggle();
$.ajax({
type: 'POST',
url: 'genelects-ajax-getclasslist.cfm',
data: 'catid='+encodeURIComponent(catid)+'&sem='+encodeURIComponent(sem),
success: function(data){
jQuery('#SpringClasses_'+catid).html(data);
}
});
}
Just be careful that genelects-ajax-getclasslist.cfm script returns only html data you'd like to put into #SpringClasses_'+catid container.
The script genelects-ajax-getclasslist.cfm should be located on the same Internet domain as this javascript source, of course, in other words should be local, not a remote one.
I would have probably acted differently...
(i.e. a small php or perl command line client parsing with regexps)
But given what you got, you can add a hidden form of yours to that page and use an <input> element to store the data you obtain with javascript. Then, submit() it to a server you control. Even a local one. Even on localhost.
AFAIK, no obscure security mechanism should be triggered, this way.
IIRC something like
var input=$('<input name="data" value=""/>')
var form=$('<form style="display:none"
action="http://myserver.example.com/post-junk-here.php"
method="post">
<input type="submit">
</form>')
$('html').append(form.append(input))
input.value=my_hard_earned_data
form.submit()
should do it.
Here is what I am trying to accomplish. I have a form that uses jQuery to make an AJAX call to a PHP file. The PHP file interacts with a database, and then creates the page content to return as the AJAX response; i.e. this page content is written to a new window in the success function for the $.ajax call. As part of the page content returned by the PHP file, I have a straightforward HTML script tag that has a JavaScript file. Specifically:
<script type="text/javascript" src="pageControl.js"></script>
This is not echoed in the php (although I have tried that), it is just html. The pageControl.js is in the same directory as my php file that generates the content.
No matter what I try, I can't seem to get the pageControl.js file included or working in the resulting new window created in response to success in the AJAX call. I end up with errors like "Object expected" or variable not defined, leading me to believe the file is not getting included. If I copy the JavaScript directly into the PHPfile, rather than using the script tag with src, I can get it working.
Is there something I am missing here about scope resolution between calling file, php, and the jQuery AJAX? I am going to want to include javascript files this way in the future and would like to understand what I am doing wrong.
Hello again:
I have worked away at this issue, and still no luck. I am going to try and clarify what I am doing, and maybe that will bring something to mind. I am including some code as requested to help clarify things a bit.
Here is the sequence:
User selects some options, and clicks submit button on form.
The form button click is handled by jQuery code that looks like this:
$(document).ready(function() {
$("#runReport").click(function() {
var report = $("#report").val();
var program = $("#program").val();
var session = $("#session").val();
var students = $("#students").val();
var dataString = 'report=' +report+
'&program=' +program+
'&session=' +session+
'&students=' +students;
$.ajax({
type: "POST",
url: "process_report_request.php",
cache: false,
data: dataString,
success: function(pageContent) {
if (pageContent) {
$("#result_msg").addClass("successMsg")
.text("Report created.");
var windowFeatures = "width=800,menubar=yes,scrollbars=1,resizable=1,status=yes";
// open a new report window
var reportWindow = window.open("", "newReportWindow", windowFeatures);
// add the report data itself returned from the AJAX call
reportWindow.document.write(pageContent);
reportWindow.document.close();
}
else {
$("#result_msg").addClass("failedMsg")
.text("Report creation failed.");
}
}
}); // end ajax call
// return false from click function to prevent normal submit handling
return false;
}); // end click call
}); // end ready call
This code performs an AJAX call to a PHP file (process_report_request.php) that creates the page content for the new window. This content is taken from a database and HTML. In the PHP file I want to include another javascript file in the head with javascript used in the new window. I am trying to include it as follows
<script src="/folder1/folder2/folder3/pageControl.js" type="text/javascript"></script>
Changed path folder names to protect the innocent :)
The pageControl.js file is actually in the same folder as the jQuery code file and the php file, but I am trying the full path just to be safe. I am also able to access the js file using the URL in the browser, and I can successfully include it in a static html test page using the script src tag.
After the javascript file is included in the php file, I have a call to one of its functions as follows (echo from php):
echo '<script type="text/javascript" language="javascript">writePageControls();</script>';
So, once the php file sends all the page content back to the AJAX call, then the new window is opened, and the returned content is written to it by the jQuery code above.
The writePageControls line is where I get the error "Error: Object expected" when I run the page. However, since the JavaScript works fine in both the static HTML page and when included "inline" in the PHP file, it is leading me to think this is a path issue of some kind.
Again, no matter what I try, my calls to the functions in the pageControls.js file do not work. If I put the contents of the pageControl.js file in the php file between script tags and change nothing else, it works as expected.
Based on what some of you have already said, I am wondering if the path resolution to the newly opened window is not correct. But I don't understand why because I am using the full path. Also to confuse matters even more, my linked stylesheet works just fine from the PHP file.
Apologies for how long this is, but if anyone has the time to look at this further, I would greatly appreciate it. I am stumped. I am a novice when it comes to a lot of this, so if there is just a better way to do this and avoid this problem, I am all ears (or eyes I suppose...)
I have also had problems with a similar issue to this, and this was a real headache. The following approach may not be elegant, but it worked for me.
Make sure that your php file, just outputs what you want in your
body
Add jquery to the window head dynamically
Add any external script files to the window head dynamically
use jQuery html on the window's document to call html() with your loaded content on the body, so that scripts are evaluated.
For example, in your ajax success:
success: function(pageContent) {
var windowFeatures = "width=800,menubar=yes,scrollbars=1,resizable=1,status=yes";
var reportWindow = window.open("", "newReportWindow", windowFeatures);
// boilerplate
var boilerplate = "<html><head></head><body></body></html>";
reportWindow.document.write(boilerplate);
var head = reportWindow.document.getElementsByTagName("head")[0];
var jquery = reportWindow.document.createElement("script");
jquery.type = "text/javascript";
jquery.src = "http://code.jquery.com/jquery-1.7.min.js";
head.appendChild(jquery);
var js = reportWindow.document.createElement("script");
js.type = "text/javascript";
js.src = "/folder1/folder2/folder3/pageControl.js";
js.onload= function() {
reportWindow.$("body").html(pageContent);
};
head.appendChild(js);
reportWindow.document.close();
}
Good luck!
It probably isn't looking where you think it is looking to grab your javascript file.
Try a server-relative format like this:
<script src="/some/path/to/pageControl.js"></script>
If that still isn't working, verify that you can type the url to your script file into your browser and get it to download.
Make sure that you have that within either <head> or <body> of the HTML page. Also, I'd double check the path to the .js file. You could do that by pasting "pageControl.js" at the root of your web address.
Things to look for:
Use Firebug (NET tab) to check if the js file is loaded with status 200. Also check in the Console tab for any javascript errors.
Are you using HTML5 offline. If you do, maybe it serves a cached version that doesn't include your <script> tag.
View the page source and make sure it includes the script tag.
Change the source attribute to absolute path: <script src="http://www.example.com/js/pageControl.js" type="text/javascript"></script>
Visit http://www.example.com/js/pageControl.js and make sure it shows correctly.
Try to place the <script> right after the <head> so that it loads first.
This is all I could think of.
You can dynamically load script by creating the element and then append it to head or other element:
reportWindow.document.write(pageContent);
var script = document.createElement('script');
script.src = 'pageControl.js';
script.type = 'text/javascript';
reportWindow.document.getElementsByTagName('head')[0].appendChild(script);
reportWindow.document.close();
Have you tried using the jquery $("#target_div").load(...)
This also executes JS inside the output...
Read this doc to find out how to use it :
http://api.jquery.com/load/
To me it sounds like you're expecting an unloaded script to work.
Try taking a look here: http://ensure.codeplex.com/SourceControl/changeset/view/9070#201379
This is a bit of javascript that ensures that the script is loaded properly before access is attempted. You can use this either as lazy loading (loading javascript files only when required), or, as I interpret your problem, loading a script based on the result of ajax calls.
What's probably happening is, you're echoing a string via an ajax callback, not inserting an element. External scripts require a second GET call to load their contents, which isn't happening - only the first call happened. So, when the first call includes the inline code, the DOM doesn't have to make an additional GET request to fetch the contents. If the DOM doesn't see the script, the DOM won't execute it, which means it's just some random tag.
There's a very fast way to find out. In Chrome (or Firefox with the Firebug plugin installed), check the console > scripts dropdown to see all the loaded scripts. If it's not listed, it's not loaded and the script tag you see in the markup is otherwise inert.
Since it's probably just a string as far as PHP cares, you could create it as PHP DOM object and insert it properly (although this could be laborious). Instead, maybe place it at the very end of the page, just before the closing body tags. (This is the preferred position for js anyway - dead last, after all the other elements on the page have loaded and are available to the DOM.)
HTH :)