I have an automated archive of several (media) websites' frontpage, written in php. Specifically, I am copying the html in the <body> tag twice a day, I have a copy of all their css and js files, so I can recreate the frontpage from any point in the past. Now, I came to a problem with one of those websites, as they load the main slider content (most important news) with an ajax call. I would like this ajax call to be executed before I parse the data, not just a blank div. By looking around, I found out they use a wordpress plugin named lof-jslidernews2, but I can't find the specific ajax call to see the url and make curl request. Any ideas how to achieve this?
The website: http://fokus.mk/
My code (had to parse manually like this, because of some problems with DomDocument and not-valid html):
// ...
if($html = file_get_contents ($row['page_url'])) {
$content = strstr($html, '<body');
$content = str_before($content, '</body>') . '</body>';
$filename = date('YmdHis') . $row['page_name'];
if($success = file_put_contents ('app/webroot/files/' . $filename, $content)) {
// ....
** There is nothing illegal about my project, I am not stealing content, just freezing frontpages for later comparison. I have consulted a lawyer about this. :)
I don't know why, but the guy that actually solved my problem deleted his answer. So, here it is:
He suggested using an emulator, specifically Mink. It was easy to install (using composer) and did the job on the first try. Awesome library.
Mink is an open source browser controller/emulator for web applications, written in PHP 5.3.
Related
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Save full webpage
I need to save page source of external link using PHP , Like we saving in PC.
p.s :saved folder has images and html contents.
I tried below code...it just puts the source in tes.html , i need to save all images too.So we access if offline.
<?php
include 'curl.php';
$game = load("https://otherdomain.com/");
echo $game;
?>
<?php
file_put_contents('tes.html', $game);
?>
What you are trying to do is mirroring a web site.
I would use the program wget to do so instead of reinventing the wheel.
exec( 'wget -mk -w 20 http://www.example.com/' );
See:
http://en.wikipedia.org/wiki/Wget
http://fosswire.com/post/2008/04/create-a-mirror-of-a-website-with-wget/
Either write your own solution to parse all the CSS, image and JS links (and save them) or check this answer to a similar question: https://stackoverflow.com/a/1722513/143732
You need to write a scraper and, by the looks of it, you're not yet skilled for such an endeavor. Consider studying:
Web Scraping (cURL, StreamContext in PHP, HTTP theory)
URL paths (relative, absolute, resolving)
DOMDocument and DOMXPath (for parsing HTML and easy tag querying)
Overall HTML structure (IMG, LINK, SCRIPT and other tags that load external content)
Overall CSS structure (like url('...') in CSS that loads resources the page depends on)
And only then will you be able to mirror a site, properly. But if they load content dynamically, like with Ajax, you're out of luck.
file_get_contents() also supports http(s). Example:
$game = file_get_contents('https://otherdomain.com');
I had a big PHP script written out to scrape images from this site: "http://www.mcso.us/paid/", but when it didn't work I butchered my code to simply echo the whole page.
I found that the table with the image links I want doesn't show up. I believe it's because the remote site uses ASP to generate the table. Is there a way around this? Am I wrong? Please help.
<?php
include("simple_html_dom.php");
set_time_limit(0);
$baseURL = "http://www.mcso.us/paid/";
$html = file_get_html($baseURL);
echo $html;
?>
There's no obvious reason why them using ASP would cause this, have you tried navigating the page with JavaScript turned off? It's a more likely scenario that the tables are generated through JS.
Do note that the search results are retrieved through ajax ( page http://www.mcso.us/paid/default.aspx ) by making a POST request, you can use cURL http://php.net/manual/en/book.curl.php , use chrome right-click-->inspect element---> network and make a search you will see all the info there (post variables etc ...)
Try this sample code I threw together to illustrate a point:
<?php
$url = "http://www.amazon.com/gp/offer-listing/B003WSNV4E/";
$html = file_get_contents($url);
echo($html);
?>
The amazon homepage works fine using this method (it is echoed in the browser), but this page just doesn't output anything. Is there a reason for this, and how can I fix it?
I think your problem is that you're misunderstanding your own code.
You made this comment on the question (emphasis mine):
I've never used those utilities before, so maybe I'm doing it wrong but it only seems to be downloading this page: https://www.amazon.com/gp/offer-listing/B003WSNV4E/ref=dp_olp_new?ie=UTF8&condition=new
This implies to me that an Amazon page is appearing in your browser when you run this code. This is entirely expected.
When you try to download https://rads.stackoverflow.com/amzn/click/B003WSNV4E, you're being redirected to https://www.amazon.com/gp/offer-listing/B003WSNV4E/ref=dp_olp_new?ie=UTF8&condition=new which is the intent of StackOverflow's RADS system.
What happens from there is your code is loading the raw HTML into your $html variable and dumping it straight to the browser. Because you're passing raw HTML to the browser, the browser is interpreting it as such, and it tries (and succeeds) in rendering the page.
If you just want to see the code, but not render it, then you need to convert it into html entities first:
echo htmlentities($html);
Hey everybody, this issue has had me stumped for the last week or so, here's the situation:
I've got a site hosted using GoDaddy hosting. The three files used in this issue are index.html , milktruck.js , and xml_http_request.php all hosted in the same directory.
The index.html file makes reference to the milktruck.js file with the following code:
<script type="text/javascript" src="milktruck.js"></script>
The milktruck.js file automatically fires when the site is opened. The xml_http_request.php has not fired at this point.
On line 79 out of 2000 I'm passing the variable "simple" to a function within the milktruck.js file with:
placem('p2','pp2', simple, window['lla0_2'],window['lla1_2'],window['lla2_2']);
"simple" was never initialized within the milktruck.js file. Instead I've included the following line of code in the xml_http_request.php file:
echo "<script> var simple = 'string o text'; </script>";
At this point I have not made any reference whatsoever to the xml_http_request.php file within the milktruck.js file. I don't reference that file until line 661 of the milktruck.js file with the following line of code:
xmlhttp.open('GET',"xml_http_request.php?pid="+pid+"&unLoader=true", false);
Everything compiles (I'm assuming because my game runs) , however the placem function doesn't run properly because the string 'string o text' never shows up.
If I was to comment out the line of code within the php file initializing "simple" and include the following line of code just before I call the function placem, everything works fine and the text shows up:
var simple = 'string o text';
Where do you think the problem is here? Do I need to call the php file before I try using the "simple" variable in the javascript file? How would I do that? Or is there something wrong with my code?
So, we meet again!
Buried in the question comments is the link to the actual Javascript file. It's 2,200 lines, 73kb, and poorly formatted. It's also derived from a demo for the Google Earth API.
As noted in both the comments here and in previous questions, you may be suffering from a fundamental misunderstanding about how PHP works, and how PHP interacts with Javascript.
Let's take a look at lines 62-67 of milktruck.js:
//experiment with php and javascript interaction
//'<?php $simpleString = "i hope this works"; ?>'
//var simple = "<?php echo $simpleString; ?>";
The reason this never worked is because files with the .js extension are not processed by PHP without doing some bizarre configuration changes on your server. Being on shared hosting, you won't be able to do that. Instead, you can rename the file with the .php extension. This will allow PHP to process the file, and allow the commands you entered to actually work.
You will need to make one more change to the file. At the very top, the very very top, before anything else, you will need the following line:
<?php header('Content-Type: text/javascript'); ?>
This command will tell the browser that the file being returned is Javascript. This is needed because PHP normally outputs HTML, not Javascript. Some browsers will not recognize the script if it isn't identified as Javascript.
Now that we've got that out of the way...
Instead I've included the following line of code in the xml_http_request.php file: <a script tag>
This is very unlikely to work. If it does work, it's probably by accident. We're not dealing with a normal ajax library here. We're dealing with some wacky thing created by the Google Earth folks a very, very long time ago.
Except for one or two in that entire monolithic chunk of code, there are no ajax requests that actually process the result. This means that it's unlikely that the script tag could be processed. Further, the one or two that do process the result actually treat it as XML and return a document. It's very unlikely that the script tag is processed there either.
This is going to explain why the variable never shows up reliably in Javascript.
If you need to return executable code from your ajax calls, and do so reliably, you'll want to adopt a mature, well-tested Javascript library like jQuery. Don't worry, you can mix and match the existing code and jQuery if you really wanted to. There's an API call just to load additional scripts. If you just wanted to return data, that's what JSON is for. You can have PHP code emit JSON and have jQuery fetch it. That's a heck of a lot faster, easier, and more convenient than your current unfortunate mess.
Oh, and get Firebug or use Chrome / Safari's dev tools, they will save you a great deal of Javascript pain.
However...
I'm going to be very frank here. This is bad code. This is horrible code. It's poorly formatted, the commenting is a joke, and there are roughly one point seven billion global variables. The code scares me. It scares me deeply. I would be hesitant to touch it with a ten foot pole.
I would not wish maintenance of this code on my worst enemy, and here you are, trying to do something odd with it.
I heartily encourage you to hone your skills on a codebase that is less archaic and obtuse than this one before returning to this project. Save your sanity, get out while you still can!
perhaps init your values like this:
window.simple = 'blah blah blah'
then pass window.simple
You could try the debugger to see what is going on, eg. FireBug
i am using ajax to load pages into a div
the page is loading fine
but i cant run the php and javascript
in that loaded page
in server i am loading the page like this
file_get_contents('../' . $PAGE_URL);
in the browser i am setting the content of the div
using
eval("var r = " + response.responseText);
and setting the innerHTML for that div
with the retrieve information
but when i get the new inner page
no php or java script is working
is that suppose to be like that ?
Well the php is not going to work I think because the way you are handling it, it is just text. I would suggest using something like include('../' . $PAGE_URL); and that should parse the php.
The javascript problem probably has to do with the fact that you are loading <html> <body> <head> tags in a div I'm not sure what happens when you do that, but it shouldn't work properly. Try using some type of <frame> tag.
In order for your javascript to be executed properly, you have to wait until the browser has finished to load the page.
This event is named onload(). Your code should be executed on this event.
<?php
$file = false;
if(isset($_GET['load'] && is_string($_GET['load'])) {
$tmp = stripclashes($_GET['load']);
$tmp = str_replace(".","",$tmp);
$file = $tmp . '.php';
}
if($file != false && file_exists($file) && is_readable($file)) {
require_once $file;
}
?>
called via file.php?load=test
That process the PHP file, and as long as you spit out HTML from the file simply
target = document.getElementById('page');
target.innerHTML = response.responseText;
Now, i'm fairly certain parts of that are insecure, you could have a whitelist of allowable requires. It should ideally be looking in a specific directory for the files also. I'm honestly not all too sure about directly dumping the responseText back into a DIV either, security wise as it's ripe for XSS. But it's the end of the day and I haven't looked up anything on that one. Be aware, without any kind of checking on this, you could have a user being directed to a third party site using file_get_contents, which would be a Very Bad Thing. You could eval in PHP a file_get_contents request, which... is well, Very Very Bad. For example try
<?php
echo file_get_contents("http://www.google.com");
?>
But I fear I must ask here, why are you doing it this way? This seems a very roundabout way to achieve a Hyperlink.
Is this AJAX for AJAXs sake?