Save whole page source using php [duplicate] - php

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Save full webpage
I need to save page source of external link using PHP , Like we saving in PC.
p.s :saved folder has images and html contents.
I tried below code...it just puts the source in tes.html , i need to save all images too.So we access if offline.
<?php
include 'curl.php';
$game = load("https://otherdomain.com/");
echo $game;
?>
<?php
file_put_contents('tes.html', $game);
?>

What you are trying to do is mirroring a web site.
I would use the program wget to do so instead of reinventing the wheel.
exec( 'wget -mk -w 20 http://www.example.com/' );
See:
http://en.wikipedia.org/wiki/Wget
http://fosswire.com/post/2008/04/create-a-mirror-of-a-website-with-wget/

Either write your own solution to parse all the CSS, image and JS links (and save them) or check this answer to a similar question: https://stackoverflow.com/a/1722513/143732

You need to write a scraper and, by the looks of it, you're not yet skilled for such an endeavor. Consider studying:
Web Scraping (cURL, StreamContext in PHP, HTTP theory)
URL paths (relative, absolute, resolving)
DOMDocument and DOMXPath (for parsing HTML and easy tag querying)
Overall HTML structure (IMG, LINK, SCRIPT and other tags that load external content)
Overall CSS structure (like url('...') in CSS that loads resources the page depends on)
And only then will you be able to mirror a site, properly. But if they load content dynamically, like with Ajax, you're out of luck.

file_get_contents() also supports http(s). Example:
$game = file_get_contents('https://otherdomain.com');

Related

Any way to get source code in actual php format from iframe

I am working with iframe in my php project ,and use iframe to load another php project also from htdocs(local storage).
and i use this javascript to get source code of single page of project from iframe window
var code=document.getElementById("myframe").contentWindow.document.documentElement.outerHTML;
and i get the code but all the php statements are in simplified format that is, for an example
<?php echo "hi"> is changed to hi
and my question is that there is any way to get in actual php format like this
<?php echo "hi"> into code variable???
If you want to get the php-sourcecode you will have to write a script for that.
Example (reader.php):
<?php
echo file_get_contents('example.php'); // or $_GET['file'] insetad of 'example.php'
The iframe would have to call reader.php?file=example.php then.
Be warned
This is highly insecure. You will have to take care of thw following things:
Sanitizing user-input
Prevent directory traversal
Limit access to only the files you want to be readable
Be very very sure there are no sensitive information in your php-files.

Get ajax generated content from another website

I have an automated archive of several (media) websites' frontpage, written in php. Specifically, I am copying the html in the <body> tag twice a day, I have a copy of all their css and js files, so I can recreate the frontpage from any point in the past. Now, I came to a problem with one of those websites, as they load the main slider content (most important news) with an ajax call. I would like this ajax call to be executed before I parse the data, not just a blank div. By looking around, I found out they use a wordpress plugin named lof-jslidernews2, but I can't find the specific ajax call to see the url and make curl request. Any ideas how to achieve this?
The website: http://fokus.mk/
My code (had to parse manually like this, because of some problems with DomDocument and not-valid html):
// ...
if($html = file_get_contents ($row['page_url'])) {
$content = strstr($html, '<body');
$content = str_before($content, '</body>') . '</body>';
$filename = date('YmdHis') . $row['page_name'];
if($success = file_put_contents ('app/webroot/files/' . $filename, $content)) {
// ....
** There is nothing illegal about my project, I am not stealing content, just freezing frontpages for later comparison. I have consulted a lawyer about this. :)
I don't know why, but the guy that actually solved my problem deleted his answer. So, here it is:
He suggested using an emulator, specifically Mink. It was easy to install (using composer) and did the job on the first try. Awesome library.
Mink is an open source browser controller/emulator for web applications, written in PHP 5.3.

Grabbing content through php curl

iam trying to develop a content grabber using php curl, i need to retrieve content from an url eg:http://mashable.com/2011/10/31/google-reader-backlash-sharebros-petition/ and store it in a csv file. for eg: if i enter a url to extract data, it should store the title, content, tags in the csv and subsequent for the next url. Is their any snippet like that?
the following code generates all the content, i need to specifically call in the title, content of the post
<?php
$homepage = file_get_contents('http://mashable.com/2011/10/28/occupy-wall-street-donations/');
echo strip_tags($homepage);
?>
There are so many ways. De facto, you want to parse a HTML file. strip_tags is one way, but a dirty one.
I recommend you to use the DOMDocument class for this (There should be many other ways here on so.com). The rest is standard php, writing and reading from a CSV is well documented on php.net
Example for getting links on a website (not by me):
http://php.net/manual/en/class.domdocument.php#95894

How to compress HTML output except on a certain div?

I'd like to use this function:
ob_start('no_returns');
function no_returns($a) {
return str_replace(
array("\r\n","\r","\n","\t",'','',''),
'', $a);
}
But when I do, it completely kills Disqus comments so I'd like to ignore the DIV "disqus_thread". How would I go about doing that without using some heavy search?
If you are looking to speed up the download of the web page, you might try another method:
<?php
ob_start('ob_gzhandler');
// html code here
This will compress the output in a much more efficient manner and your browser will automatically decompress the output in real-time before the visitor sees it.
A related thread on-line is here: http://bytes.com/topic/php/answers/621308-compress-html-output-php
(This is the PHP way to compress web pages without using the webserver configuration. For example apache+gzip/mod_deflate on apache as mentioned above)
Try Regular Expression and preg_replace

How can I take a snapshot of a wep page's DOM structure?

I need to compare a webpage's DOM structure at various points in point. What are the ways to retrieve and snapshot it.
I need the DOM on server-side for processing.
I basically need to track structural changes to a webpage. Such as removing of a div tag, or inserting a p tag. Changing data (innerHTML) on those tags should not be seen as a difference.
$html_page = file_get_contents("http://awesomesite.com");
$html_dom = new DOMDocument();
$html_dom->loadHTML($html_page);
That uses PHP DOM. Very simple and actually a bit fun to use. Reference
EDIT: After clarification, a better answer lies here.
Perform the following steps on server-side:
Retrieve a snapshot of the webpage via HTTP GET
Save consecutive snapshots of a page with different names for later comparison
Compare the files with an HTML-aware diff tool (see HtmlDiff tool listing page on ESW wiki).
As a proof-of-concept example with Linux shell, you can perform this comparison as follows:
wget --output-document=snapshot1.html http://example.com/
wget --output-document=snapshot2.html http://example.com/
diff snapshot1.html snapshot2.html
You can of course wrap up these commands into a server-side program or a script.
For PHP, I would suggest you to take a look at daisydiff-php. It readily provides a PHP class that enables you to easily create an HTML-aware diff tool. Example:
<?
require_once('HTMLDiff.php');
$file1 = file_get_contents('snapshot1.html');
$file2 = file_get_contents('snapshot1.html');
HTMLDiffer->htmlDiffer( $file1, $file2 );
?>
Note that with file_get_contents, you can also retrieve data from a given URL as well.
Note that DaisyDiff itself is very fine tool for visualisation of structural changes as well.
If you use firefox, firebug lets you view the DOM structure of any web page.

Categories