I have this code and its work, but.. Its saving this wwww "to fast" and on html file i see in middle part of www Loading image :/ So how can i make to delay script or smth to stay little more time on this www and when all is loaded on www then saving it to file ?
<pre><?php
$file = fopen("brawl2.html", "w");
$c = curl_init();
curl_setopt($c, CURLOPT_URL, "https://brawlstats.com/club/8LG08L");
curl_setopt($c, CURLOPT_FILE, $file);
curl_exec($c);
curl_close($c);
fclose($file);
?>
Thanks for help !
Curl is not emulating a browser, it is just downloading a single file from the server, so it will never load these images.
In HTTP, a user agent (normally a browser, but in this case the curl library) sends a request for a particular resource (URL); then the server does whatever it needs to do, and then returns a response; and then you're done.
In your case, the server is responding with an HTML page that contains some JavaScript. When loaded by a browser, this JavaScript will run, and load the images; but curl is not a browser, so will not run this JavaScript.
There are libraries that do emulate a browser, which would be able to run this; they are referred to as "headless browsers", and a quick search turned up this attempt at a comprehensive list.
It's also worth remembering that even once the JavaScript is run, the images are probably not part of the HTML, but references to other files. If you don't save those, your saved HTML won't show any images if you unplug your internet, so you may also need to think about how to archive all the resources needed to display the page, not just the page itself.
Related
I'm scraping a site, searching for JPGs to download.
Scraping the site's HTML pages works fine.
But when I try getting the JPGs with CURL, copy(), fopen(), etc., I get a 403 forbiden status.
I know that's because the site owners don't want their images scraped, so I understand a good answer would be just don't do it, because they don't want you to.
Ok, but let's say it's ok and I try to work around this, how could this be achieved?
If I get the same URL with a browser, I can open the image perfectly, it's not that my IP is banned or anything, and I'm testing the scraper one file at a time, so it's not blocking me because I make too many requests too often.
From my understanding, it could be that either the site is checking for some cookies that confirm that I'm using a browser and browsing their site before I download a JPG.
Or that maybe PHP is using some user agent for the requests that the server can detect and filter out.
Anyway, have any idea?
Actually it was quite simple.
As #Leigh suggested, it only took spoofing an http referrer with the option CURLOPT_REFERER.
In fact for every request, I just provided the domain name as the referrer and it worked.
Are you able to view the page through a browser? Wouldn't a simple search of the page source find all images?
` $findme = '.jpg';
$pos = strpos($html, $findme);
if ($pos === false) {
echo "The string '$findme' was not found in the string '$html'";
} else {
echo "Images found..
///grab image location code
} `
Basic image retrieval:
Using the GD Library plugin commonly installed by default with many web hosts. This is something of an ugly hack but some may find the fact it can be done this way useful.
$remote_img = 'http://www.somwhere.com/images/image.jpg';
$img = imagecreatefromjpeg($remote_img);
$path = 'images/';
imagejpeg($img, $path);
Classic cURL image grabbing function for when you have extracted the location of the image from the donor pages HTML.
function save_image($img,$fullpath){
$ch = curl_init ($img);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
$rawdata=curl_exec($ch);
curl_close ($ch);
if(file_exists($fullpath)){
unlink($fullpath);
}
$fp = fopen($fullpath,'x');
fwrite($fp, $rawdata);
fclose($fp);
}
If the basic cURL image grabbing function fails then the donor site probably has some form of server side defences in place to prevent retrieval and so you are probably breaching the terms of service by proceeding further. Though rare some sites do create images 'on the fly' using the GD library module, so what may look like a link to an image is actually a PHP script and that could be checking for things like a cookie, referer or session value being passed to it before the image is created and outputted.
Whenever i use curl(php) to download a page it downloads everything on the page like images, css files or javascript files. but sometimes i dont want to download these. can i control the resources that i download through curl. i have gone through the manual but i havent found an option that can make this happen? Please dont suggest getting the whole page and then using some regex magic because that would still download the page and increase load time.
this is a demo code where i download a page from mozilla.com
<?php
$url="http://www.mozilla.com/en-US/firefox/new/";
$userAgent="Mozilla/5.0 (Windows NT 5.1; rv:2.0)Gecko/20100101 Firefox/4.0";
//$accept="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$encoding="gzip, deflate";
$header['lang']="en-us,en;q=0.5";
$header['charset']="ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header['conn']="keep-alive";
$header['keep-alive']=115;
$ch=curl_init();
curl_setopt($ch,CURLOPT_USERAGENT,$userAgent);
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_ENCODING,$encoding);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch,CURLOPT_AUTOREFERER,1);
$content=curl_exec($ch);
curl_close($ch);
echo $content;
?>
when i echo the content it shows the images too. i saw in firebug's network tab that images and external js files are being downloaded
PHP's curl only fetches what you tell it to. It doesn't parse html to look for javascript/css <link> tags and <img> tags and doesn't fetch them automatically.
If you have curl downloading those resources, then it's your code telling it to do so, and it's up to you to decide what to fetch and what not to. Curl only does what you tell it to.
you can avoid the download by using
echo htmlentities($content);
I want be able to upload an image or just paste the URL of an image in order to upload it a sa profile picture for my website users.
the point is i dont wanna store the url but i want to have a copy of that image on my server because if that external image will be lost i dont want to lose it either...
i believe facebook and tumblr etc do so... what the php script or best practice to do that?
thanks!
You can get the contents (bytes) from the image using the PHP function (http://php.net/manual/en/function.file-get-contents.php)
$contents = file_get_contents('http://www.google.com/images/logos/ps_logo2.png');
You can use CURL library as well... Here's an example of how you can downloadImageFromUrl with and save it in a local SaveLocation
function downloadImageFromUrl($imageLinkURL, $saveLocationPath) {
$channel = curl_init();
$curl_setopt($channel, CURLOPT_URL, $imageLinkURL);
$curl_setopt($channel, CURLOPT_POST, 0);
$curl_setopt($channel, CURLOPT_RETURNTRANSFER, 1);
$fileBytes = curl_exec($channel);
curl_close($channel);
$fileWritter = fopen($saveLocationPath, 'w');
fwrite($fileWritter, $fileBytes);
fclose($fileWritter);
}
You can use this as follows:
downloadImageFromUrl("http://www.google.com/images/logos/ps_logo2.png", "/tmp/ps_logo2.png")
You can also get the same name of the image by parsing the URL as well...
I think this is you are looking for: Copy Image from Remote Server Over HTTP
You can use a function called imagecreatefromjpeg or something. It takes a URL path to the image and creates a new image off of that. Have a look at http://php.net/manual/en/function.imagecreatefromjpeg.php
There's different functions for different extensions, though (if you prefer using such). You may need to check for the image extension from the URL and use the appropriate function I suppose.
Handling uploads is covered in this documentation, and if users paste a URL, I'd recommend using file_get_contents to save a copy of the image to your server, and then you can simply store the path to that image, rather than the external image.
I am having trouble with the download helper in ie..basically I built a site that dynamically creates pdf invoices and pdf proofs in both cases the force download works great in firefox, chrome and opera. In IE it fails everytime and I get the following error:
Unable to download $filename from mysite.com
Unable to open this Internet site. The requested site is either unavailable or cannot be found. Please try again later.
To begin the force_download I have a anchor target _blank with a url that directs to the following controller:
function view_uploaded_file($order = 0, $name = NULL){
$this->load->helper('directory');
$params['where'] = array('id' => id_clean($order));
$data['order'] = $this->MOrders->get($params);
if($data['order']->id < 1){
redirect('admin/orders');
}
$name = db_clean(urldecode($name));
$map = directory_map('./uploads/customer_order_uploads/'.$data['order']->user_id.'/'.$data['order']->id, 1);
if(is_array($map) && in_array($name, $map)){
$this->load->helper('download');
$data = file_get_contents('./uploads/customer_order_uploads/'.$data['order']->user_id.'/'.$data['order']->id.'/'.urldecode($name));
force_download($name, $data);
} else {
redirect('admin/orders');
}
}
Originally I thought maybe a problem with MY IE but I am able to download PDFs on other sites. I then thought that it could be a problem with codeigniters download helper but I see they already made special provisions for IE in the helper.
If you have any ideas please let me know. Thank you.
Frankly I am not sure why we bothered with a helper for downloads in code igniter.
It's not entirely hard to do in pure php:
This Wonderful Question/Answer outlines how to do it quite nicely.
The real thing to remember is the content-disposition: attachment part of the headers. It's what tells the browser that the file should be downloaded & saved vs. trying to show it in the browser.
All browsers handle things differently, maybe you have something in your IE install that's overriding the behaviour but if you follow the instructions in the linked article, you should get files downloaded correctly in all browsers.
Essentially there are 3 things we need to tell the browser:
Content Type
File Name
How to treat the incoming data
(Optional Fourth, if you have it) File Size (Content-Length)
Then you just dump that data right out to the output buffer.
Response
In response to your replies, it's probably a security feature to not automatically download something in a popup window, probably one of the new things IE introduced to combat their previous security holes.
Well I have found atleast a temporary fix for the problem. All my links for force downloads were target _blank..once I created standard non pop out links the file downloads worked in IE. There is probably some type of work around but I just also realized there is really no need for a pop up window for the download anyway..the download dialog box already serves that purpose.
I have an application, which has one input=file
Now I need to upload to my server, and then move file to some other server. How can I avoid time out?
Also, any good suggestion for ajax uploader. Thanks.
Flash Uploader: Undoubtedly, SWFUpload or Uploadify (based on the latter).
File Transfer: Use PHP CURL to do an HTTP POST form transfer (http://www.php.net/manual/en/function.curl-setopt.php see the 2nd example).
Before doing the transfer do the following:
set_time_limit(-1); // PHP won't timeout
ignore_user_abort(true); // PHP won't quit if the user aborts
Edit: I don't see a valid reason why you would need a CRON job unless the file in question changes at some time (which is the real definition of sync-ing). On the other hand, if what you want is to just copy the file to a remote server, there's no reason you can't do it with plain PHP.
Also, one thing you should be aware of is file sizes. If the file size in anything less than 20mb, you're safe.
Edit 2: By the way, with the right conditions, (output buffering off, and implicit output on), you can show the user the current remote transfer progress. I've done, it ain't hard really. You just need a hidden iframe which sends progress requests to update the parent window.
It works kind of like AJAX, but using an iframe in place of XHR (since XHR returns as a bulk, not in blocks, unlike an iframe).
If interested, I can help you out with this, just ask.
Edit3: Dynamic remote upload example/explanation:
To make things short, I'll assume that your file has already been uploaded to the server by the user, but not the target remote server. I'll also assume the user landed on handle.php after uploading the file.
handle.php would look like:
// This current script is only cosmetic - though you might want to
// handle the user upload here (as I did)
$name = 'userfile'; // name of uploaded file (input box) YOU MUST CHANGE THIS
$new_name = time().'.'.pathinfo($_FILES[$name]['name'],PATHINFO_EXTESION); // the (temporary) filename
move_uploaded_file($_FILES[$name]['tmp_name'],'uploads/'.$new_name);
$url = 'remote.php?file='.$new_name; ?>
<iframe src="<?php echo $url; ?>" width="1" height="1" frameborder="0" scrolling="no"></iframe>
<div id="progress">0%</div>
<script type="text/javascript">
function progress(percent){
document.getElementById('progress').innerHTML='%'+percent;
}
</script>
Doesn't look difficult so far, no?
The next part is a little more complex. The file remote.php would look like:
set_time_limit(0); // PHP won't timeout
// if you want the user to be able to cancel the upload, simply comment out the following line
ignore_user_abort(true); // PHP won't quit if the user aborts
// to make this system work, we need to tweak output buffering
while(ob_get_level())ob_end_clean(); // remove all buffers
ob_implicit_flush(true); // ensures everything we output is sent to browser directly
function progress($percent){
// since we're in an iframe, we need "parent" to be able to call the js
// function "progress" which we defined in the other file.
echo '<script type="text/javascript">parent.progress('.$percent.');</script>';
}
function curlPostFile($url,$file=null,$onprogress=null){
curl_setopt($ch,CURLOPT_URL,$url);
if(substr($url,0,8)=='https://'){
curl_setopt($ch,CURLOPT_HTTPAUTH,CURLAUTH_ANY);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,false);
}
if($onprogress){
curl_setopt($ch,CURLOPT_NOPROGRESS,false);
curl_setopt($ch,CURLOPT_PROGRESSFUNCTION,$onprogress);
}
curl_setopt($ch,CURLOPT_HEADER,false);
curl_setopt($ch,CURLOPT_USERAGENT,K2FUAGENT);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);
curl_setopt($ch,CURLOPT_MAXREDIRS,50);
if($file){
$fh=fopen($file);
curl_setopt($ch,CURLOPT_INFILE,$fh);
curl_setopt($ch,CURLOPT_INFILESIZE,filesize($file));
}
$data=curl_exec($ch);
curl_close($ch);
fclose($fh);
return $data;
}
$file = 'uploads/'.basename($_REQUEST['file']);
function onprogress($download_size,$downloaded,$upload_size,$uploaded){
progress($uploaded/$upload_size*100); // call our progress function
}
curlPostFile('http://someremoteserver.com/handle-uplaods.php',$file,'onprogress');
progress(100); // finished!
Use i.e. scp or rsync to transfer the file to another server. Do that with a cron job every couple of minutes, not from your php script - that will prevent any timeouts occuring if the server-to-server transfer takes too long.