Running Out Of Memory With Fread - php

I'm using Backblaze B2 to store files and am using their documentation code to upload via their API. However their code uses fread to read the file, which is causing issues for files that are larger than 100MB as it tries to load the entire file into memory. Is there a better way to this that doesn't try to load the entire file into RAM?
$file_name = "file.txt";
$my_file = "<path-to-file>" . $file_name;
$handle = fopen($my_file, 'r');
$read_file = fread($handle,filesize($my_file));
$upload_url = ""; // Provided by b2_get_upload_url
$upload_auth_token = ""; // Provided by b2_get_upload_url
$bucket_id = ""; // The ID of the bucket
$content_type = "text/plain";
$sha1_of_file_data = sha1_file($my_file);
$session = curl_init($upload_url);
// Add read file as post field
curl_setopt($session, CURLOPT_POSTFIELDS, $read_file);
// Add headers
$headers = array();
$headers[] = "Authorization: " . $upload_auth_token;
$headers[] = "X-Bz-File-Name: " . $file_name;
$headers[] = "Content-Type: " . $content_type;
$headers[] = "X-Bz-Content-Sha1: " . $sha1_of_file_data;
curl_setopt($session, CURLOPT_HTTPHEADER, $headers);
curl_setopt($session, CURLOPT_POST, true); // HTTP POST
curl_setopt($session, CURLOPT_RETURNTRANSFER, true); // Receive server response
$server_output = curl_exec($session); // Let's do this!
curl_close ($session); // Clean up
echo ($server_output); // Tell me about the rabbits, George!
I have tried using:
curl_setopt($session, CURLOPT_POSTFIELDS, array('file' => '#'.realpath('file.txt')));
However I get an error response: Error reading uploaded data: SocketTimeoutException(Read timed out)
Edit: Streaming the filename withing the CURL also doesn't seem to work.

The issue you are having is related to this.
fread($handle,filesize($my_file));
With the filesize in there you might as well just do file_get_contents. it's much better memory wise to read 1 line at a time with fget
$handle = fopen($myfile, 'r');
while(!feof($handle)){
$line = fgets($handle);
}
This way you only read one line into memory, but if you need the full file contents you will still hit a bottleneck.
The only real way is to stream the upload.
I did a quick search and it seems the default for CURL is to stream the file if you give it the filename
$post_data['file'] = 'myfile.csv';
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
You can see the previous answer for more details
Is it possible to use cURL to stream upload a file using POST?
So as long as you can get past the sha1_file It looks like you can just stream the file, which should avoid the memory issues. There may be issues with time limit though. Also I can't really think of a way around getting the hash if that fails.
Just FYI, personally I never tried this, typically i just us sFTP for large file transfers. So I don't know if it has to be specially post_data['file'] I just copied that from the other answer.
Good luck...
UPDATE
Seeing as streaming seems to have failed (see comments).
You may want to test the streaming to make sure it works. I don't know what all that would involve, maybe stream a file to your own server? Also I am not sure why it wouldn't work "as advertised" and you may have tested it already. But it never hurts to test something, never assume something works until you know for sure. It very easy to try something new as a solution, only to miss a setting or put a path in wrong and then fall back to thinking its all based on the original issue.
I've spent a lot of time tearing things apart only to realize I had a spelling error. I'm pretty adept a programing these days so I typically overthink the errors too. My point is, be sure it's not a simple mistake before moving on.
Assuming everything is setup right, I would try file_get_contents. I don't know if it will be any better but it's more meant to open whole files. It also would seem to be more Readable in the code, because then it's clear that the whole file is needed. It just seems more semantically correct if nothing else.
You can also increase the RAM PHP has access to by using
ini_set('memory_limit', '512M')
You can even go higher then that, depending on your server. The highest I went before was 3G, but the server I uses has 54GB of ram and that was a one time thing, (we migrated 130million rows from MySql to MongoDB, the innodb index was eating up 30+GB ). Typically I run with 512M and have some scripts that routinely need 1G. But I wouldn't just up the Memory willy-nilly. That is usually a last resort for me after optimizing and testing. We do a lot of heavy processing that is why we have such a big server, we also have 2 slave servers (among other things) that run with 16GB each.
As far as what size to put, typically I increment it by 128M tell it works, then add an extra 128M just to be sure, but you might want to go in smaller steps. Typically people always use multiples of 8, but I don't know if that make to much difference these days.
Again, Good Luck.

Related

Best method for bulk downloading images from website

I will download a lot of images (+20.000) from a website to my server and i'm trying to figure out the best way to do this since there's so many images to download.
Currently I have the code below which works in testing. But is there a better solution or should I use some software to do this?
foreach ($products as $product) {
$url = $product->img;
$imgName = $product->product_id
$path = "images/";
$img = $path . $imgName . ".png";
file_put_contents($img, file_get_contents($url));
}
Also, is there a chance that I will break something or crash the website when I download that many images at once?
first off, i agree with #Rudy Palacois here, wget would probably be better. that said, if you want to do it in PHP, curl would be much faster than file_get_contents, for 2 reasons.
1: unlike file_get_contents, curl can reuse the same connection to download multiple files, while file_get_contents will create & close a new connection for each download, that takes time, thus curl will be faster (as long as you're not using CURLOPT_FORBID_REUSE / CURLOPT_FRESH_CONNECT , anyway)
2: curl stops the download when the Content-Length http header's bytes has been downloaded. but file_get_contents completely ignores this header, and keeps downloading everything it can, until the connection is closed. this can again be much slower than curl's approach, because it's up to the web server when the connection will close, on some servers, it's A LOT slower than reading Content-Length bytes.
(and generally, curl is faster than file_get_contents because curl supports compressed transfers, gzip and deflate, which file_get_contents does not do... but that's generally not applicable for images, most common image formats are already pre-compressed. notable exceptions include .bmp images, though)
like this:
$ch = curl_init ();
curl_setopt ( $ch, CURLOPT_ENCODING, '' ); // if you're downloading files that benefit from compression (like .bmp images), this line enables compressed transfers.
foreach ( $products as $product ) {
$url = $product->img;
$imgName = $product->product_id;
$path = "images/";
$img = $path . $imgName . ".png";
$img=fopen($img,'wb');
curl_setopt_array ( $ch, array (
CURLOPT_URL => $url,
CURLOPT_FILE => $img
) );
curl_exec ( $ch );
fclose($img);
// file_put_contents ( $img, file_get_contents ( $url ) );
}
curl_close ( $ch );
edit: fixed a code-breaking typo, it's called CURLOPT_FILE, not CURLOPT_OUTFILE
edit 2: CURLOPT_FILE wants a file resource, not a filepath, fixed that too x.x
If you have access to shell, you could use WGET, I mean, the main problem with php, if you are executing this code from a browser, is the execution time, it will stop after a few minutes or it can be loading forever and get stucked, but if you have a complete URL and a pattern, as I can see, you can create a file with the URLs, one URL per line, list.txt, for example and then execute
wget -i list.txt
Check this answer too https://stackoverflow.com/a/14578517/5415074

XML fails to load, without any errors message

I have a XML structure in my PHP file.
For example:
$file = file_get_contents($myFile);
$response = '<?xml version="1.0"?>';
$response .= '<responses>';
$response .= '<file>';
$response .= '<name>';
$response .= '</name>';
$response .= '<data>';
$response .= base64_encode($file);
$response .= '</data>';
$response .= '</file>';
$response .= '</responses>';
echo $response;
If i create .doc file or with other extension and put little text in, it works. But, if user load file with complex structure (not only text) - XML just not load, and i have a empty file without errors.
But the same files works on my other server.
I have try use simplexml_load_string for output errors, but i have no errors.
The server with PHP 5.3.3 have the problem; the one with PHP 5.6 hasn’t. It works if I try it with 5.3.3 on my local server.
Is the problem due to the PHP version? If so, how exactly?
There're basically three things that can be improved in your code:
Configure error reporting to actually see error messages.
Generate XML with a proper library, to ensure you cannot send malformed data.
Be conservative in memory usage (you're currently storing the complete file in RAM three times, two of them in a plain text representation that depending of file type can be significantly larger).
Your overall code could like like this:
// Quick code, needs more error checking and refining
$fp = fopen($myFile, 'rb');
if ($fp) {
$writer = new XMLWriter();
$writer->openURI('php://output');
$writer->startDocument('1.0');
$writer->startElement('responses');
$writer->startElement('file');
$writer->startElement('name');
$writer->endElement();
$writer->startElement('data');
while (!feof($fp)) {
// If I recall correctly, substring size must be multiple of 4
// to encode it properly (except for last part)
$writer->text(base64_encode(fread($fp, 10240)));
}
$writer->endElement();
$writer->endElement();
$writer->endElement();
fclose($fp);
}
I've tried this code with a 316 MB file and used 256 KB on my PC.
As a side note, inserting binary files inside XML is pretty troublesome when files are large. It makes extraction problematic because you can't use most of the usual tools due to extensive memory usage.

Using file_get_contents vs curl for file size

I have a file uploading script running on my server which also features remote uploads.. Everything works fine but I am wondering what is the best way to upload via URL. Right now I am using fopen to get the file from the remote url pasted in the text box named "from". I have heard that fopen isn't the best way to do it. Why is that?
Also I am using file_get_contents to get the file size of the file from the URL. I have heard that curl is better on that part. Why is that and also how can I apply these changes to this script?
<?php
$from = htmlspecialchars(trim($_POST['from']));
if ($from != "") {
$file = file_get_contents($from);
$filesize = strlen($file);
while (!feof($file)) {
$move = "./uploads/" . $rand2;
move_upload($_FILES['from']['tmp_name'], $move);
$newfile = fopen("./uploads/" . $rand2, "wb");
file_put_contents($newfile, $file);
}
}
?>
You can use filesize to get the file size of a file on disk.
file_get_contents actually gets the file into memory so $filesize = strlen(file_get_contents($from)); already gets the file, you just don't do anything with it other than find it size. You can substitute for you fwrite call file_put_contents;
See: file_get_contents and file_put_contents .
curl is used when you need more access to the HTTP protocol. There are many questions and examples on StackOverflow using curl in PHP.
So we can first download the file, in this example I wll use file_get_contents, get its size, then put the file in the directory on your local disk.
$tmpFile = file_get_contents($from);
$fileSize = strlen($tmpFile);
// you could do a check for file size here
$newFileName = "./uploads/$rand2";
file_put_contents($newFileName, $tmpFile);
In your code you have move_upload($_FILES['from']['tmp_name'], $move); but $_FILES is only applicable when you have a <input type="file"> element, which it doesn't seem you have.
P.S. You should probably white-list characters that you allow in a filename for instance $goodFilename = preg_replace("/^[^a-zA-Z0-9]+$/", "-", $filename) This is often easier to read and safer.
Replace:
while (!feof($file)) {
$move = "./uploads/" . $rand2;
move_upload($_FILES['from']['tmp_name'], $move);
$newfile = fopen("./uploads/" . $rand2, "wb");
file_put_contents($newfile, $file);
}
With:
$newFile = "./uploads/" . $rand2;
file_put_contents($newfile, $file);
The whole file is read in by file_get_contents the whole file is written by file_put_contents
As far as I understand your question: You want to get the filesize of a remote fiel given by a URL, and you're not sure which solution ist best/fastest.
At first, the biggest difference between CURL, file_get_contents() and fread() in this context is that CURL and file_get_contents() put the whole thing into memory, while fopen() gives you more control over what parts of the file you want to read. I think fopen() and file_get_contents() are nearly equivalent in your case, because you're dealing with small files and you actually want to get the whole file. So it doesn't make any difference in terms of memory usage.
CURL is just the big brother of file_get_contents(). It is actually a complete HTTP-Client rather than some kind of a wrapper for simple functions.
And talking about HTTP: Don't forget there's more to HTTP than GET and POST. Why don't you just use the resource's meta-data to check it's size before you even get it? That's one thing the HTTP method HEAD is meant for. PHP even comes with a built in function for getting the headers: get_headers(). It has some flaws though: It still sends a GET request, which makes it probably a little slower, and it follows redirects, which may cause security issues. But you can fix this pretty easily by adjusting the default context:
$opts = array(
'http' =>
array(
'method' => 'HEAD',
'max_redirects'=> 1,
'ignore_errors'=> true
)
);
stream_context_set_default($opts);
Done. Now you can simply get the headers:
$headers = get_headers('http://example.com/pic.png', 1);
//set the keys to lowercase so we don't have to deal with lower- and upper case
$lowerCaseHeaders = array_change_key_case($headers);
// 'content-length' is the header we're interested in:
$filesize = $lowerCaseHeaders['content-length'];
NOTE: filesize() will not work on a http / https stream wrapper, because stat() is not supported (http://php.net/manual/en/wrappers.http.php).
And that's pretty much it. Of course you can achieve the same with CURL just as easy if you like it better. The approach would be same (reding the headers).
And here's how you get the file and it's size (after downloading) with CURL:
// Create a CURL handle
$ch = curl_init();
// Set all the options on this handle
// find a full list on
// http://au2.php.net/manual/en/curl.constants.php
// http://us2.php.net/manual/en/function.curl-setopt.php (for actual usage)
curl_setopt($ch, CURLOPT_URL, 'http://example.com/pic.png');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Send the request and store what is returned to a variable
// This actually contains the raw image data now, you could
// pass it to e.g. file_put_contents();
$data = curl_exec($ch);
// get the required info about the request
// find a full list on
// http://us2.php.net/manual/en/function.curl-getinfo.php
$filesize = curl_getinfo($ch, CURLINFO_SIZE_DOWNLOAD);
// close the handle after you're done
curl_close($ch);
Pure PHP approach: http://codepad.viper-7.com/p8mlOt
Using CURL: http://codepad.viper-7.com/uWmsYB
For a nicely formatted and human readable output of the file size I've learned this amazing function from Laravel:
function get_file_size($size)
{
$units = array('Bytes', 'KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB');
return #round($size / pow(1024, ($i = floor(log($size, 1024)))), 2).' '.$units[$i];
}
If you don't want to deal with all this, you should check out Guzzle. It's a very powerful and extremely easy to use library for any kind HTTP stuff.

PHP filesize of dynamically chosen file

I have a php script that needs to determine the size of a file on the file system after being manipulated by a separate php script.
For example, there exists a zip file that has a fixed size but gets an additional file of unknown size inserted into it based on the user that tries to access it. So the page that's serving the file is something like getfile.php?userid=1234.
So far, I know this:
filesize('getfile.php'); //returns the actual file size of the php file, not the result of script execution
readfile('getfile.php'); //same as filesize()
filesize('getfile.php?userid=1234'); //returns false, as it can't find the file matching the name with GET vars attached
readfile('getfile.php?userid=1234'); //same as filesize()
Is there a way to read the result size of the php script instead of just the php file itself?
filesize
As of PHP 5.0.0, this function can also be used with some URL
wrappers.
something like
filesize('http://localhost/getfile.php?userid=1234');
should be enough
Someone had posted an option for using curl to do this but removed their answer after a downvote. Too bad, because it's the one way I've gotten this to work. So here's their answer that worked for me:
$ch = curl_init('http://localhost/getfile.php?userid=1234');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); //This was not part of the poster's answer, but I needed to add it to prevent the file being read from outputting with the requesting script
curl_exec($ch);
$size = 0;
if(!curl_errno($ch))
{
$info = curl_getinfo($ch);
$size = $info['size_download'];
}
curl_close($ch);
echo $size;
The only way to get the size of the output is to run it and then look. Depending on the script the result might differ though for practical use the best thing to do is to estimate basd on your knowledge. i.e. if you have a 5MB file and add another 5k user specific content it's still about 5MB in the end etc.
To expand on Ivan's answer:
Your string is 'getfile.php' with or without GET parameters, this is being treated as a local file, and therefore retrieving the filesize of the php file itself.
It is being treated as a local file because it isn't starting with the http protocol. See http://us1.php.net/manual/en/wrappers.php for supported protocols.
When using filesize() I got a warning:
Warning: filesize() [function.filesize]: stat failed for ...link... in ..file... on line 233
Instead of filesize() I found two working options to replace it:
1)
$headers = get_headers($pdfULR, 1);
$fileSize = $headers['Content-Length'];
echo $fileSize;
2)
echo strlen(file_get_contents($pdfULR));
Now it's working fine.

How to work around a site forbidding me to scrape their images with PHP

I'm scraping a site, searching for JPGs to download.
Scraping the site's HTML pages works fine.
But when I try getting the JPGs with CURL, copy(), fopen(), etc., I get a 403 forbiden status.
I know that's because the site owners don't want their images scraped, so I understand a good answer would be just don't do it, because they don't want you to.
Ok, but let's say it's ok and I try to work around this, how could this be achieved?
If I get the same URL with a browser, I can open the image perfectly, it's not that my IP is banned or anything, and I'm testing the scraper one file at a time, so it's not blocking me because I make too many requests too often.
From my understanding, it could be that either the site is checking for some cookies that confirm that I'm using a browser and browsing their site before I download a JPG.
Or that maybe PHP is using some user agent for the requests that the server can detect and filter out.
Anyway, have any idea?
Actually it was quite simple.
As #Leigh suggested, it only took spoofing an http referrer with the option CURLOPT_REFERER.
In fact for every request, I just provided the domain name as the referrer and it worked.
Are you able to view the page through a browser? Wouldn't a simple search of the page source find all images?
` $findme = '.jpg';
$pos = strpos($html, $findme);
if ($pos === false) {
echo "The string '$findme' was not found in the string '$html'";
} else {
echo "Images found..
///grab image location code
} `
Basic image retrieval:
Using the GD Library plugin commonly installed by default with many web hosts. This is something of an ugly hack but some may find the fact it can be done this way useful.
$remote_img = 'http://www.somwhere.com/images/image.jpg';
$img = imagecreatefromjpeg($remote_img);
$path = 'images/';
imagejpeg($img, $path);
Classic cURL image grabbing function for when you have extracted the location of the image from the donor pages HTML.
function save_image($img,$fullpath){
$ch = curl_init ($img);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
$rawdata=curl_exec($ch);
curl_close ($ch);
if(file_exists($fullpath)){
unlink($fullpath);
}
$fp = fopen($fullpath,'x');
fwrite($fp, $rawdata);
fclose($fp);
}
If the basic cURL image grabbing function fails then the donor site probably has some form of server side defences in place to prevent retrieval and so you are probably breaching the terms of service by proceeding further. Though rare some sites do create images 'on the fly' using the GD library module, so what may look like a link to an image is actually a PHP script and that could be checking for things like a cookie, referer or session value being passed to it before the image is created and outputted.

Categories