PHP: get remote file size with strlen? (html) - php

I was looking at PHP docs for fsockopen and whatnot and they say you can't use filesize() on a remote file without doing some crazy things with ftell or something (not sure what they said exactly), but I had a good thought about how to do it:
$file = file_get_contents("http://www.google.com");
$filesize = mb_strlen($file) / 1000; //KBs, mb_* in case file contains unicode
Would this be a good method? It seemed so simple and good to use at the time, just want to get any thoughts if this could run into problems or not be the true file size.
I only wish to use this on text (websites) by the way not binary.

This answer requires PHP5 and cUrl. It first checks the headers. If Content-Length isn't specified, it uses cUrl to download it and check the size (the file is not saved anywhere though--just temporarily in memory).
<?php
echo get_remote_size("http://www.google.com/");
function get_remote_size($url) {
$headers = get_headers($url, 1);
if (isset($headers['Content-Length'])) return $headers['Content-Length'];
if (isset($headers['Content-length'])) return $headers['Content-length'];
$c = curl_init();
curl_setopt_array($c, array(
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => array('User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3'),
));
curl_exec($c);
return curl_getinfo($c, CURLINFO_SIZE_DOWNLOAD);
}
?>

You should look at the get_headers() function. It will return a hash of HTTP headers from an HTTP request. The Content-length header may be a better judge of the size of the actual content, if it's present.
That being said, you really should use either curl or streams to do a HEAD request instead of a GET. Content-length should be present, which saves you the transfer. It will be both faster and more accurate.

it will fetch the whole file and then calculate the filesize (rather the string length) out of the retrieved data. usually filesize can tell the filesize directly from the filesystem without reading the whole file first.
so this will be rather slow, and will everytime fetch the whole file before being able to retrieve the filesize (string length

Related

Running Out Of Memory With Fread

I'm using Backblaze B2 to store files and am using their documentation code to upload via their API. However their code uses fread to read the file, which is causing issues for files that are larger than 100MB as it tries to load the entire file into memory. Is there a better way to this that doesn't try to load the entire file into RAM?
$file_name = "file.txt";
$my_file = "<path-to-file>" . $file_name;
$handle = fopen($my_file, 'r');
$read_file = fread($handle,filesize($my_file));
$upload_url = ""; // Provided by b2_get_upload_url
$upload_auth_token = ""; // Provided by b2_get_upload_url
$bucket_id = ""; // The ID of the bucket
$content_type = "text/plain";
$sha1_of_file_data = sha1_file($my_file);
$session = curl_init($upload_url);
// Add read file as post field
curl_setopt($session, CURLOPT_POSTFIELDS, $read_file);
// Add headers
$headers = array();
$headers[] = "Authorization: " . $upload_auth_token;
$headers[] = "X-Bz-File-Name: " . $file_name;
$headers[] = "Content-Type: " . $content_type;
$headers[] = "X-Bz-Content-Sha1: " . $sha1_of_file_data;
curl_setopt($session, CURLOPT_HTTPHEADER, $headers);
curl_setopt($session, CURLOPT_POST, true); // HTTP POST
curl_setopt($session, CURLOPT_RETURNTRANSFER, true); // Receive server response
$server_output = curl_exec($session); // Let's do this!
curl_close ($session); // Clean up
echo ($server_output); // Tell me about the rabbits, George!
I have tried using:
curl_setopt($session, CURLOPT_POSTFIELDS, array('file' => '#'.realpath('file.txt')));
However I get an error response: Error reading uploaded data: SocketTimeoutException(Read timed out)
Edit: Streaming the filename withing the CURL also doesn't seem to work.
The issue you are having is related to this.
fread($handle,filesize($my_file));
With the filesize in there you might as well just do file_get_contents. it's much better memory wise to read 1 line at a time with fget
$handle = fopen($myfile, 'r');
while(!feof($handle)){
$line = fgets($handle);
}
This way you only read one line into memory, but if you need the full file contents you will still hit a bottleneck.
The only real way is to stream the upload.
I did a quick search and it seems the default for CURL is to stream the file if you give it the filename
$post_data['file'] = 'myfile.csv';
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
You can see the previous answer for more details
Is it possible to use cURL to stream upload a file using POST?
So as long as you can get past the sha1_file It looks like you can just stream the file, which should avoid the memory issues. There may be issues with time limit though. Also I can't really think of a way around getting the hash if that fails.
Just FYI, personally I never tried this, typically i just us sFTP for large file transfers. So I don't know if it has to be specially post_data['file'] I just copied that from the other answer.
Good luck...
UPDATE
Seeing as streaming seems to have failed (see comments).
You may want to test the streaming to make sure it works. I don't know what all that would involve, maybe stream a file to your own server? Also I am not sure why it wouldn't work "as advertised" and you may have tested it already. But it never hurts to test something, never assume something works until you know for sure. It very easy to try something new as a solution, only to miss a setting or put a path in wrong and then fall back to thinking its all based on the original issue.
I've spent a lot of time tearing things apart only to realize I had a spelling error. I'm pretty adept a programing these days so I typically overthink the errors too. My point is, be sure it's not a simple mistake before moving on.
Assuming everything is setup right, I would try file_get_contents. I don't know if it will be any better but it's more meant to open whole files. It also would seem to be more Readable in the code, because then it's clear that the whole file is needed. It just seems more semantically correct if nothing else.
You can also increase the RAM PHP has access to by using
ini_set('memory_limit', '512M')
You can even go higher then that, depending on your server. The highest I went before was 3G, but the server I uses has 54GB of ram and that was a one time thing, (we migrated 130million rows from MySql to MongoDB, the innodb index was eating up 30+GB ). Typically I run with 512M and have some scripts that routinely need 1G. But I wouldn't just up the Memory willy-nilly. That is usually a last resort for me after optimizing and testing. We do a lot of heavy processing that is why we have such a big server, we also have 2 slave servers (among other things) that run with 16GB each.
As far as what size to put, typically I increment it by 128M tell it works, then add an extra 128M just to be sure, but you might want to go in smaller steps. Typically people always use multiples of 8, but I don't know if that make to much difference these days.
Again, Good Luck.

Best method for bulk downloading images from website

I will download a lot of images (+20.000) from a website to my server and i'm trying to figure out the best way to do this since there's so many images to download.
Currently I have the code below which works in testing. But is there a better solution or should I use some software to do this?
foreach ($products as $product) {
$url = $product->img;
$imgName = $product->product_id
$path = "images/";
$img = $path . $imgName . ".png";
file_put_contents($img, file_get_contents($url));
}
Also, is there a chance that I will break something or crash the website when I download that many images at once?
first off, i agree with #Rudy Palacois here, wget would probably be better. that said, if you want to do it in PHP, curl would be much faster than file_get_contents, for 2 reasons.
1: unlike file_get_contents, curl can reuse the same connection to download multiple files, while file_get_contents will create & close a new connection for each download, that takes time, thus curl will be faster (as long as you're not using CURLOPT_FORBID_REUSE / CURLOPT_FRESH_CONNECT , anyway)
2: curl stops the download when the Content-Length http header's bytes has been downloaded. but file_get_contents completely ignores this header, and keeps downloading everything it can, until the connection is closed. this can again be much slower than curl's approach, because it's up to the web server when the connection will close, on some servers, it's A LOT slower than reading Content-Length bytes.
(and generally, curl is faster than file_get_contents because curl supports compressed transfers, gzip and deflate, which file_get_contents does not do... but that's generally not applicable for images, most common image formats are already pre-compressed. notable exceptions include .bmp images, though)
like this:
$ch = curl_init ();
curl_setopt ( $ch, CURLOPT_ENCODING, '' ); // if you're downloading files that benefit from compression (like .bmp images), this line enables compressed transfers.
foreach ( $products as $product ) {
$url = $product->img;
$imgName = $product->product_id;
$path = "images/";
$img = $path . $imgName . ".png";
$img=fopen($img,'wb');
curl_setopt_array ( $ch, array (
CURLOPT_URL => $url,
CURLOPT_FILE => $img
) );
curl_exec ( $ch );
fclose($img);
// file_put_contents ( $img, file_get_contents ( $url ) );
}
curl_close ( $ch );
edit: fixed a code-breaking typo, it's called CURLOPT_FILE, not CURLOPT_OUTFILE
edit 2: CURLOPT_FILE wants a file resource, not a filepath, fixed that too x.x
If you have access to shell, you could use WGET, I mean, the main problem with php, if you are executing this code from a browser, is the execution time, it will stop after a few minutes or it can be loading forever and get stucked, but if you have a complete URL and a pattern, as I can see, you can create a file with the URLs, one URL per line, list.txt, for example and then execute
wget -i list.txt
Check this answer too https://stackoverflow.com/a/14578517/5415074

cURL won't get file size for a particular PDF

I've got a script that uses cURL to check the file size of a bunch of PDFs. It works great, except for with one particular PDF. For this one PDF, it tells me the file size is -1. The PDF is https://www.panerabread.com/content/dam/panerabread/documents/nutrition/Panera-Nutrition.pdf
Here is my code:
$ch = curl_init($pdf_url);
$fh = fopen('/dev/null', 'w');
curl_setopt($ch, CURLOPT_FILE, $fh);
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
// Execute
curl_exec($ch);
// Check if any error occured
if(!curl_errno($ch))
{
$info = curl_getinfo($ch);
$size=$info['download_content_length'];
}
else echo'error!';
// Close handle
curl_close($ch);
This has been asked before. I found this and a bunch of other answers searching google for "download_content_length is negative one":
curl_getinfo returning -1 as content-length
Apparently if the server doesn't set the Content-Length header, then you'll get a -1. Have you tried looking at the response headers when you download the PDF?
EDIT: I checked and the Content-Length header isn't set in the response from the server.
You could always download the file to a temporary location and get the size using the filesize function
The field will contain -1 if something went wrong while "calculating" the field. The problem with this particular file is that the server sends the wrong Content-Length header. The header says 112189 bytes where the real size of the downloaded file is 144418 bytes.
I suggest to use
$info['size_download']
this is independent from what the remote server sends, it is the measured file size in bytes.

Using file_get_contents vs curl for file size

I have a file uploading script running on my server which also features remote uploads.. Everything works fine but I am wondering what is the best way to upload via URL. Right now I am using fopen to get the file from the remote url pasted in the text box named "from". I have heard that fopen isn't the best way to do it. Why is that?
Also I am using file_get_contents to get the file size of the file from the URL. I have heard that curl is better on that part. Why is that and also how can I apply these changes to this script?
<?php
$from = htmlspecialchars(trim($_POST['from']));
if ($from != "") {
$file = file_get_contents($from);
$filesize = strlen($file);
while (!feof($file)) {
$move = "./uploads/" . $rand2;
move_upload($_FILES['from']['tmp_name'], $move);
$newfile = fopen("./uploads/" . $rand2, "wb");
file_put_contents($newfile, $file);
}
}
?>
You can use filesize to get the file size of a file on disk.
file_get_contents actually gets the file into memory so $filesize = strlen(file_get_contents($from)); already gets the file, you just don't do anything with it other than find it size. You can substitute for you fwrite call file_put_contents;
See: file_get_contents and file_put_contents .
curl is used when you need more access to the HTTP protocol. There are many questions and examples on StackOverflow using curl in PHP.
So we can first download the file, in this example I wll use file_get_contents, get its size, then put the file in the directory on your local disk.
$tmpFile = file_get_contents($from);
$fileSize = strlen($tmpFile);
// you could do a check for file size here
$newFileName = "./uploads/$rand2";
file_put_contents($newFileName, $tmpFile);
In your code you have move_upload($_FILES['from']['tmp_name'], $move); but $_FILES is only applicable when you have a <input type="file"> element, which it doesn't seem you have.
P.S. You should probably white-list characters that you allow in a filename for instance $goodFilename = preg_replace("/^[^a-zA-Z0-9]+$/", "-", $filename) This is often easier to read and safer.
Replace:
while (!feof($file)) {
$move = "./uploads/" . $rand2;
move_upload($_FILES['from']['tmp_name'], $move);
$newfile = fopen("./uploads/" . $rand2, "wb");
file_put_contents($newfile, $file);
}
With:
$newFile = "./uploads/" . $rand2;
file_put_contents($newfile, $file);
The whole file is read in by file_get_contents the whole file is written by file_put_contents
As far as I understand your question: You want to get the filesize of a remote fiel given by a URL, and you're not sure which solution ist best/fastest.
At first, the biggest difference between CURL, file_get_contents() and fread() in this context is that CURL and file_get_contents() put the whole thing into memory, while fopen() gives you more control over what parts of the file you want to read. I think fopen() and file_get_contents() are nearly equivalent in your case, because you're dealing with small files and you actually want to get the whole file. So it doesn't make any difference in terms of memory usage.
CURL is just the big brother of file_get_contents(). It is actually a complete HTTP-Client rather than some kind of a wrapper for simple functions.
And talking about HTTP: Don't forget there's more to HTTP than GET and POST. Why don't you just use the resource's meta-data to check it's size before you even get it? That's one thing the HTTP method HEAD is meant for. PHP even comes with a built in function for getting the headers: get_headers(). It has some flaws though: It still sends a GET request, which makes it probably a little slower, and it follows redirects, which may cause security issues. But you can fix this pretty easily by adjusting the default context:
$opts = array(
'http' =>
array(
'method' => 'HEAD',
'max_redirects'=> 1,
'ignore_errors'=> true
)
);
stream_context_set_default($opts);
Done. Now you can simply get the headers:
$headers = get_headers('http://example.com/pic.png', 1);
//set the keys to lowercase so we don't have to deal with lower- and upper case
$lowerCaseHeaders = array_change_key_case($headers);
// 'content-length' is the header we're interested in:
$filesize = $lowerCaseHeaders['content-length'];
NOTE: filesize() will not work on a http / https stream wrapper, because stat() is not supported (http://php.net/manual/en/wrappers.http.php).
And that's pretty much it. Of course you can achieve the same with CURL just as easy if you like it better. The approach would be same (reding the headers).
And here's how you get the file and it's size (after downloading) with CURL:
// Create a CURL handle
$ch = curl_init();
// Set all the options on this handle
// find a full list on
// http://au2.php.net/manual/en/curl.constants.php
// http://us2.php.net/manual/en/function.curl-setopt.php (for actual usage)
curl_setopt($ch, CURLOPT_URL, 'http://example.com/pic.png');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Send the request and store what is returned to a variable
// This actually contains the raw image data now, you could
// pass it to e.g. file_put_contents();
$data = curl_exec($ch);
// get the required info about the request
// find a full list on
// http://us2.php.net/manual/en/function.curl-getinfo.php
$filesize = curl_getinfo($ch, CURLINFO_SIZE_DOWNLOAD);
// close the handle after you're done
curl_close($ch);
Pure PHP approach: http://codepad.viper-7.com/p8mlOt
Using CURL: http://codepad.viper-7.com/uWmsYB
For a nicely formatted and human readable output of the file size I've learned this amazing function from Laravel:
function get_file_size($size)
{
$units = array('Bytes', 'KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB');
return #round($size / pow(1024, ($i = floor(log($size, 1024)))), 2).' '.$units[$i];
}
If you don't want to deal with all this, you should check out Guzzle. It's a very powerful and extremely easy to use library for any kind HTTP stuff.

Downloading files using GZIP

I have many XML-s and I downloaded using file or file_get_content, but the server administrator told me that through GZIP is more efficient the downloading. My question is how can I include GZIP, because I never did this before, so this solution is really new for me.
You shouldn't need to do any decoding yourself if you use cURL. Just use the basic cURL example code, with the CURLOPT_ENCODING option set to "", and it will automatically request the file using gzip encoding, if the server supports it, and decode it.
Or, if you want the content in a string instead of in a file:
$ch = curl_init("http://www.example.com/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, ""); // accept any supported encoding
$content = curl_exec($ch);
curl_close($ch);
I've tested this, and it indeed downloads the content in gzipped format and decodes it automatically.
(Also, you should probably include some error handling.)
I don't understand your question.
You say that you downloaded these files - you can't unilaterally enable compression client-side.
OTOH you can control it server-side - and since you've flagged the question as PHP, and it doesn't make any sense for your administrator to recommend compression where you don't have control over the server then I assume this is what you are talking about.
In which case you'd simply do something like:
<?php
ob_start("ob_gzhandler");
...your code for generating the XML goes here
...or maybe this is nothing to do with PHP, and the XML files are static - in which case you'd need to configure your webserver to compress on the fly.
Unless you mean that compression is available on the server and you are fetching data over HTTP using PHP as the client - in which case the server will only compress the data if the client provides an "Accept-Encoding" request header including "gzip". In which case, instead of file_get_contents() you might use:
function gzip_get_contents($url)
{
$ch=curl_init($url);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content=curl_exec($ch);
curl_close($ch);
return $content;
}
probably curl can get a gzipped file
http://www.php.net/curl
try to use this instead of file_get_contents
edit: tested
curl_setopt($c,CURLOPT_ENCODING,'gzip');
then:
gzdecode($responseContent);
Send a Accept-Encoding: gzip header in your http request and then uncompress the result as shown here:
http://de2.php.net/gzdecode

Categories