Problems getting gzipped XML files with curl PHP - php

I'm trying to grab data from an xml.gz file with curl. I'm able to download the file, but can't get the usable XML with any of my attempts. When I try to print the XML, I'm getting a long list of garbled special characters such as:
‹ì½ûrâÈ–7ú?E~{Çž¨Ši°î—Ù5=ÁÍ6]`Ø€ë²ãDLÈ u
Is there a simple way to just uncompress and encode this xml? Possibly through SimpleXML? The files are large and do require authentication. Here's my current code:
$username='username';
$password='password';
$location='http://www.example.com/file.xml.gz';
$ch = curl_init ();
curl_setopt($ch,CURLOPT_URL,$location);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_USERPWD,"$username:$password");
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_HEADER, 0);
$xmlcontent = curl_exec ($ch);
curl_close($ch);
print_r($xmlcontent);
Thanks for your help!

You will need to pass the string through gzuncompress: http://www.php.net/manual/en/function.gzuncompress.php

You first need to save the file to disk. As it's gz-compressed you need to uncompress it before you can access the (uncompressed) XML. This can be done with the zlib:// -- bzip2:// -- zip:// — Compression Streams in PHP:
$file = 'compress.zlib://file.xml.gz';
################
$xml = simplexml_load_file($file);
To get this to work, you need to have the ZLib extension installed/configured.
Wrapper means that you're not creating an uncompressed variant of that file first (create a second file, which can be a solution ,too) but the wrapper uncompresses the data of that file transparently on the fly so that the simplexml library can operate on the uncompressed XML (and that is what that library needs: uncompressed XML).
See as well:
Sorting and grouping SimpleXML Data (example of using xml.gz file with SimpleXMLElement)
Parsing extremely large XML files in php (example of using xml.gz file with XMLReader)

Not sure why, but none of the other answers worked for me in the end. zlib was installed on the server, but the gzdecode() function was not defined in the library, and the gzuncompress gave me errors, as did compress.zlib://. They might work for you so, give them a try as well.
If you need to check if zlib is installed this stackoverflow answer or this answer can help. They provide this script:
<?php
echo phpversion().", ";
if (function_exists("gzdecode")) {
echo "gzdecode OK, ";
} else {
echo "gzdecode no OK, ";
}
if (extension_loaded('zlib')) {
echo "zlib extension loaded ";
} else {
echo "zlib extension not loaded ";
}
?>
This site gives another script that shows what zlib function are installed:
var_dump(get_extension_funcs('zlib'));
SOLUTION!!! These 2 functions did the trick for me. Just curl or use file_get_contents to grab the xml file, then use this script:
$xmlcontent = gzinflate(substr($xmlcontent,10,-8));
OR use this script to grab the xml file and get the contents (see more here):
$zd = gzopen($filename,"r");
$contents = gzread($zd,$fileSize);
gzclose($zd);
Thanks to all who helped me get this answer. Hope this helps someone else!

I suggest you just decompress the result you fetch:
//[...]
$xmlcontent = gzdecode ( curl_exec($ch) );
curl_close($ch);
print_r($xmlcontent);
Obviously you should do some additional error checking, this is just the shortened general approach.
Note that there are two similar functions provided by php:
gzuncompress()
gzdecode()
Most likely you have to use the second one, if the file really is a physical gzip compressed file delivered by a http server.

Related

PHP: GET # of characters from a URL and then stop/exit?

For parsing large files on the internet, or just wanting to get the opengraph tags of a website, is there a way to GET a webpage's first 1000 characters and then to stop downloading anything else from the page?
When a file is several megabytes, it can take the server a while to parse the file. This is especially the case when operating with many of these files. Even more troublesome than bandwidth is CPU/RAM conditions as files that are too large are difficult to work with in PHP as the server can run out of memory.
Here are some PHP commands that can open a webpage:
fopen
file_get_contents
include
fread
url_get_contents
curl_init
curl_setopt
parse_url
Can any of these be set to download a specific number of characters and then exit?
Something like that?
<?php
if ($handle = fopen("http://www.example.com/", "rb")) {
echo fread($handle, 8192);
}
Got from php.net official functions doc examples...

XML fails to load, without any errors message

I have a XML structure in my PHP file.
For example:
$file = file_get_contents($myFile);
$response = '<?xml version="1.0"?>';
$response .= '<responses>';
$response .= '<file>';
$response .= '<name>';
$response .= '</name>';
$response .= '<data>';
$response .= base64_encode($file);
$response .= '</data>';
$response .= '</file>';
$response .= '</responses>';
echo $response;
If i create .doc file or with other extension and put little text in, it works. But, if user load file with complex structure (not only text) - XML just not load, and i have a empty file without errors.
But the same files works on my other server.
I have try use simplexml_load_string for output errors, but i have no errors.
The server with PHP 5.3.3 have the problem; the one with PHP 5.6 hasn’t. It works if I try it with 5.3.3 on my local server.
Is the problem due to the PHP version? If so, how exactly?
There're basically three things that can be improved in your code:
Configure error reporting to actually see error messages.
Generate XML with a proper library, to ensure you cannot send malformed data.
Be conservative in memory usage (you're currently storing the complete file in RAM three times, two of them in a plain text representation that depending of file type can be significantly larger).
Your overall code could like like this:
// Quick code, needs more error checking and refining
$fp = fopen($myFile, 'rb');
if ($fp) {
$writer = new XMLWriter();
$writer->openURI('php://output');
$writer->startDocument('1.0');
$writer->startElement('responses');
$writer->startElement('file');
$writer->startElement('name');
$writer->endElement();
$writer->startElement('data');
while (!feof($fp)) {
// If I recall correctly, substring size must be multiple of 4
// to encode it properly (except for last part)
$writer->text(base64_encode(fread($fp, 10240)));
}
$writer->endElement();
$writer->endElement();
$writer->endElement();
fclose($fp);
}
I've tried this code with a 316 MB file and used 256 KB on my PC.
As a side note, inserting binary files inside XML is pretty troublesome when files are large. It makes extraction problematic because you can't use most of the usual tools due to extensive memory usage.

Using file_get_contents vs curl for file size

I have a file uploading script running on my server which also features remote uploads.. Everything works fine but I am wondering what is the best way to upload via URL. Right now I am using fopen to get the file from the remote url pasted in the text box named "from". I have heard that fopen isn't the best way to do it. Why is that?
Also I am using file_get_contents to get the file size of the file from the URL. I have heard that curl is better on that part. Why is that and also how can I apply these changes to this script?
<?php
$from = htmlspecialchars(trim($_POST['from']));
if ($from != "") {
$file = file_get_contents($from);
$filesize = strlen($file);
while (!feof($file)) {
$move = "./uploads/" . $rand2;
move_upload($_FILES['from']['tmp_name'], $move);
$newfile = fopen("./uploads/" . $rand2, "wb");
file_put_contents($newfile, $file);
}
}
?>
You can use filesize to get the file size of a file on disk.
file_get_contents actually gets the file into memory so $filesize = strlen(file_get_contents($from)); already gets the file, you just don't do anything with it other than find it size. You can substitute for you fwrite call file_put_contents;
See: file_get_contents and file_put_contents .
curl is used when you need more access to the HTTP protocol. There are many questions and examples on StackOverflow using curl in PHP.
So we can first download the file, in this example I wll use file_get_contents, get its size, then put the file in the directory on your local disk.
$tmpFile = file_get_contents($from);
$fileSize = strlen($tmpFile);
// you could do a check for file size here
$newFileName = "./uploads/$rand2";
file_put_contents($newFileName, $tmpFile);
In your code you have move_upload($_FILES['from']['tmp_name'], $move); but $_FILES is only applicable when you have a <input type="file"> element, which it doesn't seem you have.
P.S. You should probably white-list characters that you allow in a filename for instance $goodFilename = preg_replace("/^[^a-zA-Z0-9]+$/", "-", $filename) This is often easier to read and safer.
Replace:
while (!feof($file)) {
$move = "./uploads/" . $rand2;
move_upload($_FILES['from']['tmp_name'], $move);
$newfile = fopen("./uploads/" . $rand2, "wb");
file_put_contents($newfile, $file);
}
With:
$newFile = "./uploads/" . $rand2;
file_put_contents($newfile, $file);
The whole file is read in by file_get_contents the whole file is written by file_put_contents
As far as I understand your question: You want to get the filesize of a remote fiel given by a URL, and you're not sure which solution ist best/fastest.
At first, the biggest difference between CURL, file_get_contents() and fread() in this context is that CURL and file_get_contents() put the whole thing into memory, while fopen() gives you more control over what parts of the file you want to read. I think fopen() and file_get_contents() are nearly equivalent in your case, because you're dealing with small files and you actually want to get the whole file. So it doesn't make any difference in terms of memory usage.
CURL is just the big brother of file_get_contents(). It is actually a complete HTTP-Client rather than some kind of a wrapper for simple functions.
And talking about HTTP: Don't forget there's more to HTTP than GET and POST. Why don't you just use the resource's meta-data to check it's size before you even get it? That's one thing the HTTP method HEAD is meant for. PHP even comes with a built in function for getting the headers: get_headers(). It has some flaws though: It still sends a GET request, which makes it probably a little slower, and it follows redirects, which may cause security issues. But you can fix this pretty easily by adjusting the default context:
$opts = array(
'http' =>
array(
'method' => 'HEAD',
'max_redirects'=> 1,
'ignore_errors'=> true
)
);
stream_context_set_default($opts);
Done. Now you can simply get the headers:
$headers = get_headers('http://example.com/pic.png', 1);
//set the keys to lowercase so we don't have to deal with lower- and upper case
$lowerCaseHeaders = array_change_key_case($headers);
// 'content-length' is the header we're interested in:
$filesize = $lowerCaseHeaders['content-length'];
NOTE: filesize() will not work on a http / https stream wrapper, because stat() is not supported (http://php.net/manual/en/wrappers.http.php).
And that's pretty much it. Of course you can achieve the same with CURL just as easy if you like it better. The approach would be same (reding the headers).
And here's how you get the file and it's size (after downloading) with CURL:
// Create a CURL handle
$ch = curl_init();
// Set all the options on this handle
// find a full list on
// http://au2.php.net/manual/en/curl.constants.php
// http://us2.php.net/manual/en/function.curl-setopt.php (for actual usage)
curl_setopt($ch, CURLOPT_URL, 'http://example.com/pic.png');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Send the request and store what is returned to a variable
// This actually contains the raw image data now, you could
// pass it to e.g. file_put_contents();
$data = curl_exec($ch);
// get the required info about the request
// find a full list on
// http://us2.php.net/manual/en/function.curl-getinfo.php
$filesize = curl_getinfo($ch, CURLINFO_SIZE_DOWNLOAD);
// close the handle after you're done
curl_close($ch);
Pure PHP approach: http://codepad.viper-7.com/p8mlOt
Using CURL: http://codepad.viper-7.com/uWmsYB
For a nicely formatted and human readable output of the file size I've learned this amazing function from Laravel:
function get_file_size($size)
{
$units = array('Bytes', 'KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB');
return #round($size / pow(1024, ($i = floor(log($size, 1024)))), 2).' '.$units[$i];
}
If you don't want to deal with all this, you should check out Guzzle. It's a very powerful and extremely easy to use library for any kind HTTP stuff.

PHP filesize of dynamically chosen file

I have a php script that needs to determine the size of a file on the file system after being manipulated by a separate php script.
For example, there exists a zip file that has a fixed size but gets an additional file of unknown size inserted into it based on the user that tries to access it. So the page that's serving the file is something like getfile.php?userid=1234.
So far, I know this:
filesize('getfile.php'); //returns the actual file size of the php file, not the result of script execution
readfile('getfile.php'); //same as filesize()
filesize('getfile.php?userid=1234'); //returns false, as it can't find the file matching the name with GET vars attached
readfile('getfile.php?userid=1234'); //same as filesize()
Is there a way to read the result size of the php script instead of just the php file itself?
filesize
As of PHP 5.0.0, this function can also be used with some URL
wrappers.
something like
filesize('http://localhost/getfile.php?userid=1234');
should be enough
Someone had posted an option for using curl to do this but removed their answer after a downvote. Too bad, because it's the one way I've gotten this to work. So here's their answer that worked for me:
$ch = curl_init('http://localhost/getfile.php?userid=1234');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); //This was not part of the poster's answer, but I needed to add it to prevent the file being read from outputting with the requesting script
curl_exec($ch);
$size = 0;
if(!curl_errno($ch))
{
$info = curl_getinfo($ch);
$size = $info['size_download'];
}
curl_close($ch);
echo $size;
The only way to get the size of the output is to run it and then look. Depending on the script the result might differ though for practical use the best thing to do is to estimate basd on your knowledge. i.e. if you have a 5MB file and add another 5k user specific content it's still about 5MB in the end etc.
To expand on Ivan's answer:
Your string is 'getfile.php' with or without GET parameters, this is being treated as a local file, and therefore retrieving the filesize of the php file itself.
It is being treated as a local file because it isn't starting with the http protocol. See http://us1.php.net/manual/en/wrappers.php for supported protocols.
When using filesize() I got a warning:
Warning: filesize() [function.filesize]: stat failed for ...link... in ..file... on line 233
Instead of filesize() I found two working options to replace it:
1)
$headers = get_headers($pdfULR, 1);
$fileSize = $headers['Content-Length'];
echo $fileSize;
2)
echo strlen(file_get_contents($pdfULR));
Now it's working fine.

Downloading files using GZIP

I have many XML-s and I downloaded using file or file_get_content, but the server administrator told me that through GZIP is more efficient the downloading. My question is how can I include GZIP, because I never did this before, so this solution is really new for me.
You shouldn't need to do any decoding yourself if you use cURL. Just use the basic cURL example code, with the CURLOPT_ENCODING option set to "", and it will automatically request the file using gzip encoding, if the server supports it, and decode it.
Or, if you want the content in a string instead of in a file:
$ch = curl_init("http://www.example.com/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, ""); // accept any supported encoding
$content = curl_exec($ch);
curl_close($ch);
I've tested this, and it indeed downloads the content in gzipped format and decodes it automatically.
(Also, you should probably include some error handling.)
I don't understand your question.
You say that you downloaded these files - you can't unilaterally enable compression client-side.
OTOH you can control it server-side - and since you've flagged the question as PHP, and it doesn't make any sense for your administrator to recommend compression where you don't have control over the server then I assume this is what you are talking about.
In which case you'd simply do something like:
<?php
ob_start("ob_gzhandler");
...your code for generating the XML goes here
...or maybe this is nothing to do with PHP, and the XML files are static - in which case you'd need to configure your webserver to compress on the fly.
Unless you mean that compression is available on the server and you are fetching data over HTTP using PHP as the client - in which case the server will only compress the data if the client provides an "Accept-Encoding" request header including "gzip". In which case, instead of file_get_contents() you might use:
function gzip_get_contents($url)
{
$ch=curl_init($url);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content=curl_exec($ch);
curl_close($ch);
return $content;
}
probably curl can get a gzipped file
http://www.php.net/curl
try to use this instead of file_get_contents
edit: tested
curl_setopt($c,CURLOPT_ENCODING,'gzip');
then:
gzdecode($responseContent);
Send a Accept-Encoding: gzip header in your http request and then uncompress the result as shown here:
http://de2.php.net/gzdecode

Categories