XML fails to load, without any errors message - php

I have a XML structure in my PHP file.
For example:
$file = file_get_contents($myFile);
$response = '<?xml version="1.0"?>';
$response .= '<responses>';
$response .= '<file>';
$response .= '<name>';
$response .= '</name>';
$response .= '<data>';
$response .= base64_encode($file);
$response .= '</data>';
$response .= '</file>';
$response .= '</responses>';
echo $response;
If i create .doc file or with other extension and put little text in, it works. But, if user load file with complex structure (not only text) - XML just not load, and i have a empty file without errors.
But the same files works on my other server.
I have try use simplexml_load_string for output errors, but i have no errors.
The server with PHP 5.3.3 have the problem; the one with PHP 5.6 hasn’t. It works if I try it with 5.3.3 on my local server.
Is the problem due to the PHP version? If so, how exactly?

There're basically three things that can be improved in your code:
Configure error reporting to actually see error messages.
Generate XML with a proper library, to ensure you cannot send malformed data.
Be conservative in memory usage (you're currently storing the complete file in RAM three times, two of them in a plain text representation that depending of file type can be significantly larger).
Your overall code could like like this:
// Quick code, needs more error checking and refining
$fp = fopen($myFile, 'rb');
if ($fp) {
$writer = new XMLWriter();
$writer->openURI('php://output');
$writer->startDocument('1.0');
$writer->startElement('responses');
$writer->startElement('file');
$writer->startElement('name');
$writer->endElement();
$writer->startElement('data');
while (!feof($fp)) {
// If I recall correctly, substring size must be multiple of 4
// to encode it properly (except for last part)
$writer->text(base64_encode(fread($fp, 10240)));
}
$writer->endElement();
$writer->endElement();
$writer->endElement();
fclose($fp);
}
I've tried this code with a 316 MB file and used 256 KB on my PC.
As a side note, inserting binary files inside XML is pretty troublesome when files are large. It makes extraction problematic because you can't use most of the usual tools due to extensive memory usage.

Related

Running Out Of Memory With Fread

I'm using Backblaze B2 to store files and am using their documentation code to upload via their API. However their code uses fread to read the file, which is causing issues for files that are larger than 100MB as it tries to load the entire file into memory. Is there a better way to this that doesn't try to load the entire file into RAM?
$file_name = "file.txt";
$my_file = "<path-to-file>" . $file_name;
$handle = fopen($my_file, 'r');
$read_file = fread($handle,filesize($my_file));
$upload_url = ""; // Provided by b2_get_upload_url
$upload_auth_token = ""; // Provided by b2_get_upload_url
$bucket_id = ""; // The ID of the bucket
$content_type = "text/plain";
$sha1_of_file_data = sha1_file($my_file);
$session = curl_init($upload_url);
// Add read file as post field
curl_setopt($session, CURLOPT_POSTFIELDS, $read_file);
// Add headers
$headers = array();
$headers[] = "Authorization: " . $upload_auth_token;
$headers[] = "X-Bz-File-Name: " . $file_name;
$headers[] = "Content-Type: " . $content_type;
$headers[] = "X-Bz-Content-Sha1: " . $sha1_of_file_data;
curl_setopt($session, CURLOPT_HTTPHEADER, $headers);
curl_setopt($session, CURLOPT_POST, true); // HTTP POST
curl_setopt($session, CURLOPT_RETURNTRANSFER, true); // Receive server response
$server_output = curl_exec($session); // Let's do this!
curl_close ($session); // Clean up
echo ($server_output); // Tell me about the rabbits, George!
I have tried using:
curl_setopt($session, CURLOPT_POSTFIELDS, array('file' => '#'.realpath('file.txt')));
However I get an error response: Error reading uploaded data: SocketTimeoutException(Read timed out)
Edit: Streaming the filename withing the CURL also doesn't seem to work.
The issue you are having is related to this.
fread($handle,filesize($my_file));
With the filesize in there you might as well just do file_get_contents. it's much better memory wise to read 1 line at a time with fget
$handle = fopen($myfile, 'r');
while(!feof($handle)){
$line = fgets($handle);
}
This way you only read one line into memory, but if you need the full file contents you will still hit a bottleneck.
The only real way is to stream the upload.
I did a quick search and it seems the default for CURL is to stream the file if you give it the filename
$post_data['file'] = 'myfile.csv';
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
You can see the previous answer for more details
Is it possible to use cURL to stream upload a file using POST?
So as long as you can get past the sha1_file It looks like you can just stream the file, which should avoid the memory issues. There may be issues with time limit though. Also I can't really think of a way around getting the hash if that fails.
Just FYI, personally I never tried this, typically i just us sFTP for large file transfers. So I don't know if it has to be specially post_data['file'] I just copied that from the other answer.
Good luck...
UPDATE
Seeing as streaming seems to have failed (see comments).
You may want to test the streaming to make sure it works. I don't know what all that would involve, maybe stream a file to your own server? Also I am not sure why it wouldn't work "as advertised" and you may have tested it already. But it never hurts to test something, never assume something works until you know for sure. It very easy to try something new as a solution, only to miss a setting or put a path in wrong and then fall back to thinking its all based on the original issue.
I've spent a lot of time tearing things apart only to realize I had a spelling error. I'm pretty adept a programing these days so I typically overthink the errors too. My point is, be sure it's not a simple mistake before moving on.
Assuming everything is setup right, I would try file_get_contents. I don't know if it will be any better but it's more meant to open whole files. It also would seem to be more Readable in the code, because then it's clear that the whole file is needed. It just seems more semantically correct if nothing else.
You can also increase the RAM PHP has access to by using
ini_set('memory_limit', '512M')
You can even go higher then that, depending on your server. The highest I went before was 3G, but the server I uses has 54GB of ram and that was a one time thing, (we migrated 130million rows from MySql to MongoDB, the innodb index was eating up 30+GB ). Typically I run with 512M and have some scripts that routinely need 1G. But I wouldn't just up the Memory willy-nilly. That is usually a last resort for me after optimizing and testing. We do a lot of heavy processing that is why we have such a big server, we also have 2 slave servers (among other things) that run with 16GB each.
As far as what size to put, typically I increment it by 128M tell it works, then add an extra 128M just to be sure, but you might want to go in smaller steps. Typically people always use multiples of 8, but I don't know if that make to much difference these days.
Again, Good Luck.

MongoDB GridFS PHP append text

I'm trying to make a log system using mongodb in php and GridFS. I can initially write data to my file but once I close the stream I don't know how to append data later to it. This is how I write to it:
$bucket= DB::connection('mongodb')->getMongoDB()->selectGridFSBucket();
$stream = $bucket->openUploadStream('my-file-name.txt');
$contents = 'whatever text here \n';
fwrite($stream, $contents);
fclose($stream);
I tried retreiving it and appending data to the stream but it doesn't work. This is attempt:
$bucket= DB::connection('mongodb')->getMongoDB()->selectGridFSBucket();
$stream = $bucket->openDownloadStreamByName('my-file-name.txt');
fwrite($stream, $contents);
also tried fopen on the stream but no luck. I also don't know how to retrieve this file Id after writing data to it.
I couldn't find a simple solution to append so I ended up replacing the previous file with a new file and deleting the old one.
$stream = $bucket->openDownloadStream($this->file_id);
$prev_content = stream_get_contents($stream);
//delete old data chunks
DB::connection('mongodb')->collection('fs.chunks')->where('files_id',$this->file_id)->delete();
//delete old file
DB::connection('mongodb')->collection('fs.files')->where('_id',$this->file_id)->delete();
$new_content =$prev_content.$dt.$msg."\n"; // append to log body
$new_filename = $filename_prefix;
$stream = $bucket->openUploadStream($new_filename);
fwrite($stream, $new_content);
fclose($stream);
$new_file = DB::connection('mongodb')->collection('fs.files')->where('filename',$new_filename)->first();
UPDATE:
For future readers, after using this system for few months I realized this is not a good solution. Files & chunks can get corrupted and also the process is super slow. I simply switched to logging in a txt file and just storing a reference to my file in mongodb much better :)

Using file_get_contents vs curl for file size

I have a file uploading script running on my server which also features remote uploads.. Everything works fine but I am wondering what is the best way to upload via URL. Right now I am using fopen to get the file from the remote url pasted in the text box named "from". I have heard that fopen isn't the best way to do it. Why is that?
Also I am using file_get_contents to get the file size of the file from the URL. I have heard that curl is better on that part. Why is that and also how can I apply these changes to this script?
<?php
$from = htmlspecialchars(trim($_POST['from']));
if ($from != "") {
$file = file_get_contents($from);
$filesize = strlen($file);
while (!feof($file)) {
$move = "./uploads/" . $rand2;
move_upload($_FILES['from']['tmp_name'], $move);
$newfile = fopen("./uploads/" . $rand2, "wb");
file_put_contents($newfile, $file);
}
}
?>
You can use filesize to get the file size of a file on disk.
file_get_contents actually gets the file into memory so $filesize = strlen(file_get_contents($from)); already gets the file, you just don't do anything with it other than find it size. You can substitute for you fwrite call file_put_contents;
See: file_get_contents and file_put_contents .
curl is used when you need more access to the HTTP protocol. There are many questions and examples on StackOverflow using curl in PHP.
So we can first download the file, in this example I wll use file_get_contents, get its size, then put the file in the directory on your local disk.
$tmpFile = file_get_contents($from);
$fileSize = strlen($tmpFile);
// you could do a check for file size here
$newFileName = "./uploads/$rand2";
file_put_contents($newFileName, $tmpFile);
In your code you have move_upload($_FILES['from']['tmp_name'], $move); but $_FILES is only applicable when you have a <input type="file"> element, which it doesn't seem you have.
P.S. You should probably white-list characters that you allow in a filename for instance $goodFilename = preg_replace("/^[^a-zA-Z0-9]+$/", "-", $filename) This is often easier to read and safer.
Replace:
while (!feof($file)) {
$move = "./uploads/" . $rand2;
move_upload($_FILES['from']['tmp_name'], $move);
$newfile = fopen("./uploads/" . $rand2, "wb");
file_put_contents($newfile, $file);
}
With:
$newFile = "./uploads/" . $rand2;
file_put_contents($newfile, $file);
The whole file is read in by file_get_contents the whole file is written by file_put_contents
As far as I understand your question: You want to get the filesize of a remote fiel given by a URL, and you're not sure which solution ist best/fastest.
At first, the biggest difference between CURL, file_get_contents() and fread() in this context is that CURL and file_get_contents() put the whole thing into memory, while fopen() gives you more control over what parts of the file you want to read. I think fopen() and file_get_contents() are nearly equivalent in your case, because you're dealing with small files and you actually want to get the whole file. So it doesn't make any difference in terms of memory usage.
CURL is just the big brother of file_get_contents(). It is actually a complete HTTP-Client rather than some kind of a wrapper for simple functions.
And talking about HTTP: Don't forget there's more to HTTP than GET and POST. Why don't you just use the resource's meta-data to check it's size before you even get it? That's one thing the HTTP method HEAD is meant for. PHP even comes with a built in function for getting the headers: get_headers(). It has some flaws though: It still sends a GET request, which makes it probably a little slower, and it follows redirects, which may cause security issues. But you can fix this pretty easily by adjusting the default context:
$opts = array(
'http' =>
array(
'method' => 'HEAD',
'max_redirects'=> 1,
'ignore_errors'=> true
)
);
stream_context_set_default($opts);
Done. Now you can simply get the headers:
$headers = get_headers('http://example.com/pic.png', 1);
//set the keys to lowercase so we don't have to deal with lower- and upper case
$lowerCaseHeaders = array_change_key_case($headers);
// 'content-length' is the header we're interested in:
$filesize = $lowerCaseHeaders['content-length'];
NOTE: filesize() will not work on a http / https stream wrapper, because stat() is not supported (http://php.net/manual/en/wrappers.http.php).
And that's pretty much it. Of course you can achieve the same with CURL just as easy if you like it better. The approach would be same (reding the headers).
And here's how you get the file and it's size (after downloading) with CURL:
// Create a CURL handle
$ch = curl_init();
// Set all the options on this handle
// find a full list on
// http://au2.php.net/manual/en/curl.constants.php
// http://us2.php.net/manual/en/function.curl-setopt.php (for actual usage)
curl_setopt($ch, CURLOPT_URL, 'http://example.com/pic.png');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Send the request and store what is returned to a variable
// This actually contains the raw image data now, you could
// pass it to e.g. file_put_contents();
$data = curl_exec($ch);
// get the required info about the request
// find a full list on
// http://us2.php.net/manual/en/function.curl-getinfo.php
$filesize = curl_getinfo($ch, CURLINFO_SIZE_DOWNLOAD);
// close the handle after you're done
curl_close($ch);
Pure PHP approach: http://codepad.viper-7.com/p8mlOt
Using CURL: http://codepad.viper-7.com/uWmsYB
For a nicely formatted and human readable output of the file size I've learned this amazing function from Laravel:
function get_file_size($size)
{
$units = array('Bytes', 'KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB');
return #round($size / pow(1024, ($i = floor(log($size, 1024)))), 2).' '.$units[$i];
}
If you don't want to deal with all this, you should check out Guzzle. It's a very powerful and extremely easy to use library for any kind HTTP stuff.

Problems getting gzipped XML files with curl PHP

I'm trying to grab data from an xml.gz file with curl. I'm able to download the file, but can't get the usable XML with any of my attempts. When I try to print the XML, I'm getting a long list of garbled special characters such as:
‹ì½ûrâÈ–7ú?E~{Çž¨Ši°î—Ù5=ÁÍ6]`Ø€ë²ãDLÈ u
Is there a simple way to just uncompress and encode this xml? Possibly through SimpleXML? The files are large and do require authentication. Here's my current code:
$username='username';
$password='password';
$location='http://www.example.com/file.xml.gz';
$ch = curl_init ();
curl_setopt($ch,CURLOPT_URL,$location);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_USERPWD,"$username:$password");
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_HEADER, 0);
$xmlcontent = curl_exec ($ch);
curl_close($ch);
print_r($xmlcontent);
Thanks for your help!
You will need to pass the string through gzuncompress: http://www.php.net/manual/en/function.gzuncompress.php
You first need to save the file to disk. As it's gz-compressed you need to uncompress it before you can access the (uncompressed) XML. This can be done with the zlib:// -- bzip2:// -- zip:// — Compression Streams in PHP:
$file = 'compress.zlib://file.xml.gz';
################
$xml = simplexml_load_file($file);
To get this to work, you need to have the ZLib extension installed/configured.
Wrapper means that you're not creating an uncompressed variant of that file first (create a second file, which can be a solution ,too) but the wrapper uncompresses the data of that file transparently on the fly so that the simplexml library can operate on the uncompressed XML (and that is what that library needs: uncompressed XML).
See as well:
Sorting and grouping SimpleXML Data (example of using xml.gz file with SimpleXMLElement)
Parsing extremely large XML files in php (example of using xml.gz file with XMLReader)
Not sure why, but none of the other answers worked for me in the end. zlib was installed on the server, but the gzdecode() function was not defined in the library, and the gzuncompress gave me errors, as did compress.zlib://. They might work for you so, give them a try as well.
If you need to check if zlib is installed this stackoverflow answer or this answer can help. They provide this script:
<?php
echo phpversion().", ";
if (function_exists("gzdecode")) {
echo "gzdecode OK, ";
} else {
echo "gzdecode no OK, ";
}
if (extension_loaded('zlib')) {
echo "zlib extension loaded ";
} else {
echo "zlib extension not loaded ";
}
?>
This site gives another script that shows what zlib function are installed:
var_dump(get_extension_funcs('zlib'));
SOLUTION!!! These 2 functions did the trick for me. Just curl or use file_get_contents to grab the xml file, then use this script:
$xmlcontent = gzinflate(substr($xmlcontent,10,-8));
OR use this script to grab the xml file and get the contents (see more here):
$zd = gzopen($filename,"r");
$contents = gzread($zd,$fileSize);
gzclose($zd);
Thanks to all who helped me get this answer. Hope this helps someone else!
I suggest you just decompress the result you fetch:
//[...]
$xmlcontent = gzdecode ( curl_exec($ch) );
curl_close($ch);
print_r($xmlcontent);
Obviously you should do some additional error checking, this is just the shortened general approach.
Note that there are two similar functions provided by php:
gzuncompress()
gzdecode()
Most likely you have to use the second one, if the file really is a physical gzip compressed file delivered by a http server.

Validating a large XML file ~400MB in PHP

I have a large XML file (around 400MB) that I need to ensure is well-formed before I start processing it.
First thing I tried was something similar to below, which is great as I can find out if XML is not well formed and which parts of XML are 'bad'
$doc = simplexml_load_string($xmlstr);
if (!$doc) {
$errors = libxml_get_errors();
foreach ($errors as $error) {
echo display_xml_error($error);
}
libxml_clear_errors();
}
Also tried...
$doc->load( $tempFileName, LIBXML_DTDLOAD|LIBXML_DTDVALID )
I tested this with a file of about 60MB, but anything a lot larger (~400MB) causes something which is new to me "oom killer" to kick in and terminate the script after what always seems like 30 secs.
I thought I may need to increase the memory on the script so figured out the peak usage when processing 60MB and adjusted it accordingly for a large and also turn the script time limit off just in case it was that.
set_time_limit(0);
ini_set('memory_limit', '512M');
Unfortunately this didn't work, as oom killer appears to be a linux thing that kicks in if memory load (even the right term?) is consistently high.
It would be great if I could load xml in chunks somehow as I imagine this will reduce the memory load so that oom killer doesn't stick it's fat nose in and kill my process.
Does anyone have any experience validating a large XML file and capturing errors of where it's badly formed, a lot of posts I've read point to SAX and XMLReader that might solve my problem.
UPDATE
So #chiborg pretty much solved this issue for me...the only downside to this method is that I don't get to see all of the errors in the file, just the first that failed which I guess makes sense as I think it can't parse past the first point that fails.
When using simplexml...it's able to capture most of the issues in the file and show me at the end which was nice.
Since the SimpleXML and DOM APIs will always load the document into memory, using a streaming parser like SAX or XMLReader is the better approach.
Adpating the code from the example page, it could look like this:
$xml_parser = xml_parser_create();
if (!($fp = fopen($file, "r"))) {
die("could not open XML input");
}
while ($data = fread($fp, 4096)) {
if (!xml_parse($xml_parser, $data, feof($fp))) {
$errors[] = array(
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser));
}
}
xml_parser_free($xml_parser);
For big file, perfect use XMLReader class.
But if liked simplexml syntax: https://github.com/dkrnl/SimpleXMLReader/blob/master/library/SimpleXMLReader.php
Usage example: http://github.com/dkrnl/SimpleXMLReader/blob/master/examples/example1.php

Categories