Problem reading files greater than 1GB with XMLReader

Problem reading files greater than 1GB with XMLReader - php

Is there a maximum file size the XMLReader can handle?
I'm trying to process an XML feed about 3GB large. There are certainly no PHP errors as the script runs fine and successfully loads to the database after it's been run.
The script also runs fine with smaller test feeds - 1GB and below. However, when processing larger feeds the script stops reading the XML File after about 1GB and continues running the rest of the script.
Has anybody experienced a similar problem? and if so how did you work around it?
Thanks in advance.

I had same kind of problem recently and I thought to share my experience.
It seems that problem is in the way PHP was compiled, whether it was compiled with support for 64bit file sizes/offsets or only with 32bit.
With 32bits you can only address 4GB of data. You can find a bit confusing but good explanation here: http://blog.mayflower.de/archives/131-Handling-large-files-without-PHP.html
I had to split my files with Perl utility xml_split which you can find here: http://search.cpan.org/~mirod/XML-Twig/tools/xml_split/xml_split
I used it to split my huge XML file into manageable chunks. The good thing about the tool is that it splits XML files over whole elements. Unfortunately its not very fast.
I needed to do this one time only and it suited my needs, but I wouldn't recommend it repetitive use. After splitting I used XMLReader on smaller files of about 1GB in size.

Splitting up the file will definitely help. Other things to try...
adjust the memory_limit variable in php.ini. http://php.net/manual/en/ini.core.php
rewrite your parser using SAX -- http://php.net/manual/en/book.xml.php . This is a stream-oriented parser that doesn't need to parse the whole tree. Much more memory-efficient but slightly harder to program.
Depending on your OS, there might also be a 2gb limit on the RAM chunk that you can allocate. Very possible if you're running on a 32-bit OS.

It should be noted that PHP in general has a max file size. PHP does not allow for unsigned integers, or long integers, meaning you're capped at 2^31 (or 2^63 for 64 bit systems) for integers. This is important because PHP uses an integer for the file pointer (your position in the file as you read through), meaning it cannot process a file larger than 2^31 bytes in size.
However, this should be more than 1 gigabyte. I ran into issues with two gigabytes (as expected, since 2^31 is roughly 2 billion).

I've run into a similar issue when parsing large documents. What I wound up doing is breaking the feed into smaller chunks using filesystem functions, then parsing those smaller chunks... So if you have a bunch of <record> tags that you are parsing, parse them out with string functions as a stream, and when you get a full record in the buffer, parse that using the xml functions... It sucks, but it works quite well (and is very memory efficient, since you only have at most 1 record in memory at any one time)...

Do you get any errors with
libxml_use_internal_errors(true);
libxml_clear_errors();
// your parser stuff here....
$r = new XMLReader(...);
// ....
foreach( libxml_get_errors() as $err ) {
printf(". %d %s\n", $err->code, $err->message);
}
when the parser stops prematurely?

Using WindowsXP, NTFS as filesystem and php 5.3.2 there was no problem with this test script
<?php
define('SOURCEPATH', 'd:/test.xml');
if ( 0 ) {
build();
}
else {
echo 'filesize: ', number_format(filesize(SOURCEPATH)), "\n";
timing('read');
}
function timing($fn) {
$start = new DateTime();
echo 'start: ', $start->format('Y-m-d H:i:s'), "\n";
$fn();
$end = new DateTime();
echo 'end: ', $start->format('Y-m-d H:i:s'), "\n";
echo 'diff: ', $end->diff($start)->format('%I:%S'), "\n";
}
function read() {
$cnt = 0;
$r = new XMLReader;
$r->open(SOURCEPATH);
while( $r->read() ) {
if ( XMLReader::ELEMENT === $r->nodeType ) {
if ( 0===++$cnt%500000 ) {
echo '.';
}
}
}
echo "\n#elements: ", $cnt, "\n";
}
function build() {
$fp = fopen(SOURCEPATH, 'wb');
$s = '<catalogue>';
//for($i = 0; $i < 500000; $i++) {
for($i = 0; $i < 60000000; $i++) {
$s .= sprintf('<item>%010d</item>', $i);
if ( 0===$i%100000 ) {
fwrite($fp, $s);
$s = '';
echo $i/100000, ' ';
}
}
$s .= '</catalogue>';
fwrite($fp, $s);
flush($fp);
fclose($fp);
}
output:
filesize: 1,380,000,023
start: 2010-08-07 09:43:31
........................................................................................................................
#elements: 60000001
end: 2010-08-07 09:43:31
diff: 07:31
(as you can see I screwed up the output of the end-time but I don't want to run this script another 7+ minutes ;-))
Does this also work on your system?
As a side-note: The corresponding C# test application took only 41 seconds instead of 7,5 minutes. And my slow harddrive might have been the/one limiting factor in this case.
filesize: 1.380.000.023
start: 2010-08-07 09:55:24
........................................................................................................................
#elements: 60000001
end: 2010-08-07 09:56:05
diff: 00:41
and the source:
using System;
using System.IO;
using System.Xml;
namespace ConsoleApplication1
{
class SOTest
{
delegate void Foo();
const string sourcepath = #"d:\test.xml";
static void timing(Foo bar)
{
DateTime dtStart = DateTime.Now;
System.Console.WriteLine("start: " + dtStart.ToString("yyyy-MM-dd HH:mm:ss"));
bar();
DateTime dtEnd = DateTime.Now;
System.Console.WriteLine("end: " + dtEnd.ToString("yyyy-MM-dd HH:mm:ss"));
TimeSpan s = dtEnd.Subtract(dtStart);
System.Console.WriteLine("diff: {0:00}:{1:00}", s.Minutes, s.Seconds);
}
static void readTest()
{
XmlTextReader reader = new XmlTextReader(sourcepath);
int cnt = 0;
while (reader.Read())
{
if (XmlNodeType.Element == reader.NodeType)
{
if (0 == ++cnt % 500000)
{
System.Console.Write('.');
}
}
}
System.Console.WriteLine("\n#elements: " + cnt + "\n");
}
static void Main()
{
FileInfo f = new FileInfo(sourcepath);
System.Console.WriteLine("filesize: {0:N0}", f.Length);
timing(readTest);
return;
}
}
}

Related

One-pass algorithm (clarification needed) Why the space complexity is O(1)?

From en.wikipedia:
A one-pass algorithm generally requires O(n) (see 'big O' notation) time and less than O(n) storage (typically O(1)), where n is the size of the input.
I made a test with xdebug.profiler_enable=1:
function onePassAlgorithm(array $inputArray): int
{
$size = count($inputArray);
for ($countElements = 0; $countElements < $size; ++$countElements) {
}
return $countElements;
}
$range = range(1, 1_000_000);
$result = onePassAlgorithm($range);
The memory usage of this code in qcachegrind is: 33 558 608 bytes, and all 100% of them was used by the range() function.
And this part seems to me ok, because inside the onePassAlgorithm function we have only two int variables.
And that's the reason why space complexity is O(1).
Then I made another test:
function onePassAlgorithm(array $inputArray, int $twoSum): array
{
$seen_nums = [];
foreach ($inputArray as $key => $num) {
$complement = $twoSum - $num;
if (isset($seen_nums[$complement])) {
return [$seen_nums[$complement], $key];
}
$seen_nums[$num] = $key;
}
return [];
}
$range = range(1, 1_000_000);
$result = onePassAlgorithm($range, 1_999_999);
In qcachegrind we can see that onePassAlogorithm function uses only 376 bytes (the size of the return statement). Don't we need more to sequentially save in $seen_nums variable? So again space complexity is O(1)?
My question is: Why qcachegrind shows that the variable $seen_nums in which we copy the entire $inputArray consumes no memory?
Or in other words why the storage complexity of my second realisation of this algorithm is O(1)?

From Xdebug documentation:
[2007-05-17] — Removed support for Memory profiling as that didn't work properly.
[2015-02-22] — Xdebug 2.3.0
Added the time index and memory usage for function returns in normal tracefiles.
So the reason of my confusion was in that xdebug profiles shows only memory usage for function returns, and not the full memory profiling that i was expected.

Nested loops in PHP extremely slow

I have 6 nested loops in a PHP program, however, the calculation time for the script is extremely slow. I would like to ask if there is a better way of implementing the 6 loops and increasing computation time, even if it means switching to another language. The nature of the algorithm I'm implementing requires iteration, so I don't know how I can better implement it.
Here's the code.
<?php
$time1 = microtime(true);
$res = 16;
$imageres = 128;
for($x=0;$x<$imageres;++$x){
for($y=0;$y<$imageres;++$y){
$pixels[$x][$y]=1;
}};
$quantizermatrix = 1;
$scalingcoefficient = 1/($res/2);
for($currentimagex=0;$currentimagex<($res*($imageres/$res-1)+1);$currentimagex = $currentimagex +$res){
for($currentimagey=0;$currentimagey<($res*($imageres/$res-1)+1);$currentimagey = $currentimagey +$res){
for($u=0;$u<$res;++$u){
for($v=0;$v<$res;++$v){
for($x=0;$x<$res;++$x){
for($y=0;$y<$res;++$y){
if($u == 0) {$a = 1/(sqrt(2));} else{$a = 1;};
if($v == 0){$b = 1/(sqrt(2));}else{$b = 1;};
$xes[$y] = $pixels[$x+$currentimagex][$y+$currentimagey]*cos((M_PI/$res)*($x+0.5)*$u)*cos( M_PI/$res*($y+0.5)*$v);
}
$xes1[$x] = array_sum($xes);
}
$xes2= array_sum($xes1)*$scalingcoefficient*$a*$b;
$dctarray[$u+$currentimagex][$v+$currentimagey] = round($xes2/$quantizermatrix)*$quantizermatrix;
}}}};
foreach($dctarray as $dct){
foreach($dct as $dc){
echo $dc." ";
}
echo "<br>";}
$time2 = microtime(true);echo 'script execution time: ' . ($time2 - $time1);
?>
I've removed a large portion of the code that's irrelevant, since this is the section of the code that's problematic
Essentially the code iterates through every pixel in a PNG image and outputs a computed matrix (2d array). This code takes around 2 seconds for a 128x128 image. This makes this program impractical for normal images greater than 128x128

There is a function available in Imagick library
Imagick::exportImagePixels
Refer the below link it might help you out
http://www.php.net/manual/en/imagick.exportimagepixels.php

Does a boolean value in PHP take up only 1 bit of memory?

As the question states, would the following array require 5 bits of memory?
$flags = array(true, false, true, false, false);
[EDIT]: Apologies just found this duplicate.

Each element in the array stored in a separate memory location, you also need to store the hashtable for the array, along with the keys, so NOOOO, it's going to be a lot more.

No. PHP has internal metadata attached to every variable/array element definined. PHP does not support bit fields directly, so the smallest ACTUAL allocation is a byte, plus metadata overhead.

I doubt there is an application that uses less than system arcitecture's data word as a minimum data storage unit.
But I am sure it shouldn't be your concern at all.

It depends on the php interpreter. The standard interpreter is extremely wasteful, although this is not uncommon for a dynamic language. The massive overhead is caused by garbage collection, and the dynamic nature of every value; since the contents of an array can take arbitrary values of arbitrary types (i.e. you can write $ar[1] = 's';), the type and additional metainformation must be stored.
With the following test script:
<?php
$n = 20000000;
$ar = array();
$i = 0;
$before = memory_get_usage();
for ($i = 0;$i < $n;$i++) {
$ar[] = ($i % 2 == 0);
}
$after = memory_get_usage();
echo 'Using ' . ($after - $before) . ' Bytes for ' . $n . ' values';
echo ', per value: ' . (($after - $before) / $n) . "\n";
I get about 150 Bytes per array entry (x64, php 5.4.0-2). This seems to be at the higher end of implementations; ideone reports 73 Bytes/entry (php 5.2.11), and so does codepad.

PHP filesize() On Files > 2 GB

I have been struggeling on how to get the valid filesize of a file that is >= 2 GB in PHP.
Example
Here I am checking the filesize of a file that is 3,827,394,560 bytes large with the filesize() function:
echo "The file is " . filesize('C:\MyFile.rar') . " bytes.";
Result
This is what it returns:
The file is -467572736 bytes.
Background
PHP uses signed integers, which means that the maximum number it can represent is 2,147,483,647 (+/- 2 GB).
This is where it is limited.

The solution I tried and apparently works is to use the "Size" property of the COM FileObject. I am not entirely sure what type it uses.
This is my code:
function real_filesize($file_path)
{
$fs = new COM("Scripting.FileSystemObject");
return $fs->GetFile($file_path)->Size;
}
It's simply called as following:
$file = 'C:\MyFile.rar';
$size = real_filesize($file);
echo "The size of the file is: $size";
Result
The size of the file is: 3,827,394,560 bytes

http://us.php.net/manual/en/function.filesize.php#102135 gives a complete and correct means for finding the size of a file larger than 2GB in PHP, without relying on OS-specific interfaces.
The gist of it is that you first use filesize to get the "low" bits, then open+seek the file to determine how many multiples of 2GB it contains (the "high" bits).

I was using a different approach, saving precious server-resources,
have a look at my GitHub repository github.com/eladkarako/download.eladkarako.com.
It is a plain, and complete, download-dashboard, that overcome the (*rare) cases of filesize error using client-side head-request, granted, it will not be embedded into the page's HTML source, but rendered (*fixed) some time later, so it is more suitable for hmm..., lets say, relaxed scenarios..
To make this solution available, an Apache .htaccess (or header in PHP) should be added allowing client-side usage of Content-Length value.
Essentially you can slim down the .htaccess to just allowing Content-Length removing other CORS rules.. making the website more secure.
no jQuery was used and whole thing was written in my Samsung text-editor and uploaded by FTP from my smartphone, in a 1.5-hour train-ride in my MILUIM.. and yet, still impeccable ;)

I have one "hacky" solution what works well.
Look please THIS function how I do it and you need also include this class to function can works well or change by your need.
example:
include_once 'class.os.php';
include_once 'function.filesize.32bit.php';
// Must be real path to file
$file = "/home/username/some-folder/yourfile.zip";
echo get_filesize($file);
This function is not ideal solution but here is how works:
First check if shell_exec is enabled into PHP. If is enabled, it will check via shell command real filesize.
If shell fail and OS is 64bit will return normal filesize() information
If is 32bit will go into "chunking" method and calculate filesize reading buytes.
NOTE!
Alfter reading keep results in string format to can easly calculate because PHP can calculate strings but if you transfrorm results over 2GB into integer you will have same problem as before.
WARNING!
Chunking is realy slow and if you want to loop this, you will have memory leak or script can take minutes to finish reading all the files. If you use this function on server where you have shell_exec enabled, you will have super fast reading.
P.S.
If you have some idea for changes here and improvemants feel free to commit.

For anyone who happens to be on a linux host, the easiest solution I found is to use:
exec("stat --format=\"%s\" \"$file\"");
This assumes no quotation marks or newlines in the file name and technically returns a string instead of a number I suppose, but it works well with this method to get a
human readable file size.
The largest file I tested this with was about 3.6 GB.

I know this is an oldie, but I'm using PHP x64 5.5.38 (for now) and don't want to upgrade to the latest 7.x version yet.
I did read all these posts about finding file sizes that are larger than 2GB, but all solutions where very slow for large amount of files.
So, yesterday I've created C/C++ PHP Extension "php_filesize.dll", that is using the power of C/C++ to find file sizes with a few methods I've found, it's also UTF-8 compatible and very fast.
You can try it:
http://www.jobnik.net/files/PHP/php_filesize.zip
Usage:
methods:
0 - using GetFileAttributesEx
1 - using CreateFile
2 - using FindFirstFile
-1 - using stat64 (default and optional)
$fsize = php_filesize("filepath", $method_optional);
Returns file size in string type up to 9 PetaByte
Credits:
FileSize methods: Check the file-size without opening file in C++?
UTF-8 support: https://github.com/kenjiuno/php-wfio

To get the correct file size I often use this piece of code written by myself some months ago. My code uses: exec/com/stat where available. I know its limits, but it's a good starting point. The best idea is using filesize() on 64bit architecture.
<?php
######################################################################
# Human size for files smaller or bigger than 2 GB on 32 bit Systems #
# size.php - 1.3 - 21.09.2015 - Alessandro Marinuzzi - www.alecos.it #
######################################################################
function showsize($file) {
if (strtoupper(substr(PHP_OS, 0, 3)) == 'WIN') {
if (class_exists("COM")) {
$fsobj = new COM('Scripting.FileSystemObject');
$f = $fsobj->GetFile(realpath($file));
$size = $f->Size;
} else {
$size = trim(#exec("for %F in (\"" . $file . "\") do #echo %~zF"));
}
} elseif (PHP_OS == 'Darwin') {
$size = trim(#exec("stat -f %z " . $file));
} else {
$size = trim(#exec("stat -c %s " . $file));
}
if ((!is_numeric($size)) || ($size < 0)) {
$size = filesize($file);
}
if ($size < 1024) {
echo $size . ' Byte';
} elseif ($size < 1048576) {
echo number_format(round($size / 1024, 2), 2) . ' KB';
} elseif ($size < 1073741824) {
echo number_format(round($size / 1048576, 2), 2) . ' MB';
} elseif ($size < 1099511627776) {
echo number_format(round($size / 1073741824, 2), 2) . ' GB';
} elseif ($size < 1125899906842624) {
echo number_format(round($size / 1099511627776, 2), 2) . ' TB';
} elseif ($size < 1152921504606846976) {
echo number_format(round($size / 1125899906842624, 2), 2) . ' PB';
} elseif ($size < 1180591620717411303424) {
echo number_format(round($size / 1152921504606846976, 2), 2) . ' EB';
} elseif ($size < 1208925819614629174706176) {
echo number_format(round($size / 1180591620717411303424, 2), 2) . ' ZB';
} else {
echo number_format(round($size / 1208925819614629174706176, 2), 2) . ' YB';
}
}
?>
<?php include("php/size.php"); ?>
<?php showsize("files/VeryBigFile.tar"); ?>
I hope this helps.

PHP script to generate a file with random data of given name and size?

Does anyone know of one? I need to test some upload/download scripts and need some really large files generated. I was going to integrate the test utility with my debug script.

To start you could try something like this:
function generate_file($file_name, $size_in_bytes)
{
$data = str_repeat(rand(0,9), $size_in_bytes);
file_put_contents($file_name, $data); //writes $data in a file
}
This creates file filled up with a random digit (0-9).

generate_file() from "Marco Demaio" is not memory friendly so I created file_rand().
function file_rand($filename, $filesize) {
if ($h = fopen($filename, 'w')) {
if ($filesize > 1024) {
for ($i = 0; $i < floor($filesize / 1024); $i++) {
fwrite($h, bin2hex(openssl_random_pseudo_bytes(511)) . PHP_EOL);
}
$filesize = $filesize - (1024 * $i);
}
$mod = $filesize % 2;
fwrite($h, bin2hex(openssl_random_pseudo_bytes(($filesize - $mod) / 2)));
if ($mod) {
fwrite($h, substr(uniqid(), 0, 1));
}
fclose($h);
umask(0000);
chmod($filename, 0644);
}
}
As you can see linebreaks are added every 1024 bytes to avoid problems with functions that are limited to 1024-9999 bytes. e.g. fgets() with <= PHP 4.3. And it makes it easier to open the file with an text editor having the same issue with super long lines.

Do you really need so much variation in filesize that you need a PHP script? I'd just create test files of varying sizes via the command line and use them in my unit tests. Unless the filesize itself is likely to cause a bug, it would seem you're over-engineering here...
To create a file in Windows;
fsutil file createnew d:\filepath\filename.txt 1048576
in Linux;
dd if=/dev/zero of=filepath/filename.txt bs=10000000 count=1
if is the file source (in this case nothing), of is the output file, bs is the final filesize, count defines how many blocks you want to copy.

generate_file() from #Marco Demaio caused this below when generating 4GB file.
Warning: str_repeat(): Result is too big, maximum 2147483647 allowed
in /home/xxx/test_suite/handler.php on line 38
I found below function from php.net and it's working like charm.
I have tested it upto
17.6 TB (see update below)
in less than 3 seconds.
function CreatFileDummy($file_name,$size = 90294967296 ) {
// 32bits 4 294 967 296 bytes MAX Size
$f = fopen('dummy/'.$file_name, 'wb');
if($size >= 1000000000) {
$z = ($size / 1000000000);
if (is_float($z)) {
$z = round($z,0);
fseek($f, ( $size - ($z * 1000000000) -1 ), SEEK_END);
fwrite($f, "\0");
}
while(--$z > -1) {
fseek($f, 999999999, SEEK_END);
fwrite($f, "\0");
}
}
else {
fseek($f, $size - 1, SEEK_END);
fwrite($f, "\0");
}
fclose($f);
return true;
}
Update:
I was trying to hit 120TB, 1200 TB and more but filesize was limited to 17.6 TB. After some googling I found that it is max_volume_size for ReiserFS file system which was on my server.
May be PHP can handle 1200TB also in just few seconds. :)

Why not have a script that streams out random data? The script can take parameters for file size, type etc.
This way you can simulate many scenarios, for example bandwidth throttling, premature file end etc. etc.

Does the file really need to be random? If so, just read from /dev/urandom on a Linux system:
dd if=/dev/urandom of=yourfile bs=4096 count=1024 # for a 4MB file.
If it doesn't really need to be random, just find some files you have lying around that are the appropriate size, or (alternatively) use tar and make some tarballs of various sizes.
There's no reason this needs to be done in a PHP script: ordinary shell tools are perfectly sufficient to generate the files you need.

If you want really random data you might want to try this:
$data = '';
for ($byteSize-- >= 0) {
$data .= chr(rand(0,255));
}
Might take a while, though, if you want large file sizes (as with any random data).

I would suggest using a library like Faker to generate test data.

I took the answer of mgutt and shortened it a bit. Also, his answer has a little bug which I wanted to avoid.
function createRandomFile(string $filename, int $filesize): void
{
$h = fopen($filename, 'w');
if (!$h) return;
for ($i = 0; $i < intdiv($filesize, 1024); $i++) {
fwrite($h, bin2hex(random_bytes(511)).PHP_EOL);
}
fwrite($h, substr(bin2hex(random_bytes(512)), 0, $filesize % 1024));
fclose($h);
chmod($filename, 0644);
}
Note: This works only with PHP >= 7. If you really want to run it on lower versions, use openssl_random_pseudo_bytes instead of random_bytes and floor($filesize / 1024) instead of intdiv($filesize, 1024).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Problem reading files greater than 1GB with XMLReader - php

Do you get any errors with libxml_use_internal_errors(true); libxml_clear_errors(); // your parser stuff here.... $r = new XMLReader(...); // .... foreach( libxml_get_errors() as $err ) { printf(". %d %s\n", $err->code, $err->message); } when the parser stops prematurely?

Related

One-pass algorithm (clarification needed) Why the space complexity is O(1)?

Nested loops in PHP extremely slow

Does a boolean value in PHP take up only 1 bit of memory?

PHP filesize() On Files > 2 GB

PHP script to generate a file with random data of given name and size?

Categories

Resources