Parsing large file using php linux server

Parsing large file using php linux server - php

I am a php programmer and currently I am working with files. I have to parse and insert the data to mysql database. Since its large amount of data php unable to load or parse the file. I am getting memory leak error even though I have increased memory_limit upto 1500MB.
FATAL: emalloc(): Unable to allocate 456185835 bytes
my text file contains text and xml data. I have to parse the xml data from the text file.
eg: <ajax>some text goes here</ajax> non relativ text <ajax>other content</ajax>
In the above example I have to parse the content inside tag. If any one can give some advice to separate each tag into individual file(eg: 1.txt, 2.txt), it will be great(perl or c or shell scripting..etc ).

Cough... a 1500 MB memory limit is a sure sign you have gone off the rails.
Where are you getting your file? I assume (given the size) that this is a local file. If you are trying to load the file into a string using file_get_contents() it is worth noting that the docs are wrong and that said function does not in fact using memory-mapped I/O (cf. bug 52802). So this is not going to work for you.
What you might try is instead falling back to more C-like (but still PHP) constructs, in particular fopen(), fseek(), and fread(). If the file is of a known structure with newlines, you might also consider fgets().
These should allow you to read in bytes in chunks into a reasonable size buffer from which you can do your processing. Since it looks like you are processing tagged strings, you will have to play the usual games of keeping multiple buffers around in which you can accumulate data until processable. This is fairly standard stuff covered in most introductions to, e.g., stream processing in C.
Note that in PHP (or any other language for that matter), you are also going to have to potentially consider issues of string encoding because, in general, it is no longer the case that 1 byte == 1 character (cf. Unicode).
As you insinuate, PHP may well not be the best language for this task (though it certainly can do it). But your problem isn't really a language-specific one; you are running into a fundamental limitation of handling large files without memory-mapping.

you can actually parse XML with PHP a small block at a time so you dont actually require much ram at all:
set_time_limit(0);
define('__BUFFER_SIZE__', 131072);
define('__XML_FILE__', 'pf_1360591.xml');
function elementStart($p, $n, $a) {
//handle opening of elements
}
function elementEnd($p, $n) {
//handle closing of elements
}
function elementData($p, $d) {
//handle cdata in elements
}
$xml = xml_parser_create();
xml_parser_set_option($xml, XML_OPTION_TARGET_ENCODING, 'UTF-8');
xml_parser_set_option($xml, XML_OPTION_CASE_FOLDING, 0);
xml_parser_set_option($xml, XML_OPTION_SKIP_WHITE, 1);
xml_set_element_handler($xml, 'elementStart', 'elementEnd');
xml_set_character_data_handler($xml, 'elementData');
$f = fopen(__XML_FILE__, 'r');
if($f) {
while(!feof($f)) {
$content = fread($f, __BUFFER_SIZE__);
xml_parse($xml, $content, feof($f));
unset($content);
}
fclose($f);
}

Related

XML reading script using PHP incompletely reads some elements

I have an XML data source URL from where I am reading the data using fread. It contains student information from which I am extracting the Grades and compiling them in an array.
The problem is when I run this script locally, it works fine and all the grades are correctly listed/collected in array. However, when I run this script on a shared server, I get some incorrectly read grades in addition the normal grade names, for example, "ergarten". The complete grade name "Kindergarten" is also recorded in the array which means that there a problem in reading only specific elements.
The first suspect I have in mind is fread byte length. I have changed it to 8192 but without luck.
Here is the relevant code chunk from the php file:
if (!($xml_parser = xml_parser_create())) die("Couldn't create parser.");
xml_set_element_handler( $xml_parser, "startElementHandler", "endElementHandler");
xml_set_character_data_handler( $xml_parser, "characterDataHandler");
while( $data = fread($fp, 8192)){
if(!xml_parse($xml_parser, $data, feof($fp))) {
break;}}
xml_parser_free($xml_parser);
Any thoughts?

I found the problem and fixed it myself.
The problem was that in the loop where the data was being read in chunks using fread, I was simultaneously converting that data using the XML parser and that was causing the problem since the streams of data do not always have a full tags. I removed the parser from that point to run it only when all the data has been read by the script.
That solved the problem.

PHP: How to solve ob_start() in combination imagepng() issue?

I use the following code to create an image and encode it to base64. There is no direct output of the image.
ob_start(); // catching the output buffer
imagepng($imgSignature);
$base64Signature=base64_encode(ob_get_contents());
ob_end_clean();
ob_start started recently to throw error 500 and I have trouble figuring out the issue. The server uses php 5.4.11. I really don't know if it was running the same version as I installed the script, of if the memory runs full. I know that ob_start has changed throughout the php version. I really have a hard time to wrap my head around this. Is the script correct for php 5.4.11?
I really appreciate any help.

I'm not sure how to solve your issue with ob_start(), but I have an alternative for what you are doing that don't envolve output buffers.
imagepng($imgSignature, 'php://memory/file.png');
$base64Signature = base64_encode(file_get_contents('php://memory/file.png'));
This is basically saving the png image to a virtual temporary file that exists only in memory, then you read it back and have the same result.
My theory about your error:
At some point in your code, you will have this image stored multiple times in memory. In the $imgSignature, the internal buffer you created with ob_start(), the buffer you read with ob_get_contents(), and the resulting value of base64_encode(). Pretty much all in one line. God only knows how much memory its using, not to mention you probably allocated more resources before as you were mounting this image.
It is important to not have too much stuff allocated at the same time, specially when dealing with memory consuming resources like images. If you unset() or overwrite variables you no longer need, you will allow the garbage collector to do its job of disposing those unreferenced resources from memory.
For instance, you can change the way this piece of code was written to this:
ob_start();
imagepng($imgSignature);
imagedestroy($imgSignature);
$data = ob_get_contents();
ob_end_clean();
$data = base64_encode($data);
I dropped $imgSignature as soon as I didn't need it anymore, ended and cleaned my buffer as soon I was done getting what I wanted from it, and then disposed $data as I overwrote it with the base64 encoded $data that was really what I wanted.
Now this will use significantly less memory. If you extend this to the rest of your code, or do it at least to the parts that use a lot of memory like the images you loaded or created with the GD2 lib, it should optimize the memory usage of your script giving you that extra space you need.

PHP zip_open() and php://temp, can't seem to open

Not sure if this is possible, but it's become an academic struggle now.
Using the __halt_compiler() trick to embed binary data in a PHP file, I've successfully created a self-opening script which will fseek() to __COMPILER_HALT_OFFSET__ (not too hard seeing as this precise example is documented in the manual)
Anyways, I've stored a small lump of binary ZIP data (a single folder containing a single file that says "hello world") after my call to __halt_compiler()
What I've tried to do is copy the data directly to the php://temp stream, and have done so with success (if I rewind() and passthru() the temporary stream handle, it dumps the data)
$php = fopen(__FILE__, 'rb');
$tmp = fopen('php://temp', 'r+b');
fseek($php, __COMPILER_HALT_OFFSET__);
stream_copy_to_stream($php, $tmp);
My problem comes with trying to now open php://temp1 with zip_open()
$zip = zip_open('php://temp');
1From what I can see (despite other such possibilities as lack of stream support with zip_open()) the problem here is the inherent non-permanence of data in php://memory and php://temp streams between handles. If this can be worked around, perhaps it is in fact possible.
It keeps kicking back error code 11, which I have found no2 documentation on (seemingly, like most other possible error codes)
var_dump($zip); // int(11)
2 As #cweiske pointed out, error code 11 = ZipArchive::ER_OPEN, Can't open file
Is this consequence to my attempt at using the php://temp stream, or some other possible issue? I'm also aware there exists an OOP approach (ZipArchive, et al.) but I figured I'd start with the basics.
Any ideas?

11 is the constant ZIPARCHIVE::ER_OPEN, which the manual describes with
Can't open file
Note that the manual does not state that stream wrappers may be used.
Please think about using PHP's phar extension - it does what you want, and is well tested.

php file random access and object to file saving

I have a csv file with records being sorted on the first field. I managed to generate a function that does binary search through that file, using fseek for random access through file.
However, this is still a pretty slow process, since when I seek some file position, I actually need to look left, looking for \n characted, so I can make sure I'm reading a whole line (once whole line is read, I can check for first field value mentioned above).
Here is the function that returns a line that contains character at position x:
function fgetLineContaining( $fh, $x ) {
if( $x 125145411) // 12514511 is the last pos in my file
return "";
// now go as much left as possible, until newline is found
// or beginning of the file
while( $x > 0 && $c != "\n" && $c != "\r") {
fseek($fh, $x);
$x--; // go left in the file
$c = fgetc( $fh );
}
$x+=2; // skip newline char
fseek( $fh, $x );
return fgets( $fh, 1024 ); // return the line from the beginning until \n
}
While this is working as expected, I have to sad that my csv file has ~1.5Mil lines, and these left-seeks are slowing thins down pretty much.
Is there a better way to seek a line containing position x inside a file?
Also, it would be much better if object of a class could be saved to a file without serializing it, thus enabling reading of a file object-by-object. Does php support that?
Thanks

I think you really should consider using SQLite or MySQL again (like others have suggested in the comments). Most of the suggestions about pre-calculating indexes are already implemented "properly" in these SQL engines.
You said the speed wasn't good enough in SQL. Did you have the fields indexed properly? How were you querying the data? Where you using bulk queries, where you using prepared statements? Did the SQL process have enough ram to store it's indexes in RAM?
One thing you can possibly try to speed under the current algorithm is to load the (~100MB ?) file onto a RAM disc. No matter what you chose to do, either CVS or SQLite, this WILL help speed things up, especially if the hard drive seek time is your bottleneck.
You could possibly even read the whole file into PHP array's (assuming your computer has enough RAM for that). That would allow you to do your search via index ($big_array[$offset]) lookups.
Also one thing to keep in mind, PHP isn't exactly super fast at doing low level things fast. You might want to consider moving away from PHP in favor of C or C++.

Is it better to use fseek() fread() on individual lines, or fread() the entire file and substr to parse?

To make this more clear, I'm going to put code samples:
$file = fopen('filename.ext', 'rb');
// Assume $pos has been declared
// method 1
fseek($file, $pos);
$parsed = fread($file, 2);
// method 2
while (!feof($file)) {
$data = fread($file, 1000000);
}
$data = bin2hex($data);
$parsed = substr($data, $pos, 2);
$fclose($file);
There are about 40 fread() in method 1 (with maybe 15 fseek()) vs 1 fread() in method 2. The only thing I am wondering is if loading in 1000000 bytes is overkill when you're really only extracting maybe 100 total bytes (all relatively close together in the middle of the file).
So which code is going to perform better? Which code makes more sense to use? A quick explanation would be greatly appreciated.

If you already know the offset you are looking for, fseek is the best method here, as there is no reason to load the whole file into memory if you only need a few bytes of it. The first method is better because you skip right to what you want in the file stream and read out a small portion. The second method requires you to read the entire file into memory, then seek through that while you could have just read it straight from the file. Hope this answers your question

Files are read in units of clusters, and a cluster is usually something like 8 kb. Usually a few clusters are read ahead.
So, if the file is only a few kb there is very little to gain by using fseek compared to reading the entire file. The file system will read the entire file anyway.
If the file is considerably larger, as in your case, only a few of the clusters has to be read, so the first method should perform better. At worst all the data will still be read from the disk, but your application will still use less memory.

It seems that seeking the position you want and then reading only be bytes you need is the best approach.
But the correct answer is (as always) to test it for real instead of guessing. Run your two examples in your server environment and make some time measurements. Also check memory usage. Then make your optimization once you have some hard data to back it up.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing large file using php linux server - php

Related

XML reading script using PHP incompletely reads some elements

PHP: How to solve ob_start() in combination imagepng() issue?

PHP zip_open() and php://temp, can't seem to open

php file random access and object to file saving

Is it better to use fseek() fread() on individual lines, or fread() the entire file and substr to parse?

Categories

Resources