Problems importing large xml feeds (LAMP)

Problems importing large xml feeds (LAMP) - php

Wonder if anyone can help me out with a little cron issue i am experience
The problem is that the load can spike up to 5, and the CPU usage can jump to 40%, on a dual core 'Xeon L5410 # 2.33GHz' with 356Mb RAM, and I'm not sure where I should be tweaking the code and which way to prevent that. code sample below
//Note $productFile can be 40Mb .gz compressed, 700Mb uncompressed (xml text file)
if (file_exists($productFile)) {
$fResponse = gzopen($productFile, "r");
if ($fResponse) {
while (!gzeof($fResponse)) {
$sResponse = "";
$chunkSize = 10000;
while (!gzeof($fResponse) && (strlen($sResponse) < $chunkSize)) {
$sResponse .= gzgets($fResponse, 4096);
}
$new_page .= $sResponse;
$sResponse = "";
$thisOffset = 0;
unset($matches);
if (strlen($new_page) > 0) {
//Emptying
if (!(strstr($new_page, "<product "))) {
$new_page = "";
}
while (preg_match("/<product [^>]*>.*<\/product>/Uis", $new_page, $matches, PREG_OFFSET_CAPTURE, $thisOffset)) {
$thisOffset = $matches[0][1];
$thisLength = strlen($matches[0][0]);
$thisOffset = $thisOffset + $thisLength;
$new_page = substr($new_page, $thisOffset-1);
$thisOffset = 0;
$new_page_match = $matches[0][0];
//- Save collected data here -//
}
}//End while loop
}
}
gzclose($fResponse);
}
}
$chunkSize - should it be as small as possible to keep the memory usage down and ease the regular expression, or should it be larger to avoid the code taking too long to run.
With 40,000 matches the load/CPU spikes. So does anyone have any advice on how to manage large feed uploads via crons.
Thanks in advance for your help

You have at least two problems. The first is you're trying to decompress the entire 700 MB file into memory. In fact, you're doing this twice.
while (!gzeof($fResponse) && (strlen($sResponse) < $chunkSize)) {
$sResponse .= gzgets($fResponse, 4096);
}
$new_page .= $sResponse;
Both $sResponse and $new_page will hold a string that will eventaully contain the entire 700 MB file. So that's 1.4 GB of memory you're eating up by the time the script finishes running, not to mention the cost of string concatenation (while PHP handles strings better than other languages, there are limits to what mutable vs. non-mutable will get you)
The second problem is you're running a regular expression over the increasingly large string in $new_page. This will put increased load on the server as $new_page gets larger and larger.
The easiest way to solve your problems is to split up the tasks.
Decompress the entire file to disk before doing any processing.
Use a steram based XML parser, such as XMLReader or the old SAX Event Based parser.
Even with a stream/event based parser, storing the results in memory may end up eating up a lot of ram. In that case you'll want to take each match and store it on disk/in-a-database.

Since you said you're using lamp, may I suggest an answer to one of my questions: Suggestions/Tricks for Throttling a PHP script
He suggests using the nice command on the offending script to lower the chances of it bogging down the server.
An alternative would be to profile the script and see where any bottlenecks are. I would recommend xdebug and kcachegrind or webcachegrind. There are countless questions and websites available to help you setup script profiling.

You may also want to look at PHP's SAX event-based XML parser -
http://uk3.php.net/manual/en/book.xml.php
This is good for parsing large XML files (we use it for XML files of a similar size that you are dealing with) and it does a pretty good job. No need for regexes then :)
You'd need to uncompress the file first before processing it.

Re Alan. The script will never hold 700Mb in memory as it looks like $sResponse is cleared instantly after it reaches $chunkSize and has been added to $new_page,
$new_page .= $sResponse;
$sResponse = "";
and $new_page is strlen reduced once each match is found and cleared if there are no possible matches, for each $chunkSize chunk of data.
$new_page = substr($new_page, $thisOffset-1);
if (!(strstr($new_page, "<product "))) {
$new_page = "";
}
Although I can't say I can see where the actual problem lies.

Related

BigQuery PHP API - large query result memory bloat - even with paging

I am running a range of queries in BigQuery and exporting them to CSV via PHP. There are reasons why this is the easiest method for me to do this (multiple queries dependent on variables within an app).
I am struggling with memory issues when the result set is larger than 100mb. It appears that the memory usage of my code seems to grow in line with the result set, which I thought would be avoided by paging. Here is my code:
$query = $bq->query($myQuery);
$queryResults = $bq->runQuery($query,['maxResults'=>5000]);
$FH = fopen($storagepath, 'w');
$rows = $queryResults->rows();
foreach ($rows as $row) {
fputcsv($FH, $row);
}
fclose($FH);
The $queryResults->rows() function returns a Google Iterator which uses paging to scroll through the results, so I do not understand why memory usage grows as the script runs.
Am I missing a way to discard previous pages from memory as I page through the results?
UPDATE
I have noticed that actually since upgrading to the v1.4.3 BigQuery PHP API, the memory usage does cap out at 120mb for this process, even when the result set reaches far beyond this (currently processing a 1gb result set). But still, 120mb seems too much. How can I identify and fix where this memory is being used?
UPDATE 2
This 120mb seems to be tied at 24kb per maxResult in the page. E.g. adding 1000 rows to maxResults adds 24mb of memory. So my question is now why is 1 row of data using 24kb in the Google Iterator? Is there a way to reduce this? The data itself is < 1kb per row.

Answering my own question
The extra memory is used by a load of PHP type mapping and other data structure info that comes alongside the data from BigQuery. Unfortunately I couldn't find a way to reduce the memory usage below around 24kb per row multiplied by the page size. If someone finds a way to reduce the bloat that comes along with the data please post below.
However thanks to one of the comments I realized you can extract a query directly to CSV in a Google Cloud Storage Bucket. This is really easy:
query = $bq->query($myQuery);
$queryResults = $bq->runQuery($query);
$qJobInfo = $queryResults->job()->info();
$dataset = $bq->dataset($qJobInfo['configuration']['query']['destinationTable']['datasetId']);
$table = $dataset->table($qJobInfo['configuration']['query']['destinationTable']['tableId']);
$extractJob = $table->extract('gs://mybucket/'.$filename.'.csv');
$table->runJob($extractJob);
However this still didn't solve my issue as my result set was over 1gb, so I had to make use of the data sharding function by adding a wildcard.
$extractJob = $table->extract('gs://mybucket/'.$filename.'*.csv');
This created ~100 shards in the bucket. These need to be recomposed using gsutil compose <shard filenames> <final filename>. However, gsutil only lets you compose 32 files at a time. Given I will have variable numbers of shards, opten above 32, I had to write some code to clean them up.
//Save above job as variable
$eJob = $table->runJob($extractJob);
$eJobInfo = $eJob->info();
//This bit of info from the job tells you how many shards were created
$eJobFiles = $eJobInfo['statistics']['extract']['destinationUriFileCounts'][0];
$composedFiles = 0; $composeLength = 0; $subfile = 0; $fileString = "";
while (($composedFiles < $eJobFiles) && ($eJobFiles>1)) {
while (($composeLength < 32) && ($composedFiles < $eJobFiles)) {
// gsutil creates shards with a 12 digit number after the filename, so build a string of 32 such filenames at a time
$fileString .= "gs://bucket/$filename" . str_pad($composedFiles,12,"0",STR_PAD_LEFT) . ".csv ";
$composedFiles++;
$composeLength++;
}
$composeLength = 0;
// Compose a batch of 32 into a subfile
system("gsutil compose $fileString gs://bucket/".$filename."-".$subfile.".csv");
$subfile++;
$fileString="";
}
if ($eJobFiles > 1) {
//Compose all the subfiles
system('gsutil compose gs://bucket/'.$filename.'-* gs://fm-sparkbeyond/YouTube_1_0/' . $filepath . '.gz') ==$
}
Note in order to give my Apache user access to gsutil I had to allow the user to create a .config directory in the web root. Ideally you would use the gsutil PHP library, but I didn't want the code bloat.
If anyone has a better answer please post it
Is there a way to get smaller output from the BigQuery library than 24kb per row?
Is there a more efficient way to clean up variable numbers of shards?

What's the best (most efficient) way to search for content in a file and change it with PHP? [duplicate]

This question already has answers here:
PHP what is the best way to write data to middle of file without rewriting file
(3 answers)
Closed 9 years ago.
I have a file that I'm reading with PHP. I want to look for some lines that start with some white space and then some key words I'm looking for (for example, "project_name:") and then change other parts of that line.
Currently, the way I handle this is to read the entire file into a string variable, manipulate that string and then write the whole thing back to the file, fully replacing the entire file (via fopen( filepath, "wb" ) and fwrite()), but this feels inefficient. Is there a better way?

Update: After finishing my function I had time to benchmark it. I've used a 1GB large file for testing but the results where unsatisfying :|
Yes, the memory peak allocation is significantly smaller:
standard solution: 1,86 GB
custom solution: 653 KB (4096 bytes buffersize)
But compared to the following solution there is just a slight performance boost:
ini_set('memory_limit', -1);
file_put_contents(
'test.txt',
str_replace('the', 'teh', file_get_contents('test.txt'))
);
the script above tooks ~16 seconds, the custom solution took ~13 seconds.
Resume: The custome solution is slight faster on large files and consumes much less memory(!!!).
Also if you want to run this in a web server environment the custom solution is better as many concurrent scripts would likely consume the whole available memory of the system.
Original Answer:
The only thing that comes in mind, is to read the file in chunks which fit the file systems block size and write the content or modified content back to a temporary file. After finish processing you use rename() to overwrite the original file.
This would reduce the memory peak and should be significantly faster if the file is really large.
Note: On a linux system you can get the file system block size using:
sudo dumpe2fs /dev/yourdev | grep 'Block size'
I got 4096
Here comes the function:
function freplace($search, $replace, $filename, $buffersize = 4096) {
$fd1 = fopen($filename, 'r');
if(!is_resource($fd1)) {
die('error opening file');
}
// the tempfile can be anywhere but on the same partition as the original
$tmpfile = tempnam('.', uniqid());
$fd2 = fopen($tmpfile, 'w+');
// we store len(search) -1 chars from the end of the buffer on each loop
// this is the maximum chars of the search string that can be on the
// border between two buffers
$tmp = '';
while(!feof($fd1)) {
$buffer = fread($fd1, $buffersize);
// prepend the rest from last one
$buffer = $tmp . $buffer;
// replace
$buffer = str_replace($search, $replace, $buffer);
// store len(search) - 1 chars from the end of the buffer
$tmp = substr($buffer, -1 * (strlen($search)) + 1);
// write processed buffer (minus rest)
fwrite($fd2, $buffer, strlen($buffer) - strlen($tmp));
};
if(!empty($tmp)) {
fwrite($fd2, $tmp);
}
fclose($fd1);
fclose($fd2);
rename($tmpfile, $filename);
}
Call it like this:
freplace('foo', 'bar', 'test.txt');

Parsing large file using php linux server

I am a php programmer and currently I am working with files. I have to parse and insert the data to mysql database. Since its large amount of data php unable to load or parse the file. I am getting memory leak error even though I have increased memory_limit upto 1500MB.
FATAL: emalloc(): Unable to allocate 456185835 bytes
my text file contains text and xml data. I have to parse the xml data from the text file.
eg: <ajax>some text goes here</ajax> non relativ text <ajax>other content</ajax>
In the above example I have to parse the content inside tag. If any one can give some advice to separate each tag into individual file(eg: 1.txt, 2.txt), it will be great(perl or c or shell scripting..etc ).

Cough... a 1500 MB memory limit is a sure sign you have gone off the rails.
Where are you getting your file? I assume (given the size) that this is a local file. If you are trying to load the file into a string using file_get_contents() it is worth noting that the docs are wrong and that said function does not in fact using memory-mapped I/O (cf. bug 52802). So this is not going to work for you.
What you might try is instead falling back to more C-like (but still PHP) constructs, in particular fopen(), fseek(), and fread(). If the file is of a known structure with newlines, you might also consider fgets().
These should allow you to read in bytes in chunks into a reasonable size buffer from which you can do your processing. Since it looks like you are processing tagged strings, you will have to play the usual games of keeping multiple buffers around in which you can accumulate data until processable. This is fairly standard stuff covered in most introductions to, e.g., stream processing in C.
Note that in PHP (or any other language for that matter), you are also going to have to potentially consider issues of string encoding because, in general, it is no longer the case that 1 byte == 1 character (cf. Unicode).
As you insinuate, PHP may well not be the best language for this task (though it certainly can do it). But your problem isn't really a language-specific one; you are running into a fundamental limitation of handling large files without memory-mapping.

you can actually parse XML with PHP a small block at a time so you dont actually require much ram at all:
set_time_limit(0);
define('__BUFFER_SIZE__', 131072);
define('__XML_FILE__', 'pf_1360591.xml');
function elementStart($p, $n, $a) {
//handle opening of elements
}
function elementEnd($p, $n) {
//handle closing of elements
}
function elementData($p, $d) {
//handle cdata in elements
}
$xml = xml_parser_create();
xml_parser_set_option($xml, XML_OPTION_TARGET_ENCODING, 'UTF-8');
xml_parser_set_option($xml, XML_OPTION_CASE_FOLDING, 0);
xml_parser_set_option($xml, XML_OPTION_SKIP_WHITE, 1);
xml_set_element_handler($xml, 'elementStart', 'elementEnd');
xml_set_character_data_handler($xml, 'elementData');
$f = fopen(__XML_FILE__, 'r');
if($f) {
while(!feof($f)) {
$content = fread($f, __BUFFER_SIZE__);
xml_parse($xml, $content, feof($f));
unset($content);
}
fclose($f);
}

Searching through very large files with php to extract a block very efficiently

I've been having a major headache lately with parsing metadata from video files, and found part of the problem is a disregard of various standards (or at least differences in interepretation) by video-production software vendors (and other reasons).
As a result I need to be able scan through very large video (and image) files, of various formats, containers and codecs, and dig out the metadata. I've already got FFMpeg, ExifTool Imagick and Exiv2 each to handle different types of metadata in various filetypes and been through various other options to fill some other gaps (please don't suggest libraries or other tools, I've tried them all :)).
Now I'm down to scanning the large files (upto 2GB each) for an XMP block (which is commonly written to movie files by Adobe suite and some other software). I've written a function to do it, but I'm concerned it could be improved.
function extractBlockReverse($file, $searchStart, $searchEnd)
{
$handle = fopen($file, "r");
if($handle)
{
$startLen = strlen($searchStart);
$endLen = strlen($searchEnd);
for($pos = 0,
$output = '',
$length = 0,
$finished = false,
$target = '';
$length < 10000 &&
!$finished &&
fseek($handle, $pos, SEEK_END) !== -1;
$pos--)
{
$currChar = fgetc($handle);
if(!empty($output))
{
$output = $currChar . $output;
$length++;
$target = $currChar . substr($target, 0, $startLen - 1);
$finished = ($target == $searchStart);
}
else
{
$target = $currChar . substr($target, 0, $endLen - 1);
if($target == $searchEnd)
{
$output = $target;
$length = $length + $endLen;
$target = '';
}
}
}
fclose($handle);
return $output;
}
else
{
throw new Exception('not found file');
}
return false;
}
echo extractBlockReverse("very_large_video_file.mov",
'<x:xmpmeta',
'</x:xmpmeta>');
At the moment it's 'ok' but I'd really like to get the most out of php here without crippling my server so I'm wondering if there is a better way to do this (or tweaks to the code which would improve it) as this approach seems a bit over the top for something as simple as finding a couple of strings and pulling out anything between them.

You can use one of the fast string searching algorithms - like Knuth-Morris-Pratt
or Boyer-Moore in order to find the positions of the start and end tags, and then read all the data between them.
You should measure their performance though, as with such small search patterns it might turn out that the constant of the chosen algorithm is not good enough for it to be worth it.

With files this big, I think that the most important optimization would be to NOT search the string everywhere. I don't believe that a video or image will ever have a XML block smack in the middle - or if it has, it will likely be garbage.
Okay, it IS possible - TIFF can do this, and JPEG too, and PNG; so why not video formats? But in real world applications, loose-format metadata such as XMP are usually stored last. More rarely, they are stored near the beginning of the file, but that's less common.
Also, I think that most XMP blocks will not have sizes too great (even if Adobe routinely pads them in order to be able to "almost always" quickly update them in-place).
So my first attempt would be to extract the first, say, 100 Kb and last 100 Kb of information from the file. Then scan these two blocks for "
If the search does not succeed, you will still be able to run the exhaustive search, but if it succeeds it will return in one ten-thousandth of the time. Conversely, even if this "trick" only succeeded one time in one thousand, it would still be worthwhile.

How to write to file in large php application(multiple questions)

What is the best way to write to files in a large php application. Lets say there are lots of writes needed per second. How is the best way to go about this.
Could I just open the file and append the data. Or should i open, lock, write and unlock.
What will happen of the file is worked on and other data needs to be written. Will this activity be lost, or will this be saved. and if this will be saved will is halt the application.
If you have been, thank you for reading!

Here's a simple example that highlights the danger of simultaneous wites:
<?php
for($i = 0; $i < 100; $i++) {
$pid = pcntl_fork();
//only spawn more children if we're not a child ourselves
if(!$pid)
break;
}
$fh = fopen('test.txt', 'a');
//The following is a simple attempt to get multiple threads to start at the same time.
$until = round(ceil(time() / 10.0) * 10);
echo "Sleeping until $until\n";
time_sleep_until($until);
$myPid = posix_getpid();
//create a line starting with pid, followed by 10,000 copies of
//a "random" char based on pid.
$line = $myPid . str_repeat(chr(ord('A')+$myPid%25), 10000) . "\n";
for($i = 0; $i < 1; $i++) {
fwrite($fh, $line);
}
fclose($fh);
echo "done\n";
If appends were safe, you should get a file with 100 lines, all of which roughly 10,000 chars long, and beginning with an integer. And sometimes, when you run this script, that's exactly what you'll get. Sometimes, a few appends will conflict, and it'll get mangled, however.
You can find corrupted lines with grep '^[^0-9]' test.txt
This is because file append is only atomic if:
You make a single fwrite() call
and that fwrite() is smaller than PIPE_BUF (somewhere around 1-4k)
and you write to a fully POSIX-compliant filesystem
If you make more than a single call to fwrite during your log append, or you write more than about 4k, all bets are off.
Now, as to whether or not this matters: are you okay with having a few corrupt lines in your log under heavy load? Honestly, most of the time this is perfectly acceptable, and you can avoid the overhead of file locking.

I do have high-performance, multi-threaded application, where all threads are writing (appending) to single log file. So-far did not notice any problems with that, each thread writes multiple times per second and nothing gets lost. I think just appending to huge file should be no issue. But if you want to modify already existing content, especially with concurrency - I would go with locking, otherwise big mess can happen...

If concurrency is an issue, you should really be using databases.

If you're just writing logs, maybe you have to take a look in syslog function, since syslog provides an api.
You should also delegate writes to a dedicated backend and do the job in an asynchroneous maneer ?

These are my 2p.
Unless a unique file is needed for a specific reason, I would avoid appending everything to a huge file. Instead, I would wrap the file by time and dimension. A couple of configuration parameters (wrap_time and wrap_size) could be defined for this.
Also, I would probably introduce some buffering to avoid waiting the write operation to be completed.
Probably PHP is not the most adapted language for this kind of operations, but it could still be possible.

Use flock()
See this question

If you just need to append data, PHP should be fine with that as filesystem should take care of simultaneous appends.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Problems importing large xml feeds (LAMP) - php

Related

BigQuery PHP API - large query result memory bloat - even with paging

What's the best (most efficient) way to search for content in a file and change it with PHP? [duplicate]

Parsing large file using php linux server

Searching through very large files with php to extract a block very efficiently

How to write to file in large php application(multiple questions)

Categories

Resources