file_get_contents and file_put_contents with large files - php

I'm trying to get file contents, replace some parts of it using regular expressions and preg_replace and save it to another file:
$content = file_get_contents('file.txt', true);
$content_replaced = preg_replace('/\[\/m\]{1}\s+(\{\{.*\}\})\s+[\x{4e00}-\x{9fa5}]+/u', 'replaced text', $contents);
if ($content_replaced) {
file_put_contents('file_new.txt', $content_replaced);
echo "Successful!";
}
else {
echo "Some error ocurred";
}
this piece of code works fine with small files, but when I try the original file, which is about 60Mb, it just keeps giving me a message "Some error ocurred".
Any suggestions are greatly appreciated.
Update. No errors in the logs, memory limit is set to 1024M

I've had max/limit issues with file_put_contents.
No idea what the limits might be, but using fwrite solved my troubles and I put down the bottle.

You're probably running out of memory. What's the memory_limit set to? (phpinfo() will tell you). You may be able to increase the memory limit like:
ini_set('memory_limit','128M');

I'm pretty sure you're hitting some regex limit. Heck, some time ago I hit a limit with 1000 chars... with 60Mb of input I bet you will likely hit regex limits everywhere also with really simple patterns. I will try at least to simplify it as much as possible, making it ungreedy with .*? instead of .* if possible.
To get more information, just check the return value of preg_last_error().

Related

Is there a maximum number of lines readable by PHP functions

I have this file of 10 millions words, one word on every line. I'm trying to open that file, read every line, put it in an array and count the number of occurrences for each word.
wartek
mei_atnz
sommerray
swaggyfeed
yo_bada
ronnieradke
… and so on (10M+ lines)
I can open the file, read its size, even parse it line by line and echo the line on the browser (it's very long, of course), but when I'm trying to perform any other operation, the script just refuse to execute. No error, no warning, no die(…), nothing.
Accessing the file is always OK, but it's the operations which are not performed with the same success. I tried this and it worked…
while(!feof($pointer)) {
$row = fgets($pointer);
print_r($row);
}
… but this didn't :
while(!feof($pointer)) {
$row = fgets($pointer);
array_push($dest, $row);
}
Also tried with SplFileObject or file($source, FILE_IGNORE_NEW_LINES) with the same result every time (not okay with big file, okay with small file)
Guessing that the issue is not the size (150 ko), but probably the length (10M+ lines), I chunked the file to reduce it to ~20k without any improvement, then reduced it again to ~8k lines, and it worked.
I also removed the time limit with set_time_limit(0); or removed (almost) any memory limit both in the php.ini and in my script ini_set('memory_limit', '8192M');.Regarding the errors I could have, I set the error_reporting(E_ALL); at the top of my script.
So the questions are :
is there a maximum number of lines that can be read by PHP built-in functions?
why I can echo or print_r but not perform any other operations?
I think you might be running into a long execution time:
How to increase the execution timeout in php?
Different operation take different time. Printing might be a lot easier than pushing 10M new data into an array one-by-one. It's strange that you don't get any error messages, you should receive process exceeded time somewhere.

Is it possible to change the behavior of PHP's print_r function [duplicate]

This question already has answers here:
making print_r use PHP_EOL
(5 answers)
Closed 6 years ago.
I've been coding in PHP for a long time (15+ years now), and I usually do so on a Windows OS, though most of the time it's for execution on Linux servers. Over the years I've run up against an annoyance that, while not important, has proved to be a bit irritating, and I've gotten to the point where I want to see if I can address it somehow. Here's the problem:
When coding, I often find it useful to output the contents of an array to a text file so that I can view it's contents. For example:
$fileArray = file('path/to/file');
$faString = print_r($fileArray, true);
$save = file_put_contents('fileArray.txt', $faString);
Now when I open the file fileArray.txt in Notepad, the contents of the file are all displayed on a single line, rather than the nice, pretty structure seen if the file were opened in Wordpad. This is because, regardless of OS, PHP's print_r function uses \n for newlines, rather than \r\n. I can certainly perform such replacement myself by simply adding just one line of code to make the necessary replacements, ans therein lies the problem. That one, single line of extra code translates back through my years into literally hundreds of extra steps that should not be necessary. I'm a lazy coder, and this has become unacceptable.
Currently, on my dev machine, I've got a different sort of work-around in place (shown below), but this has it's own set of problems, so I'd like to find a way to "coerce" PHP into putting in the "proper" newline characters without all that extra code. I doubt that this is likely to be possible, but I'll never find out if I never ask, so...
Anyway, my current work-around goes like this. I have, in my PHP include path, a file (print_w.php) which includes the following code:
<?php
function print_w($in, $saveToString = false) {
$out = print_r($in, true);
$out = str_replace("\n", "\r\n", $out);
switch ($saveToString) {
case true: return $out;
default: echo $out;
}
}
?>
I also have auto_prepend_file set to this same file in php.ini, so that it automatically includes it every time PHP executes a script on my dev machine. I then use the function print_w instead of print_r while testing my scripts. This works well, so long as when I upload a script to a remote server I make sure that all references to the function print_w are removed or commented out. If I miss one, I (of course) get a fatal error, which can prove more frustrating than the original problem, but I make it a point to carefully proofread my code prior to uploading, so it's not often an issue.
So after all that rambling, my question is, Is there a way to change the behavior of print_r (or similar PHP functions) to use Windows newlines, rather than Linux newlines on a Windows machine?
Thanks for your time.
Ok, after further research, I've found a better work-around that suite my needs, and eliminates the need to call a custom function instead of print_r. This new work-around goes like this:
I still have to have an included file (I've kept the same name so as not to have to mess with php.ini), and php.ini still has the auto_prepend_file setting in place, but the code in print_w.php is changes a bit:
<?php
rename_function('print_r', 'print_rw');
function print_r($in, $saveToString = false) {
$out = print_rw($in, true);
$out = str_replace("\n", "\r\n", $out);
switch ($saveToString) {
case true: return $out;
default: echo $out;
}
}
?>
This effectively alters the behavior of the print_r function on my local machine, without my having to call custom functions, and having to make sure that all references to that custom function are neutralized. By using PHP's rename_function I was able to effectively rewrite how print_r behaves, making it possible to address my problem.

PHP: How to solve ob_start() in combination imagepng() issue?

I use the following code to create an image and encode it to base64. There is no direct output of the image.
ob_start(); // catching the output buffer
imagepng($imgSignature);
$base64Signature=base64_encode(ob_get_contents());
ob_end_clean();
ob_start started recently to throw error 500 and I have trouble figuring out the issue. The server uses php 5.4.11. I really don't know if it was running the same version as I installed the script, of if the memory runs full. I know that ob_start has changed throughout the php version. I really have a hard time to wrap my head around this. Is the script correct for php 5.4.11?
I really appreciate any help.
I'm not sure how to solve your issue with ob_start(), but I have an alternative for what you are doing that don't envolve output buffers.
imagepng($imgSignature, 'php://memory/file.png');
$base64Signature = base64_encode(file_get_contents('php://memory/file.png'));
This is basically saving the png image to a virtual temporary file that exists only in memory, then you read it back and have the same result.
My theory about your error:
At some point in your code, you will have this image stored multiple times in memory. In the $imgSignature, the internal buffer you created with ob_start(), the buffer you read with ob_get_contents(), and the resulting value of base64_encode(). Pretty much all in one line. God only knows how much memory its using, not to mention you probably allocated more resources before as you were mounting this image.
It is important to not have too much stuff allocated at the same time, specially when dealing with memory consuming resources like images. If you unset() or overwrite variables you no longer need, you will allow the garbage collector to do its job of disposing those unreferenced resources from memory.
For instance, you can change the way this piece of code was written to this:
ob_start();
imagepng($imgSignature);
imagedestroy($imgSignature);
$data = ob_get_contents();
ob_end_clean();
$data = base64_encode($data);
I dropped $imgSignature as soon as I didn't need it anymore, ended and cleaned my buffer as soon I was done getting what I wanted from it, and then disposed $data as I overwrote it with the base64 encoded $data that was really what I wanted.
Now this will use significantly less memory. If you extend this to the rest of your code, or do it at least to the parts that use a lot of memory like the images you loaded or created with the GD2 lib, it should optimize the memory usage of your script giving you that extra space you need.

fgetcsv returns too many entries

I have the following code:
while (!feof($file)) {
$arrayOfIdToBodyPart = fgetcsv($file,0, "\t");
if (count($arrayOfIdToBodyPart)==2){
the problem is, the contents of the file look like this:
39 ankle
40 tibia
41 Vastus Intermedius
and so on
sometimes, the test in the if will show three entries, with the first being the number, the second being the name, and the third being just... emtpy.
This causes the if block to fail, and me to be sad. I know i can just make the if block test for >=2, but is there any way i can get it to just recognise the fact that there are two items? I don't like that the fgetcsv is finding "mystery" characters at the end of the line.
Is this possibly a unix server running a windows-based file error? If so, and i'm running an ubuntu server without dos2unix, where do i get it?
You probably have tabs at the end of a line:
value<tab>value<tab><newline>
If that's the case, dos2unix won't help you. You might have to do something like read each line into a variable, trim() the variable, and then use str_getcsv() to split it.
Is it possible that you have a tab at the end of those lines? They are invisible and often hard to spot... you might want to double check.
Also if you are working with csv files, while you are running windows locally and the server is unix, I found this line:
ini_set('auto_detect_line_endings', true);
saves a lot of headaches.

PHP Loop - expression/function causing serious delay

I was wondering if anybody could shed any light on this problem.. PHP 5.3.0 :)
I have a loop, which is grabbing the contents of a CSV file (large, 200mb), handling the data, building a stack of variables for mysql inserts and once the loop is complete and the variables created, I'm inserting the information.
Now firstly, the mysql insert is performing perfectly, no delays and all is fine, however it's the LOOP itself that has the delay, I was originally using fgetcsv() to read the CSV file but compared to file_get_contents() this had a seriously delay - so I switched to file_get_contents(). The loop will perform in a matter of seconds, until I attempt to add a function (I've also added the expression inside the loop without the function to see if it helps) to create an array with the CSV data from each line, this is what is causing serious delays on the parsing time! (the difference is about 30 seconds based on this 200mb file but depending of filesize of csv file I guess)
Here's some code so you can see what I'm doing:
$filename = "file.csv";
$content = file_get_contents($filename);
$rows = explode("\n", $content);
foreach ($rows as $data) {
$data = preg_replace("/^\"(.*)\"$/","$1",preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", trim($data))); //THIS IS THE CULPRIT CAUSING SLOW LOADING?!?
}
Running the above loop, will perform almost instantly without the line:
$data = preg_replace("/^\"(.*)\"$/","$1",preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", trim($data)));
I've also tried creating a function as below (outside of loop):
function csv_string_to_array($str) {
$expr="/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/";
$results=preg_split($expr,trim($str));
return preg_replace("/^\"(.*)\"$/","$1",$results);
}
and calling the function instead of the one liner:
$data = csv_string_to_array($data);
With again no luck :(
Any help would be appreciated on this, I'm guessing the fgetcsv function is performing in a very similar way based on the delay it causes, looping through and creating an array from the line of data.
Danny
The regex subexpressions (bounded by "(...)") are the issue. It's trivial to show that adding these to an expression can greatly reduce its performance. The first thing I would try is to stop using preg_replace() to simply remove leading and trailing double quotes (trim() would be a better bet for that) and see how much that helps. After that you might need to try a non-regex way to parse the line.
I partially found a solution, I'm sending a batch to only loop 1000 lines at a time (php is looping by 1000 until it reaches the end of the file).
I'm then only setting:
$data = preg_replace("/^\"(.*)\"$/","$1",preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", trim($data)));
on the 1000 lines, so that it's not being set for the WHOLE file which was causing issues.
It is now looping and inserting 1000 rows into the mysql database in 1-2 seconds, which I'm happy with. I've setup the script to loop 1000 rows, remember its last location, then loop to the next 1000 until it reaches the end, it seems to be working ok!
I'd say the major culprit is the complexity of the preg_split() regexp.
And the explode() is probably eating some seconds.
$content = file_get_contents($filename);
$rows = explode("\n", $content);
could be replaced by:
$rows = file ($filename); // returns an array
But, I second the above suggestion from ITroubs, fgetcsv() would probably be a much better solution.
I would suggest using fgetcsv for parsing the data. It seems like memory may be your biggest impact. So to avoid consuming 200MB of RAM, you should parse line-by-line as follows:
$fp = fopen($input, 'r');
while (($row = fgetcsv($fp, 0, ',', '"')) !== false) {
$out = '"' . implode($row, '", "') . '"'; // quoted, comma-delimited output
// perform work
}
Alternatively: Using conditionals in preg is typically very expensive. It is can sometimes be faster to process these lines using explode() and trim() with its $charlist parameter.
The other alternative, if you still want to use preg, add the S modifier to try to speed up the expression.
http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
S
When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character.
By the way, I don't think your function is doing what you think it should: it won't actually modify the $rows array when you've exited from the loop. To do that, you need something more like:
foreach ($rows as $key => $data) {
$rows[$key]=preg_replace("/^\"(.*)\"$/","$1",preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", trim($data)));

Categories