I was wondering if anybody could shed any light on this problem.. PHP 5.3.0 :)
I have a loop, which is grabbing the contents of a CSV file (large, 200mb), handling the data, building a stack of variables for mysql inserts and once the loop is complete and the variables created, I'm inserting the information.
Now firstly, the mysql insert is performing perfectly, no delays and all is fine, however it's the LOOP itself that has the delay, I was originally using fgetcsv() to read the CSV file but compared to file_get_contents() this had a seriously delay - so I switched to file_get_contents(). The loop will perform in a matter of seconds, until I attempt to add a function (I've also added the expression inside the loop without the function to see if it helps) to create an array with the CSV data from each line, this is what is causing serious delays on the parsing time! (the difference is about 30 seconds based on this 200mb file but depending of filesize of csv file I guess)
Here's some code so you can see what I'm doing:
$filename = "file.csv";
$content = file_get_contents($filename);
$rows = explode("\n", $content);
foreach ($rows as $data) {
$data = preg_replace("/^\"(.*)\"$/","$1",preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", trim($data))); //THIS IS THE CULPRIT CAUSING SLOW LOADING?!?
}
Running the above loop, will perform almost instantly without the line:
$data = preg_replace("/^\"(.*)\"$/","$1",preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", trim($data)));
I've also tried creating a function as below (outside of loop):
function csv_string_to_array($str) {
$expr="/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/";
$results=preg_split($expr,trim($str));
return preg_replace("/^\"(.*)\"$/","$1",$results);
}
and calling the function instead of the one liner:
$data = csv_string_to_array($data);
With again no luck :(
Any help would be appreciated on this, I'm guessing the fgetcsv function is performing in a very similar way based on the delay it causes, looping through and creating an array from the line of data.
Danny
The regex subexpressions (bounded by "(...)") are the issue. It's trivial to show that adding these to an expression can greatly reduce its performance. The first thing I would try is to stop using preg_replace() to simply remove leading and trailing double quotes (trim() would be a better bet for that) and see how much that helps. After that you might need to try a non-regex way to parse the line.
I partially found a solution, I'm sending a batch to only loop 1000 lines at a time (php is looping by 1000 until it reaches the end of the file).
I'm then only setting:
$data = preg_replace("/^\"(.*)\"$/","$1",preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", trim($data)));
on the 1000 lines, so that it's not being set for the WHOLE file which was causing issues.
It is now looping and inserting 1000 rows into the mysql database in 1-2 seconds, which I'm happy with. I've setup the script to loop 1000 rows, remember its last location, then loop to the next 1000 until it reaches the end, it seems to be working ok!
I'd say the major culprit is the complexity of the preg_split() regexp.
And the explode() is probably eating some seconds.
$content = file_get_contents($filename);
$rows = explode("\n", $content);
could be replaced by:
$rows = file ($filename); // returns an array
But, I second the above suggestion from ITroubs, fgetcsv() would probably be a much better solution.
I would suggest using fgetcsv for parsing the data. It seems like memory may be your biggest impact. So to avoid consuming 200MB of RAM, you should parse line-by-line as follows:
$fp = fopen($input, 'r');
while (($row = fgetcsv($fp, 0, ',', '"')) !== false) {
$out = '"' . implode($row, '", "') . '"'; // quoted, comma-delimited output
// perform work
}
Alternatively: Using conditionals in preg is typically very expensive. It is can sometimes be faster to process these lines using explode() and trim() with its $charlist parameter.
The other alternative, if you still want to use preg, add the S modifier to try to speed up the expression.
http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
S
When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character.
By the way, I don't think your function is doing what you think it should: it won't actually modify the $rows array when you've exited from the loop. To do that, you need something more like:
foreach ($rows as $key => $data) {
$rows[$key]=preg_replace("/^\"(.*)\"$/","$1",preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", trim($data)));
Related
I've made this script to extract data from a CSV file.
$url = 'https://flux.netaffiliation.com/feed.php?maff=3E9867FCP3CB0566CA125F7935102835L51118FV4';
$data = array_map(function($line) { return str_getcsv($line, '|'); }, file($url));
It's working exactly as I want but I've just been told that it's not the proper way to do it and that I really should use fgetcsv instead.
Is it right ? I've tried many ways to do it with fgetcsv but didn't manage at all to get anything close.
Here is an example of what i would like to get as an output :
$data[4298][0] = 889698467841
$data[4298][1] = Figurine Funko Pop! - N° 790 - Disney : Mighty Ducks - Coach Bombay
$data[4298][2] = 108740
$data[4298][3] = 14.99
First of all, there is no the ONE proper way to do things in programming. It is up to you and depends on your use case.
I just downloaded the CSV file and it is ca. 20MB big. In your solution you download the whole file at once. If you do not have any memory restrictions and you do not have to give a fast feedback to the caller, I mean if the delay for downloading of the whole file is not important, your solution is better solution, if you want to guarantee the processing of the whole content. In this case, you read all the content at once and the further processing of the content does not depend on other things like your Internet connection etc.
If you want to use fgetcsv, you would read from the URL line by line squentially. Your connection has to remain until a line has been processed. In this you do not need big memory allocation but it would take longer to having processed the whole content.
Both methods have their pros and contras. You should know what is your goal. How often would you run this script? You should consider your use case and make a decision which method is the best for you.
Here is the same result without array_map():
$url = 'https://flux.netaffiliation.com/feed.php?maff=3E9867FCP3CB0566CA125F7935102835L51118FV4';
$lines = file($url);
$data = [];
foreach($lines as $line)
{
$data[] = str_getcsv(trim($line), '|');
//optionally:
//$data[] = explode('|',trim($line));
}
$lines = null;
First of all, I am not looking for an answer saying "Check your PHP memory limit" or "You need to add more memory" or these kind of stuff ... I am on a dedicated machine, with 8GB of RAMS; 512MB of it is the memory limit. I always get an out of memory error on one single line :
To clarify: This part of the code belongs to Joomla! CMS.
function get($id, $group, $checkTime){
$data = false;
$path = $this->_getFilePath($id, $group);
$this->_setExpire($id, $group);
if (file_exists($path)) {
$data = file_get_contents($path);
if($data) {
// Remove the initial die() statement
$data = preg_replace('/^.*\n/', '', $data); // Out of memory here
}
}
return $data;
}
This is a part of Joomla's caching ... This function read the cache file and remove the first line which block direct access to the files and return the rest of the data.
As you can see the line uses a preg_replace to remove the first line in the cache file which is always :
<?php die("Access Denied"); ?>
My question is, it seems to me as a simple process (removing the first line from the file content) could it consume a lot of memory if the initial $data is huge? if so, what's the best way to work around that issue? I don't mind having the cache files without that die() line I can take of security and block direct access to the cache files.
Am I wrong?
UPDATE
As suggested by posts, regex seems to create more problems than solving them. I've tried:
echo memory_get_usage() . "\n";
after the regex then tried the same statement using substr(). The difference is very slight in memory usage. Almost nothing.
That's for your contributions, I am still trying to find out why this happen.
Use substr to avoid the memory hungry preg_replace() , like this:
$data = substr($data, strpos($data, '?>') + 3);
As a general advice don't use regular expressions if you can do the same task by using other string/array functions, regular expression functions are slower and consume more memory than the core string/array functions.
This is explicitly warned in PHP docs too, see some examples:
http://www.php.net/manual/en/function.preg-match.php#refsect1-function.preg-match-notes
http://www.php.net/manual/en/function.preg-split.php#refsect1-function.preg-split-notes
don't use a string function to replace something in a huge string. You can cycle through the lines of a file and just break after you have found what your looking for.
check the PHP docs here:
http://php.net/manual/en/function.fgets.php
basically what #cbuckley just said :p
If You just want to remove the first line of a file and return the rest, you should make use of file:
$lines = file($path);
array_shift($lines);
$data = implode("\n", $lines);
In stead of using file_get_contents() that gets the entire file at once, which can be too big to run regexps on, you should use fopen() in combination with fgets() (http://php.net/fgets). This function gets the file line by line.
You can then choose to do a regexp on a specific line. Or in your case just skip the entire line.
So in stead of:
$data = file_get_contents($path);
if($data) {
// Remove the initial die() statement
$data = preg_replace('/^.*\n/', '', $data); // Out of memory here
}
try this:
$fileHandler = fopen($path,'r');
$lineNumber = 0;
while (($line = fgets($fileHandler)) !== false) {
if($lineNumber++ != 0) { // Skip the initial die() statement
$data .= $line; // or maybe echo out $line directly so $data doesn't take up too much memory as well.
}
}
I suggest you use this in the file including the cached files:
define('INCLUDESALLOW', 1);
and in the file that will be included:
if( !defined('INCLUDESALLOW') ) die("Access Denied");
Then just use include instead of file_get_contents. This would run PHP code in the included though, not 100% sure if that is what you need.
There times that you will use more memory than the 8 MB php has allotted. If your unable to use less memory by making your code more efficient, you might have to increase your available memory. This can be done in two ways.
The limit can be set to a global default in php.ini:
memory_limit = 32M
Or you can override it in your script like this:
<?php
ini_set('memory_limit', '64M');
...
For more on PHP memory limit you can see This SO question or ini.memory-limit.
I have a 1.3GB text file that I need to extract some information from in PHP. I have researched it and have come up with a few various ways to do what I need to do, but as always am after a little clarification on which method would be best or if another better exists that I do not know about?
The information I need in the text file is only the first 40 characters of each line, and there are around 17million lines in the file. The 40 characters from each line will be inserted into a database.
The methods I have are below;
// REMOVE TIME LIMIT
set_time_limit(0);
// REMOVE MEMORY LIMIT
ini_set('memory_limit', '-1');
// OPEN FILE
$handle = #fopen('C:\Users\Carl\Downloads\test.txt', 'r');
if($handle) {
while(($buffer = fgets($handle)) !== false) {
$insert[] = substr($buffer, 0, 40);
}
if(!feof($handle)) {
// END OF FILE
}
fclose($handle);
}
Above is read each line at a time and get the data, I have all the database inserts sorted, doing 50 inserts at a time ten times over in a transaction.
The next method is the same as above really but calling file() to store all the lines in an array before doing a foreach to get the data? I am not sure about this method though as the array would essentially have over 17 million values.
Another method would be to extract only part of the file, rewrite the file with the unused data, and after that part has been executed recall the script using a header call?
What would be the best way in terms of getting this done in the most quick and efficient manner? Or is there a better way to approach this that I have thought of?
Also I plan to use this script with wamp, but running it in a browser while testing has caused problems with timeout even with setting script time out to 0. Is there a way I can execute the script to run without accessing the page through a browser?
You have it good so far, don't use "file()" function as it would most probably hit RAM usage limit and terminate your script.
I wouldn't even accumulate stuff into "insert[]" array, as that would waste RAM as well. If you can, insert into the database right away.
BTW, there is a nice tool called "cut" that you could use to process the file.
cut -c1-40 file.txt
You could even redirect cut's stdout to some PHP script that inserts into database.
cut -c1-40 file.txt | php -f inserter.php
inserter.php could then read lines from php://stdin and insert into DB.
"cut" is a standard tool available on all Linuxes, if you use Windows you can get it with MinGW shell, or as part of msystools (if you use git) or install native win32 app using gnuWin32.
Why are you doing this in PHP when your RDBMS almost certainly has bulk import functionality built in? MySQL, for example, has LOAD DATA INFILE:
LOAD DATA INFILE 'data.txt'
INTO TABLE `some_table`
FIELDS TERMINATED BY ''
LINES TERMINATED BY '\n';
( #line )
SET `some_column` = LEFT( #line, 40 );
One query.
MySQL also has the mysqlimport utility that wraps this functionality from the command line.
None of the above. The problem with the using fgets() is it does not work as you expect. When the maximum characters is reached, then the next call to fgets() will continue on the same line. You have correctly identified the problem with using file(). The third method is an interesting idea, and you could pull it off with other solutions as well.
That said, your first idea of using fgets() is pretty close, however we need to slightly modify its behaviour. Here's a customized version that will work as you'd expect:
function fgetl($fp, $len) {
$l = 0;
$buffer = '';
while (false !== ($c = fgetc($fp)) && PHP_EOL !== $c) {
if ($l < $len)
$buffer .= $c;
++$l;
}
if (0 === $l && false === $c) {
return false;
}
return $buffer;
}
Execute the insert operation immediately or you will waste memory. Make sure you are using prepared statements to insert this many rows; this will drastically reduce execution time. You don't want to submit the full query on each insert when you can only submit the data.
This question was asked on a message board, and I want to get a definitive answer and intelligent debate about which method is more semantically correct and less resource intensive.
Say I have a file with each line in that file containing a string. I want to generate an MD5 hash for each line and write it to the same file, overwriting the previous data. My first thought was to do this:
$file = 'strings.txt';
$lines = file($file);
$handle = fopen($file, 'w+');
foreach ($lines as $line)
{
fwrite($handle, md5(trim($line))."\n");
}
fclose($handle);
Another user pointed out that file_get_contents() and file_put_contents() were better than using fwrite() in a loop. Their solution:
$thefile = 'strings.txt';
$newfile = 'newstrings.txt';
$current = file_get_contents($thefile);
$explodedcurrent = explode('\n', $thefile);
$temp = '';
foreach ($explodedcurrent as $string)
$temp .= md5(trim($string)) . '\n';
$newfile = file_put_contents($newfile, $temp);
My argument is that since the main goal of this is to get the file into an array, and file_get_contents() is the preferred way to read the contents of a file into a string, file() is more appropriate and allows us to cut out another unnecessary function, explode().
Furthermore, by directly manipulating the file using fopen(), fwrite(), and fclose() (which is the exact same as one call to file_put_contents()) there is no need to have extraneous variables in which to store the converted strings; you're writing them directly to the file.
My method is the exact same as the alternative - the same number of opens/closes on the file - except mine is shorter and more semantically correct.
What do you have to say, and which one would you choose?
This should be more efficient and less resource-intensive as the previous two methods:
$file = 'passwords.txt';
$passwords = file($file);
$converted = fopen($file, 'w+');
while (count($passwords) > 0)
{
static $i = 0;
fwrite($converted, md5(trim($passwords[$i])));
unset($passwords[$i]);
$i++;
}
fclose($converted);
echo 'Done.';
As one of the comments suggests do what makes more sense to you. Since you might come back to this code in few months and you need to spend least amount of time trying to understand it.
However, if speed is your concern then I would create two test cases (you pretty much already got them) and use timestamp (create variable with timestamp at the beginning of the script, then at the end of the script subtract it from timestamp at the end of the script to work out the difference - how long it took to run the script.) Prepare few files I would go for about 3, two extremes and one normal file. To see which version runs faster.
http://php.net/manual/en/function.time.php
I would think that differences would be marginal, but it also depends on your file sizes.
I'd propose to write a new temporary file, while you process the input one. Once done, overwrite the input file with the temporary one.
I have a form that allows the user to either upload a text file or copy/paste the contents of the file into a textarea. I can easily differentiate between the two and put whichever one they entered into a string variable, but where do I go from there?
I need to iterate over each line of the string (preferably not worrying about newlines on different machines), make sure that it has exactly one token (no spaces, tabs, commas, etc.), sanitize the data, then generate an SQL query based off of all of the lines.
I'm a fairly good programmer, so I know the general idea about how to do it, but it's been so long since I worked with PHP that I feel I am searching for the wrong things and thus coming up with useless information. The key problem I'm having is that I want to read the contents of the string line-by-line. If it were a file, it would be easy.
I'm mostly looking for useful PHP functions, not an algorithm for how to do it. Any suggestions?
preg_split the variable containing the text, and iterate over the returned array:
foreach(preg_split("/((\r?\n)|(\r\n?))/", $subject) as $line){
// do stuff with $line
}
I would like to propose a significantly faster (and memory efficient) alternative: strtok rather than preg_split.
$separator = "\r\n";
$line = strtok($subject, $separator);
while ($line !== false) {
# do something with $line
$line = strtok( $separator );
}
Testing the performance, I iterated 100 times over a test file with 17 thousand lines: preg_split took 27.7 seconds, whereas strtok took 1.4 seconds.
Note that though the $separator is defined as "\r\n", strtok will separate on either character - and as of PHP4.1.0, skip empty lines/tokens.
See the strtok manual entry:
http://php.net/strtok
If you need to handle newlines in diferent systems you can simply use the PHP predefined constant PHP_EOL (http://php.net/manual/en/reserved.constants.php) and simply use explode to avoid the overhead of the regular expression engine.
$lines = explode(PHP_EOL, $subject);
It's overly-complicated and ugly but in my opinion this is the way to go:
$fp = fopen("php://memory", 'r+');
fputs($fp, $data);
rewind($fp);
while($line = fgets($fp)){
// deal with $line
}
fclose($fp);
Potential memory issues with strtok:
Since one of the suggested solutions uses strtok, unfortunately it doesn't point out a potential memory issue (though it claims to be memory efficient). When using strtok according to the manual, the:
Note that only the first call to strtok uses the string argument.
Every subsequent call to strtok only needs the token to use, as it
keeps track of where it is in the current string.
It does this by loading the file into memory. If you're using large files, you need to flush them if you're done looping through the file.
<?php
function process($str) {
$line = strtok($str, PHP_EOL);
/*do something with the first line here...*/
while ($line !== FALSE) {
// get the next line
$line = strtok(PHP_EOL);
/*do something with the rest of the lines here...*/
}
//the bit that frees up memory
strtok('', '');
}
If you're only concerned with physical files (eg. datamining):
According to the manual, for the file upload part you can use the file command:
//Create the array
$lines = file( $some_file );
foreach ( $lines as $line ) {
//do something here.
}
foreach(preg_split('~[\r\n]+~', $text) as $line){
if(empty($line) or ctype_space($line)) continue; // skip only spaces
// if(!strlen($line = trim($line))) continue; // or trim by force and skip empty
// $line is trimmed and nice here so use it
}
^ this is how you break lines properly, cross-platform compatible with Regexp :)
Kyril's answer is best considering you need to be able to handle newlines on different machines.
"I'm mostly looking for useful PHP functions, not an algorithm for how
to do it. Any suggestions?"
I use these a lot:
explode() can be used to split a string into an array, given a
single delimiter.
implode() is explode's counterpart, to go from array back to string.
Similar as #pguardiario, but using a more "modern" (OOP) interface:
$fileObject = new \SplFileObject('php://memory', 'r+');
$fileObject->fwrite($content);
$fileObject->rewind();
while ($fileObject->valid()) {
$line = $fileObject->current();
$fileObject->next();
}
SplFileObject doc: https://www.php.net/manual/en/class.splfileobject.php
PHP IO streams: https://www.php.net/manual/en/wrappers.php.php