I have a form that allows the user to either upload a text file or copy/paste the contents of the file into a textarea. I can easily differentiate between the two and put whichever one they entered into a string variable, but where do I go from there?
I need to iterate over each line of the string (preferably not worrying about newlines on different machines), make sure that it has exactly one token (no spaces, tabs, commas, etc.), sanitize the data, then generate an SQL query based off of all of the lines.
I'm a fairly good programmer, so I know the general idea about how to do it, but it's been so long since I worked with PHP that I feel I am searching for the wrong things and thus coming up with useless information. The key problem I'm having is that I want to read the contents of the string line-by-line. If it were a file, it would be easy.
I'm mostly looking for useful PHP functions, not an algorithm for how to do it. Any suggestions?
preg_split the variable containing the text, and iterate over the returned array:
foreach(preg_split("/((\r?\n)|(\r\n?))/", $subject) as $line){
// do stuff with $line
}
I would like to propose a significantly faster (and memory efficient) alternative: strtok rather than preg_split.
$separator = "\r\n";
$line = strtok($subject, $separator);
while ($line !== false) {
# do something with $line
$line = strtok( $separator );
}
Testing the performance, I iterated 100 times over a test file with 17 thousand lines: preg_split took 27.7 seconds, whereas strtok took 1.4 seconds.
Note that though the $separator is defined as "\r\n", strtok will separate on either character - and as of PHP4.1.0, skip empty lines/tokens.
See the strtok manual entry:
http://php.net/strtok
If you need to handle newlines in diferent systems you can simply use the PHP predefined constant PHP_EOL (http://php.net/manual/en/reserved.constants.php) and simply use explode to avoid the overhead of the regular expression engine.
$lines = explode(PHP_EOL, $subject);
It's overly-complicated and ugly but in my opinion this is the way to go:
$fp = fopen("php://memory", 'r+');
fputs($fp, $data);
rewind($fp);
while($line = fgets($fp)){
// deal with $line
}
fclose($fp);
Potential memory issues with strtok:
Since one of the suggested solutions uses strtok, unfortunately it doesn't point out a potential memory issue (though it claims to be memory efficient). When using strtok according to the manual, the:
Note that only the first call to strtok uses the string argument.
Every subsequent call to strtok only needs the token to use, as it
keeps track of where it is in the current string.
It does this by loading the file into memory. If you're using large files, you need to flush them if you're done looping through the file.
<?php
function process($str) {
$line = strtok($str, PHP_EOL);
/*do something with the first line here...*/
while ($line !== FALSE) {
// get the next line
$line = strtok(PHP_EOL);
/*do something with the rest of the lines here...*/
}
//the bit that frees up memory
strtok('', '');
}
If you're only concerned with physical files (eg. datamining):
According to the manual, for the file upload part you can use the file command:
//Create the array
$lines = file( $some_file );
foreach ( $lines as $line ) {
//do something here.
}
foreach(preg_split('~[\r\n]+~', $text) as $line){
if(empty($line) or ctype_space($line)) continue; // skip only spaces
// if(!strlen($line = trim($line))) continue; // or trim by force and skip empty
// $line is trimmed and nice here so use it
}
^ this is how you break lines properly, cross-platform compatible with Regexp :)
Kyril's answer is best considering you need to be able to handle newlines on different machines.
"I'm mostly looking for useful PHP functions, not an algorithm for how
to do it. Any suggestions?"
I use these a lot:
explode() can be used to split a string into an array, given a
single delimiter.
implode() is explode's counterpart, to go from array back to string.
Similar as #pguardiario, but using a more "modern" (OOP) interface:
$fileObject = new \SplFileObject('php://memory', 'r+');
$fileObject->fwrite($content);
$fileObject->rewind();
while ($fileObject->valid()) {
$line = $fileObject->current();
$fileObject->next();
}
SplFileObject doc: https://www.php.net/manual/en/class.splfileobject.php
PHP IO streams: https://www.php.net/manual/en/wrappers.php.php
Related
Wanted to seperate each line in the text file, reverse the order and echo it.
Got Method SplFileObject::__toString() must return a string value error.
Here's the code:
for ($x=$lines; $x>0; $x--) {
$file = new SplFileObject("post.txt");
$file -> seek($x);
echo $file;
}
I would use good old file() together with array_reverse():
foreach(array_reverse(file('file.txt')) as $line) {
echo $line;
}
You're simply seeking to a line longer than the file you've input.
This is an issue for you with these files, because you won't know how many lines are in a file until you've iterated to the end. A text file's number of lines isn't something stored in the file's metadata - it's not like the filesize. SplFileObject is intended to use its iterator, which can only seek() (go forward from the start to a specific line -- fairly slow if you're reversing lines this way), next(), or rewind() (go back to the start). It's an excellent class to use if you want to read through a file, but not great for going backwards the way you've indicated.
To get lines backwards from the end, you'll want to use an array where each item is a line in your file. That's the intended use of the old builtin file ( http://php.net/manual/en/function.file.php ).
If your heart is set on an SplFileObject, because you want to use it elsewhere in your code, you can build an array in reverse like so:
$lines = [];
foreach (new SplFileObject('test.txt') as $line) {
array_unshift($lines, $line);
}
echo implode('', $lines);
I am reading from log files which can be anything from a small log file up to 8-10mb of logs. The typical size would probably be 1mb. Now the key thing is that the keyword im looking for is normally near the end of the document, in probably 95% of the cases. Then i extract 1000 characters after the keyword.
If i use this approach:
$lines = explode("\n",$body);
$reversed = array_reverse($lines);
foreach($reversed AS $line) {
// Search for my keyword
}
Would it be more efficent than using:
$pos = stripos($body,$keyword);
$snippet_pre = substr($body, $pos, 1000);
What i am not sure on is with stripos does it just start searching through the document 1 character at a time so in theory if there is 10,000 characters after the keyword then i wont have to read those into memory, whereas the first option would have to read everything into memory even though it probably only needs the last 100 lines, could i alter it to read 100 lines into memory, then search another 101-200 lines if the first 100 was not successful or is the query so light that it doesnt really matter.
I have a 2nd question and this assumes the reverse_array is the best approach, how would i extract the next 1000 characters after i have found the keyword, here is my woeful attempt
$body = $this_is_the_log_content;
$lines = explode("\n",$body);
$reversed = array_reverse($lines);
foreach($reversed AS $line) {
$pos = stripos($line,$keyword);
$snippet_pre = substr($line, $pos, 1000);
}
Why i don't think that will work is because each $line might only be a few hundred characters so would the better solution be to explode it every say 2,000 lines and also keep the previous $line as a backup variable so something like this.
$body = $this_is_the_log_content;
$lines = str_split($body, 2000);
$reversed = array_reverse($lines);
$previous_line = $line;
foreach($reversed AS $line) {
$pos = stripos($line,$keyword);
if ($pos) {
$line = $previous_line . ' ' . $line;
$pos1 = stripos($line,$keyword);
$snippet_pre = substr($line, $pos, 1000);
}
}
Im probably massively over-complicating this?
I would strongly consider using a tool like grep for this. You can call this command line tool from PHP and use it to search the file for the word you are looking for and do things like give you the byte offset of the matching line, give you a matching line plus trailing context lines, etc.
Here is a link to grep manual. http://unixhelp.ed.ac.uk/CGI/man-cgi?grep
Play with the command a bit on the command line to get it the way you want it, then call it from PHP using exec(), passthru(), or similar depending on how you need to capture/display the content.
Alternatively, you can simply fopen() the file with the pointer at the end and move the file pointer forward in the file using fseek() searching for the string as you move along the way. Once you find you needle, you can then read the file from that offset until you get to the end of file or the number of log entries.
Either of these might be preferable to reading the entire log file into memory and then trying to work with it.
The other thing to consider is whether 1000 characters is meaningful. Typically log files would have lines that vary in length. To me it would seem that you should be more concerned about getting the next X lines from the log file, not the next Y characters. What if a line has 2000 characters, are you saying you only want to get half of it? That may not be meaningful at all.
This works
$arr = array_merge(array_diff($words, array("the","an"));
Why doesn't this work?
$common consists of 40 words in an array.
$arr = array_merge(array_diff($words, $common));
Is there another solution for this?
For Reference:
<?php
error_reporting(0);
$str1= "the engine has two ways to run: batch or conversational. In batch, expert system has all the necessary data to process from the beginning";
common_words($str1);
function common_words(&$string) {
$file = fopen("common.txt", "r") or exit("Unable to open file!");
$common = array();
while(!feof($file)) {
array_push($common,fgets($file));
}
fclose($file);
$words = explode(" ",$string);
$arr = array_merge(array_diff($words, array("the","an")));
print_r($arr);
}
?>
White-spaces are evil, sometimes..
fgets with only one parameter will return one line of data from the filehandle provided.
Though, it will not strip off the trailing new-line ("\n" or whatever EOL character(s) is used) in the line returned.
Since common.txt seems to have one word per line, this is the reason why php won't find any matching elements when you use array_diff.
PHP: fgets - Manual
parameter: length
Reading ends when length - 1 bytes have been read, on a newline (which is included in the return value), or on EOF (whichever comes first). If no length is specified, it will keep reading from the stream until it reaches the end of the line.
Rephrase:
All entries off $common will have a trailing line-break the way you are doing it now.
Alternative solutions 1
If you are not going to process the entries in common.txt I'd recommend you to take a look at php's function file, and use that in conjunction with array_map to rtrim the lines for you.
$common = array_map ('rtrim', file ('common.txt')); // will do what you want
Alternative solutions 2
After #MarkBaker saw the solution above he made a comment saying that you might as well pass a flag to file to make it work in the same manner, there is no need to call array_map to "fix" the entries returned.
$common = file ('common.txt', FILE_IGNORE_NEW_LINES);
This question was asked on a message board, and I want to get a definitive answer and intelligent debate about which method is more semantically correct and less resource intensive.
Say I have a file with each line in that file containing a string. I want to generate an MD5 hash for each line and write it to the same file, overwriting the previous data. My first thought was to do this:
$file = 'strings.txt';
$lines = file($file);
$handle = fopen($file, 'w+');
foreach ($lines as $line)
{
fwrite($handle, md5(trim($line))."\n");
}
fclose($handle);
Another user pointed out that file_get_contents() and file_put_contents() were better than using fwrite() in a loop. Their solution:
$thefile = 'strings.txt';
$newfile = 'newstrings.txt';
$current = file_get_contents($thefile);
$explodedcurrent = explode('\n', $thefile);
$temp = '';
foreach ($explodedcurrent as $string)
$temp .= md5(trim($string)) . '\n';
$newfile = file_put_contents($newfile, $temp);
My argument is that since the main goal of this is to get the file into an array, and file_get_contents() is the preferred way to read the contents of a file into a string, file() is more appropriate and allows us to cut out another unnecessary function, explode().
Furthermore, by directly manipulating the file using fopen(), fwrite(), and fclose() (which is the exact same as one call to file_put_contents()) there is no need to have extraneous variables in which to store the converted strings; you're writing them directly to the file.
My method is the exact same as the alternative - the same number of opens/closes on the file - except mine is shorter and more semantically correct.
What do you have to say, and which one would you choose?
This should be more efficient and less resource-intensive as the previous two methods:
$file = 'passwords.txt';
$passwords = file($file);
$converted = fopen($file, 'w+');
while (count($passwords) > 0)
{
static $i = 0;
fwrite($converted, md5(trim($passwords[$i])));
unset($passwords[$i]);
$i++;
}
fclose($converted);
echo 'Done.';
As one of the comments suggests do what makes more sense to you. Since you might come back to this code in few months and you need to spend least amount of time trying to understand it.
However, if speed is your concern then I would create two test cases (you pretty much already got them) and use timestamp (create variable with timestamp at the beginning of the script, then at the end of the script subtract it from timestamp at the end of the script to work out the difference - how long it took to run the script.) Prepare few files I would go for about 3, two extremes and one normal file. To see which version runs faster.
http://php.net/manual/en/function.time.php
I would think that differences would be marginal, but it also depends on your file sizes.
I'd propose to write a new temporary file, while you process the input one. Once done, overwrite the input file with the temporary one.
I was wondering if anybody could shed any light on this problem.. PHP 5.3.0 :)
I have a loop, which is grabbing the contents of a CSV file (large, 200mb), handling the data, building a stack of variables for mysql inserts and once the loop is complete and the variables created, I'm inserting the information.
Now firstly, the mysql insert is performing perfectly, no delays and all is fine, however it's the LOOP itself that has the delay, I was originally using fgetcsv() to read the CSV file but compared to file_get_contents() this had a seriously delay - so I switched to file_get_contents(). The loop will perform in a matter of seconds, until I attempt to add a function (I've also added the expression inside the loop without the function to see if it helps) to create an array with the CSV data from each line, this is what is causing serious delays on the parsing time! (the difference is about 30 seconds based on this 200mb file but depending of filesize of csv file I guess)
Here's some code so you can see what I'm doing:
$filename = "file.csv";
$content = file_get_contents($filename);
$rows = explode("\n", $content);
foreach ($rows as $data) {
$data = preg_replace("/^\"(.*)\"$/","$1",preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", trim($data))); //THIS IS THE CULPRIT CAUSING SLOW LOADING?!?
}
Running the above loop, will perform almost instantly without the line:
$data = preg_replace("/^\"(.*)\"$/","$1",preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", trim($data)));
I've also tried creating a function as below (outside of loop):
function csv_string_to_array($str) {
$expr="/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/";
$results=preg_split($expr,trim($str));
return preg_replace("/^\"(.*)\"$/","$1",$results);
}
and calling the function instead of the one liner:
$data = csv_string_to_array($data);
With again no luck :(
Any help would be appreciated on this, I'm guessing the fgetcsv function is performing in a very similar way based on the delay it causes, looping through and creating an array from the line of data.
Danny
The regex subexpressions (bounded by "(...)") are the issue. It's trivial to show that adding these to an expression can greatly reduce its performance. The first thing I would try is to stop using preg_replace() to simply remove leading and trailing double quotes (trim() would be a better bet for that) and see how much that helps. After that you might need to try a non-regex way to parse the line.
I partially found a solution, I'm sending a batch to only loop 1000 lines at a time (php is looping by 1000 until it reaches the end of the file).
I'm then only setting:
$data = preg_replace("/^\"(.*)\"$/","$1",preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", trim($data)));
on the 1000 lines, so that it's not being set for the WHOLE file which was causing issues.
It is now looping and inserting 1000 rows into the mysql database in 1-2 seconds, which I'm happy with. I've setup the script to loop 1000 rows, remember its last location, then loop to the next 1000 until it reaches the end, it seems to be working ok!
I'd say the major culprit is the complexity of the preg_split() regexp.
And the explode() is probably eating some seconds.
$content = file_get_contents($filename);
$rows = explode("\n", $content);
could be replaced by:
$rows = file ($filename); // returns an array
But, I second the above suggestion from ITroubs, fgetcsv() would probably be a much better solution.
I would suggest using fgetcsv for parsing the data. It seems like memory may be your biggest impact. So to avoid consuming 200MB of RAM, you should parse line-by-line as follows:
$fp = fopen($input, 'r');
while (($row = fgetcsv($fp, 0, ',', '"')) !== false) {
$out = '"' . implode($row, '", "') . '"'; // quoted, comma-delimited output
// perform work
}
Alternatively: Using conditionals in preg is typically very expensive. It is can sometimes be faster to process these lines using explode() and trim() with its $charlist parameter.
The other alternative, if you still want to use preg, add the S modifier to try to speed up the expression.
http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
S
When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character.
By the way, I don't think your function is doing what you think it should: it won't actually modify the $rows array when you've exited from the loop. To do that, you need something more like:
foreach ($rows as $key => $data) {
$rows[$key]=preg_replace("/^\"(.*)\"$/","$1",preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", trim($data)));