str_getcsv to separate lines removes enclosing characters in lines - php

str_getcsv () has some odd behaviour. It removes all characters that match the enclosing character, instead of just the enclosing ones. I'm trying to parse a CSV string (contents of an uploaded file) in two steps:
split the CSV string into an array of lines
split each line into an array of fields
with this code:
$whole_file_string = file_get_contents($file);
$array_of_lines = str_getcsv ($whole_file_string, "\n", "\""); // step 1. split csv into lines
foreach ($array_of_lines as $one_line_string) {
$splitted_line = str_getcsv ($one_line_string, ",", "\""); // step 2. split line into fields
};
In the code example nothing is done with $splitted_line for clarity of example
Then I feed this script a file with the following contents: "text,with,delimiter",secondfield.
When step 1 is performed the first (and only) element of $array_of_lines is text,with,delimiter,secondfield. So when step 2 is performed, it splits the line into 4 fields, but that needs to be 2.
I can't use fgetcsv() because some string conversion is done (checking BOM, converting encoding accordingly and stuff like that) after reading the file and before splitting it into lines in step 1.
I'm at the point of writing my own string parser (which isn't that complicated for CSV format), but before I do so I want to make sure that that's the best approach. I'm a bit disappointed that the PHP functions are letting me down on this simple (and I guess quite common) use case: processing an uploaded csv file with varying encoding.
Any tips?

You should only be calling str_getcsv() on one line at a time, not the whole file.
$array_of_lines = file($file, FILE_IGNORE_NEW_LINES); // split CSV into lines
foreach($array_of_lines as $one_line_string) {
$splitted_line = str_getcsv($one_line_string, ",", "\""); // split line into fields
}

Here's my own CSV parser, fully compliant with IETF rfc 4180.
I'm curious to know if this can be done in regular expressions, those are not my forte.
/**
* Parse a string according to CSV format (https://tools.ietf.org/html/rfc4180), with variabele delimiter (default ,).
* #param string $string String to be parsed as csv
* #param string $delimiter character to be used as field delimiter
* #return array Array with for each line an array with csv field values
*/
function csv_parse ($string, $delimiter = ",", $line_mode = true) {
// This function parses on line-level first ($line_mode = true) and calls itself recursively to parse each line on field-level ($line_mode = false).
// when in line mode, the delimiter is eol (\n, \r\n and \n\r).
// when in field mode, the delimiter is the passed $delimiter.
$delimiter = substr ($delimiter,0,1); // delimiter is one character
$length = strlen ($string);
$parsed_array = array();
$end_of_line_state = false;
$enclosed_state = false;
$i = 0;
$field = "";
do {
switch (true) {
case (!$enclosed_state && $end_of_line_state && ($string[$i] == "\r" && $string[$i-1] == "\n")) :
case (!$enclosed_state && $end_of_line_state && ($string[$i] == "\n" && $string[$i-1] == "\r")) :
// ...found second character of eol (\r\n of \n\r). Ignore
$end_of_line_state = false;
break;
case (!$enclosed_state && !$end_of_line_state && ($string[$i] == "\n" || $string[$i] == "\r")) :
// ... found first character of eol \n, \r\n of \n\r
$end_of_line_state = true; // eol can be two characters, so prepare for the second
if ($field != "") { // ignore empty lines. Prohibited in csv
$parsed_array [] = csv_parse ($field, $delimiter, false); // recursive call to parse on field-level. Flush result
};
$field = ""; // prepare for next one
break;
case (!$enclosed_state && $string[$i] == $delimiter && !$line_mode) :
// ...delimiter found
$parsed_array [] = $field; // flush field as new array element
$field = ""; // prepare for next one
break;
case ($string[$i] == "\"") :
// ...encloser found
if ($enclosed_state) {
if ($i < $length && $string[$i+1] == "\"") {
// ... escaped " found
if (!$line_mode) {
$field .= "\""; // when parsing fieldlevel, only " is part of the line
} else {
$field .= "\"\""; // when parsing line level, the escaping " is also part of the line
};
$i++;
} else {
// ...closing encloser found
$enclosed_state = false;
if ($line_mode) {
$field .= $string[$i]; // when parsing line level, the enclosing " are part of the line
};
};
} else {
// ... opening encloser found
$enclosed_state = true;
if ($line_mode) {
$field .= $string[$i]; // when parsing line level, the enclosing " are part of the line
};
};
break;
default:
// ...regular character found
$field .= $string[$i];
};
$i++;
if ($i >= $length) { // end of string
if ($line_mode) {
$parsed_array [] = csv_parse ($field, $delimiter, false); // recursive call to parse on field-level. Flush result.
} else {
$parsed_array [] = $field; // flush last field
};
};
} while ($i < $length);
return $parsed_array;
};

Hm. Strange. I'm fiddling around a bit and I discovered something interesting:
$a = str_getcsv("\"#ne piece, #f text\"", "\n");
$b = str_getcsv("\"#ne piece, #f text\"", "\n","\"");
$c = str_getcsv("\"#ne piece, #f text\"", "\n","#");
echo $a; // #ne piece, #f text
echo $b; // #ne piece, #f text
echo $c; // "#ne piece, #f text"
So Passing an # as enclosing character instead of " (or leaving it out and using the default, which is " as well) is semantically bullocks, but it does the job. It leaves the enclosing " around the field, so if you then str_getcsv($c, ",") it results in one value as it should. And if you have a field enclosed in # it leaves it untouched. I tested that.
It is clear that str_getcsv has its flaws. On splitting to lines it shouldn't strip quotes, just like it doesn't strip # when that's the enclosing parameter. But sadly it does and therefore isn't CSV compliant.

Related

PHP Method Validation: Can I parse a CSV with just explode and str_replace

Yesterday I put together a parser that takes single line inputs from PHP's file() function and parses out each line into fields (code shown below). I'm using file() instead of fopen() so as not to lock the files in question.
I'm reviewing other solutions and came across greg.kindel's comment on this post saying that any solution using splits or pattern matching is doomed to fail:
Javascript code to parse CSV data
I realize that kindel is answering a question about parsing an entire CSV file (line breaks included) so this is a slightly different application, but I would still like to validate my method. The only regex used is to clean individual line data of non-printable characters, but not to parse out individual fields. Am I overlooking something by using splits this way?
Code:
function read_csv($fname = '', $use_headers = true)
{
if(strlen($fname) >= 5 && substr($fname, strlen($fname)-4, 4) == '.csv')
{
$data_array = array();
$headers = array();
# Parse file into individual rows
# Iterate through rows to parse fields.
$rows = file($fname);
for($i = 0; $i < count($rows); $i++)
{
# Remove non-printable characters
# Split string by commas
$rows[$i] = preg_replace('/[[:^print:]]/', '', $rows[$i]);
$split = explode(',', $rows[$i]);
$text = array();
$count = 0;
$fields = array();
# Iterate through split string
foreach($split as $key => $value)
{
# Count cumulative number of quotation marks
# Build field string
# When cumulative number of quotation marks is even, save string and start new field
$count += strlen($value) - strlen(str_replace('"', '', $value));
$text[] = $value;
if($count % 2 == 0)
{
# Reinsert commas in string
# Remove escape quotes from fields encapsulated by quotes
# Convert double-quotation marks to single
$result = implode(',', $text);
if(substr($result, 0, 1) == '"')
{$result = str_replace('""', '"', substr($result, 1, strlen($result)-2));}
$fields[] = $result;
$count = 0;
$text = array();
}
}
# Write $fields to associative array, headers optional
if($i == 0 && $use_headers)
{
foreach($fields as $key => $header)
{$headers[$key] = $header;}
} else {
$tmp = array();
foreach($fields as $key => $value)
{
if($use_headers)
{$tmp[$headers[$key]] = $value;}
else
{$tmp[] = $value;}
}
$data_array[] = $tmp;
}
}
return $data_array;
} else {
# If provided filename is not a csv file, return an error
# Uses the same associative array format as $data_array
return array(0 => array('Error' => 'Invalid filename', 'Filename' => $fname));
}
}

Endline after certain number of characters

I have a text file with a lot of inserts that looks like this:
INSERT INTO yyy VALUES ('1','123123','da,sdadwa','6.7','24f,5','f5,5','dasdad,fsdfsdfsfsasada dasdasd','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa','dasdasd','q231e','','0','','g','1','123123','dasdadwa','6.7','24f,5','f5,5','dasdad,fsdfsdfsfsasada dasdasd','','','q231e','','0','','a','1','123123','dasdadwa','655.755','24f,5','f5,5','dasdad,fsdfsdfsfsasada dasdasd','','','q231e','','','','a');
INSERT INTO yyy VALUES ('2','123123','dasdadwa','6.8','24f,6','f5,5','dasdad,fsdfsdfsfsasada dasdasd','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa','dasdasd','q231e','','0','','g','2','123123','dasdadwa','6.8','24f,6','f5,5','dasdad,fsdfsdfsfsasada dasdasd','','','q231e','','0','','a','2','123123','dasdadwa','6.8','24f,6','f5,5','dasdad,fsdfsdfsfsasada dasdasd','','','q231e','','','','a');
INSERT INTO yyy VALUES ('3','123123','dasdadwa','6.9','24f,7','f5,5','dasdad,fsdfsdfsfsasada dasdasd','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa','dasdasd','q231e','','0','','g','3','123123','dasdadwa','6.9','24f,7','f5,5','dasdad,fsdfsdfsfsasada dasdasd','','','q231e','','0','','a','3','123123','dasdadwa','6.9','24f,7','f5,5','dasdad,fsdfsdfsfsasada dasdasd','','','q231e','','','','a');
INSERT INTO yyy VALUES ('4','123123','dasdadwa','6.10','24f,8','f5,5','dasdad,fsdfsdfsfsasada dasdasd','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa','dasdasd','q231e','','0','','g','4','123123','dasdadwa','6.10','24f,8','f5,5','dasdad,fsdfsdfsfsasada dasdasd','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa','','q231e','','0','','a','4','123123','dasdadwa','6.10','24f,8','f5,5','dasdad,fsdfsdfsfsasada dasdasd','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa','','q231e','','','','a');
INSERT INTO yyy VALUES ('5','123123','dasdadwa','6.11','24f,9','f5,5','dasdad,fsdfsdfsfsasada dasdasd','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa','dasdasd','q231e','','0','','g','5','123123','dasdadwa','6.11','24f,9','f5,5','dasdad,fsdfsdfsfsasada dasdasd','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa','','q231e','','0','','a','5','123123','dasdadwa','6.11','24f,9','f5,5','dasdad,fsdfsdfsfsasada dasdasd','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa','','q231e','','','','a');
I must modify this text file so that each line can have a maximum of 50 characters. The problem is that I cannot simply put an endline after 50 characters because that would break the elements in those inserts, so I need to put the endline before the last comma.
For the first row it needs to be something like this:
INSERT INTO yyy VALUES ('1','123123','da,sdadwa',
'6.7','24f,5','f5,5',
'dasdad,fsdfsdfsfsasada dasdasd',
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'dasdasd','q231e','','0','','g','1','123123',
'dasdadwa','6.7','24f,5','f5,5',
'dasdad,fsdfsdfsfsasada dasdasd','','','q231e','',
'0','','a','1','123123','dasdadwa','655.755',
'24f,5','f5,5','dasdad,fsdfsdfsfsasada dasdasd',
'','','q231e','','','','a');
As you can see there can be commas even inside the elements('da,sdadwa') which makes this a tad more difficult. I tried putting everything into arrays but I ran into some problems and couldn't get it to work.
What i tried:
if(is_array($testFileContents))
{
foreach($testFileContents as $line)
{
$j=0;
for($i=0;$i<=strlen($line);$i++)
{
//echo $line[$i];
$ct=1;
if($j==50)
{
if($line[$j]==",")
{
//$line[$j]=$line[$j].PHP_EOL;
}
else
{
$temporaryJ = $j;
while($line[$temporaryJ]!=",")
{
$temporaryJ--;
}
//$line[$temporaryJ] = $line[$temporaryJ].PHP_EOL;
//$j=$i-$ct*50;
$j=0;
$ct=$ct+1;
echo $ct." ";
}
}
else
{
$j++;
}
}
}
}
I know there has to be a much more simple way of going around this without using arrays but I cannot figure it out.
You can use preg_split() to split the lines. I found a pattern another user posted in this answer for matching values for an INSERT statement:
"~'(?:\\\\'|[^'])*'(*SKIP)(*F)|,~". This utilizes Special Backtracking Control Verbs.
You can play with the PHP code in this PhpFiddle.
foreach($lines as $line) {
$matches = preg_split("~'(?:\\\\'|[^'])*'(*SKIP)(*F)|,~",$line);
$currentIndex = 0;
$currentLine = '';
$outputLines = array();
$delimeter = ',';
while($currentIndex < count($matches)) {
if ($currentIndex == count($matches)-1 ) {
$delimeter = '';
}
$tempLine = $currentLine . $matches[$currentIndex] . $delimeter;
if (strlen($tempLine) <= 50) {
$currentLine .= $matches[$currentIndex] . $delimeter;
}
else {//push current line into array and start a new line
$outputLines[] = $currentLine;
$currentLine = $matches[$currentIndex] . $delimeter;
}
if ($currentIndex == count($matches)-1 ) {
$outputLines[] = $currentLine;
}
$currentIndex++;
}
//can use implode("\n",$outputLines) to write out to file
//or whatever your needs are
}

How to normalise CSV content in PHP?

Problem:
I'm looking for a PHP function to easily and efficiently normalise CSV content in a string (not in a file). I have made a function for that. I provide it in an answer, because it is a possible solution. Unfortuanately it doesn't work when the separator is included in incomming string values.
Can anyone provide a better solution?
Why not using fputcsv / fgetcsv ?
Because:
it requires at least PHP 5.1.0 (which is sometimes not available)
it can only read from files, but not from a string. even though, sometimes the input is not a file (eg. if you fetch the CSV from an email)
putting the content into a temporary file might be unavailable due to security policies.
Why / what kind of normalisation?
Normalise in a way, that the encloser encloses every field. Because the encloser can be optional and different per line and per field. This can happen if one is implementing unclean/incomplete specifications and/or using CSV content from different sources/programs/developers.
Example function call:
$csvContent = "'a a',\"b\",c,1, 2 ,3 \n a a,'bb',cc, 1, 2, 3 ";
echo "BEFORE:\n$csvContent\n";
normaliseCSV($csvContent);
echo "AFTER:\n$csvContent\n";
Output:
BEFORE:
'a a',"b",c,1, 2 ,3
a a,'bb',cc, 1, 2, 3
AFTER:
"a a","b","c","1","2","3"
"a a","bb","cc","1","2","3"
To specifically address your concern regarding f*csv working only with files:
Since PHP 5.3 there's str_getcsv.
For at least PHP >= 5.1 (and I really hope that's the oldest you'll have to deal with these days), you can use stream wrappers:
$buffer = fopen('php://memory', 'r+');
fwrite($buffer, $string);
rewind($buffer);
fgetcsv($buffer) ..
Or obviously the reverse if you want to use fputcsv.
This is a possible solution. But it doesn't consider the case that the separator (,) might be included in incoming strings.
function normaliseCSV(&$csv,$lineseperator = "\n", $fieldseperator = ',', $encloser = '"')
{
$csvArray = explode ($lineseperator,$csv);
foreach ($csvArray as &$line)
{
$lineArray = explode ($fieldseperator,$line);
foreach ($lineArray as &$field)
{
$field = $encloser.trim($field,"\0\t\n\x0B\r \"'").$encloser;
}
$line = implode ($fieldseperator,$lineArray);
}
$csv = implode ($lineseperator,$csvArray);
}
It is a simple chain of explode -> explode -> trim -> implode -> implode .
Although I agree with #deceze that you could expect atleast 5.1 these days, i'm sure there are some internal company servers somewhere who don't want to update.
I altered your method to be able to use field and line separators between double quotes, or in your case the $encloser value.
<?php
/*
In regards to the specs on http://tools.ietf.org/html/rfc4180 I use the following rules:
- "Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes."
- "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
Exception:
Even though the specs says use double quotes, I 'm using your $encloser variable
*/
echo normaliseCSV('a,b,\'c\',"d,e","f","g""h""i","""j"""' . "\n" . "\"k\nl\nm\"");
function normaliseCSV($csv,$lineseperator = "\n", $fieldseperator = ',', $encloser = '"')
{
//We need 4 temporary replacement values
//line seperator, fieldseperator, double qoutes, triple qoutes
$keys = array();
while (count($keys)<3) {
$tmp = "##".md5(rand().rand().microtime())."##";
if (strpos($csv, $tmp)===false) {
$keys[] = $tmp;
}
}
//first we exchange "" (double $encloser) and """ to make sure its not exploded
$csv = str_replace($encloser.$encloser.$encloser, $keys[0], $csv);
$csv = str_replace($encloser.$encloser, $keys[0], $csv);
//Explode on $encloser
//Every odd index is within quotes
//Exchange line and field seperators for something not used.
$content = explode($encloser,$csv);
$len = count($content);
if ($len>1) {
for ($x=1;$x<$len;$x=$x+2) {
$content[$x] = str_replace($lineseperator,$keys[1], $content[$x]);
$content[$x] = str_replace($fieldseperator,$keys[2], $content[$x]);
}
}
$csv = implode('',$content);
$csvArray = explode ($lineseperator,$csv);
foreach ($csvArray as &$line)
{
$lineArray = explode ($fieldseperator,$line);
foreach ($lineArray as &$field)
{
$val = trim($field,"\0\t\n\x0B\r '");
//put back the exchanged values
$val = str_replace($keys[0],$encloser.$encloser,$val);
$val = str_replace($keys[1],$lineseperator,$val);
$val = str_replace($keys[2],$fieldseperator,$val);
$val = $encloser.$val.$encloser;
$field = $val;
}
$line = implode ($fieldseperator,$lineArray);
}
$csv = implode ($lineseperator,$csvArray);
return $csv;
}
?>
Output would be:
"a","b","c","d,e","f","g""h""i","""j"""
"k
l
m"
Codepad example
when i first read this question wasn´t sure if it should be solved or not, since <5.1 environments should be extinguished a long time ago, dispite of that is a hell of a question how to solve this so we should be thinking wich approach to take... and my guess is it should be char by char examination.
I have separated logic in three main scenarios:
A: CHAR is a separator
B: CHAR is a Fuc$€/& quotation
C: CHAR is a Value
Obtaining as a reulst this weapon class (including log for it) for our arsenal:
<?php
Class CSVParser
{
#basic requirements
public $input;
public $separator;
public $currentQuote;
public $insideQuote;
public $result;
public $field;
public $quotation = array();
public $parsedArray = array();
# for logging purposes only
public $logging = TRUE;
public $log = array();
function __construct($input, $separator, $quotation=array())
{
$this->separator = $separator;
$this->input = $input;
$this->quotation = $quotation;
}
/**
* The main idea is to go through the string to parse char by char to analize
* when a complete field is detected it´ll be quoted according and added to an array
*/
public function parse()
{
for($i = 0; $i < strlen($this->input); $i++){
$this->processStream($i);
}
foreach($this->parsedArray as $value)
{
if(!is_null($value))
$this->result .= '"'.addslashes($value).'",';
}
return rtrim($this->result, ',');
}
private function processStream($i)
{
#A case (its a separator)
if($this->input[$i]===$this->separator){
$this->log("A", $this->input[$i]);
if($this->insideQuote){
$this->field .= $this->input[$i];
}else
{
$this->saveField($this->field);
$this->field = NULL;
}
}
#B case (its a f"·%$% quote)
if(in_array($this->input[$i], $this->quotation)){
$this->log("B", $this->input[$i]);
if(!$this->insideQuote){
$this->insideQuote = TRUE;
$this->currentQuote = $this->input[$i];
}
else{
if($this->currentQuote===$this->input[$i]){
$this->insideQuote = FALSE;
$this->currentQuote ='';
$this->saveField($this->field);
$this->field = NULL;
}else{
$this->field .= $this->input[$i];
}
}
}
#C case (its a value :-) )
if(!in_array($this->input[$i], array_merge(array($this->separator), $this->quotation))){
$this->log("C", $this->input[$i]);
$this->field .= $this->input[$i];
}
}
private function saveField($field)
{
$this->parsedArray[] = $field;
}
private function log($type, $value)
{
if($this->logging){
$this->log[] = "CASE ".$type." WITH ".$value." AS VALUE";
}
}
}
and example of how to use it would be:
$original = 'a,"ab",\'ab\'';
$test = new CSVParser($original, ',', array('"', "'"));
echo "<PRE>ORIGINAL: ".$original."</PRE>";
echo "<PRE>PARSED: ".$test->parse()."</PRE>";
echo "<pre>";
print_r($test->log);
echo "</pre>";
and here are the results:
ORIGINAL: a,"ab",'ab'
PARSED: "a","ab","ab"
Array
(
[0] => CASE C WITH a AS VALUE
[1] => CASE A WITH , AS VALUE
[2] => CASE B WITH " AS VALUE
[3] => CASE C WITH a AS VALUE
[4] => CASE C WITH b AS VALUE
[5] => CASE B WITH " AS VALUE
[6] => CASE A WITH , AS VALUE
[7] => CASE B WITH ' AS VALUE
[8] => CASE C WITH a AS VALUE
[9] => CASE C WITH b AS VALUE
[10] => CASE B WITH ' AS VALUE
)
I might have mistakes since i only dedicated 25 mins to it, so any comment will be appreciated an edited.

PHP json Scanning escaped character

I am writing a JSONScanner class that basically takes a string and scans the whole thing to construct a JSONObject. Currently I'm writing read_string() method, to read a string. When reading a string that escapes '\', I get some invalid output.
Here is my JSONScanner class
class JSONScanner {
private $in;
private $pos;
public function __construct($in) {
$this->in = $in;
$this->pos = 0;
}
#########################################################
############### Method used for debugging ###############
#########################################################
public function display() {
$this->pos = 1;
echo $this->read_string($this->get_char());
}
#########################################################
#########################################################
private function read_string($quote) {
$str = "";
while(($c = $this->get_char()) != $quote) {
if($c == '\\') {
$str .= $this->get_escaped_char();
} else {
$str .= $c;
}
}
return $str;
}
private function get_escaped_char() {
$c = $this->get_char();
switch($c) {
case 'n':
return '\n';
case 't':
return '\t';
case 'r':
return '\r';
// display the characters being escaped
case '\\':
case '\'':
case '"':
default:
return $c;
}
}
private function get_char() {
if($this->pos >= strlen($this->in)) {
return -1; // END OF INPUT
}
return substr($this->in, $this->pos++, 1);
}
}
Here is my running code
$str = '{"a\\":1,"b":2}';
$jscan = new JSONScanner($str);
$jscan->display();
With the above string, I'm getting
a":1,
However when I try
$str = '{"a\\\":1,"b":2}';
$jscan = new JSONScanner($str);
$jscan->display();
I get what I need, which is
a\
Why am I needing to put 2 backslashes to escape 1 backslash?
EDIT:
I was trying the same json string on json_decode, and it gave me the same results, with 2 backslashes, nothing but with 3 backslahes it gave me a\. Why is that? Isn't escaping a backslash takes 2 consecutive ones \\?
$str = '{"a\\":1,"b":2}';
This is a PHP string literal, which has its own escaping rules. The actual string you're representing with the above is:
{"a\":1,"b":2}
If you want to represent one backslash in a PHP string literal, you need to write two backslashes. So the correct string representation for what you want is:
$str = '{"a\\\\":1,"b":2}';
It happens to work with three backslashes, because \\ becomes one \ and the next \ isn't followed by any special character, so it by itself also represents a single backslash.

Print to a file a sorted string with numbers and letters, looking for a preg_match solution

Ok, so I have a text file filled with student numbers and the corresponding name. I want to get all those data, format them properly (uppercase, proper number of spaces, etc.) and put them in another file. The original text format is somewhat like this:
20111101613 XXXXXXXX , XXXX
20111121235 xxXXXX, xxxxxx
20111134234 XXXX, XxxX
20111104142 XXXXXxxxX, XXXX
20111131231 XX , XXXXXX
Example:
Input file content is something like this:
20111112346 Zoomba, Samthing
20111122953 Acosta, Arbyn
20110111241 Smith, John
20111412445 Over, Flow Stack
20111112345 foo, BAR
And the output file content should be like this:
20111122953 ACOSTA, ARBYN
20111112345 FOO, BAR
20111412445 OVER, FLOW STACK
20110111241 SMITH, JOHN
20111112346 ZOOMBA, SAMTHING
EDIT: Can someone give me a hint or the solution on how to make this function with using regular expressions?
function sortslist($infile, $outfile)
{
// open the input file named conversion.txt
$inptr = fopen($infile, "r");
if (!$inptr)
{
trigger_error("File cannot be opened: $infile", E_USER_ERROR);
}
// initialize student number to zero
$snum = 0;
// number of letters in the name string
$n = 0;
// initialize the name string to be empty
$name = "";
// iteratively scan the input file
$done = false;
while (!$done)
{
// get each character in the file
$c = fgetc($inptr);
// if the character is a digit, add it to the student number
if (ctype_digit($c))
{
$snum = (($snum * 10) + ($c - '0'));
}
// else, add to name string including commas and space. Input file might have tabs
else if (ctype_alpha($c) || ($n > 0 && ($c == " " || $c == "\t")) || $c == ",")
{
// append the new character to name string
$name .= $c;
// add a space after the comma
if ($c == ",")
{
$name .= " ";
$n++;
}
// increment the number of letters
$n++;
}
// end of the line
else if ($c == "\n" || !$c)
{
// 0 is added to student numbers when newline is detected so neglect them
if ($snum != 0 && $name != "\n")
{
// replace consecutive spaces with one space only
$name = preg_replace(['/\s\s+/', '/\s,/', '/(\s*)(?>$)/'], [' ', ',', ''], $name);
// record student number and name
$info['snum'][] = $snum;
$info['name'][] = strtoupper($name);
}
// reset the values needed
$snum = 0;
$n = 0;
$name = "";
// if we hit the end of the file then it is done
if (!$c)
{
$done = true;
}
}
}
// sort the students names alphabetically
array_multisort($info['name'], $info['snum']);
// combine the name strings and there corresponding student number
$i = 0;
$students = [];
$last_student = end($info['snum']);
foreach ($info['snum'] as $snum)
{
$students[$snum] = $snum . " " . $info['name'][$i++];
// update input file too
fwrite($inptr, $students[$snum]);
// don't add a newline to the end of the file
if ($snum != $last_student)
{
$students[$snum] .= "\n";
}
}
// put it into a new file called slist.txt
file_put_contents($outfile, $students, LOCK_EX);
// close the input file
fclose($inptr);
}
Your problem lies in the fact that $hashtable values are stored with the student ID first in the string. asort() will allways look at the beginning of the value, and sort according to that. So in order to sort by the name, you will have to split up the student ID and the name into two separate arrays and then sort them using array_multisort().
Replace:
$hashtable[$snum] = $snum . " " . strtoupper($name) . "\n";
with:
$snums[] = $snum;
$names[] = strtoupper($name);
array_multisort($names, $snums);
$j = 0;
while ($names) {
$hashtable[$snum] = $snums[$j]. " ". $names[$j]. "\n";
$j++;
}

Categories