Exploding comma deliminated CSV file when column values itself contains comma - php

I am reading a comma deliminated CSV file line by line and then separate each column value using PHP explode function. The problem is that there are some columns which itself have comma (,) values in it so they are also exploded.
A row of data:
03,1392,06,1000,1,"1000,36,21,68",4,AF,TJ,AF,44071000
Here "1000,36,21,68" must be considered as a single value but PHP explode also breaks it. I know this is how explode works but is there any alternate function which can be used in this case. Also i would need to remove the double quotes (") from both sides from this value.

Don't try using explode and parsing it yourself:
use PHP's built-in str_getcsv() function
or use fgetcsv() to read and parse each line directly from file
EDIT
If you're feeling really adventurous, you can use SPL to read and parse the file
$file = new SplFileObject("data.csv");
while (!$file->eof()) {
var_dump($file->fgetcsv());
}
or
$file = new SplFileObject("data.csv");
$file->setFlags(SplFileObject::READ_CSV);
foreach ($file as $fields) {
var_dump($fields);
}

Related

How to use fgetcsv() when the CSV has several double quotes """" or if the entire line is wrapped in quotes?

Some CSV files that we import to our server cannot be parsed correctly.
We are reading the CSV file with PHP's fgetcsv():
while (($line = fgetcsv($file)) !== false) { ... }
However, when the CSV line is wrapped in quotes (and contains two double quotes inside), for example:
"first entry,"""","""",Data Chunk,2022-05-30"
The fgetcsv() function cannot handle the line correctly and sees the first entry,"""","""",Data Chunk,2022-05-30 as one entry.
How can we make sure the function does regard first entry as a separate entry, and also interpretes the other parts """" as empty entries?
On more research I found:
Fields containing double quotes ("), Line Break (CRLF) and Comma must be enclosed with double quotes.
If Fields enclosed by double quotes (") contain double quotes character then the double quotes inside the field must be preceded with another double quote as an escape sequence. Source
This is likely the issue that we face here.
A more complete data example of the CSV:
Allgemeines
Subject,Body,Attachment,Author,Created At,Updated At
"Hello everyone, this is a sample. Kind regards,"""","""",Author name (X),2022-05-30 14:54:32 UTC,2022-05-30 14:54:37 UTC"
","""",https://padlet-uploads.storage.googleapis.com/456456456/testfile.docx,Author name (X),2022-05-15 13:53:04 UTC,2022-05-15 13:54:40 UTC"
",""Hello everyone!"
This is some fun text.
More to come.
Another sentence.
And more text.
Even more text
See you soon.
","",Author name (X),2021-07-22 09:41:06 UTC,2021-07-23 16:12:42 UTC
""
Important Things to Know in 2022
Subject,Body,Attachment,Author,Created At,Updated At
"","
01.01.2022 First day of new year
02.02.2202 Second day of new year
Please plan ahead.
","",Author name (X),2021-07-22 09:58:19 UTC,2022-03-24 14:16:50 UTC
""
Note: Line starts with double quote and ends with double quote and carriage return and new line feed.
Turns out the CSV data was corrupted.
The user messed around with the CSV in Excel, and as stated in the comments, likely overwrote the original CSV. Causing double escapings.
For anyone facing the same issue:
Do not waste your time in trying to recover corrupted CSV files with a custom parser.
Ask your user to give you access to the original CSV export site and generate the CSV yourself.
Check the CSV integrity. See code below.
$file = fopen($csvfile, 'r');
// validate if all the records have same number of fields, empty lines (count 1), full entry (count 6) - depends on your CSV structure
$length_array = array();
while (($data = fgetcsv($file, 1000, ",")) !== false)
{
// count number of entries
$length_array[] = count($data);
};
$length_array = array_unique($length_array);
// free memory by closing file
fclose($file);
// depending on your CSV structure it is $length_array==1 or $length_array==2
if (count($length_array) > 2)
{
// count mismatch
return 'Invalid CSV!';
}
👍

fgetcsv/fputcsv $escape parameter fundamentally broken

Overview
fgetcsv and fputcsv support an $escape argument, however, it's either broken, or I'm not understanding how it's supposed to work. Ignore the fact that you don't see the $escape parameter documented on fputcsv, it is supported in the PHP source, there's a small bug preventing it from coming through in the documentation.
The function also supports $delimiter and $enclosure parameters, defaulting to a comma and a double quote respectively. I would expect the $escape parameter should be passed in order to have a field containing any one of those metacharacters (backslash, comma or double quote), however this certainly isn't the case. (I now understand from reading Wikipedia, these are to be enclosed in double-quotes).
What I've tried
Take for example the pitfall that has affected numerous posters in the comments section from the fgetcsv documentation. The case where we'd like to write a single backslash to a field.
$r = fopen('/tmp/test.csv', 'w');
fwrite($r, '"\"');
fclose($r);
$r = fopen('/tmp/test.csv', 'r');
var_dump(fgetcsv($r));
fclose($r);
This returns false. I've also tried "\\", however that also returns false. Padding the backslash(es) with some nebulous text gives fgetcsv the boost it needs... "hi\\there" and "hi\there" both parse and have the same result, but the result has only 1 backslash, so what's the point of the $escape at all?
I've observed the same behavior when not enclosing the backslash in double quotes. Writing a 'CSV' file containing the string \, and \\, have the same result when parsed by fgetcsv, 1 backslash.
Let's ask PHP how it might encode a backslash as a field in a CSV using fputcsv
$r = fopen('/tmp/test.csv', 'w');
fputcsv($r, array('\\'));
fclose($r);
echo file_get_contents('/tmp/test.csv');
The result is a double-quote enclosed single backslash (and I've tried 3 versions of PHP > 5.5.4 when $enclose support was supposedly added to fputcsv). The hilarity of this is that fgetcsv can't even read it properly per my notes above, it returns false... I'd expect fputcsv not to enclose the backslash in double quotes or fgetcsv to be able to read "\" as fputcsv has written it..., or really in my apparently misconstrued mind, for fputcsv to write a double quote enclosed pair of backslashes and for fgetcsv to be able to properly parse it!
Reproducible Test
Try writing a single quote to a file using fputcsv, then reading it via fgetcsv.
$aBackslash = array('\\');
// Write a single backslash to a file using fputcsv
$r = fopen('/tmp/test.csv', 'w');
fputcsv($r, $aBackslash);
fclose($r);
// Read the file using fgetcsv
$r = fopen('/tmp/test.csv', 'r');
$aFgetcsv = fgetcsv($r);
fclose($r);
// Compare the read value from fgetcsv to our original value
if(count(array_diff($aBackslash, $aFgetcsv)))
echo "PHP CSV support is broken\n";
Questions
Taking a step back I have some questions
What's the point of the $escape parameter?
Given the loose definition of CSV files, can it be said PHP is supporting them correctly?
What's the 'proper' way to encode a backslash in a CSV file?
Background
I initially discovered this when a co-worker provided me a CSV file produced from Python, which wrote out a single backslash enclosed by double quotes and after fgetcsv failed to read it. I had the gaul to ask him if he could use a standard Python function. Little did I know the PHP CSV toolkit is a tangled mess! (FWIW: the Python dev tells me he's using the CSV writing module).
From a quick look at Python's documentation on CSV Format Parameters, the escape character used within enclosed values (i.e. inside double quotes) is another double quote.
For PHP, the default escape character is a backslash (^); to match Python's behaviour you need to use this:
$data = fgetcsv($r, 0, ',', '"', '"');
(^) Actually fgetcsv() treats both $enclosure||$enclosure and $escape||$enclosure in the same way, so the $escape argument is used to avoid treating the backslash as a special character.
(^^) Setting the $length parameter to 0 instead of a fixed hard limit makes it less efficient.
EDIT 2
So after sleep and a relook at the code, turns out fputcsv doesn't accept the escape parameter, and I was being stupid. I've updated the code below to proper working code. The same basic principle applies, the escape parameter is there to alter the escape parameter so you can load a CSV with backslashes without them being treated as escape characters. The trick is to use a character that isn't contained within the csv. You can do this by grepping the file for a specific character, until you find one that isn't returned.
EDIT
Ok, so the verdict is that it checks for the escape char, and then never stops checking. So, if it finds it, it's escaped. That simple.
That said, the purpose of the escape parameter is to allow for this exact situation, where you can alter the escape char to a character that isn't needed.
Here I've converted your example code to a working code:
$aBackslash = array('\\');
// Write a single backslash to a file using fputcsv
$r = fopen('/tmp/test.csv', 'w');
fputcsv($r, $aBackslash, ',', '"'); // EDIT 2: Removed escape param that causes PHP Notice.
fclose($r);
// Read the file using fgetcsv
$r = fopen('/tmp/test.csv', 'r');
$aFgetcsv = fgetcsv($r, ',', '"', '#');
fclose($r);
// Compare the read value from fgetcsv to our original value
if(count(array_diff($aBackslash, $aFgetcsv)))
echo "PHP CSV support is broken\n";
else
echo "PHP WORKS!\n";
One important caveat is that both fgetcsv and fputcsv must have the same parameters, otherwise the returned array will not match up to the original array.
ORIGINAL ANSWER
You are very correct. This is a failing with the language. I've tried every permutation of slashes that I can think of, and I've yet to actually achieve a successful response from the CSV. It always returns just as your example says.
I think what #deceze was mention is that in your example you use array('\\') which is actually the string literal "\" which PHP interprets as such, and passes "\" to the CSV, which is then returned that way. This returns the erroneous response \", which, as I stated above, is definitely wrong.
I did manage to find a work around, so that the result is actually appropriate:
First, for your example we'll either need to generate /tmp/test.csv in with "\" as the body, or alter the array slightly. Easiest method is just changing the array to:
array('"\\\\"');
After that, we should change up the fgetcsv request a bit.
$aFgetcsv = fgetcsv($r);
$aFgetcsv = array_map('stripslashes', $aFgetcsv);
By doing this, we're telling PHP to strip the first slash, thus making the string within $aFgetcsv "\"
Just had the same problem. The solution was to set $escape to false:
$row = ['a', '{"b":"single dquote=\""}', 'c'];
fputcsv($f, $row); // invalid csv: a,"{""b"":""single dquote=\"""}",c
fputcsv($f, $row, ',', '"', false); // valid csv: a,"{""b"":""single dquote=\""""}",c

php function for csv conversion w/ commas and other formatting characters

I am downloading my data from MySQL to .csv format. I am having no problem using mysql_real_escape_string(), but this function removes any commas or formatting characters that exist in my data.. So the .csv structure is maintained, but my grammatical characters (such as commas) are expectantly removed.
mysql_real_escape_string doesn't REMOVE data. It simply makes a string safe to insert into an SQL query. Standard rules for CSV is the enclose any string containing commas in double-quotes, so
This is my comma , containing string
becomes
"This is my comma, containing string"
in the CSV output. And any fields containing double-quotes should have the quotes doubled:
This is my "little" friend
becomes
This is my ""little"" friend
Enclosing each field with double quotes helps.
A function to convert an array to CSV:
function arr2csv($twoDaray) {
foreach($twoDarray as $k=>$v) {
$row=implode('","',$v);
echo '"'.$row.'"'.chr(10).chr(13);
}
}
I solved this by wrapping the entire string in quotes, then individually wrapping quotes and commas to maintain the formatting:
...
$csv_output .= "\"" . eregi_replace("\"", "\"\"", stripslashes($rowr[$j])) . "\",";
...
You'll note that I strangely applied stripslashes(). Unfortunately the script I am working on only works in php4, and so slashes are added by default settings of the .ini. So I just strip them out.
I'll also probably replace eregi_replace() with str_replace() as I believe it's deprecated.
Anyhow. The above solution works to remove commas and slashes and maintains them where

Remove extra comma after string

I have a lot of data in a CSV file. I wrote some code to extract only column 1 and put it in a txt file:
fwrite($file2, $data[0].',');
Now, this created a TXT file with all values separated by a comma.
However, after the last value was read there was an extra comma
I don't need this, because when I used foreach($splitcontents as $x=> $y) using a comma delimiter, it reads a garbage value at the end because of the extra comma.
How do I remove or avoid the comma at the end?
Use fputcsv() instead of misreimplementing it.
Instead of assembling the CSV file yourself field-wise you could use fputcsv() which puts it into the right format:
while (...) {
fputcsv($file2, array($data[0], $data[1], $data[22]) );
The second parameter must be an array. If you really only want one column, then leave out the rest.
Also for reading the files back in, check out fgetcsv(). This might simplify your foreach + $splitstring approach.
One way to solve the problem is to use rtrim($data, ',') on the data you load from the second file before splitting it. This will remove the trailing comma.
If you want to fix the file itself, you can do this:
ftruncate($file2, ftell($file2)-1);
You have to do this just before you call fclose()

User fgetcsv with and without quotations around entries

Edit: is there an alternative to fgetcsv?
The code below processes csv files where each entry is in cased by quotes and separated by commas ex: "Name","Last"... the problem I'm having is sometimes the csv files do not have quotes around each entry and just has the comma to separate it ex: Name,Last. How can I handle both types?
$uploadcsv = "/temp/files/Load15.csv";
$handle = fopen($uploadcsv, 'r');
$column_headers = array();
$row_count = 0;
while (($data = fgetcsv($handle, 100000, ",")) !== FALSE) {
if ($row_count==0){
$column_headers = $data;
} else {
print_r($data);
}
++$row_count;
}
this csv works:
"Name","Last"
"Mike","Aidens"
"Mike1","Aidens1"
this csv does not work:
Name,Last
Mike,Aidens
Mike1,Aidens1
Edit: Strange error... I tried a small snippet from the CSV file with no quotations and it worked. Odd then, I try a large piece then the entire CSV content (this is all be paste into a new test.csv file) and it worked. Both files are the same exact size 17,151kb yet the original csv file will not process. There is no trailing spaces or line at the end.
Set the 4th parameter to an empty string, it sets the enclosure, which is default ".
fgetcsv($handle, 100000, ",", '');
Use this line of code before php getcsv function call
ini_set('auto_detect_line_endings',TRUE);
As far as I am aware fgetcsv should work fine with or without quotes around the data.
Unless the CSV file is malformed, this will "just work".
In order words, you don't need to worry about whether or not every field has quotes around it, fgetcsv will take care of this for you.
Had the same problem, it couldn't read Hebrew (utf-8) letters without double quotes. It ran fine on the command line (could read Hebrew without double quotes), but in Apache it read only the header which had double quotes and returned empty strings instead of Hebrew strings in the rest of the lines which did not have double quotes at all.
Checked the locale in Apache and it returned the letter "C", but in the command line it returned "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C"
Thus I've added the following line before the fgetcsv command:
setlocale(LC_CTYPE, 'en_US.UTF-8');
And it worked, and read Hebrew letters without double quotes successfully.

Categories