standardizing CSV file types

standardizing CSV file types - php

I'm using a csv parser class (http://code.google.com/p/php-csv-parser/) to parse and extract data from csv files. The problem I'm encountering is that it only works for certain csv file types. (It seems that there is a csv type for Mac, for Ms-Dos, and for Windows.)
The code works if I use a csv file which was saved on a mac (in excel) using the csv - windows option. However, if I save a file on a windows machine simply as csv, that doesn't work. (You would think that that would be the same format as saving csv-windows on a mac.) It does work from a windows machine if I save it as a csv-MSDOS file. This seems a little ridiculous.
Is there a way to standardize these three file types so that my code can read any type of csv that is uploaded?
i'm thinking it would be something like this:
$standardizedCSV = preg_replace_all('/\r[^\n]/', '\r\n', $csvContent);
I know it has something to do with how each file type handles end of lines, but I'm a little put out trying to figure out those differences. If anybody has any advice, please let me know.
Thanks.
UPDATE:
This is the relevant code from the csv parser I'm using which extracts data row by row:
$c = 0;
$d = $this->settings['delimiter'];
$e = $this->settings['escape'];
$l = $this->settings['length'];
$res = fopen($this->_filename, 'r');
while ($keys = fgetcsv($res, $l, $d, $e)) {
if ($c == 0) {
$this->headers = $keys;
} else {
array_push($this->rows, $keys);
}
$c ++;
}
I guess I need to understand how fgetcsv handles eol's, so that I can make sure that csv files of any format are handled in the same manner.

This seems to do the trick:
ini_set("auto_detect_line_endings", true);
The problem was with line endings, but I didn't need to create my own EOL parser. This runtime setting does it for me. See http://us.php.net/manual/en/filesystem.configuration.php#ini.auto-detect-line-endings.

I don't think the line endings is an issue. The thing about CSV is that it's only a "comma separated values" file and not standardized beyond that. So some systems separate the values using commas, some using semicolons (;). I'm sure there are variations that use even other value separators.
Additionally, the escape character (most often backslash \) can be different between CSV files, and some CSV files also use quotation marks around each value (").
A CSV file can use any variation between the above. For instance, I'm fairly certain that Microsoft Excel exports CSV files separating values using semicolons and without any quotation around the values.
I'm sure there are ways to auto-detect how to parse the CSV file, but the best way would be allowing the user to decide. That's what Excel does.

If you use CSV files, you have to agree on many details which are not properly standardized:
Line endings (Unix 0x0a, Macintosh 0x0d, DOS 0x0d 0x0a)
Field separators (comma, semicolon etc.)
Field quoting (all fields quoted, only string fields, only string fields containing field and line separators)
Escaping of double quotes within string fields (doubling of double quotes, backslash character before double quote etc.)
Multiline string fields (are they allowed or not)
File encoding (ISO-8859-1, UTF-8 etc.)
If you create a CSV reader, you can automatically handle different variations of line endings and field quoting. But the rest has to be known to the CSV parser beforehand.
The defacto standard is the CSV format produced by Excel. However, Excel uses different variations of the format:
Usually DOS line endings (but I've never tried it with Excel for Macintosh)
Field separator depending on the locale. If the comma is used to group the digits in long numbers, Excel uses the semicolon as field separator. Otherwise the comma.
Excel uses double quotes if needed.
Excel doubles the double quotes within string fields.
Excel supports multiline string fields.
The file encoding seems to be the file encoding of the current locale. So it varies.

Related

Regex to delete faulty characters within a csv file to make SplFileObject work correctly in PHP

I try to parse a csv file in PHP via SplFileObject. Sadly SplFileObject stucks sometimes if there are erroneous invisible characters in the text. The function detects a quote instead of skipping or read it as normal character while iterating over the lines in the csv file.
The screenshot below is from Textwrangler:
I also copied it from Textwrangler here (invisible char should be between "forgé." and "Circa"):
Fer forgé.� Circa
My code (SplFileObject part):
$splFile = new \SplFileObject($file);
$splFile->setFlags(\SplFileObject::DROP_NEW_LINE | \SplFileObject::SKIP_EMPTY | \SplFileObject::READ_AHEAD | \SplFileObject::READ_CSV);
$splFile->setCsvControl(",", '"', '"');
I tried to figure out which charset the csv file has via file -I my.csv. Output: my.csv: application/octet-stream; charset=binary. That is a weird result as the file is readable via Textwrangler and is therfore NOT binary. I also read another csv generated in the same way and the output is as expected: second.csv: text/plain; charset=utf-8. The tool used to generate the csv files is called Visual Web Ripper (tool for crawling web pages).
How I can determine which character this upside-down question mark is (it seems not to be the spanish upside down question mark - maybe just a placeholder inserted by Textwrangler)?
How can I delete this character and all "invalid" characters in my csv file? Is there a regular expression which matches every character, number, sign (punctuation and other textual symbols) which is in fact a real character and leave out something like in the example above? I am looking for an unicode-safe regular expression (need to preserve german umlauts, french, russian, chinese, japan and korean characters as well). Alternatively: How can I convert a csv file with charset=binary to UTF-8?
Edit:
If I open it via nano editor it shows forgé.^# Circa. After a quick search it seems to be a NUL character or \u0000 (see comments and https://en.wikipedia.org/wiki/Null_character for reference).
Edit 2:
I digged a little more into it: It seems that there is a problem with the $splFile->current() function, which reads a line at the current file pointer. The line gets truncated after the NUL character (no matter if I try to read it via SplFileObject::READ_CSV or just as normal string (without SplFileObject::READ_CSV parameter)).

The solution was to omit the SplFileObject::DROP_NEW_LINE parameter. I also checked if the NUL character is present: It is present, but it is now considered as part of the text value of the specific column in the csv and is NOT detected as quote or column enclosure.
Of course you have to filter out empty lines by yourself now with f. e. something like:
$splFileObject = new \SplFileObject();
$splFileObject->setFlags(\SplFileObject::SKIP_EMPTY | \SplFileObject::READ_AHEAD | \SplFileObject::READ_CSV);
$columns = $splFileObject->current();
if (count($columns) === 1 && array_key_exists(0, $columns) && $columns[0] === NULL) {
// empty csv line
}

Import ods with newline in cells

I have an ods spreadsheet (managed with OpenOffice). Several cells contain multiple lines. The data table contents are used for display on a website.
When I import the file with phpmyadmin, these cells are truncated at the first newline character.
In the ods file, the newline character is char(10). In my case this has to be replaced with the string <br/>,the HTML newline tag. Writing a php program that does the replacement makes no sense since the newline character is already cut after import. For the moment I run a pc program that patches the char(10) with the '|' character in the ods file. After import, I replace the '|' with <br/> using php. Terrible! Is there a way to prevent the import by phpmyadmin to truncate on char(10)?
Thanks, Chris.

I had the same problem. My solution is not the perfect one but did the job for me.
What I did was, I replaced new line character in ODS so I can replace it back in PHP.
Open ODS file, open search&replace box then search \n and replace it some unique char where u can locate in PHP.
in my case I did something like -EOL-
in my php script replaced -EOL- with
I know it's not shortcut but a solution...
Hope it works for u as well

Identify all non-standard special characters in an ANSI-encoded CSV

I have an ANSI-encoded CSV file that contains a number of 'problem' special characters. I'm looking for a script (preferably php or javascript) that I can use to check each record in the CSV and identify those that have problem characters.
I have no trouble looping through the CSV records, so I'm just looking for a good way to determine whether a a single string contains any characters that would cause problems if the string was inserted directly into a UTF-8 encoded file.
Background: I used a script to convert an ANSI CSV directly to UTF-8 XML without taking care to convert the CSV to UTF-8, first. Boneheaded move on my part. The script created XML entities for records with problem characters, but all textNodes into which the script tried to insert text with problem characters ended up empty. What I'm looking for, now, is a way to parse the original CSV file and identify all records containing problem characters. With ~18,000 records, it's not a job that I'd like to do manually :-)
Clarification
I should have first converted the ANSI CSV to UTF-8, then run my 'convert to XML' script on the UTF-8 encoded CSV file. Instead, I skipped the first step and ran my 'convert to XML' script on the ANSI encoded CSV file. XML entities were created for all cells, but the XML entities for cells with characters such as — (em dash) and ½ (one half) were all empty. The 'convert to XML' script silently failed to insert these strings into the UTF-8 encoded XML document (using DOMDocument in PHP).

Folks, this is quick and dirty, but that's the kind of solution I needed in this situation. I used the following code to scan through the original CSV, looking at each character in each row. Any row with a characater with ord() > 127, I inserted into a second CSV. This new CSV file contained only the rows that had 'special' characters.
In this particular case, my original CSV was larger than 5MB, and the new CSV containing only rows with special characters was much smaller, on the order of a couple hundred KB, which made it much easier to work with.
$input_file = fopen($input_filePath, 'rt');
$output_file = fopen($output_filePath, 'w');
// Get the column headers of the file
$headers = fgetcsv($input_file);
// Loop through each row
while (($row = fgetcsv($input_file)) !== FALSE)
{
// Loop through each cell
foreach ($headers as $i => $header)
{
$cell = $row[$i];
// Loop through each char until we find a 'special' char
// or reach the end of the cell, whichever comes first
for ($j = 0; $j < strlen($cell); $j++) {
if (ord(substr($cell, $j, 1)) > 127) {
// If we find a special char, add this row to the new CSV file
fputcsv($output_file, $row);
break;
}
}
}
}

Find CSV cell enclosure

I'm currently working on a CSV handling class that maily uses PHP's fgetcsv() function.
I'd like to be able to detect the CSV file's delimiter and enclosure character.
Now I'm just trying to figure out how to find the cell enclosure, knowing that I've got some hell of a file to parse :
## *CSV File* ##
,,,foo,bar,,cats are
dead,lorem ipsum,csv,"this cell's enclosure is set",,
Anyway I can't figure out a good algorithm, for now I only thought of bruteforcing everything (reading the file with different enclosures and checking the output)...

You can try all known combinations and then check if the outcome is valid:
All lines have the same number of values
If this is possible with no enclosure you should prefer no enclosure.

PHP generating csv not sending correct new line feeds

I have a script that generates a csv file using the following code:
header('Content-type: text/csv');
header('Content-Disposition: attachment; filename="'.date("Ymdhis").'.csv"');
print $content;
The $content variable simply contains lines with fields separated by commas and then finalised with ."\n"; to generate a new line.
When I open the file in csv it looks fine however, when I try to use the file to import into an external program (MYOB) it does not recognise the End Of Line (\n) character and assumes one long line of text.
When I view the contents of the file in notepad, the end of line character (\n) is a small rectangle box which looks like the character code 0x7F.
If I open the file and re-save it in excel, it removes this character and replaces it with a proper end of line character and I can import the file.
What character do I need to be generating in PHP so that notepad recognises it as a valid End Of Line character? (\n) obviously doesn't do the job.

Use "\r\n". (with double quotes)
The above is the ascii characters for carriage return + line feed.
Historically this relates to the early days of computing on teletypes when the output was printed to paper and returning the carriage of the teletype head to the start of the line was a separate operation to feeding a line through the printer. You could overwrite lines by just doing a carriage return and insert blank lines by just a line feed. Doing both returned the head to the start of the line and fed it a new line to print on.
What precisely was required differed between systems -
Line feed only: - most Unix like systems
Carriage return plus Line feed: - DEC amd MS-DOS based systems
Carriage return only: - Early Apple/Mac OS's
So what you're generating at the moment is a newline on a Unix system only. Wikipedia has quite a good page on this.
There's actually unix command line tools to do the conversion too. The unix2dos and dos2unix commands convert ascii text files back and forward between the unix and dos formats by converting line feed to line feed plus carriage return and vica versa.

Be sure to use double quotes around the \r\n, not the single quotes as mentioned in the previous answer!

I experienced the same issue. Later I replaced
single quotes ' with double quotes "
building $content variable.
i.e. Keeping outer quotes rather double than single in $content variable.
And it worked :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.