Identify all non-standard special characters in an ANSI-encoded CSV - php

I have an ANSI-encoded CSV file that contains a number of 'problem' special characters. I'm looking for a script (preferably php or javascript) that I can use to check each record in the CSV and identify those that have problem characters.
I have no trouble looping through the CSV records, so I'm just looking for a good way to determine whether a a single string contains any characters that would cause problems if the string was inserted directly into a UTF-8 encoded file.
Background: I used a script to convert an ANSI CSV directly to UTF-8 XML without taking care to convert the CSV to UTF-8, first. Boneheaded move on my part. The script created XML entities for records with problem characters, but all textNodes into which the script tried to insert text with problem characters ended up empty. What I'm looking for, now, is a way to parse the original CSV file and identify all records containing problem characters. With ~18,000 records, it's not a job that I'd like to do manually :-)
Clarification
I should have first converted the ANSI CSV to UTF-8, then run my 'convert to XML' script on the UTF-8 encoded CSV file. Instead, I skipped the first step and ran my 'convert to XML' script on the ANSI encoded CSV file. XML entities were created for all cells, but the XML entities for cells with characters such as — (em dash) and ½ (one half) were all empty. The 'convert to XML' script silently failed to insert these strings into the UTF-8 encoded XML document (using DOMDocument in PHP).

Folks, this is quick and dirty, but that's the kind of solution I needed in this situation. I used the following code to scan through the original CSV, looking at each character in each row. Any row with a characater with ord() > 127, I inserted into a second CSV. This new CSV file contained only the rows that had 'special' characters.
In this particular case, my original CSV was larger than 5MB, and the new CSV containing only rows with special characters was much smaller, on the order of a couple hundred KB, which made it much easier to work with.
$input_file = fopen($input_filePath, 'rt');
$output_file = fopen($output_filePath, 'w');
// Get the column headers of the file
$headers = fgetcsv($input_file);
// Loop through each row
while (($row = fgetcsv($input_file)) !== FALSE)
{
// Loop through each cell
foreach ($headers as $i => $header)
{
$cell = $row[$i];
// Loop through each char until we find a 'special' char
// or reach the end of the cell, whichever comes first
for ($j = 0; $j < strlen($cell); $j++) {
if (ord(substr($cell, $j, 1)) > 127) {
// If we find a special char, add this row to the new CSV file
fputcsv($output_file, $row);
break;
}
}
}
}

Related

How to convert UTF-8 to ANSI csv file

I'm actually generating csv files in php, works great but I have to use these csv files to use into Microsoft Dynamics AX and here's the problem.
Csv file that I generated gets "NUL" space on some columns and I have to pull off those spaces to get clean csv files and use it in Dynamics AX.
I saw when opening them into Notepad ++ that csv files are in UTF-8 BOM and I want to convert them to ANSI, when I make the conversion to ANSI in Notepad++, all NUL spaces disappear.
I tried different things saw on StackOverflow and it is with the iconv method that I obtained the better result but it is far from perfect and what I expect.
Here's the actual code :
fprintf($fp, chr(0xEF) . chr(0xBB) . chr(0xBF));
for ( $a = 0 ; $a < count($tableau) ; $a++ ) {
foreach ( $tableau[$a] as $data ) {
fputcsv($fp, $data, ";", chr(0));
}
}
$fp=iconv("UTF-8", "Windows-1252//TRANSLIT//IGNORE", $fp);
fclose($fp);
echo json_encode(responseAjax(true));
}
and I obtain these result :
I don't understand why it's only apply in one cell instead on working on every cells which contain "NUL" spaces.
I tried the mb_converting_encoding method with no great result.
Any other idea, method or advice will be welcome,
thanks
"NUL" is the name generally given to a binary value of 00000000, which has the same meaning in all ASCII-compatible encodings, which includes UTF-8, Windows-1252, and most things that could be given the vague label "ANSI". So character encoding is not relevant here.
You appear to be explicitly adding it with chr(0) - specifically as the "enclosure" argument to fputcsv, so it's being used in place of quote marks around strings with spaces or punctuation. The string "Avion" doesn't need enclosing, which is why it doesn't have any NULs around.
Let's add some comments to the code you've apparently copied without understanding:
// Output the three bytes known as the "UTF-8 BOM"
// - an invisible character used to help software guess that a file should be read as UTF-8
fprintf($fp, chr(0xEF) . chr(0xBB) . chr(0xBF));
// Loop over the data
for ( $a = 0 ; $a < count($tableau) ; $a++ ) {
foreach ( $tableau[$a] as $data ) {
// Output a row of CSV data with the built-in function fputcsv
// $data is the array of data you want to output on this row
// Use ';' as the separator between columns
// Use chr(0) - a NUL byte - to "enclose" fields with spaces or punctuation
// The default would be to use comma (',') and quotes ('"')
// See the PHP manual at https://php.net/fputcsv for more details
fputcsv($fp, $data, ";", chr(0));
}
}
The character you use for the "enclosure" is entirely up to you; most systems will probably expect the default ", so you could use this:
fputcsv($fp, $data, ";");
Which is the same as this:
fputcsv($fp, $data, ";", '"');
The function doesn't support disabling the enclosure completely, but without it, CSV is fundamentally a very simple format - just combine the fields separated by some character, e.g. using implode:
fwrite($fp, implode(";", $data));
Character encoding is a completely separate issue. For that, you need to know two things:
What encoding is your data in
What encoding does the remote system want it in
If these don't match, you can use a function like iconv or mb_convert_encoding.
If your output is not UTF-8, you should also remove the line at the beginning of your code that outputs the UTF-8 BOM.
If your data is stored in UTF-8, and the remote system accepts data in UTF-8, you don't need to do anything here.

How to keep leading zeroes in numeric string values when a csv file created with PHP is imported to Excel?

I have a code which creates a CSV file and puts certain data in there. Some of this data is text and some are numeric strings. When this CSV file is imported in Excel the program removes the leading zeroes from the numeric strings (phone numbers and zip codes). Is there a way I can format/change these numeric string values so that the Excel can read them in a way that it'll keep the leading zeroes? Or is just the Excel the problem and this problem should be worked from there?
I have tried adding apostrophe before the numeric strings and the numbers will be there but the apostrophe will also stay and I don't want that.
$dataorder = ["Receiver:", $_POST['orderperson'], $_POST['address'], $_POST['postnumber'], $_POST['city'], $_POST['email'], $_POST['phone']];
fputcsv($fh, $dataorder, $delimeter);
Excel does not know how to handle a field which contains just numbers and thus it trims any leading zeros. There is nothing you can do at the CSV generation process. That you can do is tell excel how to treat each data column when importing the CSV file. At step 3 when you import text data (Data Tab > From Text > Select csv file), excel let you choose each column data type which default is General; there change it to Text for the numeric fields and it will keep all the leading zeros. Here is a screenshot with the step 3 of the text importer, it's in greek but you'll get the point (Γενική -> General, Κείμενο -> Text):

Regex to delete faulty characters within a csv file to make SplFileObject work correctly in PHP

I try to parse a csv file in PHP via SplFileObject. Sadly SplFileObject stucks sometimes if there are erroneous invisible characters in the text. The function detects a quote instead of skipping or read it as normal character while iterating over the lines in the csv file.
The screenshot below is from Textwrangler:
I also copied it from Textwrangler here (invisible char should be between "forgé." and "Circa"):
Fer forgé.� Circa
My code (SplFileObject part):
$splFile = new \SplFileObject($file);
$splFile->setFlags(\SplFileObject::DROP_NEW_LINE | \SplFileObject::SKIP_EMPTY | \SplFileObject::READ_AHEAD | \SplFileObject::READ_CSV);
$splFile->setCsvControl(",", '"', '"');
I tried to figure out which charset the csv file has via file -I my.csv. Output: my.csv: application/octet-stream; charset=binary. That is a weird result as the file is readable via Textwrangler and is therfore NOT binary. I also read another csv generated in the same way and the output is as expected: second.csv: text/plain; charset=utf-8. The tool used to generate the csv files is called Visual Web Ripper (tool for crawling web pages).
How I can determine which character this upside-down question mark is (it seems not to be the spanish upside down question mark - maybe just a placeholder inserted by Textwrangler)?
How can I delete this character and all "invalid" characters in my csv file? Is there a regular expression which matches every character, number, sign (punctuation and other textual symbols) which is in fact a real character and leave out something like in the example above? I am looking for an unicode-safe regular expression (need to preserve german umlauts, french, russian, chinese, japan and korean characters as well). Alternatively: How can I convert a csv file with charset=binary to UTF-8?
Edit:
If I open it via nano editor it shows forgé.^# Circa. After a quick search it seems to be a NUL character or \u0000 (see comments and https://en.wikipedia.org/wiki/Null_character for reference).
Edit 2:
I digged a little more into it: It seems that there is a problem with the $splFile->current() function, which reads a line at the current file pointer. The line gets truncated after the NUL character (no matter if I try to read it via SplFileObject::READ_CSV or just as normal string (without SplFileObject::READ_CSV parameter)).
The solution was to omit the SplFileObject::DROP_NEW_LINE parameter. I also checked if the NUL character is present: It is present, but it is now considered as part of the text value of the specific column in the csv and is NOT detected as quote or column enclosure.
Of course you have to filter out empty lines by yourself now with f. e. something like:
$splFileObject = new \SplFileObject();
$splFileObject->setFlags(\SplFileObject::SKIP_EMPTY | \SplFileObject::READ_AHEAD | \SplFileObject::READ_CSV);
$columns = $splFileObject->current();
if (count($columns) === 1 && array_key_exists(0, $columns) && $columns[0] === NULL) {
// empty csv line
}

php merging txt files, issue with encoding

I found this code on stackoverflow, from user #Attgun:
link: merge all files in directory to one text file
<?php
//Name of the directory containing all files to merge
$Dir = "directory";
//Name of the output file
$OutputFile = "filename.txt";
//Scan the files in the directory into an array
$Files = scandir ($Dir);
//Create a stream to the output file
$Open = fopen ($OutputFile, "w"); //Use "w" to start a new output file from
zero. If you want to increment an existing file, use "a".
//Loop through the files, read their content into a string variable and
write it to the file stream. Then, clean the variable.
foreach ($Files as $k => $v) {
if ($v != "." AND $v != "..") {
$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, $Data);
}
unset ($Data);
}
//Close the file stream
fclose ($Open);
?>
The code works right but when it is merging, php inserts a character in the beginning of every file copied. The file encoding i am using is UCS-2 LE.
I can view that character when i change the encoding to ANSI.
My problem is that i can't use another encoding than UCS-2 LE.
Can someone help me with this problem?
Edit: I don't wan't to change the file encoding. I want keep the same encoding without PHP add another character.
#AlexHowansky motivated me to search for an other way.
The solution that it seems to work without messing with file encoding is this :
bat file :
#echo on
copy *.txt all.txt
#pause
Now the final file keeps the encoding from the files that reads.
My compiler doesn't show any error message like before!
Most PHP string functions are encoding-agnostic. They merely see strings as a collection of bytes. You may append a b to the fopen() call in order to be sure that line feeds are not mangled but nothing in your code should change the actual encoding.
UCS-2 (as well as its successor UTF-16 and some other members of the UTF family) is a special case because the Unicode standard defines two possible directions to print the individual bytes that conform a multi-byte character (that has the fancy name of endianness), and such direction is determined by the presence of the byte order mark character, followed by a variable number of bytes that depends on the encoding and determine the endianness of the file.
Such prefix is what prevents raw file concatenation from working. However, it's a still a pretty simple format. All that's needed is removing the BOM from all files but the first one.
To be honest, I couldn't find what the BOM is for UCS-2 (it's a obsolete encoding and it's no longer present in most Unicode documentation) but since you have several samples you should be able to see it yourself. Making the assumption that it's the same as in UTF-16 (FF FE) you'd just need to omit two bytes, e.g.:
$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, substr($Data, 2));
I've composed a little self-contained example. I don't have any editor that's able to handle UCS-2 so I've used UTF-16 LE. The BOM is 0xFFFF (you can inspect your BOM with an hexadecimal editor like hexed.it):
file_put_contents('a.txt', hex2bin('FFFE6100'));
file_put_contents('b.txt', hex2bin('FFFE6200'));
$output = fopen('all.txt', 'wb');
$first = true;
foreach (scandir(__DIR__) as $position => $file) {
if (pathinfo($file, PATHINFO_EXTENSION)==='txt' && $file!=='all.txt') {
$data = file_get_contents($file);
fwrite($output, $first ? $data : substr($data, 2));
$first = false;
}
}
fclose($output);
var_dump(
bin2hex(file_get_contents('a.txt')),
bin2hex(file_get_contents('b.txt')),
bin2hex(file_get_contents('all.txt'))
);
string(8) "fffe6100"
string(8) "fffe6200"
string(12) "fffe61006200"
As you can see, we end up with a single BOM on top and no other byte has been changed. Of course, this assumes that all your text files have the same encoding the encoding is exactly the one you think.

standardizing CSV file types

I'm using a csv parser class (http://code.google.com/p/php-csv-parser/) to parse and extract data from csv files. The problem I'm encountering is that it only works for certain csv file types. (It seems that there is a csv type for Mac, for Ms-Dos, and for Windows.)
The code works if I use a csv file which was saved on a mac (in excel) using the csv - windows option. However, if I save a file on a windows machine simply as csv, that doesn't work. (You would think that that would be the same format as saving csv-windows on a mac.) It does work from a windows machine if I save it as a csv-MSDOS file. This seems a little ridiculous.
Is there a way to standardize these three file types so that my code can read any type of csv that is uploaded?
i'm thinking it would be something like this:
$standardizedCSV = preg_replace_all('/\r[^\n]/', '\r\n', $csvContent);
I know it has something to do with how each file type handles end of lines, but I'm a little put out trying to figure out those differences. If anybody has any advice, please let me know.
Thanks.
UPDATE:
This is the relevant code from the csv parser I'm using which extracts data row by row:
$c = 0;
$d = $this->settings['delimiter'];
$e = $this->settings['escape'];
$l = $this->settings['length'];
$res = fopen($this->_filename, 'r');
while ($keys = fgetcsv($res, $l, $d, $e)) {
if ($c == 0) {
$this->headers = $keys;
} else {
array_push($this->rows, $keys);
}
$c ++;
}
I guess I need to understand how fgetcsv handles eol's, so that I can make sure that csv files of any format are handled in the same manner.
This seems to do the trick:
ini_set("auto_detect_line_endings", true);
The problem was with line endings, but I didn't need to create my own EOL parser. This runtime setting does it for me. See http://us.php.net/manual/en/filesystem.configuration.php#ini.auto-detect-line-endings.
I don't think the line endings is an issue. The thing about CSV is that it's only a "comma separated values" file and not standardized beyond that. So some systems separate the values using commas, some using semicolons (;). I'm sure there are variations that use even other value separators.
Additionally, the escape character (most often backslash \) can be different between CSV files, and some CSV files also use quotation marks around each value (").
A CSV file can use any variation between the above. For instance, I'm fairly certain that Microsoft Excel exports CSV files separating values using semicolons and without any quotation around the values.
I'm sure there are ways to auto-detect how to parse the CSV file, but the best way would be allowing the user to decide. That's what Excel does.
If you use CSV files, you have to agree on many details which are not properly standardized:
Line endings (Unix 0x0a, Macintosh 0x0d, DOS 0x0d 0x0a)
Field separators (comma, semicolon etc.)
Field quoting (all fields quoted, only string fields, only string fields containing field and line separators)
Escaping of double quotes within string fields (doubling of double quotes, backslash character before double quote etc.)
Multiline string fields (are they allowed or not)
File encoding (ISO-8859-1, UTF-8 etc.)
If you create a CSV reader, you can automatically handle different variations of line endings and field quoting. But the rest has to be known to the CSV parser beforehand.
The defacto standard is the CSV format produced by Excel. However, Excel uses different variations of the format:
Usually DOS line endings (but I've never tried it with Excel for Macintosh)
Field separator depending on the locale. If the comma is used to group the digits in long numbers, Excel uses the semicolon as field separator. Otherwise the comma.
Excel uses double quotes if needed.
Excel doubles the double quotes within string fields.
Excel supports multiline string fields.
The file encoding seems to be the file encoding of the current locale. So it varies.

Categories