I'm currently working on a CSV handling class that maily uses PHP's fgetcsv() function.
I'd like to be able to detect the CSV file's delimiter and enclosure character.
Now I'm just trying to figure out how to find the cell enclosure, knowing that I've got some hell of a file to parse :
## *CSV File* ##
,,,foo,bar,,cats are
dead,lorem ipsum,csv,"this cell's enclosure is set",,
Anyway I can't figure out a good algorithm, for now I only thought of bruteforcing everything (reading the file with different enclosures and checking the output)...
You can try all known combinations and then check if the outcome is valid:
All lines have the same number of values
If this is possible with no enclosure you should prefer no enclosure.
Related
I try to parse a csv file in PHP via SplFileObject. Sadly SplFileObject stucks sometimes if there are erroneous invisible characters in the text. The function detects a quote instead of skipping or read it as normal character while iterating over the lines in the csv file.
The screenshot below is from Textwrangler:
I also copied it from Textwrangler here (invisible char should be between "forgé." and "Circa"):
Fer forgé.� Circa
My code (SplFileObject part):
$splFile = new \SplFileObject($file);
$splFile->setFlags(\SplFileObject::DROP_NEW_LINE | \SplFileObject::SKIP_EMPTY | \SplFileObject::READ_AHEAD | \SplFileObject::READ_CSV);
$splFile->setCsvControl(",", '"', '"');
I tried to figure out which charset the csv file has via file -I my.csv. Output: my.csv: application/octet-stream; charset=binary. That is a weird result as the file is readable via Textwrangler and is therfore NOT binary. I also read another csv generated in the same way and the output is as expected: second.csv: text/plain; charset=utf-8. The tool used to generate the csv files is called Visual Web Ripper (tool for crawling web pages).
How I can determine which character this upside-down question mark is (it seems not to be the spanish upside down question mark - maybe just a placeholder inserted by Textwrangler)?
How can I delete this character and all "invalid" characters in my csv file? Is there a regular expression which matches every character, number, sign (punctuation and other textual symbols) which is in fact a real character and leave out something like in the example above? I am looking for an unicode-safe regular expression (need to preserve german umlauts, french, russian, chinese, japan and korean characters as well). Alternatively: How can I convert a csv file with charset=binary to UTF-8?
Edit:
If I open it via nano editor it shows forgé.^# Circa. After a quick search it seems to be a NUL character or \u0000 (see comments and https://en.wikipedia.org/wiki/Null_character for reference).
Edit 2:
I digged a little more into it: It seems that there is a problem with the $splFile->current() function, which reads a line at the current file pointer. The line gets truncated after the NUL character (no matter if I try to read it via SplFileObject::READ_CSV or just as normal string (without SplFileObject::READ_CSV parameter)).
The solution was to omit the SplFileObject::DROP_NEW_LINE parameter. I also checked if the NUL character is present: It is present, but it is now considered as part of the text value of the specific column in the csv and is NOT detected as quote or column enclosure.
Of course you have to filter out empty lines by yourself now with f. e. something like:
$splFileObject = new \SplFileObject();
$splFileObject->setFlags(\SplFileObject::SKIP_EMPTY | \SplFileObject::READ_AHEAD | \SplFileObject::READ_CSV);
$columns = $splFileObject->current();
if (count($columns) === 1 && array_key_exists(0, $columns) && $columns[0] === NULL) {
// empty csv line
}
I have a file that I wish to read in php and split into smaller files.
The file is base64encoded but each section is delimited in the file with a (unecoded) tilde followed by the original filename of the base64 encoded data followed by another tilde.
As a silly example, the file could look like :
NbAYnnBBA~file1.txt~NbAYnnBBANbAYnnBBANbAYnnBBANbAYnnBBA~file2.txt~
I don't want to use file_get_contents as the files could be huge and I don't want to hit memory limited.
Can anyone think of a way of doing it without having to use fgetc to do it a char at a time ?
There are no line breaks in the file by the way - it is one continous block.
I'm using a csv parser class (http://code.google.com/p/php-csv-parser/) to parse and extract data from csv files. The problem I'm encountering is that it only works for certain csv file types. (It seems that there is a csv type for Mac, for Ms-Dos, and for Windows.)
The code works if I use a csv file which was saved on a mac (in excel) using the csv - windows option. However, if I save a file on a windows machine simply as csv, that doesn't work. (You would think that that would be the same format as saving csv-windows on a mac.) It does work from a windows machine if I save it as a csv-MSDOS file. This seems a little ridiculous.
Is there a way to standardize these three file types so that my code can read any type of csv that is uploaded?
i'm thinking it would be something like this:
$standardizedCSV = preg_replace_all('/\r[^\n]/', '\r\n', $csvContent);
I know it has something to do with how each file type handles end of lines, but I'm a little put out trying to figure out those differences. If anybody has any advice, please let me know.
Thanks.
UPDATE:
This is the relevant code from the csv parser I'm using which extracts data row by row:
$c = 0;
$d = $this->settings['delimiter'];
$e = $this->settings['escape'];
$l = $this->settings['length'];
$res = fopen($this->_filename, 'r');
while ($keys = fgetcsv($res, $l, $d, $e)) {
if ($c == 0) {
$this->headers = $keys;
} else {
array_push($this->rows, $keys);
}
$c ++;
}
I guess I need to understand how fgetcsv handles eol's, so that I can make sure that csv files of any format are handled in the same manner.
This seems to do the trick:
ini_set("auto_detect_line_endings", true);
The problem was with line endings, but I didn't need to create my own EOL parser. This runtime setting does it for me. See http://us.php.net/manual/en/filesystem.configuration.php#ini.auto-detect-line-endings.
I don't think the line endings is an issue. The thing about CSV is that it's only a "comma separated values" file and not standardized beyond that. So some systems separate the values using commas, some using semicolons (;). I'm sure there are variations that use even other value separators.
Additionally, the escape character (most often backslash \) can be different between CSV files, and some CSV files also use quotation marks around each value (").
A CSV file can use any variation between the above. For instance, I'm fairly certain that Microsoft Excel exports CSV files separating values using semicolons and without any quotation around the values.
I'm sure there are ways to auto-detect how to parse the CSV file, but the best way would be allowing the user to decide. That's what Excel does.
If you use CSV files, you have to agree on many details which are not properly standardized:
Line endings (Unix 0x0a, Macintosh 0x0d, DOS 0x0d 0x0a)
Field separators (comma, semicolon etc.)
Field quoting (all fields quoted, only string fields, only string fields containing field and line separators)
Escaping of double quotes within string fields (doubling of double quotes, backslash character before double quote etc.)
Multiline string fields (are they allowed or not)
File encoding (ISO-8859-1, UTF-8 etc.)
If you create a CSV reader, you can automatically handle different variations of line endings and field quoting. But the rest has to be known to the CSV parser beforehand.
The defacto standard is the CSV format produced by Excel. However, Excel uses different variations of the format:
Usually DOS line endings (but I've never tried it with Excel for Macintosh)
Field separator depending on the locale. If the comma is used to group the digits in long numbers, Excel uses the semicolon as field separator. Otherwise the comma.
Excel uses double quotes if needed.
Excel doubles the double quotes within string fields.
Excel supports multiline string fields.
The file encoding seems to be the file encoding of the current locale. So it varies.
i have a txt file with a list of country's. For my form i just read all the data in a select list line per line with fgets(). And that works fine except for some problems.
1) When i have a country with ¨ on a letter it comes in the list just as a blank.
2) When i put the data in an xml at the end it seams there is a return at the end of each value in the form of '
'.
so my question. Is there either a way to fix these problems or is there a better way to read data from a file. Or should i use on other filetype then txt?
It sounds like a trouble with the text encoding. You could try to run htmlentities on the text before echo:ing it out. Another solution is to use utf8_encode or utf8_decode (depending on which encoding your pages are served as, and on the encoding of the file).
In character data, the carriage-return (#xD) character is represented by
Just make sure that after you've read each line, you str_replace('\r', '', $line) each line to remove the carriage-return character from the end of the line.
I sometimes import data from CSV files that were provided to me, into a mysql table.
In the last one I did, some of the entries has a weird bad character in front of the actual data, and it got imported in my database. Now I'm looking for a way to clean it up.
The bad data is in the mysql column 'email', it seems to be always right in front of the actual data. When trying to print it on my screen using PHP, it shows up as �. When exporting it to a CSV file, it looks like  , and if I SET CHARACTER SET utf8 before printing it on the screen using PHP, it looks like a normal space ' '.
I was thinking of writing a PHP script that goes over all my rows one at a time, fix the email address field, and update the row. However I'm not quite sure about the "fix the email" part!
I was thinking maybe to do a "explode" and use the bad character as a delimiter, but I don't know how to type that character into my code.
Is there maybe a way to find the underlying value/utf8/hex or whatever of that character, then find it in the string?
I hope it's clear enough.
Thanks
EDIT:
In Hex, it looks like it's A0. What can I do to search and delete a character by its hex value? Either in PHP or directly in MySQL I guess ...
SELECT HEX(field) FROM table; should help determine the character.
As an alternative solution, it might actually be easier to fix the issue at the source. I've encountered similar problems with CSV files exported from Excel and have generally found that using something along the lines of...
$correctedLine = mb_convert_variables('UTF-8', 'Windows-1252', $sourceLine);
...tends to rectify the issue. (That said, you'll need to ensure that you have the multi byte string extension compiled in/enabled.)
you can trim any leading unprintable ascii char with something like:
update t set email = substr(email, 2) where ascii(email) not between 32 and 126
you can get the ascii value of the offending char with this:
select ascii(email) as first_char
I think I found a PHP answer that seems to work more reliably:
$newemail = preg_replace('/\xA0/', '', $row['oldemail']);
And then I'm going to update the row with the new email