I have a script that generates a csv file using the following code:
header('Content-type: text/csv');
header('Content-Disposition: attachment; filename="'.date("Ymdhis").'.csv"');
print $content;
The $content variable simply contains lines with fields separated by commas and then finalised with ."\n"; to generate a new line.
When I open the file in csv it looks fine however, when I try to use the file to import into an external program (MYOB) it does not recognise the End Of Line (\n) character and assumes one long line of text.
When I view the contents of the file in notepad, the end of line character (\n) is a small rectangle box which looks like the character code 0x7F.
If I open the file and re-save it in excel, it removes this character and replaces it with a proper end of line character and I can import the file.
What character do I need to be generating in PHP so that notepad recognises it as a valid End Of Line character? (\n) obviously doesn't do the job.
Use "\r\n". (with double quotes)
The above is the ascii characters for carriage return + line feed.
Historically this relates to the early days of computing on teletypes when the output was printed to paper and returning the carriage of the teletype head to the start of the line was a separate operation to feeding a line through the printer. You could overwrite lines by just doing a carriage return and insert blank lines by just a line feed. Doing both returned the head to the start of the line and fed it a new line to print on.
What precisely was required differed between systems -
Line feed only: - most Unix like systems
Carriage return plus Line feed: - DEC amd MS-DOS based systems
Carriage return only: - Early Apple/Mac OS's
So what you're generating at the moment is a newline on a Unix system only. Wikipedia has quite a good page on this.
There's actually unix command line tools to do the conversion too. The unix2dos and dos2unix commands convert ascii text files back and forward between the unix and dos formats by converting line feed to line feed plus carriage return and vica versa.
Be sure to use double quotes around the \r\n, not the single quotes as mentioned in the previous answer!
I experienced the same issue. Later I replaced
single quotes ' with double quotes "
building $content variable.
i.e. Keeping outer quotes rather double than single in $content variable.
And it worked :)
Related
I try to parse a csv file in PHP via SplFileObject. Sadly SplFileObject stucks sometimes if there are erroneous invisible characters in the text. The function detects a quote instead of skipping or read it as normal character while iterating over the lines in the csv file.
The screenshot below is from Textwrangler:
I also copied it from Textwrangler here (invisible char should be between "forgé." and "Circa"):
Fer forgé.� Circa
My code (SplFileObject part):
$splFile = new \SplFileObject($file);
$splFile->setFlags(\SplFileObject::DROP_NEW_LINE | \SplFileObject::SKIP_EMPTY | \SplFileObject::READ_AHEAD | \SplFileObject::READ_CSV);
$splFile->setCsvControl(",", '"', '"');
I tried to figure out which charset the csv file has via file -I my.csv. Output: my.csv: application/octet-stream; charset=binary. That is a weird result as the file is readable via Textwrangler and is therfore NOT binary. I also read another csv generated in the same way and the output is as expected: second.csv: text/plain; charset=utf-8. The tool used to generate the csv files is called Visual Web Ripper (tool for crawling web pages).
How I can determine which character this upside-down question mark is (it seems not to be the spanish upside down question mark - maybe just a placeholder inserted by Textwrangler)?
How can I delete this character and all "invalid" characters in my csv file? Is there a regular expression which matches every character, number, sign (punctuation and other textual symbols) which is in fact a real character and leave out something like in the example above? I am looking for an unicode-safe regular expression (need to preserve german umlauts, french, russian, chinese, japan and korean characters as well). Alternatively: How can I convert a csv file with charset=binary to UTF-8?
Edit:
If I open it via nano editor it shows forgé.^# Circa. After a quick search it seems to be a NUL character or \u0000 (see comments and https://en.wikipedia.org/wiki/Null_character for reference).
Edit 2:
I digged a little more into it: It seems that there is a problem with the $splFile->current() function, which reads a line at the current file pointer. The line gets truncated after the NUL character (no matter if I try to read it via SplFileObject::READ_CSV or just as normal string (without SplFileObject::READ_CSV parameter)).
The solution was to omit the SplFileObject::DROP_NEW_LINE parameter. I also checked if the NUL character is present: It is present, but it is now considered as part of the text value of the specific column in the csv and is NOT detected as quote or column enclosure.
Of course you have to filter out empty lines by yourself now with f. e. something like:
$splFileObject = new \SplFileObject();
$splFileObject->setFlags(\SplFileObject::SKIP_EMPTY | \SplFileObject::READ_AHEAD | \SplFileObject::READ_CSV);
$columns = $splFileObject->current();
if (count($columns) === 1 && array_key_exists(0, $columns) && $columns[0] === NULL) {
// empty csv line
}
I'm using a csv parser class (http://code.google.com/p/php-csv-parser/) to parse and extract data from csv files. The problem I'm encountering is that it only works for certain csv file types. (It seems that there is a csv type for Mac, for Ms-Dos, and for Windows.)
The code works if I use a csv file which was saved on a mac (in excel) using the csv - windows option. However, if I save a file on a windows machine simply as csv, that doesn't work. (You would think that that would be the same format as saving csv-windows on a mac.) It does work from a windows machine if I save it as a csv-MSDOS file. This seems a little ridiculous.
Is there a way to standardize these three file types so that my code can read any type of csv that is uploaded?
i'm thinking it would be something like this:
$standardizedCSV = preg_replace_all('/\r[^\n]/', '\r\n', $csvContent);
I know it has something to do with how each file type handles end of lines, but I'm a little put out trying to figure out those differences. If anybody has any advice, please let me know.
Thanks.
UPDATE:
This is the relevant code from the csv parser I'm using which extracts data row by row:
$c = 0;
$d = $this->settings['delimiter'];
$e = $this->settings['escape'];
$l = $this->settings['length'];
$res = fopen($this->_filename, 'r');
while ($keys = fgetcsv($res, $l, $d, $e)) {
if ($c == 0) {
$this->headers = $keys;
} else {
array_push($this->rows, $keys);
}
$c ++;
}
I guess I need to understand how fgetcsv handles eol's, so that I can make sure that csv files of any format are handled in the same manner.
This seems to do the trick:
ini_set("auto_detect_line_endings", true);
The problem was with line endings, but I didn't need to create my own EOL parser. This runtime setting does it for me. See http://us.php.net/manual/en/filesystem.configuration.php#ini.auto-detect-line-endings.
I don't think the line endings is an issue. The thing about CSV is that it's only a "comma separated values" file and not standardized beyond that. So some systems separate the values using commas, some using semicolons (;). I'm sure there are variations that use even other value separators.
Additionally, the escape character (most often backslash \) can be different between CSV files, and some CSV files also use quotation marks around each value (").
A CSV file can use any variation between the above. For instance, I'm fairly certain that Microsoft Excel exports CSV files separating values using semicolons and without any quotation around the values.
I'm sure there are ways to auto-detect how to parse the CSV file, but the best way would be allowing the user to decide. That's what Excel does.
If you use CSV files, you have to agree on many details which are not properly standardized:
Line endings (Unix 0x0a, Macintosh 0x0d, DOS 0x0d 0x0a)
Field separators (comma, semicolon etc.)
Field quoting (all fields quoted, only string fields, only string fields containing field and line separators)
Escaping of double quotes within string fields (doubling of double quotes, backslash character before double quote etc.)
Multiline string fields (are they allowed or not)
File encoding (ISO-8859-1, UTF-8 etc.)
If you create a CSV reader, you can automatically handle different variations of line endings and field quoting. But the rest has to be known to the CSV parser beforehand.
The defacto standard is the CSV format produced by Excel. However, Excel uses different variations of the format:
Usually DOS line endings (but I've never tried it with Excel for Macintosh)
Field separator depending on the locale. If the comma is used to group the digits in long numbers, Excel uses the semicolon as field separator. Otherwise the comma.
Excel uses double quotes if needed.
Excel doubles the double quotes within string fields.
Excel supports multiline string fields.
The file encoding seems to be the file encoding of the current locale. So it varies.
I have some questions about \r\n:
newlines are browser dependent? (not how they are displayed in a browser, but how <textarea> sends them to php via http request)
newlines are system dependent? (where php runs)
will php apply some implicit conversion?
will mysql apply some implicit conversion?
Thanks in advance!
newlines are browser dependent?
No. Use <br> to get a newline in a browser
newlines are system dependent? (where php runs)
yes : \n on OSX, \n on Unix/Linux, \r\n on Windows
will php apply some implicit conversion?
no
will mysql apply some implicit conversion?
no
Generally, for browser \r and \n are whitespace chars, like ' ' (whitespace) of \t (tab). Inside some tags (script, pre etc.) they are treated as line break symbols. In this case browser will understand any of common line break sequences (\r, \r\n, \n).
When data comes from textarea, line breaks will always be represented as \r\n.
Line breaks in php files doesn't depend on system where they're running. It depends on settings of file editor used for creating php files. When you copy a php file to another system, line breaks format will not change.
For example, look at this code:
print_r("
" === "\r\n");
Its result will depend on settings of the editor used for creating this file. It doesn't depend on current system.
But if you're trying to read some other files contained by your system (text files, for example) these files will most probably use system's common line breaks format.
No, PHP and MySQL don't apply implicit conversions.
The system independent way is using PHP_EOL constant.
New lines is not browser dependent, outer a tag with CSS white-space:pre you must to execute nl2br() php function to convert newlines to BR tags.
You may be interested in nl2br, this takes new line characters like you described and replaces them with a HTML line break (<br />).
A big gotcha for me was that in single quoted strings 'like\nthis' escape sequences (like \n) will not be interpreted. You have to use double quotes "like\nthis" to get an actual newline.
<br> is browser independent, \n should be too.
Don't know about \r
MySQL won't convert it
I use this code to get the number of columns from a CSV file:
$this->dummy_file_handler = fopen($this->config['file'],'r');
if ($dataset =fgetcsv($this->dummy_file_handler))
{
$this->number_of_columns = count($dataset);
}
It works fine unless the file is exported with Excel for Mac 2011 since the new line character is then Classic Mac (CR) which fgetcsv doesn't recognize.
If I manually change the newline from Classic Mac (CR) to Unix (LR), then it works, but I need this to be automated.
How can I make fgetcsv recognize the Classic Mac (CR) new line character?
From the manual:
Note: If PHP is not properly
recognizing the line endings when
reading files either on or created by
a Macintosh computer, enabling the
auto_detect_line_endings run-time
configuration option may help resolve
the problem.
If Saul's answer doesn't work, I'd write a simple script to read in the file all at once and str_replace all \r with \n, dumping the results into a new file, then fgetcsv'ing that new file.
I find it amusing that these terms come from the days of using typewriters:
\n = Line Feed(LF) - advances the paper one line.
\r = Carriage Return (CR) - returns the carriage to the left side of the typewriter.
I am currently translating my PHP application using gettext with POEdit. Since I respect the print margin in my source code, I was used to writing strings like that:
print $this->translate("A long string of text
that needs to follow the print margin and since
php outputs whitespaces for every break line I do
my sites renders correctly.");
However, in POEdit, as expected, the linebreaks are not escaped to whitespaces.
A long string of text\n
that needs to follow the print margin and since\n
php outputs whitespaces for every break line I do\n
my websites render correctly.\n
I know one approach would be to close the strings when changing lines in the source code like that:
print $this->translate("A long string of text " .
"that needs to follow the print margin and since " .
"php outputs whitespaces for every break line I do " .
"my sites renders correctly. ");
But it is not an approach that is extensible for me when texts need to change and print margin
still respected, unless netbeans (the IDE I use) can do that for me automatically just like eclipse
in java.
So in conclusion, is there a way to tell the POEdit parser to escape linebreaks as whitespaces in the preferences?
I know that the strings are still translatable even though linebreaks are not escaped, I'm asking this so my traductor (sometimes even the customer/user) will avoid confusion into thinking he needs to duplicate the linebreaks while he translates in POEdit.
You have to make sure that your using the right line breaks in your script and your app
LF: Line Feed, U+000A
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029
Within Windows systems (ms-dos) there line feed is CR+LF, And within "Unix-like" systems its LF adn 8Bit commodore's its a CR
You have to make sure that the source location contains the same type of feeds to your edit location.
Your server handles its line feeds different to the host that the editor is running on, just double check this and develope some means of auto replacing the Unicode chars depending on your OS
As you say that your "translating my PHP application using gettext with POEdit", i would create a script to go threw all your files via shell/doss/php and auto convert the character codes to the type of system your running on.
so if your working on Windows then you would search for all chars that are U+000A and replace with U+000DU+000A