I open a file (saved as ISO 8859-1) using the terminal (Ubuntu) and see where new lines should be the following character ^M (surrounded by XX before and after).
Now, I run this code in php to see how PHP handles that:
$text=str_split($text);
var_dump($text);
in the var_dump I see only an array with size 4 and only the 'X' in it.
Any idea what is going on in there?
EDIT: open office translates this ^M correctly to a new line.
ANOTHER EDIT:
The following code changes nothing. echo str_replace("\r","XXXXXX",$text);
I run this before the str_split
^M is not a newline. ^J is a newline. ^M is the character that Windows uses before a newline to show that it causes a line break. It is also called a "carriage return". The escape sequence for it is \r.
Related
I try to parse a csv file in PHP via SplFileObject. Sadly SplFileObject stucks sometimes if there are erroneous invisible characters in the text. The function detects a quote instead of skipping or read it as normal character while iterating over the lines in the csv file.
The screenshot below is from Textwrangler:
I also copied it from Textwrangler here (invisible char should be between "forgé." and "Circa"):
Fer forgé.� Circa
My code (SplFileObject part):
$splFile = new \SplFileObject($file);
$splFile->setFlags(\SplFileObject::DROP_NEW_LINE | \SplFileObject::SKIP_EMPTY | \SplFileObject::READ_AHEAD | \SplFileObject::READ_CSV);
$splFile->setCsvControl(",", '"', '"');
I tried to figure out which charset the csv file has via file -I my.csv. Output: my.csv: application/octet-stream; charset=binary. That is a weird result as the file is readable via Textwrangler and is therfore NOT binary. I also read another csv generated in the same way and the output is as expected: second.csv: text/plain; charset=utf-8. The tool used to generate the csv files is called Visual Web Ripper (tool for crawling web pages).
How I can determine which character this upside-down question mark is (it seems not to be the spanish upside down question mark - maybe just a placeholder inserted by Textwrangler)?
How can I delete this character and all "invalid" characters in my csv file? Is there a regular expression which matches every character, number, sign (punctuation and other textual symbols) which is in fact a real character and leave out something like in the example above? I am looking for an unicode-safe regular expression (need to preserve german umlauts, french, russian, chinese, japan and korean characters as well). Alternatively: How can I convert a csv file with charset=binary to UTF-8?
Edit:
If I open it via nano editor it shows forgé.^# Circa. After a quick search it seems to be a NUL character or \u0000 (see comments and https://en.wikipedia.org/wiki/Null_character for reference).
Edit 2:
I digged a little more into it: It seems that there is a problem with the $splFile->current() function, which reads a line at the current file pointer. The line gets truncated after the NUL character (no matter if I try to read it via SplFileObject::READ_CSV or just as normal string (without SplFileObject::READ_CSV parameter)).
The solution was to omit the SplFileObject::DROP_NEW_LINE parameter. I also checked if the NUL character is present: It is present, but it is now considered as part of the text value of the specific column in the csv and is NOT detected as quote or column enclosure.
Of course you have to filter out empty lines by yourself now with f. e. something like:
$splFileObject = new \SplFileObject();
$splFileObject->setFlags(\SplFileObject::SKIP_EMPTY | \SplFileObject::READ_AHEAD | \SplFileObject::READ_CSV);
$columns = $splFileObject->current();
if (count($columns) === 1 && array_key_exists(0, $columns) && $columns[0] === NULL) {
// empty csv line
}
I have this string to be encoded (with line break)
Sender ID
Sender ID
Sender ID
When using this urlencode generator, I get the desired output which is
Sender%20ID%0ASender%20ID%0ASender%20ID
However when i using php urlencode() i get this output
Sender+ID%0D%0ASender+ID%0D%0ASender+ID
When using the php rawurlencode() i get this output
Sender%20ID%0D%0ASender%20ID%0D%0ASender%20ID
How to achieve the output same as the generator? I need it to be same since Blackberry phone will properly show line break only if the urlencode for line break is %0A (i am working on a sms system).
Right now the only solution i can think is to search for the %0D%0A and replace with %0A
You have a Windows line ending which is being translated directly by PHP and ignored by your generator tool. The easy way to get rid of it is to simply:
str_replace( "\r\n", "\n", $input );
%0D refers to the 13th ASCII character: \r. Since this is immediately followed by %0A (the \n) it is clear that you have the MS line ending (\r\n) instead of the *nix line ending (\n) and that the urlencode generator is using the *nix approach.
I use this code to get the number of columns from a CSV file:
$this->dummy_file_handler = fopen($this->config['file'],'r');
if ($dataset =fgetcsv($this->dummy_file_handler))
{
$this->number_of_columns = count($dataset);
}
It works fine unless the file is exported with Excel for Mac 2011 since the new line character is then Classic Mac (CR) which fgetcsv doesn't recognize.
If I manually change the newline from Classic Mac (CR) to Unix (LR), then it works, but I need this to be automated.
How can I make fgetcsv recognize the Classic Mac (CR) new line character?
From the manual:
Note: If PHP is not properly
recognizing the line endings when
reading files either on or created by
a Macintosh computer, enabling the
auto_detect_line_endings run-time
configuration option may help resolve
the problem.
If Saul's answer doesn't work, I'd write a simple script to read in the file all at once and str_replace all \r with \n, dumping the results into a new file, then fgetcsv'ing that new file.
I find it amusing that these terms come from the days of using typewriters:
\n = Line Feed(LF) - advances the paper one line.
\r = Carriage Return (CR) - returns the carriage to the left side of the typewriter.
I have a script that generates a csv file using the following code:
header('Content-type: text/csv');
header('Content-Disposition: attachment; filename="'.date("Ymdhis").'.csv"');
print $content;
The $content variable simply contains lines with fields separated by commas and then finalised with ."\n"; to generate a new line.
When I open the file in csv it looks fine however, when I try to use the file to import into an external program (MYOB) it does not recognise the End Of Line (\n) character and assumes one long line of text.
When I view the contents of the file in notepad, the end of line character (\n) is a small rectangle box which looks like the character code 0x7F.
If I open the file and re-save it in excel, it removes this character and replaces it with a proper end of line character and I can import the file.
What character do I need to be generating in PHP so that notepad recognises it as a valid End Of Line character? (\n) obviously doesn't do the job.
Use "\r\n". (with double quotes)
The above is the ascii characters for carriage return + line feed.
Historically this relates to the early days of computing on teletypes when the output was printed to paper and returning the carriage of the teletype head to the start of the line was a separate operation to feeding a line through the printer. You could overwrite lines by just doing a carriage return and insert blank lines by just a line feed. Doing both returned the head to the start of the line and fed it a new line to print on.
What precisely was required differed between systems -
Line feed only: - most Unix like systems
Carriage return plus Line feed: - DEC amd MS-DOS based systems
Carriage return only: - Early Apple/Mac OS's
So what you're generating at the moment is a newline on a Unix system only. Wikipedia has quite a good page on this.
There's actually unix command line tools to do the conversion too. The unix2dos and dos2unix commands convert ascii text files back and forward between the unix and dos formats by converting line feed to line feed plus carriage return and vica versa.
Be sure to use double quotes around the \r\n, not the single quotes as mentioned in the previous answer!
I experienced the same issue. Later I replaced
single quotes ' with double quotes "
building $content variable.
i.e. Keeping outer quotes rather double than single in $content variable.
And it worked :)
I am currently translating my PHP application using gettext with POEdit. Since I respect the print margin in my source code, I was used to writing strings like that:
print $this->translate("A long string of text
that needs to follow the print margin and since
php outputs whitespaces for every break line I do
my sites renders correctly.");
However, in POEdit, as expected, the linebreaks are not escaped to whitespaces.
A long string of text\n
that needs to follow the print margin and since\n
php outputs whitespaces for every break line I do\n
my websites render correctly.\n
I know one approach would be to close the strings when changing lines in the source code like that:
print $this->translate("A long string of text " .
"that needs to follow the print margin and since " .
"php outputs whitespaces for every break line I do " .
"my sites renders correctly. ");
But it is not an approach that is extensible for me when texts need to change and print margin
still respected, unless netbeans (the IDE I use) can do that for me automatically just like eclipse
in java.
So in conclusion, is there a way to tell the POEdit parser to escape linebreaks as whitespaces in the preferences?
I know that the strings are still translatable even though linebreaks are not escaped, I'm asking this so my traductor (sometimes even the customer/user) will avoid confusion into thinking he needs to duplicate the linebreaks while he translates in POEdit.
You have to make sure that your using the right line breaks in your script and your app
LF: Line Feed, U+000A
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029
Within Windows systems (ms-dos) there line feed is CR+LF, And within "Unix-like" systems its LF adn 8Bit commodore's its a CR
You have to make sure that the source location contains the same type of feeds to your edit location.
Your server handles its line feeds different to the host that the editor is running on, just double check this and develope some means of auto replacing the Unicode chars depending on your OS
As you say that your "translating my PHP application using gettext with POEdit", i would create a script to go threw all your files via shell/doss/php and auto convert the character codes to the type of system your running on.
so if your working on Windows then you would search for all chars that are U+000A and replace with U+000DU+000A