How to read non-ASCII characters from CLI standard input - php

If I type å in CMD, fgets stop waiting for more input and the loop runs until I press ctrl-c. If I type a "normal" characters like a-z0-9!?() it works as expected.
I run the code in CMD under Windows 7 with UTF-8 as charset (chcp 65001), the file is saved as UTF-8 without bom. I use PHP 5.3.5 (cli).
<?php
echo "ÅÄÖåäö work here.\n";
while(1)
{
echo '> '. fgets(STDIN);
}
?>
If I change charset to chcp 1252 the loop doesn't break when I type å and it print "> å" but the "ÅÄÖåäö work here" become "ÅÄÖåäö work here!". And I know that I can change the file to ANSI, but then I can't use special characters like ╠╦╗.
So why does fgets stop waiting for userinput after I have typed åäö?
And how can I fix this?
EDIT:
Also found a strange bug.
echo "öäåÅÄÖåäö work here! Or?".chr(10); -> ��äåÅÄÖåäö work here! Or? re! Or?.
If the first char in echo is å/ä/ö it print strange chars AND the end output duplicate's with n - 1 char.. (n = number of åäö in the begining of the string).
Eg: echo "åäö 1234" -> ??äö 123434 and echo åäöåäö 1234 -> ??äöåäö 1234 1234.
EDIT2 (solved):
The problem was chcp 65001, now I use chcp 437 (chcp 437).
Big thanks to Timothy Martens!

Possible solution:
echo '>';
$line = stream_get_line(STDIN, 999999, PHP_EOL);
Notes:
I was unable to reproduce your error using multiple versions of PHP.
Using the following PHP version 5.3.8 gave me no issues
PHP 5.3 (5.3.8)
VC9 x86 Non Thread Safe (2011-Aug-23 12:26:18)
Arcitechture is Win XP SP3 32 bit
You might try upgrading PHP.
I downloaded php-5.3.5-nts-Win32-VC6-x86 and was not able to reproduce your error, it works fine for me.
Edit: Additionaly I typed the characters using my spanish keyboard.
Edit2:
CMD Command:
chcp 437
PHP Code:
<?php
$fp=fopen("php://stdin","r");
while(1){
$str = fgets(STDIN);
echo mb_detect_encoding($str)."\n";
echo '>'.stream_get_line($fp,999999,"\n")."\n";
}
?>
Output:
test
ASCII
test
>test
öïü
öïü
>öïü

I think that happens because PHP 5.3 does not support properly multibyte characters.
These chars: ÅÄÖåäö
Are binary: c3 85 c3 84 c3 96 c3 a5 c3 a4 c3 b6 (without BOM at beggining)
Citing PHP String:
A string is series of characters, where a character is the same as a byte. This means that PHP only supports a 256-character set, and hence does not offer native Unicode support. See details of the string type.
Normally does not affect the final result, because the browser/reader understand multibyte characters, but for CMD and STDIN buffer is ÅÄÖåäö (12 chars/bytes char array).
only MB functions handle multibyte strings basic operations.

Related

Is it true that string literals in PHP can only be encoded in an encoding which is a compatible superset of ASCII, such as UTF-8 or ISO-8859-1?

I come across following text from the Details of the String Type page from PHP Manual :
Given that PHP does not dictate a specific encoding for strings, one might
wonder how string literals are encoded. String will be encoded in whatever
fashion it is encoded in the script
file. Thus, if the script is written in ISO-8859-1, the string will be
encoded in ISO-8859-1 and so on. However, this does not apply if Zend
Multibyte is enabled; in that case, the script may be written in an
arbitrary encoding (which is explicity declared or is detected) and
then converted to a certain internal encoding, which is then the
encoding that will be used for the string literals. Note that there
are some constraints on the encoding of the script (or on the internal
encoding, should Zend Multibyte be enabled) – this almost always means
that this encoding should be a compatible superset of ASCII, such as
UTF-8 or ISO-8859-1.
So my doubt is, is it true that string literals in PHP can only be encoded in an encoding which is a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 and not in an encoding which is not a compatible superset of ASCII?
Is it possible to encode string literals in PHP in some non-ASCII compatible encoding like UTF-16, UTF-32 or some other such non-ASCII compatible encoding? If yes then will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions? If no, then what's the reason?
Suppose, Zend Multibyte is enabled and I've set the internal encoding to a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 or some other non-ASCII compatible encoding. Now, can I declare the encoding which is not a compatible superset of ASCII, such as UTF-16 or UTF-32 in the script file?
If yes, then in this case what encoding the string literals would get encoded in? If no, then what's the reason?
Also, explain me how does this encoding thing work for string literals if Zend Multibyte is enabled?
How to enable the Zend Multibyte? What's the main intention behind turning it On? When it is required to turn it On?
It would be better if you could clear my doubts accompanied by suitable examples.
Thank You.
String literals in PHP source code files are taken literally as the raw bytes which are present in the source code file. If you have bytes in your source code which represent a UTF-16 string or anything else really, then you can use them directly:
$ echo -n '<?php echo "' > test.php
$ echo -n 日本語 | iconv -t UTF-16 >> test.php
$ echo '";' >> test.php
$ cat test.php
<?php echo "??e?g,??";
$ cat test.php | xxd
00000000: 3c3f 7068 7020 6563 686f 2022 feff 65e5 <?php echo "..e.
00000010: 672c 8a9e 223b 0a g,..";.
$ php test.php
??e?g,??$
$ php test.php | iconv -f UTF-16
日本語
This demonstrates a source code file ostensibly written in ASCII, but containing a UTF-16 string literal in the middle, which is output as is.
The bigger problem with this kind of source code is that it's difficult to work with. It's somewhere between a pain in the neck and impossible to get a text editor to treat the PHP code in one encoding and string literals in another. So typically, you want to keep the entire source code, including string literals, in one and the same encoding throughout.
You can also easily get into trouble:
$ echo -n '<?php echo "' > test.php
$ echo -n 漢字 | iconv -t UTF-16 >> test.php
$ echo '";' >> test.php
$ cat test.php | xxd
00000000: 3c3f 7068 7020 6563 686f 2022 feff 6f22 <?php echo "..o"
00000010: 5b57 223b 0a [W";.
"漢字" here is encoded to feff 6f22 5b57, which contains 22 or ", a string literal terminator, which means you have a syntax error now.
By default the PHP interpreter expects the PHP code to be ASCII compatible, so if you want to keep your string literals and the rest of the source code in the same encoding, you're pretty much limited to ASCII compatible encodings. However, the Zend Multibyte extension allows you to use other encodings if you declare the used encoding accordingly (in php.ini if it's not ASCII compatible). So you could write your source code in, say, Shift-JIS throughout; probably even with string literals in some other encoding*.
* (At which point I'll quit going into details because what is wrong with you?!)
Summary:
PHP must understand all the PHP code; by default it understands ASCII, with Zend Multibyte it can understand other encodings as well.
The string literals in your source code can contain any bytes you want, as long as PHP doesn't interpret them as special characters in the string literal (e.g. the 22 example above), in which case you need to escape them (with a backslash in the encoding of the general source code).
The string value at runtime will be the raw byte sequence PHP read from the string literal.
Having said all this, it is typically a pain in the neck to diverge from ASCII compatible encodings. It's a pain in text editors and easily leads to mojibake if some tool in your workflow is treating the file incorrectly. At most I'd advice to use ASCII-compatible encodings, e.g.:
echo "日本語"; // UTF-8 encoded (let's hope)
If you must have a non-ASCII-compatible string literal, you should use byte notation:
echo "\xfe\xff\x65\xe5\x67\x2c\x8a\x9e";
Or conversion:
echo iconv('UTF-8', 'UTF-16', '日本語');
[..] will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions?
Sure, strings in PHP are raw byte arrays for all intents and purposes. It doesn't matter how you obtained that string. If you have a UTF-16 string obtained with any of the methods demonstrated above, including by hardcoding it in UTF-16 into the source code, you have a UTF-16 encoded string and you can put that through any and all string functions that know how to deal with it.
So my doubt is, is it true that string literals in PHP can only be
encoded in an encoding which is a compatible superset of ASCII, such
as UTF-8 or ISO-8859-1 and not in an encoding which is not a
compatible superset of ASCII?
It's not true.
Is it possible to encode string literals in PHP in some non-ASCII
compatible encoding like UTF-16, UTF-32 or some other such non-ASCII
compatible encoding? If yes then will the strings literals encoded in
such one of the non-ASCII compatible encoding work with mb_string_*
functions? If no, then what's the reason?
As #deceze says, You can easily convert the string to encoding you want via mb_convert_encoding or iconv.
From the Details of string type in PHP Manual, String will be encoded in whatever fashion it is encoded in the script file. PHP built with Zend Multibyte support and mbstring extension can parse and run PHP files that have encoded in non-ASCII compatible encoding like UTF-16, See tests in Zend/multibyte.
Zend/tests/multibyte/multibyte_encoding_003.phpt is demonstrated for running sources with UTF-16 LE encoding that output Hello World correctly.
Zend/tests/multibyte/multibyte_encoding_003.phpt
--TEST--
Zend Multibyte and UTF-16 BOM
--SKIPIF--
<?php
if (!in_array("zend.detect_unicode", array_keys(ini_get_all()))) {
die("skip Requires configure --enable-zend-multibyte option");
}
if (!extension_loaded("mbstring")) {
die("skip Requires mbstring extension");
}
?>
--INI--
zend.multibyte=1
mbstring.internal_encoding=iso-8859-1
--FILE--
<?php
print "Hello World\n";
?>
===DONE===
--EXPECT--
Hello World
===DONE===
$ run-tests.php --keep-php --show-out --show-php Zend/tests/multibyte/multibyte_encoding_003.phpt
... skip some trivial message ...
Running selected tests.
TEST 1/1 [multibyte_encoding_003.phpt]
========TEST========
<?php
print "Hello World\n";
?>
===DONE===
========DONE========
========OUT========
Hello World
===DONE===
========DONE========
PASS Zend Multibyte and UTF-16 BOM [multibyte_encoding_003.phpt]
=====================================================================
Number of tests : 1 1
Tests skipped : 0 ( 0.0%) --------
Tests warned : 0 ( 0.0%) ( 0.0%)
Tests failed : 0 ( 0.0%) ( 0.0%)
Expected fail : 0 ( 0.0%) ( 0.0%)
Tests passed : 1 (100.0%) (100.0%)
---------------------------------------------------------------------
Time taken : 0 seconds
=====================================================================
$ file multibyte_encoding_003.php
multibyte_encoding_003.php: PHP script text, Little-endian UTF-16 Unicode text
Another example is Zend/tests/multibyte/multibyte_encoding_004.phpt, It runs source which encoded with Shift JIS.
Zend/tests/multibyte/multibyte_encoding_004.phpt (Note: Some Japanese characters are not display correctly because of mixing encoding in one file and LC_MESSAGE is set to UTF-8)
--TEST--
test for mbstring script_encoding for flex unsafe encoding (Shift_JIS)
--SKIPIF--
<?php
if (!in_array("zend.detect_unicode", array_keys(ini_get_all()))) {
die("skip Requires configure --enable-zend-multibyte option");
}
if (!extension_loaded("mbstring")) {
die("skip Requires mbstring extension");
}
?>
--INI--
zend.multibyte=1
zend.script_encoding=Shift_JIS
mbstring.internal_encoding=Shift_JIS
--FILE--
<?php
function \\\($)
{
echo $;
}
\\\("h~t#\");
?>
--EXPECT--
h~t#\
$ run-tests.php --keep-php --show-out --show-php
./multibyte_encoding_004.phpt
... skip some trivial message ...
Running selected tests.
TEST 1/1 [multibyte_encoding_004.phpt]
========TEST========
<?php
function \\\($)
{
echo $;
}
\\\("h~t#\");
?>
========DONE========
========OUT========
h~t#\
========DONE========
PASS test for mbstring script_encoding for flex unsafe encoding (Shift_JIS) [multibyte_encoding_004.phpt]
=====================================================================
Number of tests : 1 1
Tests skipped : 0 ( 0.0%) --------
Tests warned : 0 ( 0.0%) ( 0.0%)
Tests failed : 0 ( 0.0%) ( 0.0%)
Expected fail : 0 ( 0.0%) ( 0.0%)
Tests passed : 1 (100.0%) (100.0%)
---------------------------------------------------------------------
Time taken : 0 seconds
=====================================================================
$ file Zend/tests/multibyte/multibyte_encoding_004.php
multibyte_encoding_004.php: PHP script text, Non-ISO extended-ASCII text
$ cat Zend/tests/multibyte/multibyte_encoding_004.php | iconv -f SJIS -t utf-8
<?php
function 予蚕能($引数)
{
echo $引数;
}
予蚕能("ドレミファソ");
?>
Is it possible to encode string literals in PHP in some non-ASCII
compatible encoding like UTF-16, UTF-32 or some other such non-ASCII
compatible encoding? If yes then will the strings literals encoded in
such one of the non-ASCII compatible encoding work with mb_string_*
functions? If no, then what's the reason?
The answer to the first question is yes, The tests for Zend Multibyte is convincingly demonstrated. The answer for the second question is also yes if given the correct encoding hints to mb_string_*.
Suppose, Zend Multibyte is enabled and I've set the internal encoding
to a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 or some
other non-ASCII compatible encoding. Now, can I declare the encoding
which is not a compatible superset of ASCII, such as UTF-16 or UTF-32
in the script file?
If yes, then in this case what encoding the string literals would get
encoded in? If no, then what's the reason?
Yes, The output generated by second command is UTF-32 encoding (Represents single character as 4 bytes)
$ echo -e '<?php\necho "Hello 中文";' | php | hexdump -C
00000000 48 65 6c 6c 6f 20 e4 b8 ad e6 96 87 |Hello ......|
0000000c
$ echo '<?php\\necho "Hello 中文";' | iconv -t utf-16 | php -d zend.multibyte=1 -d zend.script_encoding=UTF-16 -d mbstring.internal_encoding=UTF-32 | hexdump -C
00000000 00 00 00 48 00 00 00 65 00 00 00 6c 00 00 00 6c |...H...e...l...l|
00000010 00 00 00 6f 00 00 00 20 00 00 4e 2d 00 00 65 87 |...o... ..N-..e.|
00000020
Also, explain me how does this encoding thing work for string literals
if Zend Multibyte is enabled?
Zend Multibyte feature is implemented on Zend/zend_multibyte.c, Let Zend engine knows more encoding other than Ascii and UTF-8, It is only the interface for encoding stuff, because the default implementation is dummy function, The real implementation is the mbstring extension, Therefore, mbstring is mandatory extension to get multibyte support when loaded.
$ php -m | grep mbstring
mbstring
$ php -n -m | grep mbstring # -n disable mbstring, No configuration (ini) files will be used.
$ echo -e '<?php\n echo "Hello 中文\n"; ' | iconv -t utf-16 | php -n -d zend.multibyte=1
Fatal error: Could not convert the script from the detected encoding "UTF-32LE" to a compatible encoding in Unknown on line 0
How to enable the Zend Multibyte? What's the main intention behind
turning it On? When it is required to turn it On?
Declare zend.multibyte=1 in php.ini will enable parsing of source files in multibyte encodings, Also you can pass -d zend.multibyte=1 to PHP cli executable as above example to enable multibyte support in PHP Zend engine.
How to enable the Zend Multibyte?
Compile PHP using the --enable-zend-multibyte flag (before PHP 5.4) and activate the zend.multibyte setting in the php.ini.
Cf. https://secure.php.net/manual/en/ini.core.php#ini.zend.multibyte and https://secure.php.net/manual/en/configure.about.php#configure.options.php

php filter_var FILTER_FLAG_ENCODE_HIGH

I have the fallowing testcase for the php function function_var():
<?php
$inputvalue = "Ž"; //NUM = 142 on the ASCII extended list
$sanitized = filter_var($inputvalue, FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_HIGH);
echo 'The sanitized output: '.$sanitized."\n"; // --> & #197;& #189; (Å ½)
?>
If you run the above snippet the output is not what I expect to be returned. The Ž is number 142 in the ASCII extended list (see: ascii-code[dot]com). So what I expect to get back is the '& #142;' (string, without the space).
I had help finding out what is going wrong I just dont know how to solve it yet.
If you convert 'Ž' to Hex UTF-8 bytes you get: C5 BD. These hex bytes correspond with the ISO-8859 hex values: Å ½(see: http://cs.stanford.edu/~miles/iso8859.html). These 2 characters then get decoded by filter_var to '& #197;& #189;'.
See this onlineconverter!!!: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C5%BD&mode=char
So basically what happens: UTF-8 bytes are used to translate them as Latin-1 characters bytes. The converter page says the fallowing: "UTF-8 bytes as Latin-1 characters" is what you typically see when you display a UTF-8 file with a terminal or editor that only knows about 8-bit characters.
I dont think my editor is the problem. I am using a Mac with Coda 2 (UTF-8 as default). The test has also been tested on a html5 page with meta character set to utf-8. Furthermore I am using a defaut XAMPP localhost server. With Firebug in Firefox I also checked if the file was served as UTF-8 (it is).
Anyone got a idea how I can solve this encoding problem?
I am gona drop this cause I am not finding any solution. The email() function is also not safe and I am gona use either phpmailer or swiftmailer (and I am leaning towards the latter).

Prolog and php encoding

I'm creating a interface between swi-prolog and php. The php writes commands it wants prolog to run on a file and then does a system call so prolog runs the file. The problem is that when there's special characters on the file (like á, í, ã, ê and etc...), these characters are replaced by \uFFFD in the output from prolog, I know that this codepoint is for unknown/unidentified codepoints, but I have been unsuccessful to solve the issue with what I found on the Internet. If a run the file from the terminal myself it shows the correct characters, just when php runs from exec or shell_exec that it seem to lose reason.
Here's the code used, first the php:
$arquivo = fopen("/home/giz/prologDB/run.pl", w);
$run = <<<EOT
go :-
consult('/home/giz/prologDB/pessoasOps.pl'),
addPessoa(0,'$name','$posicao','$resume','$unidade','$curso','$disciplina',$alunos,[]),
halt.
EOT;
echo $run;
fwrite($arquivo, $run);
$cmd = "prolog -f /home/giz/prologDB/run.pl -g go";
exec( $cmd, $output );
echo "\n";
print_r( $output );
echo "\n";
prolog code:
addPessoa(LOCAL, NOME, POSICAO, RESUMO, UNIDADE, CURSO, DISCIPLINA, ALUNOS, REFERENCIA):-
write( 'Prolog \nwas called \nfrom PHP \nsuccessfully.\n' ),
write('pessoa('),
write(LOCAL),
write(',\''),
write(NOME),
write('\',\''),
write(POSICAO),
write('\',\''),
write(RESUMO),
write('\',\''),
write(UNIDADE),
write('\',\''),
write(CURSO),
write('\',\''),
write(DISCIPLINA),
write('\','),
write(ALUNOS),
write(','),
write(REFERENCIA),
write(').\n'),
make.
Does someone know how to make it interpret the string properly?
Most probably Prolog expects UTF-8 encoded characters, and you are feeding it ISO-8859-n characters, where n is most probably 1 or 15. In UTF-8, when a byte >= 128 is seen, it is either the first of a multibyte sequence (if it is >= 192) or a continuation byte. If the first byte of a multibyte sequence is not followed by a continuation byte, or if a sequence starts with a continuation byte, you get an unrecognized byte sequence, in your case a U+FFFD codepoint. All characters with diacritics are above 128 in ISO-8859-n.
Check also swi-prolog's manual page on encoding, especially the whole paragraph that starts with these two sentences:
The default encoding for files is derived from the Prolog flag
encoding, which is initialised from the environment. If the
environment variable LANG ends in "UTF-8", this encoding is assumed.
A good reason for a different behavior of swi-prolog when called from a shell or from within PHP could be a different setting of the LANG environment variable in these two cases. But in the same paragraph the manual mentions ways of forcing the encoding.
In a shell, the fastest way to see the bytes contained in a file is to do an od -tx1z filename | less (leave out the z in case of hard-to-print characters).

remove utf-8 figure spaces with php

I have some xml files with figure spaces in it, I need to remove those with php.
The utf-8 code for these is e2 80 a9. If I'm not mistaken php does not seem to like 6 byte utf-8 chars, so far at least I'm unable to find a way to delete the figure spaces with functions like preg_replace.
Anybody any tips or even better a solution to this problem?
Have you tried preg_replace('/\x{2007}/u', '', $stringWithFigureSpaces);?
U+2007 is the unicode codepoint for the FIGURE SPACE.
Please see my answer on a similar unicode-regex topic with PHP which includes information about the \x{FFFF}-syntax.
Regarding you comment about the non-working - the following works perfectly on my machine:
$ php -a
Interactive shell
php > $str = "a\xe2\x80\x87b"; // \xe2\x80\x87 is the FIGURE SPACE
php > echo preg_replace('/\x{2007}/u', '_', $str); // \x{2007} is the PCRE unicode codepoint notation for the U+2007 codepoint
a_b
What's you PHP version? Are you sure the character is a FIGURE SPACE at all? Can you run the following snippet on your string?
for ($i = 0; $i < strlen($str); $i++) {
printf('%x ', ord($str[$i]));
}
On my test string this outputs
61 e2 80 87 62
a |U+2007| b
EDIT after OP comment:
\xe2\x80\xa9 is a PARAGRAPH SEPARATOR which is unicode codepoint U+2029, so your code should be preg_replace('/\x{2029}/u', '', $stringWithUglyCharacter);
Maybe mb_convert_encoding function can help.

Problem reading accented characters in PHP

Got a strange problem in PHP land. Here's a stripped down example:
$handle = fopen("file.txt", "r");
while (($line = fgets($handle)) !== FALSE) {
echo $line;
}
fclose($handle);
As an example, if I have a file that looks like this:
Lucien Frégis
Then the above code run from the command line outputs the same name, but instead of an e acute I get :
Lucien FrÚgis
Looking at a hex dump of the file I see that the byte in question is E9, which is what I would expect for e acute in php's default encoding (ISO-8859-1), confirmed by outputting the current value of default_charset.
Any thoughts?
EDIT:
As suggested, I've checked the windows codepage, and apparently its 850, which is obsolete (but does explane why 0xE9 is being displayed the way it is...)
0xE9 is the encoding for é in iso-8859-1. It's also the unicode codepoint for the same character. If your console interprets output in a different encoding (Such as cp-850), then the same byte will translate to a different codepoint, thus displaying a different character on screen. If you look at the code page for cp-850, you can see that the byte 0xE9 translates to Ú (Unicode codepoint 0xDA). So basically your console interprets the bytes wrongly. I'm not sure how, but you should change the charset of your console to iso-8859-1.
Before running your php on the command line, try executing the command:
chcp 1252
This will change the codepage to one where the accented characters are as you expect.
See the following links for the difference between the 850 and 1252 codepages:
http://en.wikipedia.org/wiki/Code_page_850
http://en.wikipedia.org/wiki/Windows-1252
The accent might be considered unicode data and you will have to store it as such. Take a look at utf_decode, utf_encode, and iconv functions.
No wait, it is in the ISO 8859-1 charset. I don't know. Have you tried reading in binary mode or using file_get_contents?

Categories