PHP gettext and non-ANSII charters - php

I have a PHP web application which is originally in Polish. But I was asked to locale it into Russian. I've decided to use gettext. But I've problem when I'm trying to translate string with Polish special characters. For example:
echo gettext('Urządzenie');
Display "Urządzenie" in web browser instead of word in Russian.
All files are encoded in UTF-8 and .po file was generated with --from-code utf-8 . Translations without Polish special chars such as
echo gettext('Instrukcja');
works well. Do you know what could be the reason of this strange behaviour?

Are you sure the PHP file is in UTF-8 format? To verify, try this:
echo bin2hex('Urządzenie');
You should see the following bytes:
55 72 7a c4 85 64 7a 65 6e 69 65

Related

Charset of Textfile stored with file_put_contents() is missinterpretated

To prepare a download of some HTML contenteditable, as plain text file, I do following :
Send the html contenteditable, which inherits other html elements, through Ajax to a server side script prepareDownload.php.
There I create a new DOMDocument : $doc = new DOMDocument();
Then I do : $doc->loadHTML('<?xml encoding="UTF-8">' . $_POST["data"]);
Then I am looking for text contents in certain elements and assemble it in $plainText
Finally I write $plainText to disk with : file_put_contents($txtFile, $plainText, LOCK_EX);
So far it works … but when I open the textfile the special characters like the German Ä are a mess.
To find out where the problem might be generated I place some print_r() commands on several stages in the php script and look into the browsers console whats coming back.
Until the point where I write $plainText with to disk file_put_contents() everything is perfect. Looking into the stored text file then, characters are a mess.
Now I assume that file_put_contents() misinterprets the given charset. But how to tell file_put_contents() that it should interpret (not encode) it as UTF-8 ?
EDIT:
As a test to find out more I replaced the explizit statement :
$doc->loadHTML('<?xml encoding="UTF-8">' . $_POST["data"])
with
$doc->loadHTML($_POST["data"])
The character ä in the file still looks weired, but different. The hexdump now looks like this :
0220: 20 76 69 65 6C 2C 20 65 72 7A C3 A4 68 6C 74 20 viel, erz..hlt
Now ä has two points (two bytes) and is hex C3 A4. What kind of encoding is this ?

UTF-8 issue with PHP's json_decode

EDIT2: The issue was with how my Perl client was interpreting the output from PHP's json_encode which outputs Unicode code points by default. Putting the JSON Perl module in ascii mode (my $j = JSON->new()->ascii();) made things work as expected.
I'm interacting with an API written in PHP that returns JSON, using a client written in Perl which then submits a modified version of the JSON back to the same API. The API pulls values from a PostgreSQL database whose encoding is UTF8. What I'm running in to is that the API returns a different character encoding, even though the value PHP receives from the database is proper UTF-8.
I've managed to reproduce what I'm seeing with a couple lines of PHP (5.3.24):
<?php
$val = array("Millán");
print json_encode($val)."\n";
According to the PHP documentation, string literals are encoded ... in whatever fashion [they are] encoded in the script file.
Here is the hex dumped file encoding (UTF-8 lower case a-acute = c3 a1):
$ grep ill test.php | od -An -t x1c
24 76 61 6c 20 3d 20 61 72 72 61 79 28 22 4d 69
$ v a l = a r r a y ( " M i
6c 6c c3 a1 6e 22 29 3b 0a
l l 303 241 n " ) ; \n
And here is the output from PHP:
$ php -f test.php | od -An -t x1c
5b 22 4d 69 6c 6c 5c 75 30 30 65 31 6e 22 5d 0a
[ " M i l l \ u 0 0 e 1 n " ] \n
The UTF-8 lower case a-acute has been changed to a "Unicode" lower case a-acute by json_encode.
How can I keep PHP/json_encode from switching the encoding of this variable?
EDIT: What's interesting is that if I change the string literal to utf8_encode("Millán") then things work as expected. The utf8_encode docs say that function only supports ISO-8859-1 input, so I'm a bit confused about why that works.
This is entirely based on a misunderstanding. json_encode encodes non-ASCII characters as Unicode escape sequences \u..... These sequences do not reference any physical byte encoding in any UTF encoding, it references the character by its Unicode code point. U+00E1 is the Unicode code point for the character á. Any proper JSON parser will decode \u00e1 back into the character "á". There's no issue here.
try the below command to solve their problems.
<?php
$val = array("Millán");
print json_encode($val, JSON_UNESCAPED_UNICODE);
Note: add the JSON_UNESCAPED_UNICODE parameter to the json_encode function to keep the original values.
For python, this Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

PHP, IMAP and Outlook 2010 - folder names encoding is different?

Im developing e-mail client in php (with symfony2) and i have problem with folders with non-ascii characters in name.
Folder created in php app is visible in same app correctly. Same in Outlook, created in outlook looks good in outlook. In other cases not. Folder created in outlook is not displayed correctly in php and vice-versa.
Im using utf-7 to encode folder names in php. Which encoding uses Outlook?
Example: Folder named "Wysłąne" (misspelled polish word meaning "sent"), first one is encoded in utf7 by php, and second created in Outlook:
PHP:
Wys&xYLEhQ-ne
Outlook:
Wys&AUIBBQ-ne
Why it differs? How to make it in same encoding?
There seems to be a mixup in your source character encoding. imap_utf7_encode (and similar) expect your string in ISO-8859-1 encoding.
AFAICT there is no way to represent Wysłąne in ISO-8859-1. "Wysłąne" represented as UTF-8 becomes (hex bytes)
byte value (hex) 57, 79, 73, C5 82, C4 85, 6E 65
unicode character W y s ł ą n e
The PHP result Wys&xYLEhQ-ne when decoded is "Wys얂쒅ne". The the two special characters therein are Korean characters with code points U+C582 and U+C485 respectively. So it appears a character-per-character translation is somehow attempted, where the UTF-8 representation of two of the characters is interpreted as Unicode code points instead.
The simplest way to fix this is to use the mbstring extension which has the mb_convert_encoding function.
$utf7encoded = mb_convert_encoding($utf8SourceString, "UTF7-IMAP","UTF-8")
$decodedAsUTF8 = mb_convert_encoding($utf7String,"UTF-8", "UTF7-IMAP")

problems with UTF-8 encoding in PHP

The characters I am getting from the URL, for example www.mydomain.com/?name=john , were fine, as longs as they were not in Russian.
If they were are in Russian, I was getting '����'.
So I added $name= iconv("cp1251","utf-8" ,$name); and now it works fine for Russian and English characters, but screws up other languages. :)))
For example 'Jānis' ( Latvian ) that worked fine before iconv, now turns into 'jДЃnis'.
Any idea if there's some universal encoder that would work with both the Cyrillic languages and not screw up other languages?
Why don't you just use UTF-8 with all files and processes?
Actually this runs down to the problem of how the URL is encoded. If you're clicking a link on a given page the browser will use the page's encoding to sent the request but if you enter the URL directly into the address-bar of your browser the behavior is somehow undefined as there is no standardized way on the encoding to use (Firefox provides an about:config switch to use UTF-8 encoded URLs).
Besides using some encoding detection there is no way to know the encoding used with the URL in the given request.
EDIT:
Just to backup what I said above, I wrote a small test script that shows the default behavior of the five major browsers (running Mac OS X in my case - Windows Vista via Parallels in case of the IE):
$p = $_GET['p'];
for ($i = 0; $i < strlen($p); $i++) {
// this displays the binary data received via the URL in hex format
echo dechex(ord($p[$i])) . ' ';
}
Calling http://path/to/script.php?p=äöü leads to
Safari (4.0.5): c3 a4 c3 b6 c3 bc
Firefox (3.6.3): c3 a4 c3 b6 c3 bc
Google Chrome (5.0.375.38): c3 a4 c3 b6 c3 bc
Opera (10.10): e4 f6 fc
Internet Explorer (8.0.6001.18904): e4 f6 fc
So obviously the first three use UTF-8 encoded URLs while Opera and IE use ISO-8859-1 or some of its variants. Conclusion: you cannot be sure what's the encoding of textual data sent via an URL.
Seems like the issue is the file encoding, you should always use UTF-8 no BOM as the prefered encoding for your .php files, code editors such as Intype let you easily specify this (UTF-8 Plain).
Also, add the following code to your files before any output:
header('Content-Type: text/html; charset=utf-8');
You should also read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.

Strange characters in PHP

This is driving me crazy.
I have this one php file on a test server at work which does not work.. I kept deleting stuff from it till it became
<?
print 'Hello';
?>
it outputs
Hello
if I create a new file and copy / paste the same script to it it works!
Why does this one file give me the strange characters all the time?
That's the BOM (Byte Order Mark) you are seeing.
In your editor, there should be a way to force saving without BOM which will remove the problem.
Found it, file -> encoding -> UTF8 with BOM , changed to to UTF :-)
I should ahve asked before wasing time trying to figure it out :-)
Just in case, here is a list of bytes for BOM
Encoding Representation (hexadecimal)
UTF-8 EF BB BF
UTF-16 (BE) FE FF
UTF-16 (LE) FF FE
UTF-32 (BE) 00 00 FE FF
UTF-32 (LE) FF FE 00 00
UTF-7 2B 2F 76, and one of the following bytes: [ 38 | 39 | 2B | 2F ]†
UTF-1 F7 64 4C
UTF-EBCDIC DD 73 66 73
SCSU 0E FE FF
BOCU-1 FB EE 28 optionally followed by FF†

Categories