problems with UTF-8 encoding in PHP - php

The characters I am getting from the URL, for example www.mydomain.com/?name=john , were fine, as longs as they were not in Russian.
If they were are in Russian, I was getting '����'.
So I added $name= iconv("cp1251","utf-8" ,$name); and now it works fine for Russian and English characters, but screws up other languages. :)))
For example 'Jānis' ( Latvian ) that worked fine before iconv, now turns into 'jДЃnis'.
Any idea if there's some universal encoder that would work with both the Cyrillic languages and not screw up other languages?

Why don't you just use UTF-8 with all files and processes?

Actually this runs down to the problem of how the URL is encoded. If you're clicking a link on a given page the browser will use the page's encoding to sent the request but if you enter the URL directly into the address-bar of your browser the behavior is somehow undefined as there is no standardized way on the encoding to use (Firefox provides an about:config switch to use UTF-8 encoded URLs).
Besides using some encoding detection there is no way to know the encoding used with the URL in the given request.
EDIT:
Just to backup what I said above, I wrote a small test script that shows the default behavior of the five major browsers (running Mac OS X in my case - Windows Vista via Parallels in case of the IE):
$p = $_GET['p'];
for ($i = 0; $i < strlen($p); $i++) {
// this displays the binary data received via the URL in hex format
echo dechex(ord($p[$i])) . ' ';
}
Calling http://path/to/script.php?p=äöü leads to
Safari (4.0.5): c3 a4 c3 b6 c3 bc
Firefox (3.6.3): c3 a4 c3 b6 c3 bc
Google Chrome (5.0.375.38): c3 a4 c3 b6 c3 bc
Opera (10.10): e4 f6 fc
Internet Explorer (8.0.6001.18904): e4 f6 fc
So obviously the first three use UTF-8 encoded URLs while Opera and IE use ISO-8859-1 or some of its variants. Conclusion: you cannot be sure what's the encoding of textual data sent via an URL.

Seems like the issue is the file encoding, you should always use UTF-8 no BOM as the prefered encoding for your .php files, code editors such as Intype let you easily specify this (UTF-8 Plain).
Also, add the following code to your files before any output:
header('Content-Type: text/html; charset=utf-8');
You should also read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.

Related

PHP get ASCII code of a character

How can I get the ASCII code of a character in PHP ??
http://www.backbone.se/urlencodingUTF8.htm
When I try:
$h = dechex(ord('ñ'));
echo $h;
I'm getting C3 when I should be getting f1
I want for example:
for 'ñ' -> %f1
for 'º' -> %ba
How can I get that?
Thanks in advance!!
As they told you in the comments, ñ is not standard ASCII. It is present on the extended ASCII table, but your source file may be saved in UTF-8 and thus ñ is stored with different bytes. You could try setting your editor to save documents in a different charset.
However, what you're doing is really dangerous. Since as you see encoding can change, it's always a bad idea to write in source files characters that are not part of standard ASCII.
Why do you need to do that operation? Is there another way? Can't you use UTF-8?

Python xml-rpc with PHP client and unicode not working

I have XML-RPC Server written with Python. It takes some values and saves them in a mysql database. Data are in utf-8 and the whole process works fine.
I have no problem talking to it with other languages like Python and ASP.NET and C#, but when it comes to PHP, there is a problem. The characters are not being saved in MySQL as they should be and they are all scrambled characters.
I have done all the recommendations as setting the header in the PHP file and etc. I have also configured MySQL collation to utf-8, but the problem still exists.
The Curl used is from Github: https://github.com/dongsheng/cURL
Source code is below:
<?php
error_reporting(E_ALL);
header('Content-Type: text/plain; charset=utf-8');
require_once('curl.php');
$rpc = "http://xmlrpc-webservice-address.com/";
$client = new xmlrpc_client($rpc, true);
$text="سلام";
$arr1=array("username", "password", array("111"), $text, "30002240123456", "ws", False);
$resp = $client->call('send', $arr1);
print_r($resp);
print_r("\n");
print_r($text);
?>
Most likely your problem is one of the following (in descending order of likelihood):
Your PHP source code isn't actually encoded as UTF-8, but as, say, CP1256. That means the non-ASCII literal string in the source is actually mojibake nonsense as far as the PHP interpreter, which reads it as UTF-8, is concerned. And those garbage bytes get passed through as-is all the way through the chain—to the XML-RPC service and back, to the browser, and to the user's screen.
Your PHP source code is encoded as UTF-8, but your PHP interpreter thinks it's, say, CP1256, because of the way it (or your server/module) is configured. So, once again, the literal string is mojibake (in the opposite direction), which again passes through the whole chain.
Your web service isn't returning UTF-8, but, say, Latin-1, and your other clients all treat it accordingly as Latin-1, but your PHP client code just assumes it's UTF-8, passes it to the browser as if it were UTF-8, and the user sees garbage.
If you're not absolutely, positively sure that your editor saved the source code as UTF-8, look at the source file in a hex editor. If it's UTF-8, the Arabic string should look like D8 B3 D9 84 D8 A7 D9 85. If it's anything different—like D3 E1 C7 E3 (CP1256) or D3 E3 C7 E5 (ISO-8859-6), that's your problem.

php filter_var FILTER_FLAG_ENCODE_HIGH

I have the fallowing testcase for the php function function_var():
<?php
$inputvalue = "Ž"; //NUM = 142 on the ASCII extended list
$sanitized = filter_var($inputvalue, FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_HIGH);
echo 'The sanitized output: '.$sanitized."\n"; // --> & #197;& #189; (Å ½)
?>
If you run the above snippet the output is not what I expect to be returned. The Ž is number 142 in the ASCII extended list (see: ascii-code[dot]com). So what I expect to get back is the '& #142;' (string, without the space).
I had help finding out what is going wrong I just dont know how to solve it yet.
If you convert 'Ž' to Hex UTF-8 bytes you get: C5 BD. These hex bytes correspond with the ISO-8859 hex values: Å ½(see: http://cs.stanford.edu/~miles/iso8859.html). These 2 characters then get decoded by filter_var to '& #197;& #189;'.
See this onlineconverter!!!: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C5%BD&mode=char
So basically what happens: UTF-8 bytes are used to translate them as Latin-1 characters bytes. The converter page says the fallowing: "UTF-8 bytes as Latin-1 characters" is what you typically see when you display a UTF-8 file with a terminal or editor that only knows about 8-bit characters.
I dont think my editor is the problem. I am using a Mac with Coda 2 (UTF-8 as default). The test has also been tested on a html5 page with meta character set to utf-8. Furthermore I am using a defaut XAMPP localhost server. With Firebug in Firefox I also checked if the file was served as UTF-8 (it is).
Anyone got a idea how I can solve this encoding problem?
I am gona drop this cause I am not finding any solution. The email() function is also not safe and I am gona use either phpmailer or swiftmailer (and I am leaning towards the latter).

PHP, IMAP and Outlook 2010 - folder names encoding is different?

Im developing e-mail client in php (with symfony2) and i have problem with folders with non-ascii characters in name.
Folder created in php app is visible in same app correctly. Same in Outlook, created in outlook looks good in outlook. In other cases not. Folder created in outlook is not displayed correctly in php and vice-versa.
Im using utf-7 to encode folder names in php. Which encoding uses Outlook?
Example: Folder named "Wysłąne" (misspelled polish word meaning "sent"), first one is encoded in utf7 by php, and second created in Outlook:
PHP:
Wys&xYLEhQ-ne
Outlook:
Wys&AUIBBQ-ne
Why it differs? How to make it in same encoding?
There seems to be a mixup in your source character encoding. imap_utf7_encode (and similar) expect your string in ISO-8859-1 encoding.
AFAICT there is no way to represent Wysłąne in ISO-8859-1. "Wysłąne" represented as UTF-8 becomes (hex bytes)
byte value (hex) 57, 79, 73, C5 82, C4 85, 6E 65
unicode character W y s ł ą n e
The PHP result Wys&xYLEhQ-ne when decoded is "Wys얂쒅ne". The the two special characters therein are Korean characters with code points U+C582 and U+C485 respectively. So it appears a character-per-character translation is somehow attempted, where the UTF-8 representation of two of the characters is interpreted as Unicode code points instead.
The simplest way to fix this is to use the mbstring extension which has the mb_convert_encoding function.
$utf7encoded = mb_convert_encoding($utf8SourceString, "UTF7-IMAP","UTF-8")
$decodedAsUTF8 = mb_convert_encoding($utf7String,"UTF-8", "UTF7-IMAP")

How to read non-ASCII characters from CLI standard input

If I type å in CMD, fgets stop waiting for more input and the loop runs until I press ctrl-c. If I type a "normal" characters like a-z0-9!?() it works as expected.
I run the code in CMD under Windows 7 with UTF-8 as charset (chcp 65001), the file is saved as UTF-8 without bom. I use PHP 5.3.5 (cli).
<?php
echo "ÅÄÖåäö work here.\n";
while(1)
{
echo '> '. fgets(STDIN);
}
?>
If I change charset to chcp 1252 the loop doesn't break when I type å and it print "> å" but the "ÅÄÖåäö work here" become "ÅÄÖåäö work here!". And I know that I can change the file to ANSI, but then I can't use special characters like ╠╦╗.
So why does fgets stop waiting for userinput after I have typed åäö?
And how can I fix this?
EDIT:
Also found a strange bug.
echo "öäåÅÄÖåäö work here! Or?".chr(10); -> ��äåÅÄÖåäö work here! Or? re! Or?.
If the first char in echo is å/ä/ö it print strange chars AND the end output duplicate's with n - 1 char.. (n = number of åäö in the begining of the string).
Eg: echo "åäö 1234" -> ??äö 123434 and echo åäöåäö 1234 -> ??äöåäö 1234 1234.
EDIT2 (solved):
The problem was chcp 65001, now I use chcp 437 (chcp 437).
Big thanks to Timothy Martens!
Possible solution:
echo '>';
$line = stream_get_line(STDIN, 999999, PHP_EOL);
Notes:
I was unable to reproduce your error using multiple versions of PHP.
Using the following PHP version 5.3.8 gave me no issues
PHP 5.3 (5.3.8)
VC9 x86 Non Thread Safe (2011-Aug-23 12:26:18)
Arcitechture is Win XP SP3 32 bit
You might try upgrading PHP.
I downloaded php-5.3.5-nts-Win32-VC6-x86 and was not able to reproduce your error, it works fine for me.
Edit: Additionaly I typed the characters using my spanish keyboard.
Edit2:
CMD Command:
chcp 437
PHP Code:
<?php
$fp=fopen("php://stdin","r");
while(1){
$str = fgets(STDIN);
echo mb_detect_encoding($str)."\n";
echo '>'.stream_get_line($fp,999999,"\n")."\n";
}
?>
Output:
test
ASCII
test
>test
öïü
öïü
>öïü
I think that happens because PHP 5.3 does not support properly multibyte characters.
These chars: ÅÄÖåäö
Are binary: c3 85 c3 84 c3 96 c3 a5 c3 a4 c3 b6 (without BOM at beggining)
Citing PHP String:
A string is series of characters, where a character is the same as a byte. This means that PHP only supports a 256-character set, and hence does not offer native Unicode support. See details of the string type.
Normally does not affect the final result, because the browser/reader understand multibyte characters, but for CMD and STDIN buffer is ÅÄÖåäö (12 chars/bytes char array).
only MB functions handle multibyte strings basic operations.

Categories