Encoding string with non-ascii characters - php

I have a string such as this - Panamá. I need to convert this string to Panam\xE1 so it's readable in a JavaScript file I'm generating using PHP.
Is there a function to encode this in PHP? Any ideas would be appreciated.

My rule is,
If you try to encode or escape data using preg_replace or
using massive mapping arrays or str_replace, STOP you are probably doing it wrong.
All it takes is one missed or eroneous mapping (and you WILL miss some mappings) then you end up with code that doesn't work in all cases and code which corrupts your data in some cases. Whole libraries have been written already dedicated to doing the translations for you (e.g. iconv) and for escaping data, you should use the proper PHP function.
If you plan on outputting the data to a browser (the fact you want to encode for javascript suggests this) then I suggest using UTF8 encoding. If your data is in latin-1, use the utf8_encode function.
Whether your PHP string contains ASCII characters or not, to send any data from PHP to JS you should ALWAYS use the json_encode function.
PHP code
$your_encoding = 'latin1';
$panama = "Panamá";
//Get your data in utf8 if it isnt already
$panama = iconv($your_encoding, "utf-8", $panama);
$panama_encoded = json_encode($panama);
echo "var js_panama = " . $panama_encoded . ";";
JS Output
var js_panama = "Panam\u00e1";
Even though JSON supports unicode, it may not be compatible with your non UTF-8 javascript file. This is not a problem because the json_encode PHP function will escape unicode characters by default.

Assuming that your input is in the latin-1 encoding then ord and dechex will do what you want:
$result = preg_replace_callback(
'/[\x80-\xff]/',
function($match) {
return '\x'.dechex(ord($match[0]));
},
$input);
If your input is in any other encoding then you would need to know what encoding that is and adapt the solution accordingly. Note that in this case it would not be possible to use specifically the \x## notation in the JS output in all cases.

This should work for you:
$str = "Panamá";
$str = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
$utf = iconv('UTF-8', 'UCS-4', current($m));
return sprintf("\x%s", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $str);
echo $str;
Output (Source Code):
Panam\xE1

Related

Php json_encode converts utf8 string to characters codes

I have a Persian text "سرما"
And then when I convert it to JSON using json_encode(), I get a series of escaped character codes such as \u0633 which seems to be expected and of a rational process. But my confusion lies where I don't know how to convert them back into readable string of characters. How should I do that in PHP?
Should I use anything of mb_* family? I also have checked json_encode() parameters and have found nothing appropriate for me.
UPDATE
what I get saved in my DB is:
["u0633u0631u0645u0627"]
Which shows the characters are not escaped properly. While if I change it to
["\u0633\u0631\u0645\u0627"] it becomes easily readable by json_decode()
They should be converted back on the other end when it's decoded. This is the safest option as it might not be possible to guaranteed that the transmission or storage will not corrupt a multi-byte encoding.
If you're certain that everything is safe for UTF8 end-to-end you can do:
$res = json_encode($foo, \JSON_UNESCAPED_UNICODE);
http://php.net/manual/en/function.json-encode.php
Maybe try encoding the unicode characters, and then json_encoding it, then on the other side (receiving JSON) decode the json, then decode the unicode.
Example:
//Encode
json_encode(utf8_encode($string));
//Decode
utf8_decode(json_decode($string));
its simple just use JSON_UNESCAPED_SLASHES atribute
your problem is't utf8 you need force JSON to don't escape Slashes
example
$bar = "سرما";
$res = json_encode($bar, JSON_UNESCAPED_SLASHES );
// $res equal to ["\u0633\u0631\u0645\u0627"]
if you check the result in your MYSQL Database
it happen when you did't Use addslashes()
example
$bar = "سرما";
$res = json_encode($bar, JSON_UNESCAPED_SLASHES );
$res = addslashes($res);
// $res equal to ["\\u0633\\u0631\\u0645\\u0627"] now it's ready to use in MYSQL

How to decode hex content?

I have $_SERVER['REDIRECT_SSL_CLIENT_S_DN'] content that has somekind of hex data. How can i decode it?
$_SERVER['REDIRECT_SSL_CLIENT_S_DN'] = '../CN=\x00M\x00\xC4\x00,\x00I\x00S\x00,\x004\x000\x003\x001\x002\x000\x000\x002/SN=..';
$pattern = '/CN=(.*)\\/SN=/';
preg_match($pattern, $_SERVER['REDIRECT_SSL_CLIENT_S_DN'], $server_matches);
print_r($server_matches[1]);
The result is:
\x00M\x00\xC4\x00,\x00I\x00S\x00,\x004\x000\x003\x001\x002\x000\x000\x002
The result i need is:
MÄ,IS,40312002
I tried to decode it with chr(hexdec($value)); and it almost works, but in html input i see lot of question marks.
EDIT:
Additional test with results. Not yet perfect. Array reveals some errors: http://pastebin.com/BC4xxqmE
After using utf8_encode, you now have a multibyte string. This means you need to use PHP's multibyte (mb_) functions.
So, str_split won't work anymore. You need to use either mb_split or preg_split with the u flag.
$splitted = preg_split('//u', $string);
Here's a demo showing that your code is now working: http://ideone.com/nqeC0U
Have you tried unicode equivalent of chr()? chr mod 256 all the input that's why you see all those question marks.
The code below is from one of the post in chr php manual
function unichr($u) {
return mb_convert_encoding('&#' . intval($u) . ';', 'UTF-8', 'HTML-ENTITIES');
}
Update
//New function
function unichr($intval) {
return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}
I test with xC4=196 it gives me an Ä
http://codepad.viper-7.com/3htuwW
Your input is in UTF-8 using that conversion is similar to utf8_decode which will convert to ISO-8859-1. UTF-8 though supports more characters than ISO-8859-1. This is why xC4 shows up as a question mark for you.
Try using something more powerful like iconv.

How to produce JSON - un-escaped unicodes in php 5.3.x [duplicate]

When I use json_encode to encode my multi lingual strings , It also changes special characters.What should I do to keep them same .
For example
<?
echo json_encode(array('şüğçö'));
It returns something like ["\u015f\u00fc\u011f\u00e7\u00f6"]
But I want ["şüğçö"]
try it:
<?
echo json_encode(array('şüğçö'), JSON_UNESCAPED_UNICODE);
In JSON any character in strings may be represented by a Unicode escape sequence. Thus "\u015f\u00fc\u011f\u00e7\u00f6" is semantically equal to "şüğçö".
Although those character can also be used plain, json_encode probably prefers the Unicode escape sequences to avoid character encoding issues.
PHP 5.4 adds the option JSON_UNESCAPED_UNICODE, which does what you want. Note that json_encode always outputs UTF-8.
You shouldn't want this
It's definitely possible, even without PHP 5.4.
First, use json_encode() to encode the string and save it in a variable.
Then simply use preg_replace() to replace all \uxxxx with unicode again.
json_encode() does not provide any options for choosing the charset the encoding is in in versions prior to 5.4.
<?php
print_r(json_decode(json_encode(array('şüğçö'))));
/*
Array
(
[0] => şüğçö
)
*/
So do you really need to keep these characters unescaped in the JSON?
Json_encode charset solution for PHP 5.3.3
As JSON_UNESCAPED_UNICODE is not working in PHP 5.3.3 so we have used this method and it is working.
$data = array(
'text' => 'Päiväkampanjat'
);
$json_encode = json_encode($data);
var_dump($json_encode); // text: "P\u00e4iv\u00e4kampanjat"
$unescaped_data = preg_replace_callback('/\\\\u(\w{4})/', function ($matches) {
return html_entity_decode('&#x' . $matches[1] . ';', ENT_COMPAT, 'UTF-8');
}, $json_encode);
var_dump($unescaped); // text is unescaped -> Päiväkampanjat

PHP and accent characters (Ba\u015f\u00e7\u0131l)

I have a string like so "Ba\u015f\u00e7\u0131l". I'm assuming those are some special accent characters. How do I:
1) Display the string with the accents (i.e replace code with actual character)
2) What is best practice for storing strings like this?
2) If I don't want to allow such characters, how do I replace it with "normal characters"?
My educated guess is that you obtained such values from a JSON string. If that's the case, you should properly decode the full piece of data with json_decode():
<?php
header('Content-Type: text/plain; charset=utf-8');
$data = '"Ba\u015f\u00e7\u0131l"';
var_dump( json_decode($data) );
?>
To display the characters look at How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?
You can store the character like that, or decoded, just make sure your storage can handle the UTF8 charset.
Use iconv with the translit flag.
Here's an example...
function replace_unicode_escape_sequence($match) {
return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}
$str = preg_replace_callback('/\\\\u([0-9a-f]{4})/i', 'replace_unicode_escape_sequence', $str);
echo $str;
echo '<br/>';
$str = iconv('UTF8', 'ASCII//TRANSLIT', $str);
echo $str;
Here's another option:
<html><head>
<!-- don't forget to tell the browser what encoding you're using: -->
<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
</head><body><?php
$string = "Ba\u015f\u00e7\u0131l";
echo json_decode('"'.str_replace('"', '\"', $string).'"');
?></body></html>
This works because the \u000 syntax is what JSON uses. Note that json_decode() requires the JSON module, which is now a part of the standard PHP installation.
There is no native support in PHP to decode such strings.
There are several tricks to use native function though I am not sure that any of those is safe and injection proof :
json_decode . See http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
xml parser
regex replace
If anybody has other options for escaping/unescaping Utf8 using native function, please post a reply.
Another option using Zend Framework is to download the Zend_Utf8 proposal class. See more information at Zend_Utf8 proposal for Zend Framework
Outputing them would output the appropriate character. If you don't provide any encoding for the output document, the browser would try and guess the best one to show. Otherwise you should figure it out and output explicitly.
Simply store them, or turn them into normal chars and binary store them.
Use iconv functions to convert from one encoding to another, then you shuold save your source file with the desired encoding to support it.

Translate URLENCODED data into UTF-8 in PHP

I've got a string that is in my database like 中华武魂 when I post my request to retrieve the data via my website I'm getting the data to the server in the format %E4%B8%AD%E5%8D%8E%E6%AD%A6%E9%AD%82
What decoding steps to I have to take in order to get it back to the usable form?
While also cleaning the user input to ensure they're not going to try an SQL injection attack?
(escape string before or after encoding?)
EDIT:
rawurldecode(); // returns "中åŽæ­¦é­‚"
urldecode(); // returns "中åŽæ­¦é­‚"
public function utf8_urldecode($str) {
$str = preg_replace("/%u([0-9a-f]{3,4})/i","&#x\\1;",urldecode($str));
return html_entity_decode($str,null,'UTF-8');
}
// returns "中åŽæ­¦é­‚"
... which actually works when I try and use it in an SQL statement.
I think because I was doing an echo and die(); without specifying a header of UTF-8 (thus I guess that was reading to me as latin)
Thanks for the help!
When your data is actually that percent-encoded form, you just have to call rawurldecode:
$data = '%E4%B8%AD%E5%8D%8E%E6%AD%A6%E9%AD%82';
$str = rawurldecode($data);
This suffices as the data already is encoded in UTF-8: 中 (U+4E2D) is encoded with the byte sequence 0xE4B8AD in UTF-8 and that is encoded with %E4%B8%AD when using the percent-encoding.
That your output does not seem to be as expected is probably because the output is interpreted with the wrong character encoding, probably Windows-1252 instead of UTF-8. Because in Windows-1252, 0xE4 represents ä, 0xB8 represents ¸, 0xAD represents å, and so on. So make sure to specify the output character encoding properly.
Use PHP's urldecode:
http://php.net/manual/en/function.urldecode.php
You have choices here: urldecode or rawurldecode.
If you had encoded your string using urlencode, you must use urldecode because of the way spaces are handled. While urlencode converts spaces to +, it is not the same with rawurlencode.

Categories