Utf-8 to UTF-16BE - php

I save a record "فحص الرسالة العربية" in php that always saved as :
فحص الرسالة العربية
I want to convert this into UTF-16BE chars when i retrieve it so I am using a function that returns :
002600230031003600300031003b002600230031003500380031003b002600230031003500380039003b0020002600230031003500370035003b002600230031003600300034003b002600230031003500380035003b002600230031003500380037003b002600230031003500370035003b002600230031003600300034003b002600230031003500370037003b0020002600230031003500370035003b002600230031003600300034003b002600230031003500390033003b002600230031003500380035003b002600230031003500370036003b002600230031003600310030003b002600230031003500370037003b
This is function that m using for converting string retrieved from database
function convertCharsn($string) {
$in = '';
$out = iconv('UTF-8', 'UTF-16BE', $string);
for($i=0; $i<strlen($out); $i++) {
$in .= sprintf("%02X", ord($out[$i]));
}
return $in;
}
But when i type same character in below url, it shows different characters as compared to my string.
http://www.routesms.com/downloads/onlineunicode.asp
returning :
0641062D063500200627064406310633062706440629002006270644063906310628064A0629
I want my string to be converted as it is being converted in above url.
my database collation is utf-8_general_ci

Basically, you need to decode those characters out of HTML entities first. Just use html_entity_decode()
$rawChars = html_entity_decode($string, ENT_QUOTES | ENT_HTML401, 'UTF-8');
convertCharsn($rawChars);
Otherwise, you're just encoding the entities. You can see that as & is 0026 in UTF16, and # is 0023. So you can see the repeating sequence of 00260023 in the above transcoding that you posted. So decode it first, and you should be set...

Related

Encode cyrillic UTF-8 to Unicode symbols

In a project's code there's this part:
$companies = $db->fetchAll("SELECT id FROM companies where contact_person like '%num%\"". $num. "\"%'");
For example, $num = 'ЦВ123456' (ЦВ - cyrillic symbols). But in the db ЦВ is stored in unicode - \u0426\u0412. So there are no hits. So how can I convert $num to unicode so the query becomes ...contact_person like '\u0426\u0412123456 ?
you can use php build in mb-convert encoding,
convert your $string to the encoding used in the database , and query with the encoded value
https://www.php.net/manual/en/function.mb-convert-encoding.php
something like this
$string = "ЦВ123456";
$unicode = mb_convert_encoding($string, "utf-8", "unicode");

how to change ascii alphabet to utf-8 in php

I have an ASCII string. I like to change its encoding to utf-8.
But I found there's a simple function to change ascii to utf-8 in php.
and vice verse, I like to change utf-8 alphabet to ascii.
Please advise.
I have tried:
<?php
// utf-8
$str = "CHONKIOK";
// I can't even how to print these utf-8 characters in php. I just copied/pasted the string.
// strlen($str) => 24 bytes
// mb_detect_encoding($str) => utf-8
$str2 = "CHONKIOK";
// strlen($str2) => 8 bytes
// mb_detect_encoding($str2) => ascii
// change ascii to utf-8
$str = mb_convert_encoding($str2, "UTF-8");
echo mb_detect_encoding($str);
// returns ascii
What you are doing is correct.
As per mb_detect_encoding it states that it detects the most likely character encoding.
As the entire ASCII set is contained within UTF-8 at the exact same character positions, this function is telling you that it's an ASCII string because it technically is. The bytes of this string when encoded in both ASCII and UFT-8 are identical.
As you've found, when you include some characters outside of the ASCII set then it will give you the next probable encoding.
What exactly should I do to obtain this string: "CHONKIOK" from "CHONKIOK"?
The characters you're after are called "Fullwidth Latin" characters.
Given the C character provided is character 65,315 and a regular C is character 67, you could possible obtain the strings you're after by adding the difference of 65,248. This is only possible because the alphabet tends to repeat in the same order throughout different parts of the character charts.
You can get the code point of a character using mb_ord and convert it back to a character using mb_chr, after adding 65,248.
That might look something like:
$str_input = "ABC abc 123";
$convertable = "ABCDEFG12349abcdefg";
$str_output = "";
for ($i = 0; $i < strlen($str_input); $i++) {
$char = mb_ord($str_input[$i], "UTF-8");
if(str_contains($convertable, $str_input[$i])) $char += 65248;
$str_output .= mb_chr($char, "UTF-8");
}
echo $str_output; // outputs "ABC abc 123"
Just be sure to include the whole alphabet in $convertable
try this to convert to utf-8:
utf8_encode(string $string): string
try this to convert to ASCII:
utf8_decode(string $string): string

How to convert E5%AE%89%E5%85%A8 to this format 安全 in PHP 5.1

I got a keyword var from an url like this:
search.php?keyword=%E5%AE%89%E5%85%A8
with utf8 encoding. Then I want to convert the keyword to this format:
&#23433;&#20840;
So I can use it in MySQL/PHP/
How can I achieve this?
This should work.
$html = mb_convert_encoding($_GET['keyword'], 'HTML-ENTITIES', 'UTF-8');
echo htmlspecialchars($html);
//安全
First one is url encoded, and what you want are html entities.
$string = urldecode('%E5%AE%89%E5%85%A8'); // == 安全
$string2 = htmlentities($string); // == 安全
$string3 = htmlspecialchars($string2); // == &#x5B89;&#x5168;
The one from your question is double encoded (& got converted to &) which I assume is wrong.
23433 is decimal and equals hexadecimal x5B89. Same for the second code. For the browser, it doesn't matter if it's decimal or hexadecimal.
If you really intend double encoding, use htmlspecialchars($string); on the above code.

php multibyte string acessing via key [$i]

there is a string $string = "öşğüçı"; pay attention to the last one which is not i
when I want to print first char by echo $string[0] it prints nothing.. I know they are multibyte ones.. though printing first character can be accomplished by
echo $string[0].$string[1] but that is not what I want.. the question is
how can I make the obove mentioned issue just to program in a way below
for($i = 0; $i < sizeof($string); $i++)
echo $string[$i] . " ";
and it will print the following
ö ş ğ ü ç ı
masters of php please help...
to split a string into characters
$string = "öşğüçı";
preg_match_all('/./u', $string, $m);
$chars = $m[0];
note the "u" flag in the regular expression
<?php
// inform the browser you are sending text encoded with utf-8
header("Content-type: text/plain; charset=utf-8");
// if you're using a literal string make sure the file
// is saved using utf-8 as encoding
// or if you're getting it from another source make sure
// you get it in utf-8
$string = "öşğüçı";
// if you do not have your string in utf-8
// you need to find out the actual encoding
// and use "iconv" to convert it to utf-8
// process the string using the mb_* functions
// knowing that it is encoded in utf-8 at this point
$encoding = "UTF-8";
for($i = 0; $i < mb_strlen($string, $encoding); $i++) {
echo mb_substr($string, $i, 1, $encoding);
}
Of course if you prefer another encoding (but I wouldn't see why; maybe just utf-16) you can substitute each instance of "utf-8" from above with your desired encoding and read and use accordingly.
Example for UTF-16 output (file/input is encoded in UTF-8)
<?php
header("Content-type: text/plain; charset=utf-16");
$string = "öşğüçı";
$string = iconv("UTF-8", "UTF-16", $string);
$encoding = "UTF-16";
for($i = 0; $i < mb_strlen($string, $encoding); $i++) {
echo mb_substr($string, $i, 1, $encoding);
}
You cannot handle multi-byte strings in this way in PHP. If it's a fixed-length encoding, where every character takes up, say, two bytes, you can simply take two bytes at a time. If it's a variable-length encoding like UTF-8 though, you will need to use mb_substr and mb_strlen.
May I recommend What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text, which explains this in more detail.
Use iconv_substr or mb_substr to get character and iconv_strlen or mb_strlen to get size of string.

json_encode() non utf-8 strings?

So I have an array of strings, and all of the strings are using the system default ANSI encoding and were pulled from a SQL database. So there are 256 different possible character byte values (single byte encoding).
Is there a way I can get json_encode() to work and display these characters instead of having to use utf8_encode() on all of my strings and ending up with stuff like \u0082?
Or is that the standard for JSON?
Is there a way I can get json_encode() to work and display these characters instead of having to use utf8_encode() on all of my strings and ending up with stuff like "\u0082"?
If you have an ANSI encoded string, using utf8_encode() is the wrong function to deal with this. You need to properly convert it from ANSI to UTF-8 first. That will certainly reduce the number of Unicode escape sequences like \u0082 from the json output, but technically these sequences are valid for json, you must not fear them.
Converting ANSI to UTF-8 with PHP
json_encode works with UTF-8 encoded strings only. If you need to create valid json successfully from an ANSI encoded string, you need to re-encode/convert it to UTF-8 first. Then json_encode will just work as documented.
To convert an encoding from ANSI (more correctly I assume you have a Windows-1252 encoded string, which is popular but wrongly referred to as ANSI) to UTF-8 you can make use of the mb_convert_encoding() function:
$str = mb_convert_encoding($str, "UTF-8", "Windows-1252");
Another function in PHP that can convert the encoding / charset of a string is called iconv based on libiconv. You can use it as well:
$str = iconv("CP1252", "UTF-8", $str);
Note on utf8_encode()
utf8_encode() does only work for Latin-1, not for ANSI. So you will destroy part of your characters inside that string when you run it through that function.
Related: What is ANSI format?
For a more fine-grained control of what json_encode() returns, see the list of predifined constants (PHP version dependent, incl. PHP 5.4, some constants remain undocumented and are available in the source code only so far).
Changing the encoding of an array/iteratively (PDO comment)
As you wrote in a comment that you have problems to apply the function onto an array, here is some code example. It's always needed to first change the encoding before using json_encode. That's just a standard array operation, for the simpler case of pdo::fetch() a foreach iteration:
while($row = $q->fetch(PDO::FETCH_ASSOC))
{
foreach($row as &$value)
{
$value = mb_convert_encoding($value, "UTF-8", "Windows-1252");
}
unset($value); # safety: remove reference
$items[] = array_map('utf8_encode', $row );
}
The JSON standard ENFORCES Unicode encoding. From RFC4627:
3. Encoding
JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.
Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE
00 xx 00 xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx 00 UTF-16LE
xx xx xx xx UTF-8
Therefore, on the strictest sense, ANSI encoded JSON wouldn't be valid JSON; this is why PHP enforces unicode encoding when using json_encode().
As for "default ANSI", I'm pretty sure that your strings are encoded in Windows-1252. It is incorrectly referred to as ANSI.
<?php
$array = array('first word' => array('Слово','Кириллица'),'second word' => 'Кириллица','last word' => 'Кириллица');
echo json_encode($array);
/*
return {"first word":["\u0421\u043b\u043e\u0432\u043e","\u041a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"],"second word":"\u041a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430","last word":"\u041a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"}
*/
echo json_encode($array,256);
/*
return {"first word":["Слово","Кириллица"],"second word":"Кириллица","last word":"Кириллица"}
*/
?>
JSON_UNESCAPED_UNICODE (integer)
Encode multibyte Unicode characters literally (default is to escape as \uXXXX). Available since PHP 5.4.0.
http://php.net/manual/en/json.constants.php#constant.json-unescaped-unicode
I found the following answer for an analogous problem with a nested array not utf-8 encoded that i had to json encode:
$inputArray = array(
'a'=>'First item - à',
'c'=>'Third item - é'
);
$inputArray['b']= array (
'a'=>'First subitem - ù',
'b'=>'Second subitem - ì'
);
if (!function_exists('recursive_utf8')) {
function recursive_utf8 ($data) {
if (!is_array($data)) {
return utf8_encode($data);
}
$result = array();
foreach ($data as $index=>$item) {
if (is_array($item)) {
$result[$index] = array();
foreach($item as $key=>$value) {
$result[$index][$key] = recursive_utf8($value);
}
}
else if (is_object($item)) {
$result[$index] = array();
foreach(get_object_vars($item) as $key=>$value) {
$result[$index][$key] = recursive_utf8($value);
}
}
else {
$result[$index] = recursive_utf8($item);
}
}
return $result;
}
}
$outputArray = json_encode(array_map('recursive_utf8', $inputArray ));
json_encode($str,JSON_HEX_TAG|JSON_HEX_AMP|JSON_HEX_APOS|JSON_HEX_QUOT);
that will convert windows based ANSI to utf-8 and the error will be no more.
Use this instead:
<?php
//$return_arr = the array of data to json encode
//$out = the output of the function
//don't forget to escape the data before use it!
$out = '["' . implode('","', $return_arr) . '"]';
?>
Copy from json_encode php manual's comments. Always read the comments. They are useful.

Categories