How to convert unicode to arabic characters in php? - php

let us say that the string is
$uni_str="06280628002006280628";
In Arabic,it is: بب بب
so , how can i convert it in php without using html like:
for($i=0; $i<strlen($uni_str); $i+=4)
{
$text_str .= "&#x".substr($uni_str,$i,4).";";
}
as this code just solves the problem of viewing the result in html page ,
but i want to but the result in php variable .
as the result of the code above was like
بب بب

I found the solution , hope to help:
function uni2arabic($uni_str)
{
for($i=0; $i<strlen($uni_str); $i+=4)
{
$new="&#x".substr($uni_str,$i,4).";";
$txt = html_entity_decode("$new", ENT_COMPAT, "UTF-8");
$All.=$txt;
}
return $All;
}
variable $All contains the arabic string

Use hex2bin to decode the hex into a sequence of bytes, and then you can unpack each pair of bytes as a UTF-16 code unit (which is what I assume your string represents).
Assuming you are producing UTF-8 text output:
iconv('UTF-16BE', 'UTF-8', hex2bin('06280628002006280628'))

The following code allows you to decode the characters as well as re-encode them if necessary
Code :
if (!function_exists('codepoint_encode')) {
function codepoint_encode($str) {
return substr(json_encode($str), 1, -1);
}
}
if (!function_exists('codepoint_decode')) {
function codepoint_decode($str) {
return json_decode(sprintf('"%s"', $str));
}
}
How to use :
header('Content-Type: text/html; charset=utf-8');
var_dump(codepoint_encode('ඔන්ලි'));
var_dump(codepoint_encode('සින්ග්ලිෂ්'));
var_dump(codepoint_decode('\u0d94\u0db1\u0dca\u0dbd\u0dd2'));
var_dump(codepoint_decode('\u0dc3\u0dd2\u0db1\u0dca\u0d9c\u0dca\u0dbd\u0dd2\u0dc2\u0dca'));
Output :
string(30) "\u0d94\u0db1\u0dca\u0dbd\u0dd2"
string(60) "\u0dc3\u0dd2\u0db1\u0dca\u0d9c\u0dca\u0dbd\u0dd2\u0dc2\u0dca"
string(15) "ඔන්ලි"
string(30) "සින්ග්ලිෂ්"
If you want more complex functionality, see How to get the character from unicode code point in PHP?.

Related

Converting Window-1252 to UTF-8 Issue

I have created a function to convert the following text to UTF-8, as it appeared to be in Windows-1252 format, due to being copied to a database table from a Word Document.
Testing weird character’s correction
This seems to fix the dodgy ’ character. However i'm not getting � in the following:
Devon�s most prominent dealerships
When passing the following through the same function:
Devon's most prominent dealerships
Below is the code which does the converting:
function Windows1252ToUTF8($text) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
Edit:
The database can't be changed due to holding thousands of custom records. I tried the below but the mb_detect_encoding thinks character’s correction is UTF-8.
function Windows1252ToUTF8($text) {
if (mb_detect_encoding($text) == "UTF-8") {
return $text;
}
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
Edit 2:
Just tried the example from the PHP Documentation:
$str = 'áéóú'; // ISO-8859-1
echo "<pre>";
var_dump(mb_detect_encoding($str, 'UTF-8')); // 'UTF-8'
var_dump(mb_detect_encoding($str, 'UTF-8', true)); // false
echo "</pre>";
die();
but this simply outputs:
string(5) "UTF-8"
string(5) "UTF-8"
So I can't even detect the encoding of the string :S
Edit 3:
This seems to do the trick:
function Windows1252ToUTF8($text) {
$badChars = [ "â", "á", "ú", "é", "ó" ];
$match = preg_match("/[".join("",$badChars)."]/", $text);
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}
Edit 4:
I have matched the hex values to their corresponding values. However when I get to the weird characters they don't appear to match.
Converting Testing weird character’s correction using bin2hex
gives me
54657374696e6720776569726420636861726163746572c3a2e282ace284a27320636f7272656374696f6e
This means the "’" is actually the bytes \xc3\xa2\xe2\x82\xac\xe2\x84\xa2. This is a typical sign of a UTF-8 string having been interpreted as Windows Latin-1/1252, and then re-encoded to UTF-8.
’ (UTF-8 \xe2\x80\x99)
→ bytes interpreted as Latin-1 equal the string ’
→ characters encoded to UTF-8 result in \xc3\xa2\xe2\x82\xac\xe2\x84\xa2
To restore the original, you need to reverse that chain of mis-encodings:
$s = "\xc3\xa2\xe2\x82\xac\xe2\x84\xa2";
echo mb_convert_encoding($s, 'Windows-1252', 'UTF-8');
This interprets the string as UTF-8, converts it to the Windows-1252 equivalent, which is then the valid UTF-8 representation of ’.
Preferably you figure out at what point the encoding screwed up like this and you stop that from happening in the future. If it happened by "copy and pasting from Word", then basically somebody pasted garbage into your database and you need to fix the workflow with Word somehow. Otherwise there may be an incorrect encoding-conversion step somewhere in your code which you need to fix.
The following seems to do the trick. Not the way I wanted it to work by checking for specific characters, but it does the trick.
function Windows1252ToUTF8($text) {
$badChars = [ "â", "á", "ú", "é", "ó" ];
$match = preg_match("/[".join("",$badChars)."]/", $text);
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}
Edit:
function Windows1252ToUTF8($text) {
// http://www.fileformat.info/info/charset/UTF-8/list.htm
$illegal_hex = [ "c3a2", "c3a1", "c3ba", "c3a9", "c3b3" ];
$match = preg_match("/".join("|",$illegal_hex)."/", bin2hex($text));
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}

How to convert Emoji from Unicode in PHP?

I use this table of Emoji and try this code:
<?php print json_decode('"\u2600"'); // This convert to ☀ (black sun with rays) ?>
If I try to convert this \u1F600 (grinning face) through json_decode, I see this symbol — ὠ0.
Whats wrong? How to get right Emoji?
PHP 5
JSON's \u can only handle one UTF-16 code unit at a time, so you need to write the surrogate pair instead. For U+1F600 this is \uD83D\uDE00, which works:
echo json_decode('"\uD83D\uDE00"');
😀
PHP 7
You now no longer need to use json_decode and can just use the \u and the unicode literal:
echo "\u{1F30F}";
🌏
In addition to the answer of Tino, I'd like to add code to convert hexadecimal code like 0x1F63C to a unicode symbol in PHP5 with splitting it to a surrogate pair:
function codeToSymbol($em) {
if($em > 0x10000) {
$first = (($em - 0x10000) >> 10) + 0xD800;
$second = (($em - 0x10000) % 0x400) + 0xDC00;
return json_decode('"' . sprintf("\\u%X\\u%X", $first, $second) . '"');
} else {
return json_decode('"' . sprintf("\\u%X", $em) . '"');
}
}
echo codeToSymbol(0x1F63C); outputs 😼
Example of code parsing string including emoji unicode format
$str = 'Test emoji \U0001F607 \U0001F63C';
echo preg_replace_callback(
'/\\\U([A-F0-9]+)/',
function ($matches) {
return mb_convert_encoding(hex2bin($matches[1]), 'UTF-8', 'UTF-32');
},
$str
);
Output: Test emoji 😇 😼
https://3v4l.org/63dUR

check if the string begin with euro/pound symbol

I'm trying to check if a string is start with '€' or '£' in PHP.
Below are the codes
$text = "€123";
if($text[0] == "€"){
echo "true";
}
else{
echo "false";
}
//output false
If only check a single char, it works fine
$symbol = "€";
if($symbol == "€"){
echo "true";
}
else{
echo "false";
}
// output true
I have also tried to print the string on browser.
$text = "€123";
echo $text; //display euro symbol correctly
echo $text[0] //get a question mark
I have tried to use substr(), but the same problem occurred.
Characters, such as '€' or '£' are multi-byte characters. There is an excellent article that you can read here. According to the PHP docs, PHP strings are byte arrays. As a result, accessing or modifying a string using array brackets is not multi-byte safe, and should only be done with strings that are in a single-byte encoding such as ISO-8859-1.
Also make sure your file is encoded with UTF-8: you can use a text editor such as NotePad++ to convert it.
If I reduce the PHP to this, it works, the key being to use mb_substr:
<?php
header ('Content-type: text/html; charset=utf-8');
$text = "€123";
echo mb_substr($text,0,1,'UTF-8');
?>
Finally, it would be a good idea to add the UTF-8 meta-tag in your head tag:
<meta charset="utf-8">
I suggest this as the easiest solution to you. Convert the symbols to their unicode identifiers using htmlentities().
htmlentities($text, ENT_QUOTES, "UTF-8");
Which will either give you £ or €. Now that allows you to run a switch() {case:} statement to check. (Or your if statements)
$symbols = explode(";", $text);
switch($symbols[0]) {
case "&pound":
echo "It's Pounds";
break;
case "&euro":
echo "It's Euros";
break;
}
Working Example
This happens because you’re using a multi-byte character encoding (probably UTF-8) in which both € and £ are recorded using multiple bytes. That means that "€" is a string of three bytes, not just one.
When you use $text[0] you're getting only the first byte of the first character, and so it doesn't match the three bytes of "€". You need to get the first three bytes instead, to check whether one string starts with another.
Here’s the function I use to do that:
function string_starts_with($string, $prefix) {
return substr($string, 0, strlen($prefix)) == $prefix;
}
The question mark appears because the first byte of "€" isn’t enough to encode a whole character: the error is indicated by ‘�’ when available, otherwise ‘?’.

How do I write UTF-8 data to a UTF-16LE file using PHP?

Given a string of UTF-8 data in PHP, how can I convert and save it to a UTF-16LE file (this particular file happens to be destined for Indesign - to be placed as a tagged text document).
Data:
$copy = "<UNICODE-MAC>\n";
$copy .= "<Version:8><FeatureSet:InDesign-Roman><ColorTable:=<Black:COLOR:CMYK:Process:0,0,0,1>>\n";
$copy .= "A bunch of unicode special characters like ñ, é, etc.";
I am using the following code, but to no avail:
file_put_contents("output.txt", pack("S",0xfeff) . $copy);
You can use iconv:
$copy_utf16 = iconv("UTF-8", "UTF-16LE", $copy);
file_put_contents("output.txt", $copy_utf16);
Note that UTF-16LE does not include a Byte-Order-Marker, because the byte order is well defined. To produce a BOM use "UTF-16" instead.
Using the following code, I have found a solution:
this function changes the byte order (from http://shiplu.mokadd.im/95/convert-little-endian-to-big-endian-in-php-or-vice-versa/):
function chbo($num) {
$data = dechex($num);
if (strlen($data) <= 2) {
return $num;
}
$u = unpack("H*", strrev(pack("H*", $data)));
$f = hexdec($u[1]);
return $f;
}
used with a utf-8 to utf-16LE conversion, it creates a file that will work with indesign:
file_put_contents("output.txt", pack("S",0xfeff). chbo(iconv("UTF-8","UTF-16LE",$copy));
Alternatively, you could use mb_convert_encoding() as follows:
$copy_UTF16LE = mb_convert_encoding($copy,'UTF-16LE','UTF-8');

Handling UTF-8 string in PHP 5.3

We are using some third-party php library functions and have some difficulties converting utf-8 strings.
After some experiment, this is what we got so far:
(1) The following will print the correct unicode word (it's 'one' word) in browser(we use Firefox):
$s = "\345\244\247";
echo $s;
大 <-- (prints out a correct unicode word)
(2) However, the library function will return something like this:
$s2 = "\\345\\244\\247";
echo $s2;
\345\244\247 <-- the print out will contain escape character so the unicode isn't showing correctly
(3) So the question is, is there a php function capable of doing this, converting $s2 to the correct unicode form (like $s)?
Thanks.
The environment is PHP 5.3.
Something like http://ideone.com/Owl2a3 :
function _conv($oct) {
return chr(octdec($oct[1]));
}
$es = "\\345\\244\\247";
$es = preg_replace_callback('#\\\\(\d{3})#', '_conv', $es);
echo $es;
outputs 大
the problem is, that you're escaping the slashes!
use this:
$s2 = str_replace("\\","\",$s2);

Categories