I want to convert hindi / Devanagari text for example "आए थे पर्यटक, खुद ही बह ग" into Unicode escaped characters like "\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917".
I am developing a hindi website and i have seen most of sites are using Escaped Unicode sequence inside their meta tags and schema.org.
So i decided to give it a try.
i can see Hindi AKA Devanagari letters with their Escaped Unicode sequence at http://www.endmemo.com/unicode/devanagari.php
and i have also seen a tool which works the same https://www.mobilefish.com/services/unicode_escape_sequence_converter/unicode_escape_sequence_converter.php
but i cannot find any way to convert these Devanagari letters into Escaped Unicode sequence via php.
I have tried few things but nothing is working and i am not getting much help from google because all articles / forums are talking to decoding unicode escape sequence to unicode but none of them is taking about encoding..
header( 'Content-Type: text/html; charset=utf-8' );
function encode2($str) {
$str = mb_convert_encoding($str , 'UTF-32', 'UTF-8');
$t = unpack("N*", $str);
$t = array_map(function($n) { return "&#$n;"; }, $t);
return implode("", $t);
}
$message = "आए थे पर्यटक, खुद ही बह गए";
$message_convert = encode2($message);
echo $message_convert;
echo "fdfdfdfdfdfdfd<br/>";
echo mb_convert_encoding($message, "HTML-ENTITIES", "auto");
I want this "आए थे पर्यटक, खुद ही बह ग" to "\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917"
Please help!
as suggest by #paskl i tried:
$message = "आए थे पर्यटक, खुद ही बह गए";
$unicode = json_encode($message)
echo $unicode;
And i got ""\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917\u090f""
I hope it will help others who want to convert devanagari/hindi letters into Escaped Unicode sequence with php on their website.
Thanks to #paskl
Unless you're looking to transmit this data as JSON I wouldn't really recommend using json_encode() as it will wrap your output in literal double quotes that you'd need to strip back off. However there's not an easy way to encode unicode escapes in PHP in a way that is memory-efficient.
That said, here is the not-easy code:
// PHP < 7.2
// https://github.com/symfony/polyfill-mbstring/blob/master/Mbstring.php#L708-L730
if( ! function_exists("mb_ord") ) {
function mb_ord($s) {
if (1 === \strlen($s)) {
return \ord($s);
}
$code = ($s = unpack('C*', substr($s, 0, 4))) ? $s[1] : 0;
if (0xF0 <= $code) {
return (($code - 0xF0) << 18) + (($s[2] - 0x80) << 12) + (($s[3] - 0x80) << 6) + $s[4] - 0x80;
}
if (0xE0 <= $code) {
return (($code - 0xE0) << 12) + (($s[2] - 0x80) << 6) + $s[3] - 0x80;
}
if (0xC0 <= $code) {
return (($code - 0xC0) << 6) + $s[2] - 0x80;
}
return $code;
}
}
function ord2seqlen($ord) {
if($ord < 128){
return 1;
} else if($ord < 224) {
return 2;
} else if($ord < 240) {
return 3;
} else if($ord < 248) {
return 4;
} else {
throw new \Exception("No support for 5 or 6 byte sequences.");
}
}
function utf8_seq_iter($input) {
for($i=0,$c=strlen($input); $i<$c; ) {
$bytes = ord2seqlen(ord($input[$i]));
yield substr($input, $i, $bytes);
$i += $bytes;
}
}
function escape_codepoint($codepoint, $skip_low=true) {
$ord = mb_ord($codepoint);
if( $skip_low && $ord < 128 ) {
return $codepoint;
} else {
return sprintf("\\u%04x", $ord);
}
}
$input = "आए थे पर्यटक, खुद ही बह गए";
$output = '';
foreach( utf8_seq_iter($input) as $codepoint ) {
$output .= escape_codepoint($codepoint);
}
var_dump($output);
Output:
string(121) "\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917\u090f"
Edit: I've turned this into a small composer package available here:
https://packagist.org/packages/wrossmann/utf8_escape
Related
I've a string like this :
%d8%b7%d8%b1%d8%a7%d8%ad%db%8c-%d8%a7%d9%be%d9%84%db%8c%da%a9%db%8c%d8%b4%d9%86-%d9%81%d8%b1%d9%88%d8%b4%da%af%d8%a7%d9%87%db%8c
the meta tag of page is set to utf-8
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
i want to convert this unicode to pure readable utf-8 string
I've tested lots of code ,thie is my last code :
function convertFarsi($str) {
return html_entity_decode(preg_replace('/\\\\u([a-f0-9]{4})/i', '&#x$1;', $str),ENT_QUOTES, 'UTF-8');
}
and it doesn't work.
How can I convert these unicode to utf8 string ?
You can use url_decode to get the following result:
<?php
$string = '%d8%b7%d8%b1%d8%a7%d8%ad%db%8c-%d8%a7%d9%be%d9%84%db%8c%da%a9%db%8c%d8%b4%d9%86-%d9%81%d8%b1%d9%88%d8%b4%da%af%d8%a7%d9%87%db%8c';
$outpout = urldecode($string);
echo $outpout; // طراحی-اپلیکیشن-فروشگاهی
This function doesn't decode unicode characters. I wrote a function that does.
function unicode_urldecode($url)
{
preg_match_all('/%u([[:alnum:]]{4})/', $url, $a);
foreach ($a[1] as $uniord)
{
$dec = hexdec($uniord);
$utf = '';
if ($dec < 128)
{
$utf = chr($dec);
}
else if ($dec < 2048)
{
$utf = chr(192 + (($dec - ($dec % 64)) / 64));
$utf .= chr(128 + ($dec % 64));
}
else
{
$utf = chr(224 + (($dec - ($dec % 4096)) / 4096));
$utf .= chr(128 + ((($dec % 4096) - ($dec % 64)) / 64));
$utf .= chr(128 + ($dec % 64));
}
$url = str_replace('%u'.$uniord, $utf, $url);
}
return urldecode($url);
}
Source
Demo
This seems to do it:
<?php
$s = '%d8%b7%d8%b1%d8%a7%d8%ad%db%8c-%d8%a7%d9%be%d9%84%db%8c%da%a9%db%8c%d8%b4%d9%86-%d9%81%d8%b1%d9%88%d8%b4%da%af%d8%a7%d9%87%db%8c';
$t = urldecode($s);
var_dump($t == 'طراحی-اپلیکیشن-فروشگاهی');
https://php.net/function.urldecode
I have encoded string like ªÙªÑ à¾ç§Íé¹
Please check below function that i have used for decode(utf-8 to tis620) it.
function utf8_to_tis620($string) {
$str = $string;
$res = "";
for ($i = 0; $i < strlen($str); $i++) {
if (ord($str[$i]) == 224) {
$unicode = ord($str[$i+2]) & 0x3F;
$unicode |= (ord($str[$i+1]) & 0x3F) << 6;
$unicode |= (ord($str[$i]) & 0x0F) << 12;
$res .= chr($unicode-0x0E00+0xA0);
$i += 2;
} else {
$res .= $str[$i];
}
}
return $res;
}
So it will return string like ชูชัย Gงอ้น but it isn't correct in THAI language.
Actually it should return ชูชัย เพ็งอ้น that is returned from http://string-functions.com/encodedecode.aspx.
But there is used windows 874 decoding.
Please let me know how can i decode utf-8 to windows 874 in php?
You can use mb_convert_encoding to change the character encoding:
function utf8_to_tis620($string) {
return mb_convert_encoding($string, 'UTF-8', 'TIS-620');
}
As noted by krasipenkov in their comment,
There is small difference between ISO-8859-11 and TIS-620. ISO-8859-11 is
nearly identical to the national Thai standard TIS-620 (1990). The
sole difference is that ISO/IEC 8859-11 allocates non-breaking space
to code 0xA0, while TIS-620 leaves it undefined. (In practice, this
small distinction is usually ignored.)
Instead of 'TIS-620', you can use 'ISO-8859-11' for the Windows 874 character encoding if needed.
You can use iconv for convert string to request character encoding.
$string= iconv('TIS-620','UTF-8//ignore',$string);
I was using emojione to convert emoticons but there is problem.
When someone upload emoticon from mobile then something like, \ud83d\ude0c\ud83d\ude0c\ud83d\ude0c this unicode.
emojione doesn't convert this type of code.
Can anybody help me to convert this code or suggest me to use any other package
I have done # last.
$str = '\ud83d\ude0c\ud83d\ude0c\ud83d\ude0c';
$regex = '/\\\u([dD][89abAB][\da-fA-F]{2})\\\u([dD][c-fC-F][\da-fA-F]{2})
|\\\u([\da-fA-F]{4})/sx';
echo preg_replace_callback($regex, function($matches) {
if (isset($matches[3])) {
$cp = hexdec($matches[3]);
} else {
$lead = hexdec($matches[1]);
$trail = hexdec($matches[2]);
// http://unicode.org/faq/utf_bom.html#utf16-4
$cp = ($lead << 10) + $trail + 0x10000 - (0xD800 << 10) - 0xDC00;
}
// https://tools.ietf.org/html/rfc3629#section-3
// Characters between U+D800 and U+DFFF are not allowed in UTF-8
if ($cp > 0xD7FF && 0xE000 > $cp) {
$cp = 0xFFFD;
}
// https://github.com/php/php-src/blob/php-5.6.4/ext/standard/html.c#L471
// php_utf32_utf8(unsigned char *buf, unsigned k)
if ($cp < 0x80) {
return chr($cp);
} else if ($cp < 0xA0) {
return chr(0xC0 | $cp >> 6) . chr(0x80 | $cp & 0x3F);
}
return html_entity_decode('&#' . $cp . ';');
}, $str);
output will be:
😌😌😌
I have this code to decode numeric html entities to the UTF8 equivalent character.
I'm trying to convert this character:
which should output:
However, it just disappears (no output). (i've checked the source code of the page, the page has the correct utf8 character set headers/meta tags).
Does anyone know what is wrong with the code?
function entity_decode($string, $quote_style = ENT_COMPAT, $charset = "UTF-8") {
$string = html_entity_decode($string, $quote_style, $charset);
$string = preg_replace_callback('~&#x([0-9a-fA-F]+);~i', "chr_utf8_callback", $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr_utf8("\\1")', $string);
//this is another method, which also doesn't work..
//$string = preg_replace_callback("/(\&#[0-9]+;)/", "entity_decode_callback", $string);
return $string;
}
function chr_utf8_callback($matches) {
return chr_utf8(hexdec($matches[1]));
}
function chr_utf8($num) {
if ($num < 128) return chr($num);
if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
return '';
}
function entity_decode_callback($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}
echo '=' . entity_decode('');
html_entity_decode already does what you're looking for:
$string = '';
echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');
It will return the character:
’ binary hex: c292
Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.
Also there are some more quirks:
But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range to are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.
See: is getting converted as “\u0092” by nokogiri in ruby on rails
I want to get the UCS-2 code points for a given UTF-8 string. For example the word "hello" should become something like "0068 0065 006C 006C 006F". Please note that the characters could be from any language including complex scripts like the east asian languages.
So, the problem comes down to "convert a given character to its UCS-2 code point"
But how? Please, any kind of help will be very very much appreciated since I am in a great hurry.
Transcription of questioner's response posted as an answer
Thanks for your reply, but it needs to be done in PHP v 4 or 5 but not 6.
The string will be a user input, from a form field.
I want to implement a PHP version of utf8to16 or utf8decode like
function get_ucs2_codepoint($char)
{
// calculation of ucs2 codepoint value and assign it to $hex_codepoint
return $hex_codepoint;
}
Can you help me with PHP or can it be done with PHP with version mentioned above?
Use an existing utility such as iconv, or whatever libraries come with the language you're using.
If you insist on rolling your own solution, read up on the UTF-8 format. Basically, each code point is stored as 1-4 bytes, depending on the value of the code point. The ranges are as follows:
U+0000 — U+007F: 1 byte: 0xxxxxxx
U+0080 — U+07FF: 2 bytes: 110xxxxx 10xxxxxx
U+0800 — U+FFFF: 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
U+10000 — U+10FFFF: 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Where each x is a data bit. Thus, you can tell how many bytes compose each code point by looking at the first byte: if it begins with a 0, it's a 1-byte character. If it begins with 110, it's a 2-byte character. If it begins with 1110, it's a 3-byte character. If it begins with 11110, it's a 4-byte character. If it begins with 10, it's a non-initial byte of a multibyte character. If it begins with 11111, it's an invalid character.
Once you figure out how many bytes are in the character, it's just a matter if bit twiddling. Also note that UCS-2 cannot represent characters above U+FFFF.
Since you didn't specify a language, here's some sample C code (error checking omitted):
wchar_t utf8_char_to_ucs2(const unsigned char *utf8)
{
if(!(utf8[0] & 0x80)) // 0xxxxxxx
return (wchar_t)utf8[0];
else if((utf8[0] & 0xE0) == 0xC0) // 110xxxxx
return (wchar_t)(((utf8[0] & 0x1F) << 6) | (utf8[1] & 0x3F));
else if((utf8[0] & 0xF0) == 0xE0) // 1110xxxx
return (wchar_t)(((utf8[0] & 0x0F) << 12) | ((utf8[1] & 0x3F) << 6) | (utf8[2] & 0x3F));
else
return ERROR; // uh-oh, UCS-2 can't handle code points this high
}
Scott Reynen wrote a function to convert UTF-8 into Unicode. I found it looking at the PHP documentation.
function utf8_to_unicode( $str ) {
$unicode = array();
$values = array();
$lookingFor = 1;
for ($i = 0; $i < strlen( $str ); $i++ ) {
$thisValue = ord( $str[ $i ] );
if ( $thisValue < ord('A') ) {
// exclude 0-9
if ($thisValue >= ord('0') && $thisValue <= ord('9')) {
// number
$unicode[] = chr($thisValue);
}
else {
$unicode[] = '%'.dechex($thisValue);
}
} else {
if ( $thisValue < 128)
$unicode[] = $str[ $i ];
else {
if ( count( $values ) == 0 ) $lookingFor = ( $thisValue < 224 ) ? 2 : 3;
$values[] = $thisValue;
if ( count( $values ) == $lookingFor ) {
$number = ( $lookingFor == 3 ) ?
( ( $values[0] % 16 ) * 4096 ) + ( ( $values[1] % 64 ) * 64 ) + ( $values[2] % 64 ):
( ( $values[0] % 32 ) * 64 ) + ( $values[1] % 64 );
$number = dechex($number);
$unicode[] = (strlen($number)==3)?"%u0".$number:"%u".$number;
$values = array();
$lookingFor = 1;
} // if
} // if
}
} // for
return implode("",$unicode);
} // utf8_to_unicode
PHP code (which assumes valid utf-8, no check for non-valid utf-8):
function ord_utf8($c) {
$b0 = ord($c[0]);
if ( $b0 < 0x10 ) {
return $b0;
}
$b1 = ord($c[1]);
if ( $b0 < 0xE0 ) {
return (($b0 & 0x1F) << 6) + ($b1 & 0x3F);
}
return (($b0 & 0x0F) << 12) + (($b1 & 0x3F) << 6) + (ord($c[2]) & 0x3F);
}
I'm amused because I just gave this problem to students on a final exam. Here's a sketch of UTF-8:
hex binary UTF-8 binary
0000-007F 00000000 0abcdefg => 0abcdefg
0080-07FF 00000abc defghijk => 110abcde 10fghijk
0800-FFFF abcdefgh ijklmnop => 1110abcd 10efghij 10klmnop
And here's some C99 code:
static void check(char c) {
if ((c & 0xc0) != 0xc0) RAISE(Bad_UTF8);
}
uint16_t Utf8_decode(char **p) { // return code point and advance *p
char *s = *p;
if ((s[0] & 0x80) == 0) {
(*p)++;
return s[0];
} else if ((s[0] & 0x40) == 0) {
RAISE (Bad_UTF8);
return ~0; // prevent compiler warning
} else if ((s[0] & 0x20) == 0) {
if ((s[0] & 0xf0) != 0xe0) RAISE (Bad_UTF8);
check(s[1]); check(s[2]);
(*p) += 3;
return ((s[0] & 0x0f) << 12)
+ ((s[1] & 0x3f) << 6)
+ ((s[2] & 0x3f));
} else {
check(s[1]);
(*p) += 2;
return ((s[0] & 0x1f) << 6)
+ ((s[1] & 0x3f));
}
}
Use mb_ord() in php >= 7.2.
Or this function:
function ord_utf8($c) {
$len = strlen($c);
$code = ord($c);
if($len > 1) {
$code &= 0x7F >> $len;
for($i = 1; $i < $len; $i++) {
$code <<= 6;
$code += ord($c[$i]) & 0x3F;
}
}
return $code;
}
$c is a character.
If you need convert string to character array.You can use this.
$string = 'abcde';
$string = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);