I have this code to decode numeric html entities to the UTF8 equivalent character.
I'm trying to convert this character:
which should output:
However, it just disappears (no output). (i've checked the source code of the page, the page has the correct utf8 character set headers/meta tags).
Does anyone know what is wrong with the code?
function entity_decode($string, $quote_style = ENT_COMPAT, $charset = "UTF-8") {
$string = html_entity_decode($string, $quote_style, $charset);
$string = preg_replace_callback('~&#x([0-9a-fA-F]+);~i', "chr_utf8_callback", $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr_utf8("\\1")', $string);
//this is another method, which also doesn't work..
//$string = preg_replace_callback("/(\&#[0-9]+;)/", "entity_decode_callback", $string);
return $string;
}
function chr_utf8_callback($matches) {
return chr_utf8(hexdec($matches[1]));
}
function chr_utf8($num) {
if ($num < 128) return chr($num);
if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
return '';
}
function entity_decode_callback($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}
echo '=' . entity_decode('');
html_entity_decode already does what you're looking for:
$string = '';
echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');
It will return the character:
’ binary hex: c292
Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.
Also there are some more quirks:
But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range to are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.
See: is getting converted as “\u0092” by nokogiri in ruby on rails
Related
I want to convert hindi / Devanagari text for example "आए थे पर्यटक, खुद ही बह ग" into Unicode escaped characters like "\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917".
I am developing a hindi website and i have seen most of sites are using Escaped Unicode sequence inside their meta tags and schema.org.
So i decided to give it a try.
i can see Hindi AKA Devanagari letters with their Escaped Unicode sequence at http://www.endmemo.com/unicode/devanagari.php
and i have also seen a tool which works the same https://www.mobilefish.com/services/unicode_escape_sequence_converter/unicode_escape_sequence_converter.php
but i cannot find any way to convert these Devanagari letters into Escaped Unicode sequence via php.
I have tried few things but nothing is working and i am not getting much help from google because all articles / forums are talking to decoding unicode escape sequence to unicode but none of them is taking about encoding..
header( 'Content-Type: text/html; charset=utf-8' );
function encode2($str) {
$str = mb_convert_encoding($str , 'UTF-32', 'UTF-8');
$t = unpack("N*", $str);
$t = array_map(function($n) { return "&#$n;"; }, $t);
return implode("", $t);
}
$message = "आए थे पर्यटक, खुद ही बह गए";
$message_convert = encode2($message);
echo $message_convert;
echo "fdfdfdfdfdfdfd<br/>";
echo mb_convert_encoding($message, "HTML-ENTITIES", "auto");
I want this "आए थे पर्यटक, खुद ही बह ग" to "\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917"
Please help!
as suggest by #paskl i tried:
$message = "आए थे पर्यटक, खुद ही बह गए";
$unicode = json_encode($message)
echo $unicode;
And i got ""\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917\u090f""
I hope it will help others who want to convert devanagari/hindi letters into Escaped Unicode sequence with php on their website.
Thanks to #paskl
Unless you're looking to transmit this data as JSON I wouldn't really recommend using json_encode() as it will wrap your output in literal double quotes that you'd need to strip back off. However there's not an easy way to encode unicode escapes in PHP in a way that is memory-efficient.
That said, here is the not-easy code:
// PHP < 7.2
// https://github.com/symfony/polyfill-mbstring/blob/master/Mbstring.php#L708-L730
if( ! function_exists("mb_ord") ) {
function mb_ord($s) {
if (1 === \strlen($s)) {
return \ord($s);
}
$code = ($s = unpack('C*', substr($s, 0, 4))) ? $s[1] : 0;
if (0xF0 <= $code) {
return (($code - 0xF0) << 18) + (($s[2] - 0x80) << 12) + (($s[3] - 0x80) << 6) + $s[4] - 0x80;
}
if (0xE0 <= $code) {
return (($code - 0xE0) << 12) + (($s[2] - 0x80) << 6) + $s[3] - 0x80;
}
if (0xC0 <= $code) {
return (($code - 0xC0) << 6) + $s[2] - 0x80;
}
return $code;
}
}
function ord2seqlen($ord) {
if($ord < 128){
return 1;
} else if($ord < 224) {
return 2;
} else if($ord < 240) {
return 3;
} else if($ord < 248) {
return 4;
} else {
throw new \Exception("No support for 5 or 6 byte sequences.");
}
}
function utf8_seq_iter($input) {
for($i=0,$c=strlen($input); $i<$c; ) {
$bytes = ord2seqlen(ord($input[$i]));
yield substr($input, $i, $bytes);
$i += $bytes;
}
}
function escape_codepoint($codepoint, $skip_low=true) {
$ord = mb_ord($codepoint);
if( $skip_low && $ord < 128 ) {
return $codepoint;
} else {
return sprintf("\\u%04x", $ord);
}
}
$input = "आए थे पर्यटक, खुद ही बह गए";
$output = '';
foreach( utf8_seq_iter($input) as $codepoint ) {
$output .= escape_codepoint($codepoint);
}
var_dump($output);
Output:
string(121) "\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917\u090f"
Edit: I've turned this into a small composer package available here:
https://packagist.org/packages/wrossmann/utf8_escape
I have encoded string like ªÙªÑ à¾ç§Íé¹
Please check below function that i have used for decode(utf-8 to tis620) it.
function utf8_to_tis620($string) {
$str = $string;
$res = "";
for ($i = 0; $i < strlen($str); $i++) {
if (ord($str[$i]) == 224) {
$unicode = ord($str[$i+2]) & 0x3F;
$unicode |= (ord($str[$i+1]) & 0x3F) << 6;
$unicode |= (ord($str[$i]) & 0x0F) << 12;
$res .= chr($unicode-0x0E00+0xA0);
$i += 2;
} else {
$res .= $str[$i];
}
}
return $res;
}
So it will return string like ชูชัย Gงอ้น but it isn't correct in THAI language.
Actually it should return ชูชัย เพ็งอ้น that is returned from http://string-functions.com/encodedecode.aspx.
But there is used windows 874 decoding.
Please let me know how can i decode utf-8 to windows 874 in php?
You can use mb_convert_encoding to change the character encoding:
function utf8_to_tis620($string) {
return mb_convert_encoding($string, 'UTF-8', 'TIS-620');
}
As noted by krasipenkov in their comment,
There is small difference between ISO-8859-11 and TIS-620. ISO-8859-11 is
nearly identical to the national Thai standard TIS-620 (1990). The
sole difference is that ISO/IEC 8859-11 allocates non-breaking space
to code 0xA0, while TIS-620 leaves it undefined. (In practice, this
small distinction is usually ignored.)
Instead of 'TIS-620', you can use 'ISO-8859-11' for the Windows 874 character encoding if needed.
You can use iconv for convert string to request character encoding.
$string= iconv('TIS-620','UTF-8//ignore',$string);
Is it possible to input a character and get the unicode value back? for example, i can put ⽇ in html to output "⽇", is it possible to give that character as an argument to a function and get the number as an output without building a unicode table?
$val = someFunction("⽇");//returns 12103
or the reverse?
$val2 = someOtherFunction(12103);//returns "⽇"
I would like to be able to output the actual characters to the page not the codes, and I would also like to be able to get the code from the character if possible.
The closest I got to what I want is php.net/manual/en/function.mb-decode-numericentity.php but I cant get it working, is this the code I need or am I on the wrong track?
function _uniord($c) {
if (ord($c[0]) >=0 && ord($c[0]) <= 127)
return ord($c[0]);
if (ord($c[0]) >= 192 && ord($c[0]) <= 223)
return (ord($c[0])-192)*64 + (ord($c[1])-128);
if (ord($c[0]) >= 224 && ord($c[0]) <= 239)
return (ord($c[0])-224)*4096 + (ord($c[1])-128)*64 + (ord($c[2])-128);
if (ord($c[0]) >= 240 && ord($c[0]) <= 247)
return (ord($c[0])-240)*262144 + (ord($c[1])-128)*4096 + (ord($c[2])-128)*64 + (ord($c[3])-128);
if (ord($c[0]) >= 248 && ord($c[0]) <= 251)
return (ord($c[0])-248)*16777216 + (ord($c[1])-128)*262144 + (ord($c[2])-128)*4096 + (ord($c[3])-128)*64 + (ord($c[4])-128);
if (ord($c[0]) >= 252 && ord($c[0]) <= 253)
return (ord($c[0])-252)*1073741824 + (ord($c[1])-128)*16777216 + (ord($c[2])-128)*262144 + (ord($c[3])-128)*4096 + (ord($c[4])-128)*64 + (ord($c[5])-128);
if (ord($c[0]) >= 254 && ord($c[0]) <= 255) // error
return FALSE;
return 0;
} // function _uniord()
and
function _unichr($o) {
if (function_exists('mb_convert_encoding')) {
return mb_convert_encoding('&#'.intval($o).';', 'UTF-8', 'HTML-ENTITIES');
} else {
return chr(intval($o));
}
} // function _unichr()
Here's a more compact implementation of unichr/uniord based on pack:
// code point to UTF-8 string
function unichr($i) {
return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}
// UTF-8 string to code point
function uniord($s) {
return unpack('V', iconv('UTF-8', 'UCS-4LE', $s))[1];
}
If you're using PHP7.2 (or later), you don't need to define a new function. There are two functions for your purposes from Multibyte String extension!
To get code point of a character (i.e. Unicode value), use mb_ord(); and to get a specific character from that value, use mb_chr().
E.g.:
mb_chr(12103, "utf8"); // ⽇
mb_ord("⽇", "utf8"); // 12103
This also works, (for someone who understands bitshifting this might be more readable than Mark Bakers answer):
public function ordinal($str){
$charString = mb_substr($str, 0, 1, 'utf-8');
$size = strlen($charString);
$ordinal = ord($charString[0]) & (0xFF >> $size);
//Merge other characters into the value
for($i = 1; $i < $size; $i++){
$ordinal = $ordinal << 6 | (ord($charString[$i]) & 127);
}
return $ordinal;
}
You can use the following functions
For encoding
string utf8_encode ( string $data )
http://php.net/manual/en/function.utf8-encode.php
For decoding
string utf8_decode ( string $data )
http://php.net/manual/en/function.utf8-decode.php
Also check
http://php.net/manual/en/function.htmlspecialchars.php
<?php
echo htmlspecialchars_decode("⽇");//will print ⽇
?>
I want to get the UCS-2 code points for a given UTF-8 string. For example the word "hello" should become something like "0068 0065 006C 006C 006F". Please note that the characters could be from any language including complex scripts like the east asian languages.
So, the problem comes down to "convert a given character to its UCS-2 code point"
But how? Please, any kind of help will be very very much appreciated since I am in a great hurry.
Transcription of questioner's response posted as an answer
Thanks for your reply, but it needs to be done in PHP v 4 or 5 but not 6.
The string will be a user input, from a form field.
I want to implement a PHP version of utf8to16 or utf8decode like
function get_ucs2_codepoint($char)
{
// calculation of ucs2 codepoint value and assign it to $hex_codepoint
return $hex_codepoint;
}
Can you help me with PHP or can it be done with PHP with version mentioned above?
Use an existing utility such as iconv, or whatever libraries come with the language you're using.
If you insist on rolling your own solution, read up on the UTF-8 format. Basically, each code point is stored as 1-4 bytes, depending on the value of the code point. The ranges are as follows:
U+0000 — U+007F: 1 byte: 0xxxxxxx
U+0080 — U+07FF: 2 bytes: 110xxxxx 10xxxxxx
U+0800 — U+FFFF: 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
U+10000 — U+10FFFF: 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Where each x is a data bit. Thus, you can tell how many bytes compose each code point by looking at the first byte: if it begins with a 0, it's a 1-byte character. If it begins with 110, it's a 2-byte character. If it begins with 1110, it's a 3-byte character. If it begins with 11110, it's a 4-byte character. If it begins with 10, it's a non-initial byte of a multibyte character. If it begins with 11111, it's an invalid character.
Once you figure out how many bytes are in the character, it's just a matter if bit twiddling. Also note that UCS-2 cannot represent characters above U+FFFF.
Since you didn't specify a language, here's some sample C code (error checking omitted):
wchar_t utf8_char_to_ucs2(const unsigned char *utf8)
{
if(!(utf8[0] & 0x80)) // 0xxxxxxx
return (wchar_t)utf8[0];
else if((utf8[0] & 0xE0) == 0xC0) // 110xxxxx
return (wchar_t)(((utf8[0] & 0x1F) << 6) | (utf8[1] & 0x3F));
else if((utf8[0] & 0xF0) == 0xE0) // 1110xxxx
return (wchar_t)(((utf8[0] & 0x0F) << 12) | ((utf8[1] & 0x3F) << 6) | (utf8[2] & 0x3F));
else
return ERROR; // uh-oh, UCS-2 can't handle code points this high
}
Scott Reynen wrote a function to convert UTF-8 into Unicode. I found it looking at the PHP documentation.
function utf8_to_unicode( $str ) {
$unicode = array();
$values = array();
$lookingFor = 1;
for ($i = 0; $i < strlen( $str ); $i++ ) {
$thisValue = ord( $str[ $i ] );
if ( $thisValue < ord('A') ) {
// exclude 0-9
if ($thisValue >= ord('0') && $thisValue <= ord('9')) {
// number
$unicode[] = chr($thisValue);
}
else {
$unicode[] = '%'.dechex($thisValue);
}
} else {
if ( $thisValue < 128)
$unicode[] = $str[ $i ];
else {
if ( count( $values ) == 0 ) $lookingFor = ( $thisValue < 224 ) ? 2 : 3;
$values[] = $thisValue;
if ( count( $values ) == $lookingFor ) {
$number = ( $lookingFor == 3 ) ?
( ( $values[0] % 16 ) * 4096 ) + ( ( $values[1] % 64 ) * 64 ) + ( $values[2] % 64 ):
( ( $values[0] % 32 ) * 64 ) + ( $values[1] % 64 );
$number = dechex($number);
$unicode[] = (strlen($number)==3)?"%u0".$number:"%u".$number;
$values = array();
$lookingFor = 1;
} // if
} // if
}
} // for
return implode("",$unicode);
} // utf8_to_unicode
PHP code (which assumes valid utf-8, no check for non-valid utf-8):
function ord_utf8($c) {
$b0 = ord($c[0]);
if ( $b0 < 0x10 ) {
return $b0;
}
$b1 = ord($c[1]);
if ( $b0 < 0xE0 ) {
return (($b0 & 0x1F) << 6) + ($b1 & 0x3F);
}
return (($b0 & 0x0F) << 12) + (($b1 & 0x3F) << 6) + (ord($c[2]) & 0x3F);
}
I'm amused because I just gave this problem to students on a final exam. Here's a sketch of UTF-8:
hex binary UTF-8 binary
0000-007F 00000000 0abcdefg => 0abcdefg
0080-07FF 00000abc defghijk => 110abcde 10fghijk
0800-FFFF abcdefgh ijklmnop => 1110abcd 10efghij 10klmnop
And here's some C99 code:
static void check(char c) {
if ((c & 0xc0) != 0xc0) RAISE(Bad_UTF8);
}
uint16_t Utf8_decode(char **p) { // return code point and advance *p
char *s = *p;
if ((s[0] & 0x80) == 0) {
(*p)++;
return s[0];
} else if ((s[0] & 0x40) == 0) {
RAISE (Bad_UTF8);
return ~0; // prevent compiler warning
} else if ((s[0] & 0x20) == 0) {
if ((s[0] & 0xf0) != 0xe0) RAISE (Bad_UTF8);
check(s[1]); check(s[2]);
(*p) += 3;
return ((s[0] & 0x0f) << 12)
+ ((s[1] & 0x3f) << 6)
+ ((s[2] & 0x3f));
} else {
check(s[1]);
(*p) += 2;
return ((s[0] & 0x1f) << 6)
+ ((s[1] & 0x3f));
}
}
Use mb_ord() in php >= 7.2.
Or this function:
function ord_utf8($c) {
$len = strlen($c);
$code = ord($c);
if($len > 1) {
$code &= 0x7F >> $len;
for($i = 1; $i < $len; $i++) {
$code <<= 6;
$code += ord($c[$i]) & 0x3F;
}
}
return $code;
}
$c is a character.
If you need convert string to character array.You can use this.
$string = 'abcde';
$string = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
i use curl to fetch a page's source, it works great but not the inside html, it's returned with something like this :
the html look like this : t i & # 7 8 7 1; t, #&&#%$^%, etc....
yes, i really want to convert it to normal text, i try php decode functions but there's no luck at all
Thank you
##################3
edit :
thank you sirs,i tried
$fixed_result = html_entity_decode($result, ENT_COMPAT, "UTF-8");
and it works like a charm, but there are some character become "�", as this :
" S�C KH�CH"
i have no idea what is this
thank you sirs
That appears to be HTML entity encoded, you should be able to revert to the normal characters using html_entity_decode with the appropriate character set specified. e.g.:
$fixed_result = html_entity_decode($result, ENT_COMPAT, "UTF-8");
As per ajreal's comment check the manual for html_entity_decode and try the fallback solution:
// For users prior to PHP 4.3.0 you may do this:
function unhtmlentities($string)
{
// replace numeric entities
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
// replace literal entities
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl = array_flip($trans_tbl);
return strtr($string, $trans_tbl);
}
$c = unhtmlentities($a);
(edit) For multi-byte support try the following version, taken from the PHP comments page,
function unhtmlentities($string)
{
// replace numeric entities
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'code2utf(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'code2utf("\\1")', $string);
// replace literal entities
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl = array_flip($trans_tbl);
return strtr($string, $trans_tbl);
}
// Returns the utf string corresponding to the unicode value (from php.net, courtesy - romans#void.lv)
function code2utf($num)
{
if ($num < 128) return chr($num);
if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
return '';
}
$c = unhtmlentities($a);