Convert UTF-8 to Window 874 using PHP - php

I have encoded string like ªÙªÑ à¾ç§Íé¹
Please check below function that i have used for decode(utf-8 to tis620) it.
function utf8_to_tis620($string) {
$str = $string;
$res = "";
for ($i = 0; $i < strlen($str); $i++) {
if (ord($str[$i]) == 224) {
$unicode = ord($str[$i+2]) & 0x3F;
$unicode |= (ord($str[$i+1]) & 0x3F) << 6;
$unicode |= (ord($str[$i]) & 0x0F) << 12;
$res .= chr($unicode-0x0E00+0xA0);
$i += 2;
} else {
$res .= $str[$i];
}
}
return $res;
}
So it will return string like ชูชัย Gงอ้น but it isn't correct in THAI language.
Actually it should return ชูชัย เพ็งอ้น that is returned from http://string-functions.com/encodedecode.aspx.
But there is used windows 874 decoding.
Please let me know how can i decode utf-8 to windows 874 in php?

You can use mb_convert_encoding to change the character encoding:
function utf8_to_tis620($string) {
return mb_convert_encoding($string, 'UTF-8', 'TIS-620');
}
As noted by krasipenkov in their comment,
There is small difference between ISO-8859-11 and TIS-620. ISO-8859-11 is
nearly identical to the national Thai standard TIS-620 (1990). The
sole difference is that ISO/IEC 8859-11 allocates non-breaking space
to code 0xA0, while TIS-620 leaves it undefined. (In practice, this
small distinction is usually ignored.)
Instead of 'TIS-620', you can use 'ISO-8859-11' for the Windows 874 character encoding if needed.

You can use iconv for convert string to request character encoding.
$string= iconv('TIS-620','UTF-8//ignore',$string);

Related

HDF5: How to decode UTF8-encoded string from h5dump output?

I'm writing an attribute to an HDF5 file using UTF-8 encoding. As an example, I've written "äöüß" to the attribute "notes" in the file.
I'm now trying to parse the output of h5ls (or h5dump) to extract this data back. Either tool gives me an output like this:
ATTRIBUTE "notes" {
DATATYPE H5T_STRING {
STRSIZE 8;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
DATA {
(0): "\37777777703\37777777644\37777777703\37777777666\37777777703\37777777674\37777777703\37777777637"
}
}
I'm aware that, e.g., \37777777703\37777777644 somehow encodes ä as 0xC3 0xA4, however, I have a really hard time coming up with how this encoding works.
What's the magic formula behind this and how can I properly decode it back into äöüß?
The strings are encoded using base 8. I've decoded them in the PHP backend using:
$line = "This is the text including some UTF-8 bytes \37777777703\37777777644\37777777703\37777777666\37777777703\37777777674\37777777703\37777777637";
// extract UTF-8 Bytes
$octbytes;
preg_match_all("/\\\\37777777(\\d{3})/", $line, $octbytes);
// parse extracted Bytes
for ($m = 0; $m < count($octbytes[1]); ) {
$B = octdec($octbytes[1][$m]);
// UTF-8 may span over 2 to 4 Bytes
$numBytes;
if (($B & 0xF8) == 0xF0) { $numBytes = 4; }
else if (($B & 0xF0) == 0xE0) { $numBytes = 3; }
else if (($B & 0xE0) == 0xC0) { $numBytes = 2; }
else { $numBytes = 1; }
$hxstr = "";
$replaceStr = "";
for ($j = 0; $j < $numBytes; $j++) {
$match = $octbytes[1][$m+$j];
$dec = octdec($match) & 255;
$hx = strtoupper(dechex($dec));
$hxstr = $hxstr . $hx;
$replaceStr = $replaceStr . "\\37777777" . $match;
}
// pack extracted bytes into one hex string
$utfChar = pack("H*", $hxstr); // < this will be interpreted correctly
// replace Bytes in the input with the parsed chars
$parsedData = str_replace($replaceStr,$utfChar,$line);
// go to next byte
$m+=$numBytes;
}
echo "The parsed line: $line";

Convert Hindi text to Escaped Unicode character with php

I want to convert hindi / Devanagari text for example "आए थे पर्यटक, खुद ही बह ग" into Unicode escaped characters like "\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917".
I am developing a hindi website and i have seen most of sites are using Escaped Unicode sequence inside their meta tags and schema.org.
So i decided to give it a try.
i can see Hindi AKA Devanagari letters with their Escaped Unicode sequence at http://www.endmemo.com/unicode/devanagari.php
and i have also seen a tool which works the same https://www.mobilefish.com/services/unicode_escape_sequence_converter/unicode_escape_sequence_converter.php
but i cannot find any way to convert these Devanagari letters into Escaped Unicode sequence via php.
I have tried few things but nothing is working and i am not getting much help from google because all articles / forums are talking to decoding unicode escape sequence to unicode but none of them is taking about encoding..
header( 'Content-Type: text/html; charset=utf-8' );
function encode2($str) {
$str = mb_convert_encoding($str , 'UTF-32', 'UTF-8');
$t = unpack("N*", $str);
$t = array_map(function($n) { return "&#$n;"; }, $t);
return implode("", $t);
}
$message = "आए थे पर्यटक, खुद ही बह गए";
$message_convert = encode2($message);
echo $message_convert;
echo "fdfdfdfdfdfdfd<br/>";
echo mb_convert_encoding($message, "HTML-ENTITIES", "auto");
I want this "आए थे पर्यटक, खुद ही बह ग" to "\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917"
Please help!
as suggest by #paskl i tried:
$message = "आए थे पर्यटक, खुद ही बह गए";
$unicode = json_encode($message)
echo $unicode;
And i got ""\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917\u090f""
I hope it will help others who want to convert devanagari/hindi letters into Escaped Unicode sequence with php on their website.
Thanks to #paskl
Unless you're looking to transmit this data as JSON I wouldn't really recommend using json_encode() as it will wrap your output in literal double quotes that you'd need to strip back off. However there's not an easy way to encode unicode escapes in PHP in a way that is memory-efficient.
That said, here is the not-easy code:
// PHP < 7.2
// https://github.com/symfony/polyfill-mbstring/blob/master/Mbstring.php#L708-L730
if( ! function_exists("mb_ord") ) {
function mb_ord($s) {
if (1 === \strlen($s)) {
return \ord($s);
}
$code = ($s = unpack('C*', substr($s, 0, 4))) ? $s[1] : 0;
if (0xF0 <= $code) {
return (($code - 0xF0) << 18) + (($s[2] - 0x80) << 12) + (($s[3] - 0x80) << 6) + $s[4] - 0x80;
}
if (0xE0 <= $code) {
return (($code - 0xE0) << 12) + (($s[2] - 0x80) << 6) + $s[3] - 0x80;
}
if (0xC0 <= $code) {
return (($code - 0xC0) << 6) + $s[2] - 0x80;
}
return $code;
}
}
function ord2seqlen($ord) {
if($ord < 128){
return 1;
} else if($ord < 224) {
return 2;
} else if($ord < 240) {
return 3;
} else if($ord < 248) {
return 4;
} else {
throw new \Exception("No support for 5 or 6 byte sequences.");
}
}
function utf8_seq_iter($input) {
for($i=0,$c=strlen($input); $i<$c; ) {
$bytes = ord2seqlen(ord($input[$i]));
yield substr($input, $i, $bytes);
$i += $bytes;
}
}
function escape_codepoint($codepoint, $skip_low=true) {
$ord = mb_ord($codepoint);
if( $skip_low && $ord < 128 ) {
return $codepoint;
} else {
return sprintf("\\u%04x", $ord);
}
}
$input = "आए थे पर्यटक, खुद ही बह गए";
$output = '';
foreach( utf8_seq_iter($input) as $codepoint ) {
$output .= escape_codepoint($codepoint);
}
var_dump($output);
Output:
string(121) "\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917\u090f"
Edit: I've turned this into a small composer package available here:
https://packagist.org/packages/wrossmann/utf8_escape

Decoding numeric html entities via PHP

I have this code to decode numeric html entities to the UTF8 equivalent character.
I'm trying to convert this character:
’
which should output:
’
However, it just disappears (no output). (i've checked the source code of the page, the page has the correct utf8 character set headers/meta tags).
Does anyone know what is wrong with the code?
function entity_decode($string, $quote_style = ENT_COMPAT, $charset = "UTF-8") {
$string = html_entity_decode($string, $quote_style, $charset);
$string = preg_replace_callback('~&#x([0-9a-fA-F]+);~i', "chr_utf8_callback", $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr_utf8("\\1")', $string);
//this is another method, which also doesn't work..
//$string = preg_replace_callback("/(\&#[0-9]+;)/", "entity_decode_callback", $string);
return $string;
}
function chr_utf8_callback($matches) {
return chr_utf8(hexdec($matches[1]));
}
function chr_utf8($num) {
if ($num < 128) return chr($num);
if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
return '';
}
function entity_decode_callback($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}
echo '=' . entity_decode('’');
html_entity_decode already does what you're looking for:
$string = '’';
echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');
It will return the character:
’ binary hex: c292
Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.
Also there are some more quirks:
But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range € to Ÿ are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.
See: ’ is getting converted as “\u0092” by nokogiri in ruby on rails

drop 0 from md5() PHP if byte representation is less than 0x10

Using md5() function in PHP directly gives me the String. What I want to do before saving the string in the database is remove zeroes 0 if any in the byte representation of that hex and that byte representation is < 0x10 and then save the string in the database.
How can I do this in PHP?
MD5 - PHP - Raw Value - catch12 - 214423105677f2375487b4c6880c12ae - This is what I get now. Below is the value that I want the PHP to save in the database.
MD5 - Raw Value - catch12 - 214423105677f2375487b4c688c12ae
Wondering why? The MD5 code I have in my Android App for Login and Signup I did not append zeroes for the condition if ((b & 0xFF) < 0x10) hex.append("0"); Works fine. But the Forgot Password functionality in the site is PHP which is when the mismatch happens if the user resets password. JAVA code below.
byte raw[] = md.digest();
StringBuffer hexString = new StringBuffer();
for (int i=0; i<raw.length; i++)
hexString.append(Integer.toHexString(0xFF & raw[i]));
v_password = hexString.toString();
Any help on the PHP side so that the mismatch does not happen would be very very helpful. I can't change the App code because that would create problems for existing users.
Thank you.
Pass the "normal" MD5 hash to this function. It will parse it into the individual byte pairs and strip leading zeros.
EDIT: Fixed a typo
function convertMD5($md5)
{
$bytearr = str_split($md5, 2);
$ret = '';
foreach ($bytearr as $byte)
$ret .= ($byte[0] == '0') ? str_replace('0', '', $byte) : $byte;
return $ret;
}
Alternatively, if you don't want zero-bytes completely stripped (if you want 0x00 to be '0'), use this version:
function convertMD5($md5)
{
$bytearr = str_split($md5, 2);
$ret = '';
foreach ($bytearr as $byte)
$ret .= ($byte[0] == '0') ? $byte[1] : $byte;
return $ret;
}
$md5 = md5('catch12');
$new_md5 = '';
for ($i = 0; $i < 32; $i += 2)
{
if ($md5[$i] != '0') $new_md5 .= $md5[$i];
$new_md5 .= $md5[$i+1];
}
echo $new_md5;
To strip leading zeros (00->0, 0a->a, 10->10)
function stripZeros($md5hex) {
$res =''; $t = str_split($md5hex, 2);
foreach($t as $pair) $res .= dechex(hexdec($pair));
return $res;
}
To strip leading zeros & zero bytes (00->nothing, 0a->a, 10->10)
function stripZeros($md5hex) {
$res =''; $t = str_split($md5hex, 2);
foreach($t as $pair) {
$b = dechex(hexdec($pair));
if ($b!=0) $res .= $b;
}
return $res;
}

php utf-8 encoding problems

Hi All:
I met a tricky problem here: I need to read some files and convert its content into some XML files. For each line in the file, I believe most of them are valid ASCII code, so that I could just read the line into php and save the line into an XML file with default encoding XML as 'UTF-8'. However, I noticed that there might be some GBK, GB2312(Chinese character), SJIS(Japanese characters) etc.. existed in the original files, php have no problems to save the string into XML directly. However, the XML parser will detect there are invalid UTF-8 codes and crashed.
Now, I think the best library php function for my purpose is probably:
$decode_str = mb_convert_encoding($str, 'UTF-8', 'auto');
I try to run this conversation function for each line before inserting it into XML. However, as I tested with some UTF-16 and GBK encoding, I don't think this function could correctly discriminate the input string encoding schema.
In addition, I tried to use CDATA to wrap the string, it's weird that the XML parser still complain about invalid UTF-8 codes etc.. of course, when I vim the xml file, what's inside the CDATA is a mess for sure.
Any suggestions?
I spend once a lot of time to create a safe UTF8 encoding function:
function _convert($content) {
if(!mb_check_encoding($content, 'UTF-8')
OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {
$content = mb_convert_encoding($content, 'UTF-8');
if (mb_check_encoding($content, 'UTF-8')) {
// log('Converted to UTF-8');
} else {
// log('Could not be converted to UTF-8');
}
}
return $content;
}
The main problem was to figure out which encoding the input string is already using. Please tell me if my solution works for you as well!
I ran into this problem while using json_encode. I use this to get everything into utf8.
Source: http://us2.php.net/manual/en/function.json-encode.php
function ascii_to_entities($str)
{
$count = 1;
$out = '';
$temp = array();
for ($i = 0, $s = strlen($str); $i < $s; $i++)
{
$ordinal = ord($str[$i]);
if ($ordinal < 128)
{
if (count($temp) == 1)
{
$out .= '&#'.array_shift($temp).';';
$count = 1;
}
$out .= $str[$i];
}
else
{
if (count($temp) == 0)
{
$count = ($ordinal < 224) ? 2 : 3;
}
$temp[] = $ordinal;
if (count($temp) == $count)
{
$number = ($count == 3) ? (($temp['0'] % 16) * 4096) +
(($temp['1'] % 64) * 64) +
($temp['2'] % 64) : (($temp['0'] % 32) * 64) +
($temp['1'] % 64);
$out .= '&#'.$number.';';
$count = 1;
$temp = array();
}
}
}
return $out;
}

Categories