php utf-8 encoding problems

php utf-8 encoding problems - php

Hi All:
I met a tricky problem here: I need to read some files and convert its content into some XML files. For each line in the file, I believe most of them are valid ASCII code, so that I could just read the line into php and save the line into an XML file with default encoding XML as 'UTF-8'. However, I noticed that there might be some GBK, GB2312(Chinese character), SJIS(Japanese characters) etc.. existed in the original files, php have no problems to save the string into XML directly. However, the XML parser will detect there are invalid UTF-8 codes and crashed.
Now, I think the best library php function for my purpose is probably:
$decode_str = mb_convert_encoding($str, 'UTF-8', 'auto');
I try to run this conversation function for each line before inserting it into XML. However, as I tested with some UTF-16 and GBK encoding, I don't think this function could correctly discriminate the input string encoding schema.
In addition, I tried to use CDATA to wrap the string, it's weird that the XML parser still complain about invalid UTF-8 codes etc.. of course, when I vim the xml file, what's inside the CDATA is a mess for sure.
Any suggestions?

I spend once a lot of time to create a safe UTF8 encoding function:
function _convert($content) {
if(!mb_check_encoding($content, 'UTF-8')
OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {
$content = mb_convert_encoding($content, 'UTF-8');
if (mb_check_encoding($content, 'UTF-8')) {
// log('Converted to UTF-8');
} else {
// log('Could not be converted to UTF-8');
}
}
return $content;
}
The main problem was to figure out which encoding the input string is already using. Please tell me if my solution works for you as well!

I ran into this problem while using json_encode. I use this to get everything into utf8.
Source: http://us2.php.net/manual/en/function.json-encode.php
function ascii_to_entities($str)
{
$count = 1;
$out = '';
$temp = array();
for ($i = 0, $s = strlen($str); $i < $s; $i++)
{
$ordinal = ord($str[$i]);
if ($ordinal < 128)
{
if (count($temp) == 1)
{
$out .= '&#'.array_shift($temp).';';
$count = 1;
}
$out .= $str[$i];
}
else
{
if (count($temp) == 0)
{
$count = ($ordinal < 224) ? 2 : 3;
}
$temp[] = $ordinal;
if (count($temp) == $count)
{
$number = ($count == 3) ? (($temp['0'] % 16) * 4096) +
(($temp['1'] % 64) * 64) +
($temp['2'] % 64) : (($temp['0'] % 32) * 64) +
($temp['1'] % 64);
$out .= '&#'.$number.';';
$count = 1;
$temp = array();
}
}
}
return $out;
}

Related

HDF5: How to decode UTF8-encoded string from h5dump output?

I'm writing an attribute to an HDF5 file using UTF-8 encoding. As an example, I've written "äöüß" to the attribute "notes" in the file.
I'm now trying to parse the output of h5ls (or h5dump) to extract this data back. Either tool gives me an output like this:
ATTRIBUTE "notes" {
DATATYPE H5T_STRING {
STRSIZE 8;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
DATA {
(0): "\37777777703\37777777644\37777777703\37777777666\37777777703\37777777674\37777777703\37777777637"
}
}
I'm aware that, e.g., \37777777703\37777777644 somehow encodes ä as 0xC3 0xA4, however, I have a really hard time coming up with how this encoding works.
What's the magic formula behind this and how can I properly decode it back into äöüß?

The strings are encoded using base 8. I've decoded them in the PHP backend using:
$line = "This is the text including some UTF-8 bytes \37777777703\37777777644\37777777703\37777777666\37777777703\37777777674\37777777703\37777777637";
// extract UTF-8 Bytes
$octbytes;
preg_match_all("/\\\\37777777(\\d{3})/", $line, $octbytes);
// parse extracted Bytes
for ($m = 0; $m < count($octbytes[1]); ) {
$B = octdec($octbytes[1][$m]);
// UTF-8 may span over 2 to 4 Bytes
$numBytes;
if (($B & 0xF8) == 0xF0) { $numBytes = 4; }
else if (($B & 0xF0) == 0xE0) { $numBytes = 3; }
else if (($B & 0xE0) == 0xC0) { $numBytes = 2; }
else { $numBytes = 1; }
$hxstr = "";
$replaceStr = "";
for ($j = 0; $j < $numBytes; $j++) {
$match = $octbytes[1][$m+$j];
$dec = octdec($match) & 255;
$hx = strtoupper(dechex($dec));
$hxstr = $hxstr . $hx;
$replaceStr = $replaceStr . "\\37777777" . $match;
}
// pack extracted bytes into one hex string
$utfChar = pack("H*", $hxstr); // < this will be interpreted correctly
// replace Bytes in the input with the parsed chars
$parsedData = str_replace($replaceStr,$utfChar,$line);
// go to next byte
$m+=$numBytes;
}
echo "The parsed line: $line";

Convert UTF-8 to Window 874 using PHP

I have encoded string like ªÙªÑÂ à¾ç§Íé¹
Please check below function that i have used for decode(utf-8 to tis620) it.
function utf8_to_tis620($string) {
$str = $string;
$res = "";
for ($i = 0; $i < strlen($str); $i++) {
if (ord($str[$i]) == 224) {
$unicode = ord($str[$i+2]) & 0x3F;
$unicode |= (ord($str[$i+1]) & 0x3F) << 6;
$unicode |= (ord($str[$i]) & 0x0F) << 12;
$res .= chr($unicode-0x0E00+0xA0);
$i += 2;
} else {
$res .= $str[$i];
}
}
return $res;
}
So it will return string like ชูชัย Gงอ้น but it isn't correct in THAI language.
Actually it should return ชูชัย เพ็งอ้น that is returned from http://string-functions.com/encodedecode.aspx.
But there is used windows 874 decoding.
Please let me know how can i decode utf-8 to windows 874 in php?

You can use mb_convert_encoding to change the character encoding:
function utf8_to_tis620($string) {
return mb_convert_encoding($string, 'UTF-8', 'TIS-620');
}
As noted by krasipenkov in their comment,
There is small difference between ISO-8859-11 and TIS-620. ISO-8859-11 is
nearly identical to the national Thai standard TIS-620 (1990). The
sole difference is that ISO/IEC 8859-11 allocates non-breaking space
to code 0xA0, while TIS-620 leaves it undefined. (In practice, this
small distinction is usually ignored.)
Instead of 'TIS-620', you can use 'ISO-8859-11' for the Windows 874 character encoding if needed.

You can use iconv for convert string to request character encoding.
$string= iconv('TIS-620','UTF-8//ignore',$string);

Enable Extract and Display Foreign Language Content in browsers using fgetcsv

iD;English [en];Chinese [zh];German [de];Hindi [hi];Hindi (TOGO) [hi_TG];Japanese [ja]
Source[local].AlarmGroup[AlarmText_02].ID[1310:90];Unwinder: Accu position difference too big. Check for laminate break;拆卷器： 蓄存器位置差过大。 检查复合片材是否中断;Laminatspeicher: Zu grosse Positionsänderung - Auf Laminatriss prüfen;290;;巻出装置: アキュムレーター位置の差が大きすぎます。 ラミネートが壊れていないか確認してください
Source[local].AlarmGroup[AlarmText_02].ID[1311:91];Unwinder: Accu level too small for auto splice;拆卷器： 自动拼接的蓄存器级别过小;Abwickler: Akku Füllstand zu klein für Autospleiss;291;;巻出装置: 自動紙継を行うにはアキュムレーターのレベルが小さすぎます
I am trying to fetch csv content as mentioned above :
The csv file is saved as Unicode Text. It has Chinese, German, Japanese Language.
I am unable to fetch foreign language in correct format.
CSV reader Code
header('Content-Type: text/html; charset=utf-8');
$row = 1;
$up_file = 'text_SHOT_S.csv';
setlocale(LC_ALL, 'en_US.UTF-8');
if (($handle = fopen($up_file, "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, ";")) !== FALSE) {
$num = count($data);
$row++;
for ($c=0; $c < $num; $c++) {
echo $data[$c].'<br>';
}
}
fclose($handle);}
Output of the following Code:
iD戼㹲䔀渀最氀椀猀栀 嬀攀渀崀㰀牢>Chinese [zh]戼㹲䜀攀爀洀愀渀 嬀搀攀崀㰀牢>Hindi [hi]戼㹲䠀椀渀搀椀 ⠀吀伀䜀伀⤀ 嬀栀椀开吀䜀崀㰀牢>Japanese [ja] 戼㹲匀漀甀爀挀攀嬀氀漀挀愀氀崀⸀䄀氀愀爀洀䜀爀漀甀瀀嬀䄀氀愀爀洀吀攀砀琀开　㈀崀⸀䤀䐀嬀㄀㌀㄀　㨀㤀　崀㰀牢>Unwinder: Accu position difference too big. Check for laminate break戼㹲였睢桓ᩖ⃿쐀墄桛䵖湏읝➏ə‰쀀൧࡙䝔偲⽧♦ⵔ굎㱥牢>Laminatspeicher: Zu grosse Positionsänderung - Auf Laminatriss prüfen戼㹲㈀㤀　㰀牢>戼㹲ﬀ艹앑溈㩿  ꈀ괰ﰰ뼰ﰰ䴰湏湿䱝✰䵙夰丰縰夰Ȱ‰�촰ﰰ젰䰰쨰豘昰䐰樰䐰䬰먰赸垊昰估怰唰䐰ര㰀牢>Source[local].AlarmGroup[AlarmText_02].ID[1311:91]戼㹲唀渀眀椀渀搀攀爀㨀 䄀挀挀甀 氀攀瘀攀氀 琀漀漀 猀洀愀氀氀 昀漀爀 愀甀琀漀 猀瀀氀椀挀攀㰀牢>拆卷器： 自动拼接的蓄存器级别过小戼㹲䄀戀眀椀挀欀氀攀爀㨀 䄀欀欀甀 䘀ﰀ氀氀猀琀愀渀搀 稀甀 欀氀攀椀渀 昀ﰀ爀 䄀甀琀漀猀瀀氀攀椀猀猀㰀牢>291戼㹲㰀牢>巻出装置: 自動紙継を行うにはアキュムレーターのレベルが小さすぎます 戼㹲㰀牢
I either check garbage character or most of the content converted to Chinese.
Also tried the header('Content-Type: text/html; charset=iso-8859-1') and setlocale(LC_CTYPE, 'zh_CN.UTF-8','zh_ZH.big5');
I want the output same as CSV content.
Thanks in advance .

For reading CSV content I used PHPExcel and converted UTF-16 file into UTF-8 then it will fetch Chinese content properly.
Please refer below link for converting UTF-16 File to an UTF-8.
How to Convert an UTF-16 File to an UTF-8 file using PHP
To convert a file simply call the convert_file_to_utf8() function
and pass to it the file path of the file you wish to convert. The
function then uses the PHP function file_get_contents() to pack the
input file’s contents into a string variable which is then passed to
the main converter function which converts the string from UTF-16 to
UTF-8 encoding if necessary. Finally, it uses file_put_contents() to
stuff the resulting string back into the original file, overwriting
the original file contents.
function utf16_to_utf8($str) {
$c0 = ord($str[0]);
$c1 = ord($str[1]);
if ($c0 == 0xFE && $c1 == 0xFF) {
$be = true;
} else if ($c0 == 0xFF && $c1 == 0xFE) {
$be = false;
} else {
return $str;
}
$str = substr($str, 2);
$len = strlen($str);
$dec = '';
for ($i = 0; $i < $len; $i += 2) {
$c = ($be) ? ord($str[$i]) << 8 | ord($str[$i + 1]) :
ord($str[$i + 1]) << 8 | ord($str[$i]);
if ($c >= 0x0001 && $c <= 0x007F) {
$dec .= chr($c);
} else if ($c > 0x07FF) {
$dec .= chr(0xE0 | (($c >> 12) & 0x0F));
$dec .= chr(0x80 | (($c >> 6) & 0x3F));
$dec .= chr(0x80 | (($c >> 0) & 0x3F));
} else {
$dec .= chr(0xC0 | (($c >> 6) & 0x1F));
$dec .= chr(0x80 | (($c >> 0) & 0x3F));
}
}
return $dec;
}
function convert_file_to_utf8($csvfile) {
$utfcheck = file_get_contents($csvfile);
$utfcheck = utf16_to_utf8($utfcheck);
file_put_contents($csvfile, $utfcheck);
}

Please before read this answer, read the different coment.
Mudassir, you can see the exact charset with tortoise, with comparator of file (see img)
Your soft use not utf-8 but utf-16 encoding. If you cant change this, you can use http://php.net/manual/en/function.mb-convert-encoding.php
http://php.net/manual/fr/mbstring.supported-encodings.php
I've try with your file and this function, and it's work correctly. See the code :
header('Content-Type: text/html; charset=utf-8');
$row = 1;
$up_file = 'text_SHOT_S.csv';
setlocale(LC_ALL, 'en_US.UTF-8');
if (($handle = fopen($up_file, "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, ";")) !== FALSE) {
$num = count($data);
$row++;
for ($c=0; $c < $num; $c++) {
// echo $data[$c].'<br>';
echo mb_convert_encoding($data[$c],'utf8','utf-16').'<br>';
}
}
fclose($handle);}

drop 0 from md5() PHP if byte representation is less than 0x10

Using md5() function in PHP directly gives me the String. What I want to do before saving the string in the database is remove zeroes 0 if any in the byte representation of that hex and that byte representation is < 0x10 and then save the string in the database.
How can I do this in PHP?
MD5 - PHP - Raw Value - catch12 - 214423105677f2375487b4c6880c12ae - This is what I get now. Below is the value that I want the PHP to save in the database.
MD5 - Raw Value - catch12 - 214423105677f2375487b4c688c12ae
Wondering why? The MD5 code I have in my Android App for Login and Signup I did not append zeroes for the condition if ((b & 0xFF) < 0x10) hex.append("0"); Works fine. But the Forgot Password functionality in the site is PHP which is when the mismatch happens if the user resets password. JAVA code below.
byte raw[] = md.digest();
StringBuffer hexString = new StringBuffer();
for (int i=0; i<raw.length; i++)
hexString.append(Integer.toHexString(0xFF & raw[i]));
v_password = hexString.toString();
Any help on the PHP side so that the mismatch does not happen would be very very helpful. I can't change the App code because that would create problems for existing users.
Thank you.

Pass the "normal" MD5 hash to this function. It will parse it into the individual byte pairs and strip leading zeros.
EDIT: Fixed a typo
function convertMD5($md5)
{
$bytearr = str_split($md5, 2);
$ret = '';
foreach ($bytearr as $byte)
$ret .= ($byte[0] == '0') ? str_replace('0', '', $byte) : $byte;
return $ret;
}
Alternatively, if you don't want zero-bytes completely stripped (if you want 0x00 to be '0'), use this version:
function convertMD5($md5)
{
$bytearr = str_split($md5, 2);
$ret = '';
foreach ($bytearr as $byte)
$ret .= ($byte[0] == '0') ? $byte[1] : $byte;
return $ret;
}

$md5 = md5('catch12');
$new_md5 = '';
for ($i = 0; $i < 32; $i += 2)
{
if ($md5[$i] != '0') $new_md5 .= $md5[$i];
$new_md5 .= $md5[$i+1];
}
echo $new_md5;

To strip leading zeros (00->0, 0a->a, 10->10)
function stripZeros($md5hex) {
$res =''; $t = str_split($md5hex, 2);
foreach($t as $pair) $res .= dechex(hexdec($pair));
return $res;
}
To strip leading zeros & zero bytes (00->nothing, 0a->a, 10->10)
function stripZeros($md5hex) {
$res =''; $t = str_split($md5hex, 2);
foreach($t as $pair) {
$b = dechex(hexdec($pair));
if ($b!=0) $res .= $b;
}
return $res;
}

Decompressing a .gz file via PHP

I need to be able to decompress through PHP some data that I have in a string which uses the gzip format. I need to do this via PHP, not by calling - through system for example - an external program.
I go to the documentation and I find gzdecode. Too bad it doesn't exist. Digging further through google it appears this function was implemented in PHP6, which I cannot use. (Interestingly enough gzencode exists and is working).
I believe - but I'm not sure - that the gzip format simply has some extra header data. Is there a way to uncompress it by manipulating this extra data and then using gzuncompress, or some other way?
Thanks

gzdecode() is not yet in PHP. But you can use the implementation from upgradephp. It really is just a few extra header bytes.
Another option would be to use gzopen. Maybe just like gzopen("data:app/bin,....") even.

Well I found my answer by reading the comments on the gzdecode page I linked in my original post. One of the users, Aaron G, provided an implementation of it and it works:
<?php
function gzdecode($data) {
$len = strlen($data);
if ($len < 18 || strcmp(substr($data,0,2),"\x1f\x8b")) {
return null; // Not GZIP format (See RFC 1952)
}
$method = ord(substr($data,2,1)); // Compression method
$flags = ord(substr($data,3,1)); // Flags
if ($flags & 31 != $flags) {
// Reserved bits are set -- NOT ALLOWED by RFC 1952
return null;
}
// NOTE: $mtime may be negative (PHP integer limitations)
$mtime = unpack("V", substr($data,4,4));
$mtime = $mtime[1];
$xfl = substr($data,8,1);
$os = substr($data,8,1);
$headerlen = 10;
$extralen = 0;
$extra = "";
if ($flags & 4) {
// 2-byte length prefixed EXTRA data in header
if ($len - $headerlen - 2 < 8) {
return false; // Invalid format
}
$extralen = unpack("v",substr($data,8,2));
$extralen = $extralen[1];
if ($len - $headerlen - 2 - $extralen < 8) {
return false; // Invalid format
}
$extra = substr($data,10,$extralen);
$headerlen += 2 + $extralen;
}
$filenamelen = 0;
$filename = "";
if ($flags & 8) {
// C-style string file NAME data in header
if ($len - $headerlen - 1 < 8) {
return false; // Invalid format
}
$filenamelen = strpos(substr($data,8+$extralen),chr(0));
if ($filenamelen === false || $len - $headerlen - $filenamelen - 1 < 8) {
return false; // Invalid format
}
$filename = substr($data,$headerlen,$filenamelen);
$headerlen += $filenamelen + 1;
}
$commentlen = 0;
$comment = "";
if ($flags & 16) {
// C-style string COMMENT data in header
if ($len - $headerlen - 1 < 8) {
return false; // Invalid format
}
$commentlen = strpos(substr($data,8+$extralen+$filenamelen),chr(0));
if ($commentlen === false || $len - $headerlen - $commentlen - 1 < 8) {
return false; // Invalid header format
}
$comment = substr($data,$headerlen,$commentlen);
$headerlen += $commentlen + 1;
}
$headercrc = "";
if ($flags & 1) {
// 2-bytes (lowest order) of CRC32 on header present
if ($len - $headerlen - 2 < 8) {
return false; // Invalid format
}
$calccrc = crc32(substr($data,0,$headerlen)) & 0xffff;
$headercrc = unpack("v", substr($data,$headerlen,2));
$headercrc = $headercrc[1];
if ($headercrc != $calccrc) {
return false; // Bad header CRC
}
$headerlen += 2;
}
// GZIP FOOTER - These be negative due to PHP's limitations
$datacrc = unpack("V",substr($data,-8,4));
$datacrc = $datacrc[1];
$isize = unpack("V",substr($data,-4));
$isize = $isize[1];
// Perform the decompression:
$bodylen = $len-$headerlen-8;
if ($bodylen < 1) {
// This should never happen - IMPLEMENTATION BUG!
return null;
}
$body = substr($data,$headerlen,$bodylen);
$data = "";
if ($bodylen > 0) {
switch ($method) {
case 8:
// Currently the only supported compression method:
$data = gzinflate($body);
break;
default:
// Unknown compression method
return false;
}
} else {
// I'm not sure if zero-byte body content is allowed.
// Allow it for now... Do nothing...
}
// Verifiy decompressed size and CRC32:
// NOTE: This may fail with large data sizes depending on how
// PHP's integer limitations affect strlen() since $isize
// may be negative for large sizes.
if ($isize != strlen($data) || crc32($data) != $datacrc) {
// Bad format! Length or CRC doesn't match!
return false;
}
return $data;
}
?>

Try gzinflate.

Did you tried gzuncompress?
http://www.php.net/manual/en/function.gzuncompress.php

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php utf-8 encoding problems - php

Related

HDF5: How to decode UTF8-encoded string from h5dump output?

Convert UTF-8 to Window 874 using PHP

Enable Extract and Display Foreign Language Content in browsers using fgetcsv

drop 0 from md5() PHP if byte representation is less than 0x10

Decompressing a .gz file via PHP

Categories

Resources