Enable Extract and Display Foreign Language Content in browsers using fgetcsv - php

iD;English [en];Chinese [zh];German [de];Hindi [hi];Hindi (TOGO) [hi_TG];Japanese [ja]
Source[local].AlarmGroup[AlarmText_02].ID[1310:90];Unwinder: Accu position difference too big. Check for laminate break;拆卷器: 蓄存器位置差过大。 检查复合片材是否中断;Laminatspeicher: Zu grosse Positionsänderung - Auf Laminatriss prüfen;290;;巻出装置: アキュムレーター位置の差が大きすぎます。 ラミネートが壊れていないか確認してください
Source[local].AlarmGroup[AlarmText_02].ID[1311:91];Unwinder: Accu level too small for auto splice;拆卷器: 自动拼接的蓄存器级别过小;Abwickler: Akku Füllstand zu klein für Autospleiss;291;;巻出装置: 自動紙継を行うにはアキュムレーターのレベルが小さすぎます
I am trying to fetch csv content as mentioned above :
The csv file is saved as Unicode Text. It has Chinese, German, Japanese Language.
I am unable to fetch foreign language in correct format.
CSV reader Code
header('Content-Type: text/html; charset=utf-8');
$row = 1;
$up_file = 'text_SHOT_S.csv';
setlocale(LC_ALL, 'en_US.UTF-8');
if (($handle = fopen($up_file, "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, ";")) !== FALSE) {
$num = count($data);
$row++;
for ($c=0; $c < $num; $c++) {
echo $data[$c].'<br>';
}
}
fclose($handle);}
Output of the following Code:
iD戼㹲䔀渀最氀椀猀栀 嬀攀渀崀㰀牢>Chinese [zh]戼㹲䜀攀爀洀愀渀 嬀搀攀崀㰀牢>Hindi [hi]戼㹲䠀椀渀搀椀 ⠀吀伀䜀伀⤀ 嬀栀椀开吀䜀崀㰀牢>Japanese [ja] 戼㹲匀漀甀爀挀攀嬀氀漀挀愀氀崀⸀䄀氀愀爀洀䜀爀漀甀瀀嬀䄀氀愀爀洀吀攀砀琀开 ㈀崀⸀䤀䐀嬀㄀㌀㄀ 㨀㤀 崀㰀牢>Unwinder: Accu position difference too big. Check for laminate break戼㹲였睢桓ᩖ⃿쐀墄桛䵖湏읝➏ə‰쀀൧࡙䝔偲⽧♦ⵔ굎㱥牢>Laminatspeicher: Zu grosse Positionsänderung - Auf Laminatriss prüfen戼㹲㈀㤀 㰀牢>戼㹲ff艹앑溈㩿  ꈀ괰ﰰ뼰ﰰ䴰湏湿䱝✰䵙夰丰縰夰Ȱ‰�촰ﰰ젰䰰쨰豘昰䐰樰䐰䬰먰赸垊昰估怰唰䐰ര㰀牢>Source[local].AlarmGroup[AlarmText_02].ID[1311:91]戼㹲唀渀眀椀渀搀攀爀㨀 䄀挀挀甀 氀攀瘀攀氀 琀漀漀 猀洀愀氀氀 昀漀爀 愀甀琀漀 猀瀀氀椀挀攀㰀牢>拆卷器: 自动拼接的蓄存器级别过小戼㹲䄀戀眀椀挀欀氀攀爀㨀 䄀欀欀甀 䘀ﰀ氀氀猀琀愀渀搀 稀甀 欀氀攀椀渀 昀ﰀ爀 䄀甀琀漀猀瀀氀攀椀猀猀㰀牢>291戼㹲㰀牢>巻出装置: 自動紙継を行うにはアキュムレーターのレベルが小さすぎます 戼㹲㰀牢
I either check garbage character or most of the content converted to Chinese.
Also tried the header('Content-Type: text/html; charset=iso-8859-1') and setlocale(LC_CTYPE, 'zh_CN.UTF-8','zh_ZH.big5');
I want the output same as CSV content.
Thanks in advance .

For reading CSV content I used PHPExcel and converted UTF-16 file into UTF-8 then it will fetch Chinese content properly.
Please refer below link for converting UTF-16 File to an UTF-8.
How to Convert an UTF-16 File to an UTF-8 file using PHP
To convert a file simply call the convert_file_to_utf8() function
and pass to it the file path of the file you wish to convert. The
function then uses the PHP function file_get_contents() to pack the
input file’s contents into a string variable which is then passed to
the main converter function which converts the string from UTF-16 to
UTF-8 encoding if necessary. Finally, it uses file_put_contents() to
stuff the resulting string back into the original file, overwriting
the original file contents.
function utf16_to_utf8($str) {
$c0 = ord($str[0]);
$c1 = ord($str[1]);
if ($c0 == 0xFE && $c1 == 0xFF) {
$be = true;
} else if ($c0 == 0xFF && $c1 == 0xFE) {
$be = false;
} else {
return $str;
}
$str = substr($str, 2);
$len = strlen($str);
$dec = '';
for ($i = 0; $i < $len; $i += 2) {
$c = ($be) ? ord($str[$i]) << 8 | ord($str[$i + 1]) :
ord($str[$i + 1]) << 8 | ord($str[$i]);
if ($c >= 0x0001 && $c <= 0x007F) {
$dec .= chr($c);
} else if ($c > 0x07FF) {
$dec .= chr(0xE0 | (($c >> 12) & 0x0F));
$dec .= chr(0x80 | (($c >> 6) & 0x3F));
$dec .= chr(0x80 | (($c >> 0) & 0x3F));
} else {
$dec .= chr(0xC0 | (($c >> 6) & 0x1F));
$dec .= chr(0x80 | (($c >> 0) & 0x3F));
}
}
return $dec;
}
function convert_file_to_utf8($csvfile) {
$utfcheck = file_get_contents($csvfile);
$utfcheck = utf16_to_utf8($utfcheck);
file_put_contents($csvfile, $utfcheck);
}

Please before read this answer, read the different coment.
Mudassir, you can see the exact charset with tortoise, with comparator of file (see img)
Your soft use not utf-8 but utf-16 encoding. If you cant change this, you can use http://php.net/manual/en/function.mb-convert-encoding.php
http://php.net/manual/fr/mbstring.supported-encodings.php
I've try with your file and this function, and it's work correctly. See the code :
header('Content-Type: text/html; charset=utf-8');
$row = 1;
$up_file = 'text_SHOT_S.csv';
setlocale(LC_ALL, 'en_US.UTF-8');
if (($handle = fopen($up_file, "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, ";")) !== FALSE) {
$num = count($data);
$row++;
for ($c=0; $c < $num; $c++) {
// echo $data[$c].'<br>';
echo mb_convert_encoding($data[$c],'utf8','utf-16').'<br>';
}
}
fclose($handle);}

Related

HDF5: How to decode UTF8-encoded string from h5dump output?

I'm writing an attribute to an HDF5 file using UTF-8 encoding. As an example, I've written "äöüß" to the attribute "notes" in the file.
I'm now trying to parse the output of h5ls (or h5dump) to extract this data back. Either tool gives me an output like this:
ATTRIBUTE "notes" {
DATATYPE H5T_STRING {
STRSIZE 8;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
DATA {
(0): "\37777777703\37777777644\37777777703\37777777666\37777777703\37777777674\37777777703\37777777637"
}
}
I'm aware that, e.g., \37777777703\37777777644 somehow encodes ä as 0xC3 0xA4, however, I have a really hard time coming up with how this encoding works.
What's the magic formula behind this and how can I properly decode it back into äöüß?
The strings are encoded using base 8. I've decoded them in the PHP backend using:
$line = "This is the text including some UTF-8 bytes \37777777703\37777777644\37777777703\37777777666\37777777703\37777777674\37777777703\37777777637";
// extract UTF-8 Bytes
$octbytes;
preg_match_all("/\\\\37777777(\\d{3})/", $line, $octbytes);
// parse extracted Bytes
for ($m = 0; $m < count($octbytes[1]); ) {
$B = octdec($octbytes[1][$m]);
// UTF-8 may span over 2 to 4 Bytes
$numBytes;
if (($B & 0xF8) == 0xF0) { $numBytes = 4; }
else if (($B & 0xF0) == 0xE0) { $numBytes = 3; }
else if (($B & 0xE0) == 0xC0) { $numBytes = 2; }
else { $numBytes = 1; }
$hxstr = "";
$replaceStr = "";
for ($j = 0; $j < $numBytes; $j++) {
$match = $octbytes[1][$m+$j];
$dec = octdec($match) & 255;
$hx = strtoupper(dechex($dec));
$hxstr = $hxstr . $hx;
$replaceStr = $replaceStr . "\\37777777" . $match;
}
// pack extracted bytes into one hex string
$utfChar = pack("H*", $hxstr); // < this will be interpreted correctly
// replace Bytes in the input with the parsed chars
$parsedData = str_replace($replaceStr,$utfChar,$line);
// go to next byte
$m+=$numBytes;
}
echo "The parsed line: $line";

Check if csv file is in UTF-8 with PHP

Is there a way which checks a CSV-file for UTF-8 without BOM encoding? I want to check the whole file and not a single string.
I would try to set the first line with a special character and than reading the string and checking if it matches the same string hard-coded in my script. But I don't know if this is a good idea.
Google only showed me this. But the link in the last post isn't available.
if (mb_check_encoding(file_get_contents($file), 'UTF-8')) {
// yup, all UTF-8
}
You can also go through it line by line with fgets, if the file is large and you don't want to store it all in memory at once. Not sure what you mean by the second part of your question.
I recommand this function (from the symfony toolkit):
<?php
/**
* Checks if a string is an utf8.
*
* Yi Stone Li<yili#yahoo-inc.com>
* Copyright (c) 2007 Yahoo! Inc. All rights reserved.
* Licensed under the BSD open source license
*
* #param string
*
* #return bool true if $string is valid UTF-8 and false otherwise.
*/
public static function isUTF8($string)
{
for ($idx = 0, $strlen = strlen($string); $idx < $strlen; $idx++)
{
$byte = ord($string[$idx]);
if ($byte & 0x80)
{
if (($byte & 0xE0) == 0xC0)
{
// 2 byte char
$bytes_remaining = 1;
}
else if (($byte & 0xF0) == 0xE0)
{
// 3 byte char
$bytes_remaining = 2;
}
else if (($byte & 0xF8) == 0xF0)
{
// 4 byte char
$bytes_remaining = 3;
}
else
{
return false;
}
if ($idx + $bytes_remaining >= $strlen)
{
return false;
}
while ($bytes_remaining--)
{
if ((ord($string[++$idx]) & 0xC0) != 0x80)
{
return false;
}
}
}
}
return true;
}
But as it check all the characters of the string, I don't recommand to use it on a large file. Just check the first 10 lines i.e.
<?php
$handle = fopen("mycsv.csv", "r");
$check_string = "";
$line = 1;
if ($handle) {
while ((($buffer = fgets($handle, 4096)) !== false) && $line < 11) {
$check_string .= $buffer;
$line++;
}
if (!feof($handle)) {
echo "Error: unexpected fgets() fail\n";
}
fclose($handle);
var_dump( self::isUTF8($check_string) );
}

php utf-8 encoding problems

Hi All:
I met a tricky problem here: I need to read some files and convert its content into some XML files. For each line in the file, I believe most of them are valid ASCII code, so that I could just read the line into php and save the line into an XML file with default encoding XML as 'UTF-8'. However, I noticed that there might be some GBK, GB2312(Chinese character), SJIS(Japanese characters) etc.. existed in the original files, php have no problems to save the string into XML directly. However, the XML parser will detect there are invalid UTF-8 codes and crashed.
Now, I think the best library php function for my purpose is probably:
$decode_str = mb_convert_encoding($str, 'UTF-8', 'auto');
I try to run this conversation function for each line before inserting it into XML. However, as I tested with some UTF-16 and GBK encoding, I don't think this function could correctly discriminate the input string encoding schema.
In addition, I tried to use CDATA to wrap the string, it's weird that the XML parser still complain about invalid UTF-8 codes etc.. of course, when I vim the xml file, what's inside the CDATA is a mess for sure.
Any suggestions?
I spend once a lot of time to create a safe UTF8 encoding function:
function _convert($content) {
if(!mb_check_encoding($content, 'UTF-8')
OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {
$content = mb_convert_encoding($content, 'UTF-8');
if (mb_check_encoding($content, 'UTF-8')) {
// log('Converted to UTF-8');
} else {
// log('Could not be converted to UTF-8');
}
}
return $content;
}
The main problem was to figure out which encoding the input string is already using. Please tell me if my solution works for you as well!
I ran into this problem while using json_encode. I use this to get everything into utf8.
Source: http://us2.php.net/manual/en/function.json-encode.php
function ascii_to_entities($str)
{
$count = 1;
$out = '';
$temp = array();
for ($i = 0, $s = strlen($str); $i < $s; $i++)
{
$ordinal = ord($str[$i]);
if ($ordinal < 128)
{
if (count($temp) == 1)
{
$out .= '&#'.array_shift($temp).';';
$count = 1;
}
$out .= $str[$i];
}
else
{
if (count($temp) == 0)
{
$count = ($ordinal < 224) ? 2 : 3;
}
$temp[] = $ordinal;
if (count($temp) == $count)
{
$number = ($count == 3) ? (($temp['0'] % 16) * 4096) +
(($temp['1'] % 64) * 64) +
($temp['2'] % 64) : (($temp['0'] % 32) * 64) +
($temp['1'] % 64);
$out .= '&#'.$number.';';
$count = 1;
$temp = array();
}
}
}
return $out;
}

Decompressing a .gz file via PHP

I need to be able to decompress through PHP some data that I have in a string which uses the gzip format. I need to do this via PHP, not by calling - through system for example - an external program.
I go to the documentation and I find gzdecode. Too bad it doesn't exist. Digging further through google it appears this function was implemented in PHP6, which I cannot use. (Interestingly enough gzencode exists and is working).
I believe - but I'm not sure - that the gzip format simply has some extra header data. Is there a way to uncompress it by manipulating this extra data and then using gzuncompress, or some other way?
Thanks
gzdecode() is not yet in PHP. But you can use the implementation from upgradephp. It really is just a few extra header bytes.
Another option would be to use gzopen. Maybe just like gzopen("data:app/bin,....") even.
Well I found my answer by reading the comments on the gzdecode page I linked in my original post. One of the users, Aaron G, provided an implementation of it and it works:
<?php
function gzdecode($data) {
$len = strlen($data);
if ($len < 18 || strcmp(substr($data,0,2),"\x1f\x8b")) {
return null; // Not GZIP format (See RFC 1952)
}
$method = ord(substr($data,2,1)); // Compression method
$flags = ord(substr($data,3,1)); // Flags
if ($flags & 31 != $flags) {
// Reserved bits are set -- NOT ALLOWED by RFC 1952
return null;
}
// NOTE: $mtime may be negative (PHP integer limitations)
$mtime = unpack("V", substr($data,4,4));
$mtime = $mtime[1];
$xfl = substr($data,8,1);
$os = substr($data,8,1);
$headerlen = 10;
$extralen = 0;
$extra = "";
if ($flags & 4) {
// 2-byte length prefixed EXTRA data in header
if ($len - $headerlen - 2 < 8) {
return false; // Invalid format
}
$extralen = unpack("v",substr($data,8,2));
$extralen = $extralen[1];
if ($len - $headerlen - 2 - $extralen < 8) {
return false; // Invalid format
}
$extra = substr($data,10,$extralen);
$headerlen += 2 + $extralen;
}
$filenamelen = 0;
$filename = "";
if ($flags & 8) {
// C-style string file NAME data in header
if ($len - $headerlen - 1 < 8) {
return false; // Invalid format
}
$filenamelen = strpos(substr($data,8+$extralen),chr(0));
if ($filenamelen === false || $len - $headerlen - $filenamelen - 1 < 8) {
return false; // Invalid format
}
$filename = substr($data,$headerlen,$filenamelen);
$headerlen += $filenamelen + 1;
}
$commentlen = 0;
$comment = "";
if ($flags & 16) {
// C-style string COMMENT data in header
if ($len - $headerlen - 1 < 8) {
return false; // Invalid format
}
$commentlen = strpos(substr($data,8+$extralen+$filenamelen),chr(0));
if ($commentlen === false || $len - $headerlen - $commentlen - 1 < 8) {
return false; // Invalid header format
}
$comment = substr($data,$headerlen,$commentlen);
$headerlen += $commentlen + 1;
}
$headercrc = "";
if ($flags & 1) {
// 2-bytes (lowest order) of CRC32 on header present
if ($len - $headerlen - 2 < 8) {
return false; // Invalid format
}
$calccrc = crc32(substr($data,0,$headerlen)) & 0xffff;
$headercrc = unpack("v", substr($data,$headerlen,2));
$headercrc = $headercrc[1];
if ($headercrc != $calccrc) {
return false; // Bad header CRC
}
$headerlen += 2;
}
// GZIP FOOTER - These be negative due to PHP's limitations
$datacrc = unpack("V",substr($data,-8,4));
$datacrc = $datacrc[1];
$isize = unpack("V",substr($data,-4));
$isize = $isize[1];
// Perform the decompression:
$bodylen = $len-$headerlen-8;
if ($bodylen < 1) {
// This should never happen - IMPLEMENTATION BUG!
return null;
}
$body = substr($data,$headerlen,$bodylen);
$data = "";
if ($bodylen > 0) {
switch ($method) {
case 8:
// Currently the only supported compression method:
$data = gzinflate($body);
break;
default:
// Unknown compression method
return false;
}
} else {
// I'm not sure if zero-byte body content is allowed.
// Allow it for now... Do nothing...
}
// Verifiy decompressed size and CRC32:
// NOTE: This may fail with large data sizes depending on how
// PHP's integer limitations affect strlen() since $isize
// may be negative for large sizes.
if ($isize != strlen($data) || crc32($data) != $datacrc) {
// Bad format! Length or CRC doesn't match!
return false;
}
return $data;
}
?>
Try gzinflate.
Did you tried gzuncompress?
http://www.php.net/manual/en/function.gzuncompress.php

Php Convert to ISO-8859-9

I use JSON to encode an array, and I get a string like this:
{"name":"\u00fe\u00fd\u00f0\u00f6\u00e7"}
Now I need to convert this to ISO-8859-9. I tried the following but it fails:
header('Content-type: application/json; charset=ISO-8859-9');
$json = json_encode($response);
$json = utf8_decode($json);
$json = mb_convert_encoding($json, "ISO-8859-9", "auto");
echo $json;
It doesnt seem to work. What am I missing?
Thank you for your time.
You can do:
$json = json_encode($response);
header('Content-type: application/json; charset=ISO-8859-9');
echo mb_convert_encoding($json, "ISO-8859-9", "UTF-8");
Assuming that strings in $response is in utf-8. But I would strongly suggest that you just use utf-8 all the way through.
Edit: Sorry, just realised that won't work, since json_encode escapes unicode points as javascript escape codes. You'll have to convert these to utf-8 sequences first. I don't think there are any built-in functionality for that, but you can use a slightly modified variation of this library to get there. Try the following:
function unicode_hex_to_utf8($hexcode) {
$arr = array(hexdec(substr($hexcode[1], 0, 2)), hexdec(substr($hexcode[1], 2, 2)));
$dest = '';
foreach ($arr as $src) {
if ($src < 0) {
return false;
} elseif ( $src <= 0x007f) {
$dest .= chr($src);
} elseif ($src <= 0x07ff) {
$dest .= chr(0xc0 | ($src >> 6));
$dest .= chr(0x80 | ($src & 0x003f));
} elseif ($src == 0xFEFF) {
// nop -- zap the BOM
} elseif ($src >= 0xD800 && $src <= 0xDFFF) {
// found a surrogate
return false;
} elseif ($src <= 0xffff) {
$dest .= chr(0xe0 | ($src >> 12));
$dest .= chr(0x80 | (($src >> 6) & 0x003f));
$dest .= chr(0x80 | ($src & 0x003f));
} elseif ($src <= 0x10ffff) {
$dest .= chr(0xf0 | ($src >> 18));
$dest .= chr(0x80 | (($src >> 12) & 0x3f));
$dest .= chr(0x80 | (($src >> 6) & 0x3f));
$dest .= chr(0x80 | ($src & 0x3f));
} else {
// out of range
return false;
}
}
return $dest;
}
print mb_convert_encoding(
preg_replace_callback(
"~\\\\u([1234567890abcdef]{4})~", 'unicode_hex_to_utf8',
json_encode($response)),
"ISO-8859-9", "UTF-8");
As you can see on the PHP documentation site JSON encoding/decoding functions only work with utf8 encoding, so trying to change this can cause you some data problems, you may get unexpected and wrong results.

Categories