Is there a way which checks a CSV-file for UTF-8 without BOM encoding? I want to check the whole file and not a single string.
I would try to set the first line with a special character and than reading the string and checking if it matches the same string hard-coded in my script. But I don't know if this is a good idea.
Google only showed me this. But the link in the last post isn't available.
if (mb_check_encoding(file_get_contents($file), 'UTF-8')) {
// yup, all UTF-8
}
You can also go through it line by line with fgets, if the file is large and you don't want to store it all in memory at once. Not sure what you mean by the second part of your question.
I recommand this function (from the symfony toolkit):
<?php
/**
* Checks if a string is an utf8.
*
* Yi Stone Li<yili#yahoo-inc.com>
* Copyright (c) 2007 Yahoo! Inc. All rights reserved.
* Licensed under the BSD open source license
*
* #param string
*
* #return bool true if $string is valid UTF-8 and false otherwise.
*/
public static function isUTF8($string)
{
for ($idx = 0, $strlen = strlen($string); $idx < $strlen; $idx++)
{
$byte = ord($string[$idx]);
if ($byte & 0x80)
{
if (($byte & 0xE0) == 0xC0)
{
// 2 byte char
$bytes_remaining = 1;
}
else if (($byte & 0xF0) == 0xE0)
{
// 3 byte char
$bytes_remaining = 2;
}
else if (($byte & 0xF8) == 0xF0)
{
// 4 byte char
$bytes_remaining = 3;
}
else
{
return false;
}
if ($idx + $bytes_remaining >= $strlen)
{
return false;
}
while ($bytes_remaining--)
{
if ((ord($string[++$idx]) & 0xC0) != 0x80)
{
return false;
}
}
}
}
return true;
}
But as it check all the characters of the string, I don't recommand to use it on a large file. Just check the first 10 lines i.e.
<?php
$handle = fopen("mycsv.csv", "r");
$check_string = "";
$line = 1;
if ($handle) {
while ((($buffer = fgets($handle, 4096)) !== false) && $line < 11) {
$check_string .= $buffer;
$line++;
}
if (!feof($handle)) {
echo "Error: unexpected fgets() fail\n";
}
fclose($handle);
var_dump( self::isUTF8($check_string) );
}
Related
I'm writing an attribute to an HDF5 file using UTF-8 encoding. As an example, I've written "äöüß" to the attribute "notes" in the file.
I'm now trying to parse the output of h5ls (or h5dump) to extract this data back. Either tool gives me an output like this:
ATTRIBUTE "notes" {
DATATYPE H5T_STRING {
STRSIZE 8;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
DATA {
(0): "\37777777703\37777777644\37777777703\37777777666\37777777703\37777777674\37777777703\37777777637"
}
}
I'm aware that, e.g., \37777777703\37777777644 somehow encodes ä as 0xC3 0xA4, however, I have a really hard time coming up with how this encoding works.
What's the magic formula behind this and how can I properly decode it back into äöüß?
The strings are encoded using base 8. I've decoded them in the PHP backend using:
$line = "This is the text including some UTF-8 bytes \37777777703\37777777644\37777777703\37777777666\37777777703\37777777674\37777777703\37777777637";
// extract UTF-8 Bytes
$octbytes;
preg_match_all("/\\\\37777777(\\d{3})/", $line, $octbytes);
// parse extracted Bytes
for ($m = 0; $m < count($octbytes[1]); ) {
$B = octdec($octbytes[1][$m]);
// UTF-8 may span over 2 to 4 Bytes
$numBytes;
if (($B & 0xF8) == 0xF0) { $numBytes = 4; }
else if (($B & 0xF0) == 0xE0) { $numBytes = 3; }
else if (($B & 0xE0) == 0xC0) { $numBytes = 2; }
else { $numBytes = 1; }
$hxstr = "";
$replaceStr = "";
for ($j = 0; $j < $numBytes; $j++) {
$match = $octbytes[1][$m+$j];
$dec = octdec($match) & 255;
$hx = strtoupper(dechex($dec));
$hxstr = $hxstr . $hx;
$replaceStr = $replaceStr . "\\37777777" . $match;
}
// pack extracted bytes into one hex string
$utfChar = pack("H*", $hxstr); // < this will be interpreted correctly
// replace Bytes in the input with the parsed chars
$parsedData = str_replace($replaceStr,$utfChar,$line);
// go to next byte
$m+=$numBytes;
}
echo "The parsed line: $line";
I got the problem when convert between this 2 type in PHP. This is the code I searched in google
function strToHex($string){
$hex='';
for ($i=0; $i < strlen($string); $i++){
$hex .= dechex(ord($string[$i]));
}
return $hex;
}
function hexToStr($hex){
$string='';
for ($i=0; $i < strlen($hex)-1; $i+=2){
$string .= chr(hexdec($hex[$i].$hex[$i+1]));
}
return $string;
}
I check it and found out this when I use XOR to encrypt.
I have the string "this is the test", after XOR with a key, I have the result in string ↕↑↔§P↔§P ♫§T↕§↕. After that, I tried to convert it to hex by function strToHex() and I got these 12181d15501d15500e15541215712. Then, I tested with the function hexToStr() and I have ↕↑↔§P↔§P♫§T↕§q. So, what should I do to solve this problem? Why does it wrong when I convert this 2 style value?
For people that end up here and are just looking for the hex representation of a (binary) string.
bin2hex("that's all you need");
# 74686174277320616c6c20796f75206e656564
hex2bin('74686174277320616c6c20796f75206e656564');
# that's all you need
Doc: bin2hex, hex2bin.
For any char with ord($char) < 16 you get a HEX back which is only 1 long. You forgot to add 0 padding.
This should solve it:
<?php
function strToHex($string){
$hex = '';
for ($i=0; $i<strlen($string); $i++){
$ord = ord($string[$i]);
$hexCode = dechex($ord);
$hex .= substr('0'.$hexCode, -2);
}
return strToUpper($hex);
}
function hexToStr($hex){
$string='';
for ($i=0; $i < strlen($hex)-1; $i+=2){
$string .= chr(hexdec($hex[$i].$hex[$i+1]));
}
return $string;
}
// Tests
header('Content-Type: text/plain');
function test($expected, $actual, $success) {
if($expected !== $actual) {
echo "Expected: '$expected'\n";
echo "Actual: '$actual'\n";
echo "\n";
$success = false;
}
return $success;
}
$success = true;
$success = test('00', strToHex(hexToStr('00')), $success);
$success = test('FF', strToHex(hexToStr('FF')), $success);
$success = test('000102FF', strToHex(hexToStr('000102FF')), $success);
$success = test('↕↑↔§P↔§P ♫§T↕§↕', hexToStr(strToHex('↕↑↔§P↔§P ♫§T↕§↕')), $success);
echo $success ? "Success" : "\nFailed";
PHP :
string to hex:
implode(unpack("H*", $string));
hex to string:
pack("H*", $hex);
Here's what I use:
function strhex($string) {
$hexstr = unpack('H*', $string);
return array_shift($hexstr);
}
function hexToStr($hex){
// Remove spaces if the hex string has spaces
$hex = str_replace(' ', '', $hex);
return hex2bin($hex);
}
// Test it
$hex = "53 44 43 30 30 32 30 30 30 31 37 33";
echo hexToStr($hex); // SDC002000173
/**
* Test Hex To string with PHP UNIT
* #param string $value
* #return
*/
public function testHexToString()
{
$string = 'SDC002000173';
$hex = "53 44 43 30 30 32 30 30 30 31 37 33";
$result = hexToStr($hex);
$this->assertEquals($result,$string);
}
Using #bill-shirley answer with a little addition
function str_to_hex($string) {
$hexstr = unpack('H*', $string);
return array_shift($hexstr);
}
function hex_to_str($string) {
return hex2bin("$string");
}
Usage:
$str = "Go placidly amidst the noise";
$hexstr = str_to_hex($str);// 476f20706c616369646c7920616d6964737420746865206e6f697365
$strstr = hex_to_str($str);// Go placidly amidst the noise
You can try the following code to convert the image to hex string
<?php
$image = 'sample.bmp';
$file = fopen($image, 'r') or die("Could not open $image");
while ($file && !feof($file)){
$chunk = fread($file, 1000000); # You can affect performance altering
this number. YMMV.
# This loop will be dog-slow, almost for sure...
# You could snag two or three bytes and shift/add them,
# but at 4 bytes, you violate the 7fffffff limit of dechex...
# You could maybe write a better dechex that would accept multiple bytes
# and use substr... Maybe.
for ($byte = 0; $byte < strlen($chunk); $byte++)){
echo dechex(ord($chunk[$byte]));
}
}
?>
I only have half the answer, but I hope that it is useful as it adds unicode (utf-8) support
/**
* hexadecimal to unicode character
* #param string $hex
* #return string
*/
function hex2uni($hex) {
$dec = hexdec($hex);
if($dec < 128) {
return chr($dec);
}
if($dec < 2048) {
$utf = chr(192 + (($dec - ($dec % 64)) / 64));
} else {
$utf = chr(224 + (($dec - ($dec % 4096)) / 4096));
$utf .= chr(128 + ((($dec % 4096) - ($dec % 64)) / 64));
}
return $utf . chr(128 + ($dec % 64));
}
To string
var_dump(hex2uni('e641'));
Based on: http://www.php.net/manual/en/function.chr.php#Hcom55978
iD;English [en];Chinese [zh];German [de];Hindi [hi];Hindi (TOGO) [hi_TG];Japanese [ja]
Source[local].AlarmGroup[AlarmText_02].ID[1310:90];Unwinder: Accu position difference too big. Check for laminate break;拆卷器: 蓄存器位置差过大。 检查复合片材是否中断;Laminatspeicher: Zu grosse Positionsänderung - Auf Laminatriss prüfen;290;;巻出装置: アキュムレーター位置の差が大きすぎます。 ラミネートが壊れていないか確認してください
Source[local].AlarmGroup[AlarmText_02].ID[1311:91];Unwinder: Accu level too small for auto splice;拆卷器: 自动拼接的蓄存器级别过小;Abwickler: Akku Füllstand zu klein für Autospleiss;291;;巻出装置: 自動紙継を行うにはアキュムレーターのレベルが小さすぎます
I am trying to fetch csv content as mentioned above :
The csv file is saved as Unicode Text. It has Chinese, German, Japanese Language.
I am unable to fetch foreign language in correct format.
CSV reader Code
header('Content-Type: text/html; charset=utf-8');
$row = 1;
$up_file = 'text_SHOT_S.csv';
setlocale(LC_ALL, 'en_US.UTF-8');
if (($handle = fopen($up_file, "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, ";")) !== FALSE) {
$num = count($data);
$row++;
for ($c=0; $c < $num; $c++) {
echo $data[$c].'<br>';
}
}
fclose($handle);}
Output of the following Code:
iD戼㹲䔀渀最氀椀猀栀 嬀攀渀崀㰀牢>Chinese [zh]戼㹲䜀攀爀洀愀渀 嬀搀攀崀㰀牢>Hindi [hi]戼㹲䠀椀渀搀椀 ⠀吀伀䜀伀⤀ 嬀栀椀开吀䜀崀㰀牢>Japanese [ja] 戼㹲匀漀甀爀挀攀嬀氀漀挀愀氀崀⸀䄀氀愀爀洀䜀爀漀甀瀀嬀䄀氀愀爀洀吀攀砀琀开 ㈀崀⸀䤀䐀嬀㌀ 㨀㤀 崀㰀牢>Unwinder: Accu position difference too big. Check for laminate break戼㹲였睢桓ᩖ쐀墄桛䵖湏읝➏ə‰쀀൧࡙䝔偲⽧♦ⵔ굎㱥牢>Laminatspeicher: Zu grosse Positionsänderung - Auf Laminatriss prüfen戼㹲㈀㤀 㰀牢>戼㹲ff艹앑溈㩿 ꈀ괰ﰰ뼰ﰰ䴰湏湿䱝✰䵙夰丰縰夰Ȱ‰�촰ﰰ젰䰰쨰豘昰䐰樰䐰䬰먰赸垊昰估怰唰䐰ര㰀牢>Source[local].AlarmGroup[AlarmText_02].ID[1311:91]戼㹲唀渀眀椀渀搀攀爀㨀 䄀挀挀甀 氀攀瘀攀氀 琀漀漀 猀洀愀氀氀 昀漀爀 愀甀琀漀 猀瀀氀椀挀攀㰀牢>拆卷器: 自动拼接的蓄存器级别过小戼㹲䄀戀眀椀挀欀氀攀爀㨀 䄀欀欀甀 䘀ﰀ氀氀猀琀愀渀搀 稀甀 欀氀攀椀渀 昀ﰀ爀 䄀甀琀漀猀瀀氀攀椀猀猀㰀牢>291戼㹲㰀牢>巻出装置: 自動紙継を行うにはアキュムレーターのレベルが小さすぎます 戼㹲㰀牢
I either check garbage character or most of the content converted to Chinese.
Also tried the header('Content-Type: text/html; charset=iso-8859-1') and setlocale(LC_CTYPE, 'zh_CN.UTF-8','zh_ZH.big5');
I want the output same as CSV content.
Thanks in advance .
For reading CSV content I used PHPExcel and converted UTF-16 file into UTF-8 then it will fetch Chinese content properly.
Please refer below link for converting UTF-16 File to an UTF-8.
How to Convert an UTF-16 File to an UTF-8 file using PHP
To convert a file simply call the convert_file_to_utf8() function
and pass to it the file path of the file you wish to convert. The
function then uses the PHP function file_get_contents() to pack the
input file’s contents into a string variable which is then passed to
the main converter function which converts the string from UTF-16 to
UTF-8 encoding if necessary. Finally, it uses file_put_contents() to
stuff the resulting string back into the original file, overwriting
the original file contents.
function utf16_to_utf8($str) {
$c0 = ord($str[0]);
$c1 = ord($str[1]);
if ($c0 == 0xFE && $c1 == 0xFF) {
$be = true;
} else if ($c0 == 0xFF && $c1 == 0xFE) {
$be = false;
} else {
return $str;
}
$str = substr($str, 2);
$len = strlen($str);
$dec = '';
for ($i = 0; $i < $len; $i += 2) {
$c = ($be) ? ord($str[$i]) << 8 | ord($str[$i + 1]) :
ord($str[$i + 1]) << 8 | ord($str[$i]);
if ($c >= 0x0001 && $c <= 0x007F) {
$dec .= chr($c);
} else if ($c > 0x07FF) {
$dec .= chr(0xE0 | (($c >> 12) & 0x0F));
$dec .= chr(0x80 | (($c >> 6) & 0x3F));
$dec .= chr(0x80 | (($c >> 0) & 0x3F));
} else {
$dec .= chr(0xC0 | (($c >> 6) & 0x1F));
$dec .= chr(0x80 | (($c >> 0) & 0x3F));
}
}
return $dec;
}
function convert_file_to_utf8($csvfile) {
$utfcheck = file_get_contents($csvfile);
$utfcheck = utf16_to_utf8($utfcheck);
file_put_contents($csvfile, $utfcheck);
}
Please before read this answer, read the different coment.
Mudassir, you can see the exact charset with tortoise, with comparator of file (see img)
Your soft use not utf-8 but utf-16 encoding. If you cant change this, you can use http://php.net/manual/en/function.mb-convert-encoding.php
http://php.net/manual/fr/mbstring.supported-encodings.php
I've try with your file and this function, and it's work correctly. See the code :
header('Content-Type: text/html; charset=utf-8');
$row = 1;
$up_file = 'text_SHOT_S.csv';
setlocale(LC_ALL, 'en_US.UTF-8');
if (($handle = fopen($up_file, "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, ";")) !== FALSE) {
$num = count($data);
$row++;
for ($c=0; $c < $num; $c++) {
// echo $data[$c].'<br>';
echo mb_convert_encoding($data[$c],'utf8','utf-16').'<br>';
}
}
fclose($handle);}
I need to be able to decompress through PHP some data that I have in a string which uses the gzip format. I need to do this via PHP, not by calling - through system for example - an external program.
I go to the documentation and I find gzdecode. Too bad it doesn't exist. Digging further through google it appears this function was implemented in PHP6, which I cannot use. (Interestingly enough gzencode exists and is working).
I believe - but I'm not sure - that the gzip format simply has some extra header data. Is there a way to uncompress it by manipulating this extra data and then using gzuncompress, or some other way?
Thanks
gzdecode() is not yet in PHP. But you can use the implementation from upgradephp. It really is just a few extra header bytes.
Another option would be to use gzopen. Maybe just like gzopen("data:app/bin,....") even.
Well I found my answer by reading the comments on the gzdecode page I linked in my original post. One of the users, Aaron G, provided an implementation of it and it works:
<?php
function gzdecode($data) {
$len = strlen($data);
if ($len < 18 || strcmp(substr($data,0,2),"\x1f\x8b")) {
return null; // Not GZIP format (See RFC 1952)
}
$method = ord(substr($data,2,1)); // Compression method
$flags = ord(substr($data,3,1)); // Flags
if ($flags & 31 != $flags) {
// Reserved bits are set -- NOT ALLOWED by RFC 1952
return null;
}
// NOTE: $mtime may be negative (PHP integer limitations)
$mtime = unpack("V", substr($data,4,4));
$mtime = $mtime[1];
$xfl = substr($data,8,1);
$os = substr($data,8,1);
$headerlen = 10;
$extralen = 0;
$extra = "";
if ($flags & 4) {
// 2-byte length prefixed EXTRA data in header
if ($len - $headerlen - 2 < 8) {
return false; // Invalid format
}
$extralen = unpack("v",substr($data,8,2));
$extralen = $extralen[1];
if ($len - $headerlen - 2 - $extralen < 8) {
return false; // Invalid format
}
$extra = substr($data,10,$extralen);
$headerlen += 2 + $extralen;
}
$filenamelen = 0;
$filename = "";
if ($flags & 8) {
// C-style string file NAME data in header
if ($len - $headerlen - 1 < 8) {
return false; // Invalid format
}
$filenamelen = strpos(substr($data,8+$extralen),chr(0));
if ($filenamelen === false || $len - $headerlen - $filenamelen - 1 < 8) {
return false; // Invalid format
}
$filename = substr($data,$headerlen,$filenamelen);
$headerlen += $filenamelen + 1;
}
$commentlen = 0;
$comment = "";
if ($flags & 16) {
// C-style string COMMENT data in header
if ($len - $headerlen - 1 < 8) {
return false; // Invalid format
}
$commentlen = strpos(substr($data,8+$extralen+$filenamelen),chr(0));
if ($commentlen === false || $len - $headerlen - $commentlen - 1 < 8) {
return false; // Invalid header format
}
$comment = substr($data,$headerlen,$commentlen);
$headerlen += $commentlen + 1;
}
$headercrc = "";
if ($flags & 1) {
// 2-bytes (lowest order) of CRC32 on header present
if ($len - $headerlen - 2 < 8) {
return false; // Invalid format
}
$calccrc = crc32(substr($data,0,$headerlen)) & 0xffff;
$headercrc = unpack("v", substr($data,$headerlen,2));
$headercrc = $headercrc[1];
if ($headercrc != $calccrc) {
return false; // Bad header CRC
}
$headerlen += 2;
}
// GZIP FOOTER - These be negative due to PHP's limitations
$datacrc = unpack("V",substr($data,-8,4));
$datacrc = $datacrc[1];
$isize = unpack("V",substr($data,-4));
$isize = $isize[1];
// Perform the decompression:
$bodylen = $len-$headerlen-8;
if ($bodylen < 1) {
// This should never happen - IMPLEMENTATION BUG!
return null;
}
$body = substr($data,$headerlen,$bodylen);
$data = "";
if ($bodylen > 0) {
switch ($method) {
case 8:
// Currently the only supported compression method:
$data = gzinflate($body);
break;
default:
// Unknown compression method
return false;
}
} else {
// I'm not sure if zero-byte body content is allowed.
// Allow it for now... Do nothing...
}
// Verifiy decompressed size and CRC32:
// NOTE: This may fail with large data sizes depending on how
// PHP's integer limitations affect strlen() since $isize
// may be negative for large sizes.
if ($isize != strlen($data) || crc32($data) != $datacrc) {
// Bad format! Length or CRC doesn't match!
return false;
}
return $data;
}
?>
Try gzinflate.
Did you tried gzuncompress?
http://www.php.net/manual/en/function.gzuncompress.php
i programmed this php function that takes any text/html string and trims it.
For example:
gen_string("Hello, how are you today?",10);
Returns:
Hello, how...
The problem arises when the function string limit is the same as the position of a special character such as: á, ñ, etc...
In which case:
gen_string("Helló my friend",5);
Returns: Hell�...
Any ideas on how to solve this issue? This is the current function:
# string: advanced substr
function gen_string($string,$min,$clean=false) {
$text = trim(strip_tags($string));
if(strlen($text)>$min) {
$blank = strpos($text,' ');
if($blank) {
# limit plus last word
$extra = strpos(substr($text,$min),' ');
$max = $min+$extra;
$r = substr($text,0,$max);
if(strlen($text)>=$max && !$clean) $r=trim($r,'.').'...';
} else {
# if there are no spaces
$r = substr($text,0,$min).'...';
}
} else {
# if original length is lower than limit
$r = $text;
}
return trim($r);
}
Thanks!
You should use the multibyte string functions to correctly handle unicode characters.
For example you could try using mb_strimwidth to truncate a string to a specified length.
You could also take a different approach and make use of the PCRE regex extension's UTF-8 capabilities (assuming your strings are UTF-8!).
function gen_string($string, $length)
{
$str = trim(strip_tags($string));
$strlen = strlen(utf8_decode($str));
// String is less than limit
if ($strlen <= $length) return $str;
// Shorten string, preserving whole "words" (non-whitespace)
preg_match('/^.{'.($length-1).'}\S*/su', $str, $match);
// Append ellipsis if needed (bytes length is OK to check)
if (strlen($match[0]) !== strlen($str)) $match[0] .= '...';
return $match[0];
}
Aside from the multibyte issue, maybe you can write it shorter
function gen_string($str, $limit) {
if ($str >= strlen($limit))
return $str;
$offset = -(strlen($str) - $limit);
return substr($str, 0, strrpos($str, ' ', $offset)).'...';
}
It will limit the length of the string, so rather than cut it after the first word beyond the limit, it ensures that the length is never larger than the limit.
strlen() cannot be used for UTF-8 string, because it would count also the continuation characters, which should not be counted.
You can try with the following code:
define('PREG_CLASS_UNICODE_WORD_BOUNDARY',
'\x{0}-\x{2F}\x{3A}-\x{40}\x{5B}-\x{60}\x{7B}-\x{A9}\x{AB}-\x{B1}\x{B4}' .
'\x{B6}-\x{B8}\x{BB}\x{BF}\x{D7}\x{F7}\x{2C2}-\x{2C5}\x{2D2}-\x{2DF}' .
'\x{2E5}-\x{2EB}\x{2ED}\x{2EF}-\x{2FF}\x{375}\x{37E}-\x{385}\x{387}\x{3F6}' .
'\x{482}\x{55A}-\x{55F}\x{589}-\x{58A}\x{5BE}\x{5C0}\x{5C3}\x{5C6}' .
'\x{5F3}-\x{60F}\x{61B}-\x{61F}\x{66A}-\x{66D}\x{6D4}\x{6DD}\x{6E9}' .
'\x{6FD}-\x{6FE}\x{700}-\x{70F}\x{7F6}-\x{7F9}\x{830}-\x{83E}' .
'\x{964}-\x{965}\x{970}\x{9F2}-\x{9F3}\x{9FA}-\x{9FB}\x{AF1}\x{B70}' .
'\x{BF3}-\x{BFA}\x{C7F}\x{CF1}-\x{CF2}\x{D79}\x{DF4}\x{E3F}\x{E4F}' .
'\x{E5A}-\x{E5B}\x{F01}-\x{F17}\x{F1A}-\x{F1F}\x{F34}\x{F36}\x{F38}' .
'\x{F3A}-\x{F3D}\x{F85}\x{FBE}-\x{FC5}\x{FC7}-\x{FD8}\x{104A}-\x{104F}' .
'\x{109E}-\x{109F}\x{10FB}\x{1360}-\x{1368}\x{1390}-\x{1399}\x{1400}' .
'\x{166D}-\x{166E}\x{1680}\x{169B}-\x{169C}\x{16EB}-\x{16ED}' .
'\x{1735}-\x{1736}\x{17B4}-\x{17B5}\x{17D4}-\x{17D6}\x{17D8}-\x{17DB}' .
'\x{1800}-\x{180A}\x{180E}\x{1940}-\x{1945}\x{19DE}-\x{19FF}' .
'\x{1A1E}-\x{1A1F}\x{1AA0}-\x{1AA6}\x{1AA8}-\x{1AAD}\x{1B5A}-\x{1B6A}' .
'\x{1B74}-\x{1B7C}\x{1C3B}-\x{1C3F}\x{1C7E}-\x{1C7F}\x{1CD3}\x{1FBD}' .
'\x{1FBF}-\x{1FC1}\x{1FCD}-\x{1FCF}\x{1FDD}-\x{1FDF}\x{1FED}-\x{1FEF}' .
'\x{1FFD}-\x{206F}\x{207A}-\x{207E}\x{208A}-\x{208E}\x{20A0}-\x{20B8}' .
'\x{2100}-\x{2101}\x{2103}-\x{2106}\x{2108}-\x{2109}\x{2114}' .
'\x{2116}-\x{2118}\x{211E}-\x{2123}\x{2125}\x{2127}\x{2129}\x{212E}' .
'\x{213A}-\x{213B}\x{2140}-\x{2144}\x{214A}-\x{214D}\x{214F}' .
'\x{2190}-\x{244A}\x{249C}-\x{24E9}\x{2500}-\x{2775}\x{2794}-\x{2B59}' .
'\x{2CE5}-\x{2CEA}\x{2CF9}-\x{2CFC}\x{2CFE}-\x{2CFF}\x{2E00}-\x{2E2E}' .
'\x{2E30}-\x{3004}\x{3008}-\x{3020}\x{3030}\x{3036}-\x{3037}' .
'\x{303D}-\x{303F}\x{309B}-\x{309C}\x{30A0}\x{30FB}\x{3190}-\x{3191}' .
'\x{3196}-\x{319F}\x{31C0}-\x{31E3}\x{3200}-\x{321E}\x{322A}-\x{3250}' .
'\x{3260}-\x{327F}\x{328A}-\x{32B0}\x{32C0}-\x{33FF}\x{4DC0}-\x{4DFF}' .
'\x{A490}-\x{A4C6}\x{A4FE}-\x{A4FF}\x{A60D}-\x{A60F}\x{A673}\x{A67E}' .
'\x{A6F2}-\x{A716}\x{A720}-\x{A721}\x{A789}-\x{A78A}\x{A828}-\x{A82B}' .
'\x{A836}-\x{A839}\x{A874}-\x{A877}\x{A8CE}-\x{A8CF}\x{A8F8}-\x{A8FA}' .
'\x{A92E}-\x{A92F}\x{A95F}\x{A9C1}-\x{A9CD}\x{A9DE}-\x{A9DF}' .
'\x{AA5C}-\x{AA5F}\x{AA77}-\x{AA79}\x{AADE}-\x{AADF}\x{ABEB}' .
'\x{D800}-\x{F8FF}\x{FB29}\x{FD3E}-\x{FD3F}\x{FDFC}-\x{FDFD}' .
'\x{FE10}-\x{FE19}\x{FE30}-\x{FE6B}\x{FEFF}-\x{FF0F}\x{FF1A}-\x{FF20}' .
'\x{FF3B}-\x{FF40}\x{FF5B}-\x{FF65}\x{FFE0}-\x{FFFD}');
function utf8_strlen($text) {
if (function_exists('mb_strlen')) {
return mb_strlen($text);
}
// Do not count UTF-8 continuation bytes.
return strlen(preg_replace("/[\x80-\xBF]/", '', $text));
}
function utf8_truncate($string, $max_length, $wordsafe = FALSE, $add_ellipsis = FALSE, $min_wordsafe_length = 1) {
$ellipsis = '';
$max_length = max($max_length, 0);
$min_wordsafe_length = max($min_wordsafe_length, 0);
if (utf8_strlen($string) <= $max_length) {
// No truncation needed, so don't add ellipsis, just return.
return $string;
}
if ($add_ellipsis) {
// Truncate ellipsis in case $max_length is small.
$ellipsis = utf8_substr('...', 0, $max_length);
$max_length -= utf8_strlen($ellipsis);
$max_length = max($max_length, 0);
}
if ($max_length <= $min_wordsafe_length) {
// Do not attempt word-safe if lengths are bad.
$wordsafe = FALSE;
}
if ($wordsafe) {
$matches = array();
// Find the last word boundary, if there is one within $min_wordsafe_length
// to $max_length characters. preg_match() is always greedy, so it will
// find the longest string possible.
$found = preg_match('/^(.{' . $min_wordsafe_length . ',' . $max_length . '})[' . PREG_CLASS_UNICODE_WORD_BOUNDARY . ']/u', $string, $matches);
if ($found) {
$string = $matches[1];
}
else {
$string = utf8_substr($string, 0, $max_length);
}
}
else {
$string = utf8_substr($string, 0, $max_length);
}
if ($add_ellipsis) {
$string .= $ellipsis;
}
return $string;
}
function utf8_substr($text, $start, $length = NULL) {
if (function_exists('mb_substr')) {
return $length === NULL ? mb_substr($text, $start) : mb_substr($text, $start, $length);
}
else {
$strlen = strlen($text);
// Find the starting byte offset.
$bytes = 0;
if ($start > 0) {
// Count all the continuation bytes from the start until we have found
// $start characters or the end of the string.
$bytes = -1;
$chars = -1;
while ($bytes < $strlen - 1 && $chars < $start) {
$bytes++;
$c = ord($text[$bytes]);
if ($c < 0x80 || $c >= 0xC0) {
$chars++;
}
}
}
elseif ($start < 0) {
// Count all the continuation bytes from the end until we have found
// abs($start) characters.
$start = abs($start);
$bytes = $strlen;
$chars = 0;
while ($bytes > 0 && $chars < $start) {
$bytes--;
$c = ord($text[$bytes]);
if ($c < 0x80 || $c >= 0xC0) {
$chars++;
}
}
}
$istart = $bytes;
// Find the ending byte offset.
if ($length === NULL) {
$iend = $strlen;
}
elseif ($length > 0) {
// Count all the continuation bytes from the starting index until we have
// found $length characters or reached the end of the string, then
// backtrace one byte.
$iend = $istart - 1;
$chars = -1;
$last_real = FALSE;
while ($iend < $strlen - 1 && $chars < $length) {
$iend++;
$c = ord($text[$iend]);
$last_real = FALSE;
if ($c < 0x80 || $c >= 0xC0) {
$chars++;
$last_real = TRUE;
}
}
// Backtrace one byte if the last character we found was a real character
// and we don't need it.
if ($last_real && $chars >= $length) {
$iend--;
}
}
elseif ($length < 0) {
// Count all the continuation bytes from the end until we have found
// abs($start) characters, then backtrace one byte.
$length = abs($length);
$iend = $strlen;
$chars = 0;
while ($iend > 0 && $chars < $length) {
$iend--;
$c = ord($text[$iend]);
if ($c < 0x80 || $c >= 0xC0) {
$chars++;
}
}
// Backtrace one byte if we are not at the beginning of the string.
if ($iend > 0) {
$iend--;
}
}
else {
// $length == 0, return an empty string.
return '';
}
return substr($text, $istart, max(0, $iend - $istart + 1));
}
}
For your return statement you could try:
return htmlspecialchars(trim($r));
EDIT: I tried your code as you provided it and it ran fine for me without having to use htmlspecialchars(). This is probably due to the face that in the <head> of the page the code was running on, the charset was set to UTF-8. So your options could be to set the encoding of the page like this:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
or to use htmlspecialchars() as above.