I've a string like this :
%d8%b7%d8%b1%d8%a7%d8%ad%db%8c-%d8%a7%d9%be%d9%84%db%8c%da%a9%db%8c%d8%b4%d9%86-%d9%81%d8%b1%d9%88%d8%b4%da%af%d8%a7%d9%87%db%8c
the meta tag of page is set to utf-8
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
i want to convert this unicode to pure readable utf-8 string
I've tested lots of code ,thie is my last code :
function convertFarsi($str) {
return html_entity_decode(preg_replace('/\\\\u([a-f0-9]{4})/i', '&#x$1;', $str),ENT_QUOTES, 'UTF-8');
}
and it doesn't work.
How can I convert these unicode to utf8 string ?
You can use url_decode to get the following result:
<?php
$string = '%d8%b7%d8%b1%d8%a7%d8%ad%db%8c-%d8%a7%d9%be%d9%84%db%8c%da%a9%db%8c%d8%b4%d9%86-%d9%81%d8%b1%d9%88%d8%b4%da%af%d8%a7%d9%87%db%8c';
$outpout = urldecode($string);
echo $outpout; // طراحی-اپلیکیشن-فروشگاهی
This function doesn't decode unicode characters. I wrote a function that does.
function unicode_urldecode($url)
{
preg_match_all('/%u([[:alnum:]]{4})/', $url, $a);
foreach ($a[1] as $uniord)
{
$dec = hexdec($uniord);
$utf = '';
if ($dec < 128)
{
$utf = chr($dec);
}
else if ($dec < 2048)
{
$utf = chr(192 + (($dec - ($dec % 64)) / 64));
$utf .= chr(128 + ($dec % 64));
}
else
{
$utf = chr(224 + (($dec - ($dec % 4096)) / 4096));
$utf .= chr(128 + ((($dec % 4096) - ($dec % 64)) / 64));
$utf .= chr(128 + ($dec % 64));
}
$url = str_replace('%u'.$uniord, $utf, $url);
}
return urldecode($url);
}
Source
Demo
This seems to do it:
<?php
$s = '%d8%b7%d8%b1%d8%a7%d8%ad%db%8c-%d8%a7%d9%be%d9%84%db%8c%da%a9%db%8c%d8%b4%d9%86-%d9%81%d8%b1%d9%88%d8%b4%da%af%d8%a7%d9%87%db%8c';
$t = urldecode($s);
var_dump($t == 'طراحی-اپلیکیشن-فروشگاهی');
https://php.net/function.urldecode
Related
I got the problem when convert between this 2 type in PHP. This is the code I searched in google
function strToHex($string){
$hex='';
for ($i=0; $i < strlen($string); $i++){
$hex .= dechex(ord($string[$i]));
}
return $hex;
}
function hexToStr($hex){
$string='';
for ($i=0; $i < strlen($hex)-1; $i+=2){
$string .= chr(hexdec($hex[$i].$hex[$i+1]));
}
return $string;
}
I check it and found out this when I use XOR to encrypt.
I have the string "this is the test", after XOR with a key, I have the result in string ↕↑↔§P↔§P ♫§T↕§↕. After that, I tried to convert it to hex by function strToHex() and I got these 12181d15501d15500e15541215712. Then, I tested with the function hexToStr() and I have ↕↑↔§P↔§P♫§T↕§q. So, what should I do to solve this problem? Why does it wrong when I convert this 2 style value?
For people that end up here and are just looking for the hex representation of a (binary) string.
bin2hex("that's all you need");
# 74686174277320616c6c20796f75206e656564
hex2bin('74686174277320616c6c20796f75206e656564');
# that's all you need
Doc: bin2hex, hex2bin.
For any char with ord($char) < 16 you get a HEX back which is only 1 long. You forgot to add 0 padding.
This should solve it:
<?php
function strToHex($string){
$hex = '';
for ($i=0; $i<strlen($string); $i++){
$ord = ord($string[$i]);
$hexCode = dechex($ord);
$hex .= substr('0'.$hexCode, -2);
}
return strToUpper($hex);
}
function hexToStr($hex){
$string='';
for ($i=0; $i < strlen($hex)-1; $i+=2){
$string .= chr(hexdec($hex[$i].$hex[$i+1]));
}
return $string;
}
// Tests
header('Content-Type: text/plain');
function test($expected, $actual, $success) {
if($expected !== $actual) {
echo "Expected: '$expected'\n";
echo "Actual: '$actual'\n";
echo "\n";
$success = false;
}
return $success;
}
$success = true;
$success = test('00', strToHex(hexToStr('00')), $success);
$success = test('FF', strToHex(hexToStr('FF')), $success);
$success = test('000102FF', strToHex(hexToStr('000102FF')), $success);
$success = test('↕↑↔§P↔§P ♫§T↕§↕', hexToStr(strToHex('↕↑↔§P↔§P ♫§T↕§↕')), $success);
echo $success ? "Success" : "\nFailed";
PHP :
string to hex:
implode(unpack("H*", $string));
hex to string:
pack("H*", $hex);
Here's what I use:
function strhex($string) {
$hexstr = unpack('H*', $string);
return array_shift($hexstr);
}
function hexToStr($hex){
// Remove spaces if the hex string has spaces
$hex = str_replace(' ', '', $hex);
return hex2bin($hex);
}
// Test it
$hex = "53 44 43 30 30 32 30 30 30 31 37 33";
echo hexToStr($hex); // SDC002000173
/**
* Test Hex To string with PHP UNIT
* #param string $value
* #return
*/
public function testHexToString()
{
$string = 'SDC002000173';
$hex = "53 44 43 30 30 32 30 30 30 31 37 33";
$result = hexToStr($hex);
$this->assertEquals($result,$string);
}
Using #bill-shirley answer with a little addition
function str_to_hex($string) {
$hexstr = unpack('H*', $string);
return array_shift($hexstr);
}
function hex_to_str($string) {
return hex2bin("$string");
}
Usage:
$str = "Go placidly amidst the noise";
$hexstr = str_to_hex($str);// 476f20706c616369646c7920616d6964737420746865206e6f697365
$strstr = hex_to_str($str);// Go placidly amidst the noise
You can try the following code to convert the image to hex string
<?php
$image = 'sample.bmp';
$file = fopen($image, 'r') or die("Could not open $image");
while ($file && !feof($file)){
$chunk = fread($file, 1000000); # You can affect performance altering
this number. YMMV.
# This loop will be dog-slow, almost for sure...
# You could snag two or three bytes and shift/add them,
# but at 4 bytes, you violate the 7fffffff limit of dechex...
# You could maybe write a better dechex that would accept multiple bytes
# and use substr... Maybe.
for ($byte = 0; $byte < strlen($chunk); $byte++)){
echo dechex(ord($chunk[$byte]));
}
}
?>
I only have half the answer, but I hope that it is useful as it adds unicode (utf-8) support
/**
* hexadecimal to unicode character
* #param string $hex
* #return string
*/
function hex2uni($hex) {
$dec = hexdec($hex);
if($dec < 128) {
return chr($dec);
}
if($dec < 2048) {
$utf = chr(192 + (($dec - ($dec % 64)) / 64));
} else {
$utf = chr(224 + (($dec - ($dec % 4096)) / 4096));
$utf .= chr(128 + ((($dec % 4096) - ($dec % 64)) / 64));
}
return $utf . chr(128 + ($dec % 64));
}
To string
var_dump(hex2uni('e641'));
Based on: http://www.php.net/manual/en/function.chr.php#Hcom55978
I want to convert hindi / Devanagari text for example "आए थे पर्यटक, खुद ही बह ग" into Unicode escaped characters like "\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917".
I am developing a hindi website and i have seen most of sites are using Escaped Unicode sequence inside their meta tags and schema.org.
So i decided to give it a try.
i can see Hindi AKA Devanagari letters with their Escaped Unicode sequence at http://www.endmemo.com/unicode/devanagari.php
and i have also seen a tool which works the same https://www.mobilefish.com/services/unicode_escape_sequence_converter/unicode_escape_sequence_converter.php
but i cannot find any way to convert these Devanagari letters into Escaped Unicode sequence via php.
I have tried few things but nothing is working and i am not getting much help from google because all articles / forums are talking to decoding unicode escape sequence to unicode but none of them is taking about encoding..
header( 'Content-Type: text/html; charset=utf-8' );
function encode2($str) {
$str = mb_convert_encoding($str , 'UTF-32', 'UTF-8');
$t = unpack("N*", $str);
$t = array_map(function($n) { return "&#$n;"; }, $t);
return implode("", $t);
}
$message = "आए थे पर्यटक, खुद ही बह गए";
$message_convert = encode2($message);
echo $message_convert;
echo "fdfdfdfdfdfdfd<br/>";
echo mb_convert_encoding($message, "HTML-ENTITIES", "auto");
I want this "आए थे पर्यटक, खुद ही बह ग" to "\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917"
Please help!
as suggest by #paskl i tried:
$message = "आए थे पर्यटक, खुद ही बह गए";
$unicode = json_encode($message)
echo $unicode;
And i got ""\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917\u090f""
I hope it will help others who want to convert devanagari/hindi letters into Escaped Unicode sequence with php on their website.
Thanks to #paskl
Unless you're looking to transmit this data as JSON I wouldn't really recommend using json_encode() as it will wrap your output in literal double quotes that you'd need to strip back off. However there's not an easy way to encode unicode escapes in PHP in a way that is memory-efficient.
That said, here is the not-easy code:
// PHP < 7.2
// https://github.com/symfony/polyfill-mbstring/blob/master/Mbstring.php#L708-L730
if( ! function_exists("mb_ord") ) {
function mb_ord($s) {
if (1 === \strlen($s)) {
return \ord($s);
}
$code = ($s = unpack('C*', substr($s, 0, 4))) ? $s[1] : 0;
if (0xF0 <= $code) {
return (($code - 0xF0) << 18) + (($s[2] - 0x80) << 12) + (($s[3] - 0x80) << 6) + $s[4] - 0x80;
}
if (0xE0 <= $code) {
return (($code - 0xE0) << 12) + (($s[2] - 0x80) << 6) + $s[3] - 0x80;
}
if (0xC0 <= $code) {
return (($code - 0xC0) << 6) + $s[2] - 0x80;
}
return $code;
}
}
function ord2seqlen($ord) {
if($ord < 128){
return 1;
} else if($ord < 224) {
return 2;
} else if($ord < 240) {
return 3;
} else if($ord < 248) {
return 4;
} else {
throw new \Exception("No support for 5 or 6 byte sequences.");
}
}
function utf8_seq_iter($input) {
for($i=0,$c=strlen($input); $i<$c; ) {
$bytes = ord2seqlen(ord($input[$i]));
yield substr($input, $i, $bytes);
$i += $bytes;
}
}
function escape_codepoint($codepoint, $skip_low=true) {
$ord = mb_ord($codepoint);
if( $skip_low && $ord < 128 ) {
return $codepoint;
} else {
return sprintf("\\u%04x", $ord);
}
}
$input = "आए थे पर्यटक, खुद ही बह गए";
$output = '';
foreach( utf8_seq_iter($input) as $codepoint ) {
$output .= escape_codepoint($codepoint);
}
var_dump($output);
Output:
string(121) "\u0906\u090f \u0925\u0947 \u092a\u0930\u094d\u092f\u091f\u0915, \u0916\u0941\u0926 \u0939\u0940 \u092c\u0939 \u0917\u090f"
Edit: I've turned this into a small composer package available here:
https://packagist.org/packages/wrossmann/utf8_escape
I am trying to show a characters html entity
echo htmlentities(htmlentities("&"));
//outputs &
echo htmlentities(htmlentities("<"));
//outputs <
but it does not seem to work with emoji
echo htmlentities(htmlentities("😎"));
//outputs 😎
How can I get it to output 😎?
Edit:
I am trying to display a string input by the user with all of the html entities encoded.
echo htmlentities(htmlentities($input))
Example:
"this & that 😎" -> "this & that 😎"
This works for regular HTML entities, UTF-8 emoticons (and other utf stuff) as well as regular strings of course.
I was just having trouble with empty string value, so I had to put this condition into the function.
function entities( $string ) {
$stringBuilder = "";
$offset = 0;
if ( empty( $string ) ) {
return "";
}
while ( $offset >= 0 ) {
$decValue = ordutf8( $string, $offset );
$char = unichr($decValue);
$htmlEntited = htmlentities( $char );
if( $char != $htmlEntited ){
$stringBuilder .= $htmlEntited;
} elseif( $decValue >= 128 ){
$stringBuilder .= "&#" . $decValue . ";";
} else {
$stringBuilder .= $char;
}
}
return $stringBuilder;
}
// source - http://php.net/manual/en/function.ord.php#109812
function ordutf8($string, &$offset) {
$code = ord(substr($string, $offset,1));
if ($code >= 128) { //otherwise 0xxxxxxx
if ($code < 224) $bytesnumber = 2; //110xxxxx
else if ($code < 240) $bytesnumber = 3; //1110xxxx
else if ($code < 248) $bytesnumber = 4; //11110xxx
$codetemp = $code - 192 - ($bytesnumber > 2 ? 32 : 0) - ($bytesnumber > 3 ? 16 : 0);
for ($i = 2; $i <= $bytesnumber; $i++) {
$offset ++;
$code2 = ord(substr($string, $offset, 1)) - 128; //10xxxxxx
$codetemp = $codetemp*64 + $code2;
}
$code = $codetemp;
}
$offset += 1;
if ($offset >= strlen($string)) $offset = -1;
return $code;
}
// source - http://php.net/manual/en/function.chr.php#88611
function unichr($u) {
return mb_convert_encoding('&#' . intval($u) . ';', 'UTF-8', 'HTML-ENTITIES');
}
/* ---- */
var_dump( entities( "&" ) ) . "\n";
var_dump( entities( "<" ) ) . "\n";
var_dump( entities( "😎" ) ) . "\n";
var_dump( entities( "☚" ) ) . "\n";
var_dump( entities( "" ) ) . "\n";
var_dump( entities( "A" ) ) . "\n";
var_dump( entities( "Hello 😎 world" ) ) . "\n";
var_dump( entities( "this & that 😎" ) ) . "\n";
$emoji = "\xF0\x9F\x98\x8E"; // its your emoji
I get this callback from convert unicode to html entities hex
$hex = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
$char = current($m);
$utf = iconv('UTF-8', 'UCS-4', $char);
return sprintf("&#x%s;", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $emoji);
echo $hex;
echo json_encode(("\xF0\x9F\x98\x8E")); // its decoded. htmlentities doesn't work with it.
Is this OK ?
htmlentities documentation states that
all characters which have HTML character entity equivalents are
translated into these entities.
Your emoji does not have an equivalent like < is for <, so it doesn't get converted. 😎 is just an HTML code, not an HTML entity.
function htmlEntitiesOrCode($string) {
//try htmlentities first
$result = htmlentities($string, ENT_COMPAT, "UTF-8");
//if the output is different from input, an entity was returned
if ($result != $string) {
return $result;
}
//get the html code
$offset = 0;
$code = ord(substr($string, $offset,1));
if ($code >= 128) {
if ($code < 224) {
$bytesnumber = 2;
} else if ($code < 240) {
$bytesnumber = 3;
} else if ($code < 248) {
$bytesnumber = 4;
}
$codetemp = $code - 192 - ($bytesnumber > 2 ? 32 : 0) - ($bytesnumber > 3 ? 16 : 0);
for ($i = 2; $i <= $bytesnumber; $i++) {
$offset ++;
$code2 = ord(substr($string, $offset, 1)) - 128;
$codetemp = $codetemp*64 + $code2;
}
$code = $codetemp;
}
$offset += 1;
if ($offset >= strlen($string)) {
$offset = -1;
}
$result = "&#" . $code;
return $result;
}
HTML code function taken from here: http://php.net/manual/en/function.ord.php#109812
I was using emojione to convert emoticons but there is problem.
When someone upload emoticon from mobile then something like, \ud83d\ude0c\ud83d\ude0c\ud83d\ude0c this unicode.
emojione doesn't convert this type of code.
Can anybody help me to convert this code or suggest me to use any other package
I have done # last.
$str = '\ud83d\ude0c\ud83d\ude0c\ud83d\ude0c';
$regex = '/\\\u([dD][89abAB][\da-fA-F]{2})\\\u([dD][c-fC-F][\da-fA-F]{2})
|\\\u([\da-fA-F]{4})/sx';
echo preg_replace_callback($regex, function($matches) {
if (isset($matches[3])) {
$cp = hexdec($matches[3]);
} else {
$lead = hexdec($matches[1]);
$trail = hexdec($matches[2]);
// http://unicode.org/faq/utf_bom.html#utf16-4
$cp = ($lead << 10) + $trail + 0x10000 - (0xD800 << 10) - 0xDC00;
}
// https://tools.ietf.org/html/rfc3629#section-3
// Characters between U+D800 and U+DFFF are not allowed in UTF-8
if ($cp > 0xD7FF && 0xE000 > $cp) {
$cp = 0xFFFD;
}
// https://github.com/php/php-src/blob/php-5.6.4/ext/standard/html.c#L471
// php_utf32_utf8(unsigned char *buf, unsigned k)
if ($cp < 0x80) {
return chr($cp);
} else if ($cp < 0xA0) {
return chr(0xC0 | $cp >> 6) . chr(0x80 | $cp & 0x3F);
}
return html_entity_decode('&#' . $cp . ';');
}, $str);
output will be:
😌😌😌
i programmed this php function that takes any text/html string and trims it.
For example:
gen_string("Hello, how are you today?",10);
Returns:
Hello, how...
The problem arises when the function string limit is the same as the position of a special character such as: á, ñ, etc...
In which case:
gen_string("Helló my friend",5);
Returns: Hell�...
Any ideas on how to solve this issue? This is the current function:
# string: advanced substr
function gen_string($string,$min,$clean=false) {
$text = trim(strip_tags($string));
if(strlen($text)>$min) {
$blank = strpos($text,' ');
if($blank) {
# limit plus last word
$extra = strpos(substr($text,$min),' ');
$max = $min+$extra;
$r = substr($text,0,$max);
if(strlen($text)>=$max && !$clean) $r=trim($r,'.').'...';
} else {
# if there are no spaces
$r = substr($text,0,$min).'...';
}
} else {
# if original length is lower than limit
$r = $text;
}
return trim($r);
}
Thanks!
You should use the multibyte string functions to correctly handle unicode characters.
For example you could try using mb_strimwidth to truncate a string to a specified length.
You could also take a different approach and make use of the PCRE regex extension's UTF-8 capabilities (assuming your strings are UTF-8!).
function gen_string($string, $length)
{
$str = trim(strip_tags($string));
$strlen = strlen(utf8_decode($str));
// String is less than limit
if ($strlen <= $length) return $str;
// Shorten string, preserving whole "words" (non-whitespace)
preg_match('/^.{'.($length-1).'}\S*/su', $str, $match);
// Append ellipsis if needed (bytes length is OK to check)
if (strlen($match[0]) !== strlen($str)) $match[0] .= '...';
return $match[0];
}
Aside from the multibyte issue, maybe you can write it shorter
function gen_string($str, $limit) {
if ($str >= strlen($limit))
return $str;
$offset = -(strlen($str) - $limit);
return substr($str, 0, strrpos($str, ' ', $offset)).'...';
}
It will limit the length of the string, so rather than cut it after the first word beyond the limit, it ensures that the length is never larger than the limit.
strlen() cannot be used for UTF-8 string, because it would count also the continuation characters, which should not be counted.
You can try with the following code:
define('PREG_CLASS_UNICODE_WORD_BOUNDARY',
'\x{0}-\x{2F}\x{3A}-\x{40}\x{5B}-\x{60}\x{7B}-\x{A9}\x{AB}-\x{B1}\x{B4}' .
'\x{B6}-\x{B8}\x{BB}\x{BF}\x{D7}\x{F7}\x{2C2}-\x{2C5}\x{2D2}-\x{2DF}' .
'\x{2E5}-\x{2EB}\x{2ED}\x{2EF}-\x{2FF}\x{375}\x{37E}-\x{385}\x{387}\x{3F6}' .
'\x{482}\x{55A}-\x{55F}\x{589}-\x{58A}\x{5BE}\x{5C0}\x{5C3}\x{5C6}' .
'\x{5F3}-\x{60F}\x{61B}-\x{61F}\x{66A}-\x{66D}\x{6D4}\x{6DD}\x{6E9}' .
'\x{6FD}-\x{6FE}\x{700}-\x{70F}\x{7F6}-\x{7F9}\x{830}-\x{83E}' .
'\x{964}-\x{965}\x{970}\x{9F2}-\x{9F3}\x{9FA}-\x{9FB}\x{AF1}\x{B70}' .
'\x{BF3}-\x{BFA}\x{C7F}\x{CF1}-\x{CF2}\x{D79}\x{DF4}\x{E3F}\x{E4F}' .
'\x{E5A}-\x{E5B}\x{F01}-\x{F17}\x{F1A}-\x{F1F}\x{F34}\x{F36}\x{F38}' .
'\x{F3A}-\x{F3D}\x{F85}\x{FBE}-\x{FC5}\x{FC7}-\x{FD8}\x{104A}-\x{104F}' .
'\x{109E}-\x{109F}\x{10FB}\x{1360}-\x{1368}\x{1390}-\x{1399}\x{1400}' .
'\x{166D}-\x{166E}\x{1680}\x{169B}-\x{169C}\x{16EB}-\x{16ED}' .
'\x{1735}-\x{1736}\x{17B4}-\x{17B5}\x{17D4}-\x{17D6}\x{17D8}-\x{17DB}' .
'\x{1800}-\x{180A}\x{180E}\x{1940}-\x{1945}\x{19DE}-\x{19FF}' .
'\x{1A1E}-\x{1A1F}\x{1AA0}-\x{1AA6}\x{1AA8}-\x{1AAD}\x{1B5A}-\x{1B6A}' .
'\x{1B74}-\x{1B7C}\x{1C3B}-\x{1C3F}\x{1C7E}-\x{1C7F}\x{1CD3}\x{1FBD}' .
'\x{1FBF}-\x{1FC1}\x{1FCD}-\x{1FCF}\x{1FDD}-\x{1FDF}\x{1FED}-\x{1FEF}' .
'\x{1FFD}-\x{206F}\x{207A}-\x{207E}\x{208A}-\x{208E}\x{20A0}-\x{20B8}' .
'\x{2100}-\x{2101}\x{2103}-\x{2106}\x{2108}-\x{2109}\x{2114}' .
'\x{2116}-\x{2118}\x{211E}-\x{2123}\x{2125}\x{2127}\x{2129}\x{212E}' .
'\x{213A}-\x{213B}\x{2140}-\x{2144}\x{214A}-\x{214D}\x{214F}' .
'\x{2190}-\x{244A}\x{249C}-\x{24E9}\x{2500}-\x{2775}\x{2794}-\x{2B59}' .
'\x{2CE5}-\x{2CEA}\x{2CF9}-\x{2CFC}\x{2CFE}-\x{2CFF}\x{2E00}-\x{2E2E}' .
'\x{2E30}-\x{3004}\x{3008}-\x{3020}\x{3030}\x{3036}-\x{3037}' .
'\x{303D}-\x{303F}\x{309B}-\x{309C}\x{30A0}\x{30FB}\x{3190}-\x{3191}' .
'\x{3196}-\x{319F}\x{31C0}-\x{31E3}\x{3200}-\x{321E}\x{322A}-\x{3250}' .
'\x{3260}-\x{327F}\x{328A}-\x{32B0}\x{32C0}-\x{33FF}\x{4DC0}-\x{4DFF}' .
'\x{A490}-\x{A4C6}\x{A4FE}-\x{A4FF}\x{A60D}-\x{A60F}\x{A673}\x{A67E}' .
'\x{A6F2}-\x{A716}\x{A720}-\x{A721}\x{A789}-\x{A78A}\x{A828}-\x{A82B}' .
'\x{A836}-\x{A839}\x{A874}-\x{A877}\x{A8CE}-\x{A8CF}\x{A8F8}-\x{A8FA}' .
'\x{A92E}-\x{A92F}\x{A95F}\x{A9C1}-\x{A9CD}\x{A9DE}-\x{A9DF}' .
'\x{AA5C}-\x{AA5F}\x{AA77}-\x{AA79}\x{AADE}-\x{AADF}\x{ABEB}' .
'\x{D800}-\x{F8FF}\x{FB29}\x{FD3E}-\x{FD3F}\x{FDFC}-\x{FDFD}' .
'\x{FE10}-\x{FE19}\x{FE30}-\x{FE6B}\x{FEFF}-\x{FF0F}\x{FF1A}-\x{FF20}' .
'\x{FF3B}-\x{FF40}\x{FF5B}-\x{FF65}\x{FFE0}-\x{FFFD}');
function utf8_strlen($text) {
if (function_exists('mb_strlen')) {
return mb_strlen($text);
}
// Do not count UTF-8 continuation bytes.
return strlen(preg_replace("/[\x80-\xBF]/", '', $text));
}
function utf8_truncate($string, $max_length, $wordsafe = FALSE, $add_ellipsis = FALSE, $min_wordsafe_length = 1) {
$ellipsis = '';
$max_length = max($max_length, 0);
$min_wordsafe_length = max($min_wordsafe_length, 0);
if (utf8_strlen($string) <= $max_length) {
// No truncation needed, so don't add ellipsis, just return.
return $string;
}
if ($add_ellipsis) {
// Truncate ellipsis in case $max_length is small.
$ellipsis = utf8_substr('...', 0, $max_length);
$max_length -= utf8_strlen($ellipsis);
$max_length = max($max_length, 0);
}
if ($max_length <= $min_wordsafe_length) {
// Do not attempt word-safe if lengths are bad.
$wordsafe = FALSE;
}
if ($wordsafe) {
$matches = array();
// Find the last word boundary, if there is one within $min_wordsafe_length
// to $max_length characters. preg_match() is always greedy, so it will
// find the longest string possible.
$found = preg_match('/^(.{' . $min_wordsafe_length . ',' . $max_length . '})[' . PREG_CLASS_UNICODE_WORD_BOUNDARY . ']/u', $string, $matches);
if ($found) {
$string = $matches[1];
}
else {
$string = utf8_substr($string, 0, $max_length);
}
}
else {
$string = utf8_substr($string, 0, $max_length);
}
if ($add_ellipsis) {
$string .= $ellipsis;
}
return $string;
}
function utf8_substr($text, $start, $length = NULL) {
if (function_exists('mb_substr')) {
return $length === NULL ? mb_substr($text, $start) : mb_substr($text, $start, $length);
}
else {
$strlen = strlen($text);
// Find the starting byte offset.
$bytes = 0;
if ($start > 0) {
// Count all the continuation bytes from the start until we have found
// $start characters or the end of the string.
$bytes = -1;
$chars = -1;
while ($bytes < $strlen - 1 && $chars < $start) {
$bytes++;
$c = ord($text[$bytes]);
if ($c < 0x80 || $c >= 0xC0) {
$chars++;
}
}
}
elseif ($start < 0) {
// Count all the continuation bytes from the end until we have found
// abs($start) characters.
$start = abs($start);
$bytes = $strlen;
$chars = 0;
while ($bytes > 0 && $chars < $start) {
$bytes--;
$c = ord($text[$bytes]);
if ($c < 0x80 || $c >= 0xC0) {
$chars++;
}
}
}
$istart = $bytes;
// Find the ending byte offset.
if ($length === NULL) {
$iend = $strlen;
}
elseif ($length > 0) {
// Count all the continuation bytes from the starting index until we have
// found $length characters or reached the end of the string, then
// backtrace one byte.
$iend = $istart - 1;
$chars = -1;
$last_real = FALSE;
while ($iend < $strlen - 1 && $chars < $length) {
$iend++;
$c = ord($text[$iend]);
$last_real = FALSE;
if ($c < 0x80 || $c >= 0xC0) {
$chars++;
$last_real = TRUE;
}
}
// Backtrace one byte if the last character we found was a real character
// and we don't need it.
if ($last_real && $chars >= $length) {
$iend--;
}
}
elseif ($length < 0) {
// Count all the continuation bytes from the end until we have found
// abs($start) characters, then backtrace one byte.
$length = abs($length);
$iend = $strlen;
$chars = 0;
while ($iend > 0 && $chars < $length) {
$iend--;
$c = ord($text[$iend]);
if ($c < 0x80 || $c >= 0xC0) {
$chars++;
}
}
// Backtrace one byte if we are not at the beginning of the string.
if ($iend > 0) {
$iend--;
}
}
else {
// $length == 0, return an empty string.
return '';
}
return substr($text, $istart, max(0, $iend - $istart + 1));
}
}
For your return statement you could try:
return htmlspecialchars(trim($r));
EDIT: I tried your code as you provided it and it ran fine for me without having to use htmlspecialchars(). This is probably due to the face that in the <head> of the page the code was running on, the charset was set to UTF-8. So your options could be to set the encoding of the page like this:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
or to use htmlspecialchars() as above.