Shorten/truncate UTF8 string in PHP

Shorten/truncate UTF8 string in PHP - php

I need a good fast function that shortens strings to a set length with UTF8 support. Adding trailing '...' at ends is a plus. Can anyone help?

Assuming mb_* functions installed.
function truncate($str, $length, $append = '…') {
$strLength = mb_strlen($str);
if ($strLength <= $length) {
return $str;
}
return mb_substr($str, 0, $length) . $append;
}
CodePad.
Keep in mind this will add one character (the elipsis). If you want the $append included in the length that is truncated, just minus the mb_strlen($append) from the length of the string you chop.
Obviously, this will also chop in the middle of words.
Update
Here is a version that can optionally preserve whole words...
function truncate($str, $length, $breakWords = TRUE, $append = '…') {
$strLength = mb_strlen($str);
if ($strLength <= $length) {
return $str;
}
if ( ! $breakWords) {
while ($length < $strLength AND preg_match('/^\pL$/', mb_substr($str, $length, 1))) {
$length++;
}
}
return mb_substr($str, 0, $length) . $append;
}
CodePad.
It will preserve all letter characters up to the first non letter character if the third argument is TRUE.

I guess you need to truncate text, so this may be helpful:
if (!function_exists('truncate_string')) {
function truncate_string($string, $max_length) {
if (mb_strlen($string, 'UTF-8') > $max_length){
$string = mb_substr($string, 0, $max_length, 'UTF-8');
$pos = mb_strrpos($string, ' ', false, 'UTF-8');
if($pos === false) {
return mb_substr($string, 0, $max_length, 'UTF-8').'…';
}
return mb_substr($string, 0, $pos, 'UTF-8').'…';
}else{
return $string;
}
}
}
This is something like #alex just posted, but it does not break words.

Try this:
$length = 100;
if(mb_strlen($text, "utf-8") > $length){
$last_space = mb_strrpos(mb_substr($text, 0, $length, "utf-8"), " ", "utf-8");
$text = mb_substr($text, 0, $last_space, "utf-8")." ...";}
Cheers...

Related

cut text after html a ending tag

I have a comment system in my website and some users write very long comments, longer than 500 chars and I need to cut it after 200 and add the option "see more". The problem is that users can use <a>test</a> tags and in some cases the limit of 200 chars cuts the tag in the middle , like <a>t or <a or <a>test</ If any of the cases above happens, the limit should extend until the end of the html tag so ex <a>test</a>
I have this code:
function truncate($string,$length=200,$append="…") {
$string = trim($string);
if(strlen($string) > $length) {
$string = wordwrap($string, $length);
$string = explode("\n", $string, 2);
$string = $string[0] . $append;
}
return $string;
}
Any idea how to make this?
Thanks

Well, I think I did it. If anyone has any suggestion, feel free to modify this answer or comment.
function cut_text($string, $length = 350, $append = "…")
{
$string = trim($string);
$string_length = strlen($string);
$original_string = $string;
if ($string_length > $length) {
$remaining_chars = $string_length - $length;
if (strpos($string, '<') !== false && strpos($string, '>') !== false) {
$string = wordwrap($string, $length);
$string = explode("\n", $string, 2);
$string = $string[0] . $append;
$fillimi = substr_count($string, '<');
$fundi = substr_count($string, '>');
if ($fillimi == $fundi) {
$string = $string;
} else {
$i = 1;
while ($i <= $remaining_chars) {
$string = wordwrap($original_string, $length + $i);
$string = explode("\n", $string, 2);
$new_remaining_chars = $string_length - ($length + $i);
if ($new_remaining_chars > 0) {
$string = $string[0] . $append;
} else {
$string = $string[0];
}
$fillimi = substr_count($string, '<');
$fundi = substr_count($string, '>');
if ($fillimi == $fundi) {
$string = $string;
break;
}
$i++;
}
}
} else {
$string = trim($string);
$string = wordwrap($string, $length);
$string = explode("\n", $string, 2);
$string = $string[0] . $append;
}
}
return $string;
}

I think there should be this already somewhere on Internet but wasn't able to find it. What you basically need to do is count the opened tags and then if there are more opened tags than closed, it is open and can't cut yet. Here is something to push you on right direction for how to easily count the number of tags opened and closed.

mb_stripos() in PHP won't work correctly

This code:
setlocale(LC_ALL, 'pl_PL', 'pl', 'Polish_Poland.28592');
$result = mb_stripos("ĘÓĄŚŁŻŹĆŃ",'ęóąśłżźćń');
returns false;
How to fix that?
P.S. This stripos returns false when special characters is used is not correct answer.
UPDATE: I made a test:
function test() {
$search = "zawór"; $searchlen=strlen($search);
$opentag="<valve>"; $opentaglen=strlen($opentag);
$closetag="</valve>"; $closetaglen=strlen($closetag);
$test[0]['input']="test ZAWÓR test"; //normal test
$test[1]['input']="X\nX\nX ZAWÓR X\nX\nX"; //white char test
$test[2]['input']="<br> ZAWÓR <br>"; //html newline test
$test[3]['input']="ĄąĄą ZAWÓR ĄąĄą"; //polish diacritical test
$test[4]['input']="テスト ZAWÓR テスト"; //japanese katakana test
foreach ($test as $key => $val) {
$position = mb_stripos($val['input'],$search,0,'UTF-8');
if($position!=false) {
$output = $val['input'];
$output = substr_replace($output, $opentag, $position, 0);
$output = substr_replace($output, $closetag, $position+$opentaglen+$searchlen, 0);
$test[$key]['output'] = $output;
}
else {
$test[$key]['output'] = null;
}
}
return $test;
}
FIREFOX OUTPUT:
$test[0]['output'] == "test <valve>ZAWÓR</valve> test" // ok
$test[1]['output'] == "X\nX\nX <valve>ZAWÓR</valve> X\nX\nX" // ok
$test[2]['output'] == "<br> <valve>ZAWÓR</valve> <br>" // ok
$test[3]['output'] == "Ąą�<valve>�ą ZA</valve>WÓR ĄąĄą" // WTF??
$test[4]['output'] == "テ�<valve>��ト </valve>ZAWÓR テスト" // WTF??
Solution https://drupal.org/node/1107268 does not change anything.

The function works fine when told what encoding your strings are in:
var_dump(mb_stripos("ĘÓĄŚŁŻŹĆŃ",'ęóąśłżźćń', 0, 'UTF-8')); // 0
^^^^^^^
Without the explicit encoding argument, it may assume the wrong encoding and cannot treat your string correctly.
The problem with your test code is that you're mixing character-based indices with byte-offset-based indices. mb_strpos returns offsets in characters, while substr_replace works with byte offsets. Read about the topic here: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
If you want to wrap a certain word in tags in a multi-byte string, I'd rather suggest this approach:
preg_replace('/zawór/iu', '<valve>$0</valve>', $text)
Note that $text must be UTF-8 encoded, /u regular expressions only work with UTF-8.

I'm not sure why mb_stripos function dose not worked but workaround will work as below,
$str = mb_convert_case("ęóąśłżźćń", MB_CASE_UPPER, "UTF-8");
$result = mb_strrichr($str,"ĘÓĄŚŁŻŹĆŃ");
var_dump($result);
DEMO.

Using your tip, dear Rikesh, I wrote that:
function patched_mb_stripos($content,$search) {
$content=mb_convert_case($content, MB_CASE_LOWER, "UTF-8");
$search=mb_convert_case($search, MB_CASE_LOWER, "UTF-8");
return mb_stripos($content,$search);
}
and it seems to work :)

Solution from https://gist.github.com/stemar/8287074 :
function mb_substr_replace($string, $replacement, $start, $length=NULL) {
if (is_array($string)) {
$num = count($string);
// $replacement
$replacement = is_array($replacement) ? array_slice($replacement, 0, $num) : array_pad(array($replacement), $num, $replacement);
// $start
if (is_array($start)) {
$start = array_slice($start, 0, $num);
foreach ($start as $key => $value)
$start[$key] = is_int($value) ? $value : 0;
}
else {
$start = array_pad(array($start), $num, $start);
}
// $length
if (!isset($length)) {
$length = array_fill(0, $num, 0);
}
elseif (is_array($length)) {
$length = array_slice($length, 0, $num);
foreach ($length as $key => $value)
$length[$key] = isset($value) ? (is_int($value) ? $value : $num) : 0;
}
else {
$length = array_pad(array($length), $num, $length);
}
// Recursive call
return array_map(__FUNCTION__, $string, $replacement, $start, $length);
}
preg_match_all('/./us', (string)$string, $smatches);
preg_match_all('/./us', (string)$replacement, $rmatches);
if ($length === NULL) $length = mb_strlen($string);
array_splice($smatches[0], $start, $length, $rmatches[0]);
return join("",$smatches[0]);
}
solves the problem with function test()

substr_replace function returns weird symbols along with the string

I've a variable with some string in it, for example:
$var = "myText";
What I want to do is to "inject" before the last word a single quotation mark (') so the output will be:
myTex't
I've got this code:
$var = "myText";
$var = substr_replace($var, "'", strlen($var)-1, 0);
echo $var;
And it works good. The only problem is, when I try to implement it to another language(hebrew in this case) I'm getting additional characters. For instance, for that Input:
עברית I'm expecting result of: עברי'ת but instead, I'm getting this as a result: עברי�'�
Any ideas?
P.S. Hebrew is Right to Left language

you are using multibyte string and substr_replace is not multibyte compatible.
Here is a version that mimics the behavior of substr_replace() exactly: (From substr_replace PHP Manual user comment)
<?php
if (function_exists('mb_substr_replace') === false)
{
function mb_substr_replace($string, $replacement, $start, $length = null, $encoding = null)
{
if (extension_loaded('mbstring') === true)
{
$string_length = (is_null($encoding) === true) ? mb_strlen($string) : mb_strlen($string, $encoding);
if ($start < 0)
{
$start = max(0, $string_length + $start);
}
else if ($start > $string_length)
{
$start = $string_length;
}
if ($length < 0)
{
$length = max(0, $string_length - $start + $length);
}
else if ((is_null($length) === true) || ($length > $string_length))
{
$length = $string_length;
}
if (($start + $length) > $string_length)
{
$length = $string_length - $start;
}
if (is_null($encoding) === true)
{
return mb_substr($string, 0, $start) . $replacement . mb_substr($string, $start + $length, $string_length - $start - $length);
}
return mb_substr($string, 0, $start, $encoding) . $replacement . mb_substr($string, $start + $length, $string_length - $start - $length, $encoding);
}
return (is_null($length) === true) ? substr_replace($string, $replacement, $start) : substr_replace($string, $replacement, $start, $length);
}
}
?>

This happens because you are working with unicode multibyte strings. substr_replace() works byte wise. So if you are just replacing the last byte, it will possibly destroy the last character (if this is a multibyte character).
Use can use preg_replace instead of substr_replace(), it is unicode safe if you pass the u option:
preg_replace('~(.)$~u', '\'$1', $string);

Remove random generated characters from a string to display the word in clear text

I have this function to put some random characters into a string:
function random($string) {
$chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
$shuffle_start = substr(str_shuffle($chars), 0, 6);
$shuffle_end = substr(str_shuffle($chars), 0, 6);
$letters = str_split($string);
$str = '';
$count = count($letters);
foreach($letters AS $l) {
$count--;
$str .= $l;
if($count) {
$str .= substr(str_shuffle($chars), 0, 5);
}
}
return $shuffle_start . $str . $shuffle_end;
}
This function prints this from the string "hello": aApi3VhKJrDjeAbCkalprX7ll7N0Qjo3qymiw. Now, I want to remove the random characters from the string so the word "hello" are being clearly seen.
How can I do this?

Just move backwards. Strip 6 characters form start and end, and then get every sixth character
function unrandom($str){
$base = substr($str, 6, strlen($str)-12);
$ret = '';
for($i=0;$i < strlen($base); $i+=6) {
$ret .= substr($base, $i,1);
}
return $ret;
}

Cut strings (UTF-8) (PHP)

How to cut string in UTF 8.
I have searched from web this function:
function cutString($str, $lenght = 100, $end = ' …', $charset = 'UTF-8', $token = '~') {
$str = strip_tags($str);
if (mb_strlen($str, $charset) >= $lenght) {
$wrap = wordwrap($str, $lenght, $token);
$str_cut = mb_substr($wrap, 0, mb_strpos($wrap, $token, 0, $charset), $charset);
return $str_cut .= $end;
} else {
return $str;
}
}
But result of this function isn't too good. Because if we set to cut 200 letters, it will return about 110, but I need about 200.

I have just tested it and it works fine. If you run it with
echo cutString($mystring, 200);
It returns 201 characters from the string I gave it.

i think wordwrap() function does wrong in this case
cut the string manually. i use a function like this (just add 'mb_' and $charset to the string functions):
function str_cut_end_by_word($s, $max_len, $trailer = "...")
{
if (strlen($s) <= $max_len)
return $s;
$s = trim($s);
$s = substr($s, 0, $max_len);
for ($i = strlen($s) - 1; $i >= 0; $i--)
{
if (in_array($s{$i}, array(" ", "\t", "\r", "\n")))
{
return rtrim(substr($s, 0, $i)).$trailer;
}
}
return $s;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Shorten/truncate UTF8 string in PHP - php

I need a good fast function that shortens strings to a set length with UTF8 support. Adding trailing '...' at ends is a plus. Can anyone help?

Try this: $length = 100; if(mb_strlen($text, "utf-8") > $length){ $last_space = mb_strrpos(mb_substr($text, 0, $length, "utf-8"), " ", "utf-8"); $text = mb_substr($text, 0, $last_space, "utf-8")." ...";} Cheers...

Related

cut text after html a ending tag

mb_stripos() in PHP won't work correctly

substr_replace function returns weird symbols along with the string

Remove random generated characters from a string to display the word in clear text

Cut strings (UTF-8) (PHP)

Categories

Resources