Truncating with ascii symbols - php

i've a problem with the truncate php function..
<?php
print_r(truncate ('cia???☺☻♥♀♂☼•◘○♠♣xas?????!!!!----'));
function truncate($text) {
$length = 100;
$ending = '...';
$exact = true;
$considerHtml = false;
$stripTags = false;
$wordsLenght = 20;
$textArray = explode ( " ", $text );
foreach ( $textArray as $key => $word ) {
if (strlen ( $word ) > $wordsLenght) {
$truncatedWord = substr ( $word, 0, $wordsLenght );
$textArray [$key] = $truncatedWord . "[...]";
}
}
$text = implode ( " ", $textArray );
// end truncate long word
if (strlen ( $text ) <= $length) {
return $text;
} else {
$truncate = substr ( $text, 0, $length - mb_strlen ( $ending, 'UTF-8' ) );
}
}
// if the words shouldn't be cut in the middle...
if (! $exact) {
// ...search the last occurance of a space...
$spacepos = strrpos ( $truncate, ' ' );
if (isset ( $spacepos )) {
// ...and cut the text in this position
$truncate = substr ( $truncate, 0, $spacepos );
}
}
// add the defined ending to the text
$truncate .= $ending;
if ($considerHtml) {
// close all unclosed html-tags
foreach ( $open_tags as $tag ) {
$truncate .= '';
}
}
return $truncate;
The problem, whit the string o given, is that the truncate function doesn't work well with unicode symbols...
The result is this:
cia???☺☻♥♀��[...]
Is there a way to split correctly?
I tried in different ways but none of them works correctly... I'm going out of mind :)

Related

Check if a word can be created from a random letter string using PHP

<?php
$randomstring = 'raabccdegep';
$arraylist = array("car", "egg", "total");
?>
Above $randomstring is a string which contain some alphabet letters.
And I Have an Array called $arraylist which Contain 3 Words Such as 'car' , 'egg' , 'total'.
Now I need to check the string Using the words in array and print if the word can be created using the string.
For Example I need an Output Like.
car is possible.
egg is not possible.
total is not possible.
Also Please Check the repetition of letter. ie, beep is also possible. Because the string contains two e. But egg is not possible because there is only one g.
function find_in( $haystack, $item ) {
$match = '';
foreach( str_split( $item ) as $char ) {
if ( strpos( $haystack, $char ) !== false ) {
$haystack = substr_replace( $haystack, '', strpos( $haystack, $char ), 1 );
$match .= $char;
}
}
return $match === $item;
}
$randomstring = 'raabccdegep';
$arraylist = array( "beep", "car", "egg", "total");
foreach ( $arraylist as $item ) {
echo find_in( $randomstring, $item ) ? " $item found in $randomstring." : " $item not found in $randomstring.";
}
This should do the trick:
<?php
$randomstring = 'raabccdegep';
$arraylist = array("car", "egg", "total");
foreach($arraylist as $word){
$checkstring = $randomstring;
$beMade = true;
for( $i = 0; $i < strlen($word); $i++ ) {
$char = substr( $word, $i, 1 );
$pos = strpos($checkstring, $char);
if($pos === false){
$beMade = false;
} else {
substr_replace($checkstring, '', $i, 1);
}
}
if ($beMade){
echo $word . " is possible \n";
} else {
echo $word . " is not possible \n";
}
}
?>

How to extract keywords from bengali text using PHP

I want to extract keywords automatically from Bengali text files using php.I have this code for reading a Bengali text file.
<?php
$target_path = $_FILES['uploadedfile']['name'];
header('Content-Type: text/plain;charset=utf-8');
$fp = fopen($target_path, 'r') or die("Can't open CEDICT.");
$i = 0;
while ($line = fgets($fp, 1024))
{
print $line;
$i++;
}
fclose($fp) or die("Can't close file.");
And I found following codes to extract most common 10 keywords but it's not working for Bengali texts. What changes should I make?
function extractCommonWords($string){
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;
}
Please help :(
You should make simple changes:
replace stopwords in $stopWords array with proper Bengali stopwords
remove this string $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); because Bengali sybmols doesn't match this pattern
Full code looks like:
<?php
function extractCommonWords($string){
// replace array below with proper Bengali stopwords
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string
// remove this preg_replace because Bengali sybmols doesn't match this pattern
// $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\s.*?\s/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower(trim($item)), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = trim(strtolower($val));
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;
}
$string = <<<EOF
টিপ বোঝে না, টোপ বোঝে না টিপ বোঝে না, কেমন বাপু লোক
EOF;
var_dump(extractCommonWords($string), $string);
Output will be:
array(4) {
["বোঝে"]=>
int(2)
["টোপ"]=>
int(1)
["না"]=>
int(1)
["কেমন"]=>
int(1)
}
string(127) "টিপ বোঝে না, টোপ বোঝে না টিপ বোঝে না, কেমন বাপু লোক"

Limit WP excerpt to second sentence

I'm using this function to limit my WP excerpt to a sentence instead of just cutting it off after a number of words.
add_filter('get_the_excerpt', 'end_with_sentence');
function end_with_sentence($excerpt) {
$allowed_end = array('.', '!', '?', '...');
$exc = explode( ' ', $excerpt );
$found = false;
$last = '';
while ( ! $found && ! empty($exc) ) {
$last = array_pop($exc);
$end = strrev( $last );
$found = in_array( $end{0}, $allowed_end );
}
return (! empty($exc)) ? $excerpt : rtrim(implode(' ', $exc) . ' ' .$last);
}
Works like a charm, but I would like to limit this to two sentences. Anyone have an idea how to do this?
Your code didn't work for me for 1 sentence, but hey it's 2am here maybe I missed something. I wrote this from scratch:
add_filter('get_the_excerpt', 'end_with_sentence');
function end_with_sentence( $excerpt ) {
$allowed_ends = array('.', '!', '?', '...');
$number_sentences = 2;
$excerpt_chunk = $excerpt;
for($i = 0; $i < $number_sentences; $i++){
$lowest_sentence_end[$i] = 100000000000000000;
foreach( $allowed_ends as $allowed_end)
{
$sentence_end = strpos( $excerpt_chunk, $allowed_end);
if($sentence_end !== false && $sentence_end < $lowest_sentence_end[$i]){
$lowest_sentence_end[$i] = $sentence_end + strlen( $allowed_end );
}
$sentence_end = false;
}
$sentences[$i] = substr( $excerpt_chunk, 0, $lowest_sentence_end[$i]);
$excerpt_chunk = substr( $excerpt_chunk, $lowest_sentence_end[$i]);
}
return implode('', $sentences);
}
I see complexities in your sample code that make it (maybe) harder than it needs to be.
Regular expressions are awesome. If you want to modify this one, I'd recommend using this tool: https://regex101.com/
Here we're going to use preg_split()
function end_with_sentence( $excerpt, $number = 2 ) {
$sentences = preg_split( "/(\.|\!|\?|\...)/", $excerpt, NULL, PREG_SPLIT_DELIM_CAPTURE);
var_dump($sentences);
if (count($sentences) < $number) {
return $excerpt;
}
return implode('', array_slice($sentences, 0, ($number * 2)));
}
Usage
$excerpt = 'Sentence. Sentence! Sentence? Sentence';
echo end_with_sentence($excerpt); // "Sentence. Sentence!"
echo end_with_sentence($excerpt, 1); // "Sentence."
echo end_with_sentence($excerpt, 3); // "Sentence. Sentence! Sentence?"
echo end_with_sentence($excerpt, 4); // "Sentence. Sentence! Sentence? Sentence"
echo end_with_sentence($excerpt, 10); // "Sentence. Sentence! Sentence? Sentence"

PHP substr but keep HTML tags?

I am wondering if there is an elegant way to trim some text but while being HTML tag aware?
For example, I have this string:
$data = '<strong>some title text here that could get very long</strong>';
And let's say I need to return/output this string on a page but would like it to be no more than X characters. Let's say 35 for this example.
Then I use:
$output = substr($data,0,20);
But now I end up with:
<strong>some title text here that
which as you can see the closing strong tags are discarded thus breaking the HTML display.
Is there a way around this? Also note that it is possible to have multiple tags in the string for example:
<p>some text here <strong>and here</strong></p>
A few mounths ago I created a special function which is solution for your problem.
Here is a function:
function substr_close_tags($code, $limit = 300)
{
if ( strlen($code) <= $limit )
{
return $code;
}
$html = substr($code, 0, $limit);
preg_match_all ( "#<([a-zA-Z]+)#", $html, $result );
foreach($result[1] AS $key => $value)
{
if ( strtolower($value) == 'br' )
{
unset($result[1][$key]);
}
}
$openedtags = $result[1];
preg_match_all ( "#</([a-zA-Z]+)>#iU", $html, $result );
$closedtags = $result[1];
foreach($closedtags AS $key => $value)
{
if ( ($k = array_search($value, $openedtags)) === FALSE )
{
continue;
}
else
{
unset($openedtags[$k]);
}
}
if ( empty($openedtags) )
{
if ( strpos($code, ' ', $limit) == $limit )
{
return $html."...";
}
else
{
return substr($code, 0, strpos($code, ' ', $limit))."...";
}
}
$position = 0;
$close_tag = '';
foreach($openedtags AS $key => $value)
{
$p = strpos($code, ('</'.$value.'>'), $limit);
if ( $p === FALSE )
{
$code .= ('</'.$value.'>');
}
else if ( $p > $position )
{
$close_tag = '</'.$value.'>';
$position = $p;
}
}
if ( $position == 0 )
{
return $code;
}
return substr($code, 0, $position).$close_tag."...";
}
Here is DEMO: http://sandbox.onlinephpfunctions.com/code/899d8137c15596a8528c871543eb005984ec0201 (click "Execute code" to check how it works).
Using #newbieuser his function, I had the same issue, like #pablo-pazos, that it was (not) breaking when $limit fell into an html tag (in my case <br /> at the r)
Fixed with some code
if ( strlen($code) <= $limit ){
return $code;
}
$html = substr($code, 0, $limit);
//We must find a . or > or space so we are sure not being in a html-tag!
//In my case there are only <br>
//If you have more tags, or html formatted text, you must do a little more and also use something like http://htmlpurifier.org/demo.php
$_find_last_char = strrpos($html, ".")+1;
if($_find_last_char > $limit/3*2){
$html_break = $_find_last_char;
}else{
$_find_last_char = strrpos($html, ">")+1;
if($_find_last_char > $limit/3*2){
$html_break = $_find_last_char;
}else{
$html_break = strrpos($html, " ");
}
}
$html = substr($html, 0, $html_break);
preg_match_all ( "#<([a-zA-Z]+)#", $html, $result );
......
substr(strip_tags($content), 0, 100)

How to reverse a Unicode string

It was hinted in a comment to an answer to this question that PHP can not reverse Unicode strings.
As for Unicode, it works in PHP
because most apps process it as
binary. Yes, PHP is 8-bit clean. Try
the equivalent of this in PHP: perl
-Mutf8 -e 'print scalar reverse("ほげほげ")' You will get garbage,
not "げほげほ". – jrockway
And unfortunately it is correct that PHPs unicode support atm is at best "lacking". This will hopefully change drastically with PHP6.
PHPs MultiByte functions does provide the basic functionality you need to deal with unicode, but it is inconsistent and does lack a lot of functions. One of these is a function to reverse a string.
I of course wanted to reverse this text for no other reason then to figure out if it was possible. And I made a function to accomplish this enormous complex task of reversing this Unicode text, so you can relax a bit longer until PHP6.
Test code:
$enc = 'UTF-8';
$text = "ほげほげ";
$defaultEnc = mb_internal_encoding();
echo "Showing results with encoding $defaultEnc.\n\n";
$revNormal = strrev($text);
$revInt = mb_strrev($text);
$revEnc = mb_strrev($text, $enc);
echo "Original text is: $text .\n";
echo "Normal strrev output: " . $revNormal . ".\n";
echo "mb_strrev without encoding output: $revInt.\n";
echo "mb_strrev with encoding $enc output: $revEnc.\n";
if (mb_internal_encoding($enc)) {
echo "\nSetting internal encoding to $enc from $defaultEnc.\n\n";
$revNormal = strrev($text);
$revInt = mb_strrev($text);
$revEnc = mb_strrev($text, $enc);
echo "Original text is: $text .\n";
echo "Normal strrev output: " . $revNormal . ".\n";
echo "mb_strrev without encoding output: $revInt.\n";
echo "mb_strrev with encoding $enc output: $revEnc.\n";
} else {
echo "\nCould not set internal encoding to $enc!\n";
}
here's another approach using regex:
function utf8_strrev($str){
preg_match_all('/./us', $str, $ar);
return implode(array_reverse($ar[0]));
}
Here's another way. This seems to work without having to specify an output encoding (tested with a couple of different mb_internal_encodings):
function mb_strrev($text)
{
return join('', array_reverse(
preg_split('~~u', $text, -1, PREG_SPLIT_NO_EMPTY)
));
}
Grapheme functions handle UTF-8 string more correctly than mbstring and PCRE functions/ Mbstring and PCRE may break characters. You can see the defference between them by executing the following code.
function str_to_array($string)
{
$length = grapheme_strlen($string);
$ret = [];
for ($i = 0; $i < $length; $i += 1) {
$ret[] = grapheme_substr($string, $i, 1);
}
return $ret;
}
function str_to_array2($string)
{
$length = mb_strlen($string, "UTF-8");
$ret = [];
for ($i = 0; $i < $length; $i += 1) {
$ret[] = mb_substr($string, $i, 1, "UTF-8");
}
return $ret;
}
function str_to_array3($string)
{
return preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
}
function utf8_strrev($string)
{
return implode(array_reverse(str_to_array($string)));
}
function utf8_strrev2($string)
{
return implode(array_reverse(str_to_array2($string)));
}
function utf8_strrev3($string)
{
return implode(array_reverse(str_to_array3($string)));
}
// http://www.php.net/manual/en/function.grapheme-strlen.php
$string = "a\xCC\x8A" // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5)
."o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6)
var_dump(array_map(function($elem) { return strtoupper(bin2hex($elem)); },
[
'should be' => "o\xCC\x88"."a\xCC\x8A",
'grapheme' => utf8_strrev($string),
'mbstring' => utf8_strrev2($string),
'pcre' => utf8_strrev3($string)
]));
The result is here.
array(4) {
["should be"]=>
string(12) "6FCC8861CC8A"
["grapheme"]=>
string(12) "6FCC8861CC8A"
["mbstring"]=>
string(12) "CC886FCC8A61"
["pcre"]=>
string(12) "CC886FCC8A61"
}
IntlBreakIterator can be used since PHP 5.5 (intl 3.0);
function utf8_strrev($str)
{
$it = IntlBreakIterator::createCodePointInstance();
$it->setText($str);
$ret = '';
$pos = 0;
$prev = 0;
foreach ($it as $pos) {
$ret = substr($str, $prev, $pos - $prev) . $ret;
$prev = $pos;
}
return $ret;
}
The answer
function mb_strrev($text, $encoding = null)
{
$funcParams = array($text);
if ($encoding !== null)
$funcParams[] = $encoding;
$length = call_user_func_array('mb_strlen', $funcParams);
$output = '';
$funcParams = array($text, $length, 1);
if ($encoding !== null)
$funcParams[] = $encoding;
while ($funcParams[1]--) {
$output .= call_user_func_array('mb_substr', $funcParams);
}
return $output;
}
Another method:
function mb_strrev($str, $enc = null) {
if(is_null($enc)) $enc = mb_internal_encoding();
$str = mb_convert_encoding($str, 'UTF-16BE', $enc);
return mb_convert_encoding(strrev($str), $enc, 'UTF-16LE');
}
It is easy utf8_strrev( $str ). See the relevant source code of my Library that I copied below:
function utf8_strrev( $str )
{
return implode( array_reverse( utf8_split( $str ) ) );
}
function utf8_split( $str , $split_length = 1 )
{
$str = ( string ) $str;
$ret = array( );
if( pcre_utf8_support( ) )
{
$str = utf8_clean( $str );
$ret = preg_split('/(?<!^)(?!$)/u', $str );
// \X is buggy in many recent versions of PHP
//preg_match_all( '/\X/u' , $str , $ret );
//$ret = $ret[0];
}
else
{
//Fallback
$len = strlen( $str );
for( $i = 0 ; $i < $len ; $i++ )
{
if( ( $str[$i] & "\x80" ) === "\x00" )
{
$ret[] = $str[$i];
}
else if( ( ( $str[$i] & "\xE0" ) === "\xC0" ) && ( isset( $str[$i+1] ) ) )
{
if( ( $str[$i+1] & "\xC0" ) === "\x80" )
{
$ret[] = $str[$i] . $str[$i+1];
$i++;
}
}
else if( ( ( $str[$i] & "\xF0" ) === "\xE0" ) && ( isset( $str[$i+2] ) ) )
{
if( ( ( $str[$i+1] & "\xC0" ) === "\x80" ) && ( ( $str[$i+2] & "\xC0" ) === "\x80" ) )
{
$ret[] = $str[$i] . $str[$i+1] . $str[$i+2];
$i = $i + 2;
}
}
else if( ( ( $str[$i] & "\xF8" ) === "\xF0" ) && ( isset( $str[$i+3] ) ) )
{
if( ( ( $str[$i+1] & "\xC0" ) === "\x80" ) && ( ( $str[$i+2] & "\xC0" ) === "\x80" ) && ( ( $str[$i+3] & "\xC0" ) === "\x80" ) )
{
$ret[] = $str[$i] . $str[$i+1] . $str[$i+2] . $str[$i+3];
$i = $i + 3;
}
}
}
}
if( $split_length > 1 )
{
$ret = array_chunk( $ret , $split_length );
$ret = array_map( 'implode' , $ret );
}
if( $ret[0] === '' )
{
return array( );
}
return $ret;
}
function utf8_clean( $str , $remove_bom = false )
{
$regx = '/([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})|./s';
$str = preg_replace( $regx , '$1' , $str );
if( $remove_bom )
{
$str = utf8_str_replace( utf8_bom( ) , '' , $str );
}
return $str;
}
function utf8_str_replace( $search , $replace , $subject , &$count = 0 )
{
return str_replace( $search , $replace , $subject , $count );
}
function utf8_bom( )
{
return "\xef\xbb\xbf";
}
function pcre_utf8_support( )
{
static $support;
if( !isset( $support ) )
{
$support = #preg_match( '//u', '' );
//Cached the response
}
return $support;
}

Categories