I have a string in the arabic language like:
على احمد يوسف
Now I need to cut this string and output it like:
...على احمد يو
I tried this function:
function short_name($str, $limit) {
if ($limit < 3) {
$limit = 3;
}
if (strlen($str) > $limit) {
if (preg_match('/\p{Arabic}/u', $str)) {
return substr($str, 0, $limit - 3) . '...';
}
else {
return '...'.substr($str, 0, $limit - 3);
}
}
else {
return $str;
}
}
The problem is that sometimes it displays a symbol like this at the end of the string:
...�على احمد يو
Why does this happen?
The symbol displayed after the cut is the result of substr() cutting in the middle of a character, resulting in an invalid character.
You need to use Multibyte String Functions to handle arabic strings, such as mb_strlen() and mb_substr().
You also need to make sure the internal encoding for those functions is set to UTF-8. You can set this globally at the top of your script:
mb_internal_encoding('UTF-8');
Which leads to this:
strlen('على احمد يوسف') returns 24, the size in octets
mb_strlen('على احمد يوسف') returns 13, the size in characters
Note that mb_strlen('على احمد يوسف') would also return 24 if the internal encoding was still set to the default ISO-8859-1.
Answer:
return '...'.mb_substr($str, 0, $limit - 3, "UTF-8"); // UTF-8 is optional
Background:
In ISO 8859-1 Arabic is not a 8-bit character set. The substr() calls the internal libc functions which work on sets of 8-bit chars. To display characters higher then 255 (Arabic, Cyclic, Korean, etc..) there are more bits needed to display that character, for example 16 or sometimes even 32-bits. You subtract 3*8-bits which will result in some undisplayable character in UTF-8. Especially if you're going to use a lot of multibyte strings, make sure you use the correct string functions such as mb_strlen()
Try this function;
public static function shorten_arabic_text($text, $lenght)
{
mb_internal_encoding('UTF-8');
$out = mb_strlen($text) > $lenght ? mb_substr($text, 0, $lenght) . " ..." : $text;
return $out;
}
Related
I'm trying to get about 200 letters/chars (including spaces) from a external text file. I've got the code to display the text i'll include that but to get certain letters i've got no idea. Once again i'm not talking about line's i really mean letters.
<?php
$file = "Nieuws/NieuwsTest.txt";
echo file_get_contents($file) . '<br /><br />';
?>
Use the fifth parameter of file_get_contents:
$s = file_get_contents('file', false, null, 0, 200);
This will work only with 256-character set, and will not work correctly with multi-byte characters, since PHP does not offer native Unicode support, unfortunately.
Unicode
In order to read specific number of Unicode characters, you will need to implement your own function using PHP extensions such as intl and mbstring. For example, a version of fread accepting the maximum number of UTF-8 characters can be implemented as follows:
function utf8_fread($handle, $length = null) {
if ($length > 0) {
$string = fread($handle, $length * 4);
return $string ? mb_substr($string, 0, $length) : false;
}
return fread($handle);
}
If $length is positive, the function reads the maximum number of bytes that a UTF-8 string of that number of characters can take (a UTF-8 character is represented as 1 to 4 8-bit bytes), and extracts the first $length multi-byte characters using mb_substr. Otherwise, the function reads the entire file.
A UTF-8 version of file_get_contents can be implemented in similar manner:
function utf8_file_get_contents(...$args) {
if (!empty($args[4])) {
$maxlen = $args[4];
$args[4] *= 4;
$string = call_user_func_array('file_get_contents', $args);
return $string ? mb_substr($string, 0, $maxlen) : false;
}
return call_user_func_array('file_get_contents', $args);
}
You should use substr() functions.
But i recommend you to use the multy byte safe mb_substr().
$text = mb_substr( file_get_contents($file), 200 ) . '<br /><br />';
With substr you will get trouble if there is some accents etc. Thoses problems will not happen with mb_substr()
use this:
<?php
$file = "Nieuws/NieuwsTest.txt";
echo substr( file_get_contents($file), 0, 200 ) . '<br /><br />';
?>
Trying to use php similar_text() with arabic, but it's not working.
However it works great with english.
<?php
$var = similar_text("ياسر","عمار","$per");
echo $var;
?>
outbot : 5
that's wrong result, it should be 2. Is there similar_text() with arabic letters?
Here's one I'm using
//from http://www.phperz.com/article/14/1029/31806.html
function mb_split_str($str) {
preg_match_all("/./u", $str, $arr);
return $arr[0];
}
//based on http://www.phperz.com/article/14/1029/31806.html, added percent
function mb_similar_text($str1, $str2, &$percent) {
$arr_1 = array_unique(mb_split_str($str1));
$arr_2 = array_unique(mb_split_str($str2));
$similarity = count($arr_2) - count(array_diff($arr_2, $arr_1));
$percent = ($similarity * 200) / (strlen($str1) + strlen($str2) );
return $similarity;
}
So
$var = mb_similar_text('عمار', 'ياسر', $per);
output: $var = 2, $per = 25
Because the Arabic text are multibyte strings normal PHP functions cannot be used (such as 'similar_text()').
echo(strlen("عمار"));
The above code outputs: 8
echo(mb_strlen("عمار", "UTF-8"));
Using the mb_strlen function with the UTF-8 encoding specified, the output is: 4 (the correct number of characters).
You can use the mb_ functions to make your own version of the similar_text function: http://php.net/manual/en/ref.mbstring.php
Just for the record and hopefully to make some help, I want to clarify the behavior of the similar_text() function when some multi-byte character strings are given (including the character strings of the Arabic.)
The function simply treats each byte of the input string as an individual character (which implies it neither supports multi-byte characters nor the Unicode.)
The byte streams of the عمار and ياسر strings are respectively represented as the following (the bytes (in the hexadecimal representation) are separated using . and, where the end of a character is reached, then a : is used instead):
06.39:06.45:06.27:06.31 <-- Byte stream for عمار
|| || || || ||
06.4A:06.27:06.33:06.31 <-- Byte stream for ياسر
As you can tell, there are five matching, and that's the reason why the function returns 5 in this case (every two hexadecimal digits represent a byte.)
Sorry for the title, I really didn't know how to say this...
I often have a string that needs to be cut after X characters, my problem is that this string often contains special characters like : & egrave ;
So, I'm wondering, is their a way to know in php, without transforming my string, if when I am cutting my string, I am in the middle of a special char.
Example
This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact
so right now my result with a sub string would be :
This is my string with a special char : &egra
but I want to have something like this :
This is my string with a special char : è
The best thing to do here is store your string as UTF-8 without any html entities, and use the mb_* family of functions with utf8 as the encoding.
But, if your string is ASCII or iso-8859-1/win1252, you can use the special HTML-ENTITIES encoding of the mb_string library:
$s = 'This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact';
echo mb_substr($s, 0, 40, 'HTML-ENTITIES');
echo mb_substr($s, 0, 41, 'HTML-ENTITIES');
However, if your underlying string is UTF-8 or some other multibyte encoding, using HTML-ENTITIES is not safe! This is because HTML-ENTITIES really means "win1252 with high-bit characters as html entities". This is an example of where this can go wrong:
// Assuming that é is in utf8:
mb_substr('é ', 0, 2, 'HTML-ENTITIES') === 'é'
// should be 'é '
When your string is in a multibyte encoding, you must instead convert all html entities to a common encoding before you split. E.g.:
$strings_actual_encoding = 'utf8';
$s_noentities = html_entity_decode($s, ENT_QUOTES, $strings_actual_encoding);
$s_trunc_noentities = mb_substr($s_noentities, 0, 41, $strings_actual_encoding);
The best solution would be to store your text as UTF-8, instead of storing them as HTML entities. Other than that, if you don't mind the count being off (` equals one character, instead of 7), then the following snippet should work:
<?php
$string = 'This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact';
$cut_string = htmlentities(mb_substr(html_entity_decode($string, NULL, 'UTF-8'), 0, 45), NULL, 'UTF-8')."<br><br>";
Note: If you use a different function to encode the text (e.g. htmlspecialchars()), then use that function instead of htmlentities(). If you use a custom function, then use another custom function that does the opposite of your new custom function instead of html_entity_decode() (and custom function instead of htmlentities()).
The longest HTML entity is 10 characters long, including the ampersand and semicolon. If you intend to cut the string at X bytes, check bytes X-9 through X-1 for an ampersand. If the corresponding semicolon appears at byte X or later, cut the string after the semicolon instead of after byte X.
However, if you're willing to preprocess the string, Mike's solution will be more accurate because his cuts the string at X characters, not bytes.
You can use html_entity_decode() first to decode all the HTML entities. Then split your string. Then htmlentities() to re-encode the entities.
$decoded_string = html_entity_decode($original_string);
// implement logic to split string here
// then for each string part do the following:
$encoded_string_part = htmlentities($split_string_part);
A little bruteforce solution, that I'm not really happy with would a PCRE expression, let's say that you want to pass 80 characters and the longest possible HTML expression is 7 chars long:
$regex = '~^(.{73}([^&]{7}|.{0,7}$|[^&]{0,6}&[^;]+;))(.*)~mx'
// Note, this could return a bit of shorter text
return preg_replace( $regexp, '$1', $text);
Just so you know:
.{73} - 73 characters
[^&]{7} - okay, we may fill it with anything that doesn't contain &
.{0,7}$ - keep in mind the possible end (this shouldn't be necessary because shorter text wouldn't match at all)
[^&]{0,6}&[^;]+; - up to 6 characters (you'd be at 79th), then & and let it finish
Something that seems much better but requires bit of play with numbers is to:
// check whether $text is at least $N chars long :)
if( strlen( $text) < $N){
return;
}
// Get last &
$pos = strrpos( $text, '&', $N);
// We're not young anymore, we have to check this too (not entries at all) :)
if( $pos === false){
return substr( $text, 0, $N);
}
// Get Last
$end = strpos( $text, ';', $N);
// false wouldn't be smaller then 0 (entry open at the beginning
if( $end === false){
$end = -1;
}
// Okay, entry closed (; is after &)(
if( $end > $pos){
return substr($text, 0, $N);
}
// Now we need to find first ;
$end = strpos( $text, ';', $N)
if( $end === false){
// Not valid HTML, not closed entry, do whatever you want
}
return substr($text, 0, $end);
Check numbers, there may be +/-1 somewhere in indexes...
I think you would have to use a combination of strpos and strrpos to find the next and previous spaces, parse the text between the spaces, check that against a known list of special characters, and if it matches, extend your "cut" to the position of the next space. If you had a code sample of what you have now, we could give you a better answer.
I was trying range(); function with non-English language. It is not working.
$i =0
foreach(range('क', 'म') as $ab) {
++$i;
$alphabets[$ab] = $i;
}
Output: à =1
It was Hindi (India) alphabets. It is only iterating only once (Output shows).
For this, I am not getting what to do!
So, if possible, please tell me what to do for this and what should I do first before thinking of working with non-English text with any PHP functions.
Short answer: it's not possible to use range like that.
Explanation
You are passing the string 'क' as the start of the range and 'म' as the end. You are getting only one character back, and that character is à.
You are getting back à because your source file is encoded (saved) in UTF-8. One can tell this by the fact that à is code point U+00E0, while 0xE0 is also the first byte of the UTF-8 encoded form of 'क' (which is 0xE0 0xA4 0x95). Sadly, PHP has no notion of encodings so it just takes the first byte it sees in the string and uses that as the "start" character.
You are getting back only à because the UTF-8 encoded form of 'म' also starts with 0xE0 (so PHP also thinks that the "end character" is 0xE0 or à).
Solution
You can write range as a for loop yourself, as long as there is some function that returns the Unicode code point of an UTF-8 character (and one that does the reverse). So I googled and found these here:
// Returns the UTF-8 character with code point $intval
function unichr($intval) {
return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}
// Returns the code point for a UTF-8 character
function uniord($u) {
$k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
$k1 = ord(substr($k, 0, 1));
$k2 = ord(substr($k, 1, 1));
return $k2 * 256 + $k1;
}
With the above, you can now write:
for($char = uniord('क'); $char <= uniord('म'); ++$char) {
$alphabet[] = unichr($char);
}
print_r($alphabet);
See it in action.
The lazy solution would be to use html_entity_decode() and range() only for the numeric ranges it was originally intended (that it works with ASCII is a bit silly anyway):
foreach (range(0x0915, 0x092E) as $char) {
$char = html_entity_decode("&#$char;", ENT_COMPAT, "UTF-8");
$alphabets[$char] = ++$i;
}
Another solution would be translating and getting the range then translate back again.
$first = file_get_contents("http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&langpair=|en&q=क");
$second = file_get_contents("http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&langpair=|en&q=म"); //not real value
$jsonfirst = json_decode($first);
$jsonsecond = json_decode($second);
$f = $jsonfirst->responseData->translatedText;
$l = $jsonsecond->responseData->translatedText;
foreach(range($f, $l) as $ab) {
echo $ab;
}
Outputs
ABCDEFGHI
To translate back use an arraymap and a callback function that translates each of the English values back to hindi.
I have a unique problem with multibyte character strings and need to be able to shuffle, with some fair degree of randomness, a long UTF-8 encoded multibyte string in PHP without dropping or losing or repeating any of the characters.
In the PHP manual under str_shuffle there is a multi-byte function (the first user submitted one) that doesn't work: If I use a string with for example all the Japanese hiragana and katakana of string length (ex) 120 chars, I am returned a string that's 119 chars or 118 chars. Sometimes I've seen duplicate chars even though the original string doesn't have them. So that's not functional.
To make this more complex, I also need to include if possible Japanese UTF-8 newlines and line feeds and punctuation.
Can anyone with experience dealing in multiple languages with UTF-8 mb strings help? Does PHP have any built in functions to do this? str_shuffle is EXACTLY what I want. I just need it to also work on multibyte chars.
Thanks very much!
Try splitting the string using mb_strlen and mb_substr to create an array, then using shuffle before joining it back together again. (Edit: As also demonstrated in #Frosty Z's answer.)
An example from the PHP interactive prompt:
php > $string = "Pretend I'm multibyte!";
php > $len = mb_strlen($string);
php > $sploded = array();
php > while($len-- > 0) { $sploded[] = mb_substr($string, $len, 1); }
php > shuffle($sploded);
php > echo join('', $sploded);
rmedt tmu nIb'lyi!eteP
You'll want to be sure to specify the encoding, where appropriate.
This should do the trick, too. I hope.
class String
{
public function mbStrShuffle($string)
{
$chars = $this->mbGetChars($string);
shuffle($chars);
return implode('', $chars);
}
public function mbGetChars($string)
{
$chars = [];
for($i = 0, $length = mb_strlen($string); $i < $length; ++$i)
{
$chars[] = mb_substr($string, $i, 1, 'UTF-8');
}
return $chars;
}
}
I like to use this function:
function mb_str_shuffle($multibyte_string = "abcčćdđefghijklmnopqrsštuvwxyzžß,.-+'*?=)(/&%$#!~ˇ^˘°˛`˙´˝") {
$characters_array = mb_str_split($multibyte_string);
shuffle($characters_array);
return implode('', $characters_array); // or join('', $characters_array); if you have a death wish (JK)
}
Split string into an array of multibyte characters
Shuffle the good guy array who doesn't care about his residents being multibyte
Join the shuffled array together into a string
Of course I normally wouldn't have a default value for function's parameter.