Trying to use php similar_text() with arabic, but it's not working.
However it works great with english.
<?php
$var = similar_text("ياسر","عمار","$per");
echo $var;
?>
outbot : 5
that's wrong result, it should be 2. Is there similar_text() with arabic letters?
Here's one I'm using
//from http://www.phperz.com/article/14/1029/31806.html
function mb_split_str($str) {
preg_match_all("/./u", $str, $arr);
return $arr[0];
}
//based on http://www.phperz.com/article/14/1029/31806.html, added percent
function mb_similar_text($str1, $str2, &$percent) {
$arr_1 = array_unique(mb_split_str($str1));
$arr_2 = array_unique(mb_split_str($str2));
$similarity = count($arr_2) - count(array_diff($arr_2, $arr_1));
$percent = ($similarity * 200) / (strlen($str1) + strlen($str2) );
return $similarity;
}
So
$var = mb_similar_text('عمار', 'ياسر', $per);
output: $var = 2, $per = 25
Because the Arabic text are multibyte strings normal PHP functions cannot be used (such as 'similar_text()').
echo(strlen("عمار"));
The above code outputs: 8
echo(mb_strlen("عمار", "UTF-8"));
Using the mb_strlen function with the UTF-8 encoding specified, the output is: 4 (the correct number of characters).
You can use the mb_ functions to make your own version of the similar_text function: http://php.net/manual/en/ref.mbstring.php
Just for the record and hopefully to make some help, I want to clarify the behavior of the similar_text() function when some multi-byte character strings are given (including the character strings of the Arabic.)
The function simply treats each byte of the input string as an individual character (which implies it neither supports multi-byte characters nor the Unicode.)
The byte streams of the عمار and ياسر strings are respectively represented as the following (the bytes (in the hexadecimal representation) are separated using . and, where the end of a character is reached, then a : is used instead):
06.39:06.45:06.27:06.31 <-- Byte stream for عمار
|| || || || ||
06.4A:06.27:06.33:06.31 <-- Byte stream for ياسر
As you can tell, there are five matching, and that's the reason why the function returns 5 in this case (every two hexadecimal digits represent a byte.)
Related
I need to make an international accreditation validation by name and surname. But the problem is, that i need to return TRUE even if characters are different.
EXAMPLE:
$str1 = 'Bożydar Kamiński';
$str2 = 'BOZYDAR KAMINSKI';
// I need this to be TRUE
if ($str1 == $str2) {
echo 'YOUR BUNNY WROTE';
}
Is there is some php default functions to convert UTF-8 string with unicode characters (str1) to a plain latin characters?
If you know the possible differences regarding the UTF-8 encoder, you can create your own decoder. You can create a table with all possible strange characters and in the function you should perform the comparison and then exchange for the equivalent.
$coll = Collator::create('');
$coll->setStrength(Collator::PRIMARY);
var_dump(0 == $coll->compare(
'Bożydar Kamiński',
'BOZYDAR KAMINSKI'
)); // bool(true)
This question looks embarrassingly simple, but I haven't been able to find an answer.
What is the PHP equivalent to the following C# line of code?
string str = "\u1000";
This sample creates a string with a single Unicode character whose "Unicode numeric value" is 1000 in hexadecimal (4096 in decimal).
That is, in PHP, how can I create a string with a single Unicode character whose "Unicode numeric value" is known?
PHP 7.0.0 has introduced the "Unicode codepoint escape" syntax.
It's now possible to write Unicode characters easily by using a double-quoted or a heredoc string, without calling any function.
$unicodeChar = "\u{1000}";
Because JSON directly supports the \uxxxx syntax the first thing that comes into my mind is:
$unicodeChar = '\u1000';
echo json_decode('"'.$unicodeChar.'"');
Another option would be to use mb_convert_encoding()
echo mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');
or make use of the direct mapping between UTF-16BE (big endian) and the Unicode codepoint:
echo mb_convert_encoding("\x10\x00", 'UTF-8', 'UTF-16BE');
I wonder why no one has mentioned this yet, but you can do an almost equivalent version using escape sequences in double quoted strings:
\x[0-9A-Fa-f]{1,2}
The sequence of characters matching the regular expression is a
character in hexadecimal notation.
ASCII example:
<?php
echo("\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21");
?>
Hello World!
So for your case, all you need to do is $str = "\x30\xA2";. But these are bytes, not characters. The byte representation of the Unicode codepoint coincides with UTF-16 big endian, so we could print it out directly as such:
<?php
header('content-type:text/html;charset=utf-16be');
echo("\x30\xA2");
?>
ア
If you are using a different encoding, you'll need alter the bytes accordingly (mostly done with a library, though possible by hand too).
UTF-16 little endian example:
<?php
header('content-type:text/html;charset=utf-16le');
echo("\xA2\x30");
?>
ア
UTF-8 example:
<?php
header('content-type:text/html;charset=utf-8');
echo("\xE3\x82\xA2");
?>
ア
There is also the pack function, but you can expect it to be slow.
PHP does not know these Unicode escape sequences. But as unknown escape sequences remain unaffected, you can write your own function that converts such Unicode escape sequences:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}
Or with an anonymous function expression instead of create_function:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', function($match) use ($encoding) {
return mb_convert_encoding(pack('H*', $match[1]), $encoding, 'UTF-16BE');
}, $str);
}
Its usage:
$str = unicodeString("\u1000");
html_entity_decode('エ', 0, 'UTF-8');
This works too. However the json_decode() solution is a lot faster (around 50 times).
Try Portable UTF-8:
$str = utf8_chr( 0x1000 );
$str = utf8_chr( '\u1000' );
$str = utf8_chr( 4096 );
All work exactly the same way. You can get the codepoint of a character with utf8_ord(). Read more about Portable UTF-8.
As mentioned by others, PHP 7 introduces support for the \u Unicode syntax directly.
As also mentioned by others, the only way to obtain a string value from any sensible Unicode character description in PHP, is by converting it from something else (e.g. JSON parsing, HTML parsing or some other form). But this comes at a run-time performance cost.
However, there is one other option. You can encode the character directly in PHP with \x binary escaping. The \x escape syntax is also supported in PHP 5.
This is especially useful if you prefer not to enter the character directly in a string through its natural form. For example, if it is an invisible control character, or other hard to detect whitespace.
First, a proof example:
// Unicode Character 'HAIR SPACE' (U+200A)
$htmlEntityChar = " ";
$realChar = html_entity_decode($htmlEntityChar);
$phpChar = "\xE2\x80\x8A";
echo 'Proof: ';
var_dump($realChar === $phpChar); // bool(true)
Note that, as mentioned by Pacerier in another answer, this binary code is unique to a specific character encoding. In the above example, \xE2\x80\x8A is the binary coding for U+200A in UTF-8.
The next question is, how do you get from U+200A to \xE2\x80\x8A?
Below is a PHP script to generate the escape sequence for any character, based on either a JSON string, HTML entity, or any other method once you have it as a native string.
function str_encode_utf8binary($str) {
/** #author Krinkle 2018 */
$output = '';
foreach (str_split($str) as $octet) {
$ordInt = ord($octet);
// Convert from int (base 10) to hex (base 16), for PHP \x syntax
$ordHex = base_convert($ordInt, 10, 16);
$output .= '\x' . $ordHex;
}
return $output;
}
function str_convert_html_to_utf8binary($str) {
return str_encode_utf8binary(html_entity_decode($str));
}
function str_convert_json_to_utf8binary($str) {
return str_encode_utf8binary(json_decode($str));
}
// Example for raw string: Unicode Character 'INFINITY' (U+221E)
echo str_encode_utf8binary('∞') . "\n";
// \xe2\x88\x9e
// Example for HTML: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_html_to_utf8binary(' ') . "\n";
// \xe2\x80\x8a
// Example for JSON: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_json_to_utf8binary('"\u200a"') . "\n";
// \xe2\x80\x8a
function unicode_to_textstring($str){
$rawstr = pack('H*', $str);
$newstr = iconv('UTF-16BE', 'UTF-8', $rawstr);
return $newstr;
}
$msg = '67714eac99c500200054006f006b0079006f002000530074006100740069006f006e003a0020';
echo unicode_to_textstring($str);
Seams to be like sprintf have a problem with foregin characters? Or is it me doing something wrong? Looks like it work when removing chars like åäö from the string though. Should that be necessary?
I want the following lines to be aligned correctly for a report:
2011-11-27 A1823 -Ref. Leif - 12 873,00 18.98
2011-11-30 A1856 -Rättat xx - 6 594,00 19.18
I'm using sprintf() like this: %-12s %-8s -%-10s -%20s %8.2f
Using: php-5.3.23-nts-Win32-VC9-x86
Strings in PHP are basically arrays of bytes (not characters). They cannot work natively with multibyte encodings (such as UTF-8).
For details see:
https://www.php.net/manual/en/language.types.string.php#language.types.string.details
Most string functions in PHP have multibyte equivalent though (with the mb_ prefix). But the sprintf does not.
There's a user comment (by "viktor at textalk dot com") with multibyte implementation of the sprintf on the function's documentation page at php.net. It may work for you:
https://www.php.net/manual/en/function.sprintf.php#89020
I was actually trying to find out if PHP ^7 finally has a native mb_sprintf() but apparently no xD.
For the sake of completeness, here is a simple solution I've been using in some old projects. It just adds the diff between strlen & mb_strlen to the desired $targetLengh.
The non-multibyte example is just added for the sake of easy comparison =).
$text = "Gultigkeitsprufung ist fehlgeschlagen: %{errors}";
$mbText = "Gültigkeitsprüfung ist fehlgeschlagen: %{errors}";
$mbTextRussian = "Проверка не удалась: %{errors}";
$targetLength = 60;
$mbTargetLength = strlen($mbText) - mb_strlen($mbText) + $targetLength;
$mbRussianTargetLength = strlen($mbTextRussian) - mb_strlen($mbTextRussian) + $targetLength;
printf("%{$targetLength}s\n", $text);
printf("%{$mbTargetLength}s\n", $mbText);
printf("%{$mbRussianTargetLength}s\n", $mbTextRussian);
result
Gultigkeitsprufung ist fehlgeschlagen: %{errors}
Gültigkeitsprüfung ist fehlgeschlagen: %{errors}
Проверка не удалась: %{errors}
update 2019-06-12
#flowtron made me give it another thought. A simple mb_sprintf() could look like this.
function mb_sprintf($format, ...$args) {
$params = $args;
$callback = function ($length) use (&$params) {
$value = array_shift($params);
return strlen($value) - mb_strlen($value) + $length[0];
};
$format = preg_replace_callback('/(?<=%|%-)\d+(?=s)/', $callback, $format);
return sprintf($format, ...$args);
}
echo mb_sprintf("%-10s %-10s %10s\n", 'thüs', 'wörks', 'ök');
echo mb_sprintf("%-10s %-10s %10s\n", 'this', 'works', 'ok');
result
thüs wörks ök
this works ok
I only did some happy path testing here, but it works for PHP >=5.6 and should be good enough to give ppl an idea on how to encapsulate the behavior.
It does not work with the repetition/order modifiers though - e.g. %1$20s will be ignored/remain unchanged.
If you're using characters that fit in the ISO-8859-1 character set, you can convert the strings before formatting, and convert the result back to UTF8 when you are done
utf8_encode(sprintf("%-12s %-8s", utf8_decode($paramOne), utf8_decode($paramTwo))
Problem
There is no multibyte format functions.
Idea
You can't convert input strings. You should change format lengths.
A format %4s means 4 widths (not characters - see footnote). But PHP format functions count bytes.
So you should add format lengths to bytes - widths.
Implementations
from #nimmneun
function mb_sprintf($format, ...$args) {
$params = $args;
$callback = function ($length) use (&$params) {
$value = array_shift($params);
return $length[0] + strlen($value) - mb_strwidth($value);
};
$format = preg_replace_callback('/(?<=%|%-)\d+(?=s)/', $callback, $format);
return sprintf($format, ...$args);
}
And don't forget another option str_pad($input, $length, $pad_char=' ', STR_PAD_RIGHT)
function mb_str_pad(...$args) {
$args[1] += strlen($args[0]) - mb_strwidth($args[0]);
return str_pad(...$args);
}
Footnote
Asian characters have 3 bytes and 2 width and 1 character length.
If your format is %4s and the input is one asian character, you should need two spaces (padding) not three.
Sorry for the title, I really didn't know how to say this...
I often have a string that needs to be cut after X characters, my problem is that this string often contains special characters like : & egrave ;
So, I'm wondering, is their a way to know in php, without transforming my string, if when I am cutting my string, I am in the middle of a special char.
Example
This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact
so right now my result with a sub string would be :
This is my string with a special char : &egra
but I want to have something like this :
This is my string with a special char : è
The best thing to do here is store your string as UTF-8 without any html entities, and use the mb_* family of functions with utf8 as the encoding.
But, if your string is ASCII or iso-8859-1/win1252, you can use the special HTML-ENTITIES encoding of the mb_string library:
$s = 'This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact';
echo mb_substr($s, 0, 40, 'HTML-ENTITIES');
echo mb_substr($s, 0, 41, 'HTML-ENTITIES');
However, if your underlying string is UTF-8 or some other multibyte encoding, using HTML-ENTITIES is not safe! This is because HTML-ENTITIES really means "win1252 with high-bit characters as html entities". This is an example of where this can go wrong:
// Assuming that é is in utf8:
mb_substr('é ', 0, 2, 'HTML-ENTITIES') === 'é'
// should be 'é '
When your string is in a multibyte encoding, you must instead convert all html entities to a common encoding before you split. E.g.:
$strings_actual_encoding = 'utf8';
$s_noentities = html_entity_decode($s, ENT_QUOTES, $strings_actual_encoding);
$s_trunc_noentities = mb_substr($s_noentities, 0, 41, $strings_actual_encoding);
The best solution would be to store your text as UTF-8, instead of storing them as HTML entities. Other than that, if you don't mind the count being off (` equals one character, instead of 7), then the following snippet should work:
<?php
$string = 'This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact';
$cut_string = htmlentities(mb_substr(html_entity_decode($string, NULL, 'UTF-8'), 0, 45), NULL, 'UTF-8')."<br><br>";
Note: If you use a different function to encode the text (e.g. htmlspecialchars()), then use that function instead of htmlentities(). If you use a custom function, then use another custom function that does the opposite of your new custom function instead of html_entity_decode() (and custom function instead of htmlentities()).
The longest HTML entity is 10 characters long, including the ampersand and semicolon. If you intend to cut the string at X bytes, check bytes X-9 through X-1 for an ampersand. If the corresponding semicolon appears at byte X or later, cut the string after the semicolon instead of after byte X.
However, if you're willing to preprocess the string, Mike's solution will be more accurate because his cuts the string at X characters, not bytes.
You can use html_entity_decode() first to decode all the HTML entities. Then split your string. Then htmlentities() to re-encode the entities.
$decoded_string = html_entity_decode($original_string);
// implement logic to split string here
// then for each string part do the following:
$encoded_string_part = htmlentities($split_string_part);
A little bruteforce solution, that I'm not really happy with would a PCRE expression, let's say that you want to pass 80 characters and the longest possible HTML expression is 7 chars long:
$regex = '~^(.{73}([^&]{7}|.{0,7}$|[^&]{0,6}&[^;]+;))(.*)~mx'
// Note, this could return a bit of shorter text
return preg_replace( $regexp, '$1', $text);
Just so you know:
.{73} - 73 characters
[^&]{7} - okay, we may fill it with anything that doesn't contain &
.{0,7}$ - keep in mind the possible end (this shouldn't be necessary because shorter text wouldn't match at all)
[^&]{0,6}&[^;]+; - up to 6 characters (you'd be at 79th), then & and let it finish
Something that seems much better but requires bit of play with numbers is to:
// check whether $text is at least $N chars long :)
if( strlen( $text) < $N){
return;
}
// Get last &
$pos = strrpos( $text, '&', $N);
// We're not young anymore, we have to check this too (not entries at all) :)
if( $pos === false){
return substr( $text, 0, $N);
}
// Get Last
$end = strpos( $text, ';', $N);
// false wouldn't be smaller then 0 (entry open at the beginning
if( $end === false){
$end = -1;
}
// Okay, entry closed (; is after &)(
if( $end > $pos){
return substr($text, 0, $N);
}
// Now we need to find first ;
$end = strpos( $text, ';', $N)
if( $end === false){
// Not valid HTML, not closed entry, do whatever you want
}
return substr($text, 0, $end);
Check numbers, there may be +/-1 somewhere in indexes...
I think you would have to use a combination of strpos and strrpos to find the next and previous spaces, parse the text between the spaces, check that against a known list of special characters, and if it matches, extend your "cut" to the position of the next space. If you had a code sample of what you have now, we could give you a better answer.
I have two strings that look the same when I echo them, but when I var_dump() them they are different string types:
Echo:
http://blah
http://blah
var dump:
string(14) "http://blah"
string(11) "http://blah"
strToHex:
%68%74%74%70%3a%2f%2f%62%6c%61%68%00%00%00
%68%74%74%70%3a%2f%2f%62%6c%61%68
When I compare them, they return false. How can I manipulate the string type, so that I can perform a comparison that returns true?
What is the difference between string 11 and string 14? I am sure there is a simple resolution, but I have not found anything yet. No matter how I implode, explode, UTF-8 encode, etc., they will not compare the strings or change type.
Letter "a" can be written in another encoding.
For example: blаh. Here a is a Cyrillic 'а'.
All of these letters are Cyrillic, but it looks like Latin: у, е, х, а, р, о, с
Trim the strings before comparing. There are escaped characters, like \t and \n, which are not visible.
$clean_str = trim($str);
When using var_dump(), then string(14) means that the value is a string that holds 14 bytes. So string(11) and string(14) are not different "types" of strings; they are just strings of different length.
I would use something like this to see what actually is inside those strings:
function strToHex($value, $prefix = '') {
$result = '';
$length = strlen($value);
for ( $n = 0; $n < $length; $n++ ) {
$result .= $prefix . sprintf('%02x', ord($value[$n]));
}
return $result;
}
echo strToHex("test\r\n", '%');
Output:
%74%65%73%74%0d%0a
This decodes as:
%74 - t
%65 - e
%73 - s
%74 - t
%0d - \r (carriage return)
%0a - \n (line feed)
Or, as pointed out in comments by Karolis, you can use the built-in function bin2hex():
echo bin2hex("test\r\n");
Output:
746573740d0a
Try to trim these strings:
if (trim($string1) == trim($string2)) {
// Do things
}
Probably Unicode strings within the upper range are counted as double bytes.
Use mb_strlen() to check lengths.
Also some characters may not be visible, but present (there are many of Unicode spaces, etc.)
Generally, when you work with Unicode functions, you should use the mb_* string functions.
You may overload string encoding functions in php.ini to always use mb_* functions instead the standard ones (I am not sure if Xdebug honors those settings).
In PHP 6 this problem will be solved, as it should be globally Unicode-aware.