PHP arabic text compare using strpos

PHP arabic text compare using strpos - php

I have a arabic keyword in a mysql table like
*#1591; *#1610; *#1585;*#1575;*#1606
// Please consider & in the place of * , value with '&' automatically converts in to arabic.
Mysql table encoding: utf8_general_ci
I am getting some string from the external resources example twitter.
I would like to match the keyword with the tweet i am getting .
$tweet = 'وينج وأداسي الاماراتية توقعان اتفاقية تعاون لتوفير أنظمة الطائرات بدون طيا';
$keyword = '*#1591; *#1610; *#1585;*#1575;*#1606'; //From db
$status = strpos ($tweet, $keyword)
$status always returns false.
I have checked with utf8_encode(), utf_8_decode() , mb_strpos() without any luck.
I know need to convert both strings to one common format before compare but which format i need to convert ?
Please help me on this.

As arabic symbols are encoded using multibyte characters, you must use functions that support such a constraint: grapheme_strpos and mb_strpos (in that order).
Using them instead of plain old strpos will do the trick.
Also, keep in mind that you may have to check for its availability prior to its use, as not all hosted environments have them enabled:
if (function_exists('grapheme_strpos')) {
$pos = grapheme_strpos($tweet, $keyword);
} elseif (function_exists('mb_strpos')) {
$pos = mb_strpos($tweet, $keyword);
} else {
$pos = strpos($tweet, $keyword);
}
And last but not least, check the docs for the different arguments that functions take, as the encoding used by the strings.

Related

How to compare 2 strings with different encoding and different unicode characters in PHP

I need to make an international accreditation validation by name and surname. But the problem is, that i need to return TRUE even if characters are different.
EXAMPLE:
$str1 = 'Bożydar Kamiński';
$str2 = 'BOZYDAR KAMINSKI';
// I need this to be TRUE
if ($str1 == $str2) {
echo 'YOUR BUNNY WROTE';
}
Is there is some php default functions to convert UTF-8 string with unicode characters (str1) to a plain latin characters?

If you know the possible differences regarding the UTF-8 encoder, you can create your own decoder. You can create a table with all possible strange characters and in the function you should perform the comparison and then exchange for the equivalent.

$coll = Collator::create('');
$coll->setStrength(Collator::PRIMARY);
var_dump(0 == $coll->compare(
'Bożydar Kamiński',
'BOZYDAR KAMINSKI'
)); // bool(true)

Normalize Name-Surname strings: PHP+REGEX (Spanish chars- UTF8)

I'm having strings with name and surname which I need to normalize with a functiont and make them like:
Name Surname (I can recive strings like NAME SURNAME, Name SURNAME, etc...)
I've found this snipet:
echo nameize("HÉCTOR MAÑAÇ");
function nameize($str,$a_char = array("'","-"," ")){
//$str contains the complete raw name string
//$a_char is an array containing the characters we use as separators for capitalization. If you don't pass anything, there are three in there as default.
$string = strtolower($str);
foreach ($a_char as $temp){
$pos = strpos($string,$temp);
if ($pos){
//we are in the loop because we found one of the special characters in the array, so lets split it up into chunks and capitalize each one.
$mend = '';
$a_split = explode($temp,$string);
foreach ($a_split as $temp2){
//capitalize each portion of the string which was separated at a special character
$mend .= ucfirst($temp2).$temp;
}
$string = substr($mend,0,-1);
}
}
return ucfirst($string);
}
Which works pretty well, but, as you can see testing this exact example, doesn't parse spanish chars (utf8) I've tested mb_regex_encoding("UTF-8"); mb_internal_encoding("UTF-8");, headers UTF8, etc. But can't make it work fine with "special" spanish chars.
Any suggestion?

Can't see, where you use the Multibyte String Functions.
Maybe this would be convenient for your needs:
echo mb_convert_case("HÉCTOR MAÑAÇ", MB_CASE_TITLE, "UTF-8");
output:
Héctor Mañaç

Your function works fine for the given example also. Please check your file encoding type. It must be UTF-8. You can check it in Notepadd++.

php sprintf() with foreign characters?

Seams to be like sprintf have a problem with foregin characters? Or is it me doing something wrong? Looks like it work when removing chars like åäö from the string though. Should that be necessary?
I want the following lines to be aligned correctly for a report:
2011-11-27 A1823 -Ref. Leif - 12 873,00 18.98
2011-11-30 A1856 -Rättat xx - 6 594,00 19.18
I'm using sprintf() like this: %-12s %-8s -%-10s -%20s %8.2f
Using: php-5.3.23-nts-Win32-VC9-x86

Strings in PHP are basically arrays of bytes (not characters). They cannot work natively with multibyte encodings (such as UTF-8).
For details see:
https://www.php.net/manual/en/language.types.string.php#language.types.string.details
Most string functions in PHP have multibyte equivalent though (with the mb_ prefix). But the sprintf does not.
There's a user comment (by "viktor at textalk dot com") with multibyte implementation of the sprintf on the function's documentation page at php.net. It may work for you:
https://www.php.net/manual/en/function.sprintf.php#89020

I was actually trying to find out if PHP ^7 finally has a native mb_sprintf() but apparently no xD.
For the sake of completeness, here is a simple solution I've been using in some old projects. It just adds the diff between strlen & mb_strlen to the desired $targetLengh.
The non-multibyte example is just added for the sake of easy comparison =).
$text = "Gultigkeitsprufung ist fehlgeschlagen: %{errors}";
$mbText = "Gültigkeitsprüfung ist fehlgeschlagen: %{errors}";
$mbTextRussian = "Проверка не удалась: %{errors}";
$targetLength = 60;
$mbTargetLength = strlen($mbText) - mb_strlen($mbText) + $targetLength;
$mbRussianTargetLength = strlen($mbTextRussian) - mb_strlen($mbTextRussian) + $targetLength;
printf("%{$targetLength}s\n", $text);
printf("%{$mbTargetLength}s\n", $mbText);
printf("%{$mbRussianTargetLength}s\n", $mbTextRussian);
result
Gultigkeitsprufung ist fehlgeschlagen: %{errors}
Gültigkeitsprüfung ist fehlgeschlagen: %{errors}
Проверка не удалась: %{errors}
update 2019-06-12
#flowtron made me give it another thought. A simple mb_sprintf() could look like this.
function mb_sprintf($format, ...$args) {
$params = $args;
$callback = function ($length) use (&$params) {
$value = array_shift($params);
return strlen($value) - mb_strlen($value) + $length[0];
};
$format = preg_replace_callback('/(?<=%|%-)\d+(?=s)/', $callback, $format);
return sprintf($format, ...$args);
}
echo mb_sprintf("%-10s %-10s %10s\n", 'thüs', 'wörks', 'ök');
echo mb_sprintf("%-10s %-10s %10s\n", 'this', 'works', 'ok');
result
thüs wörks ök
this works ok
I only did some happy path testing here, but it works for PHP >=5.6 and should be good enough to give ppl an idea on how to encapsulate the behavior.
It does not work with the repetition/order modifiers though - e.g. %1$20s will be ignored/remain unchanged.

If you're using characters that fit in the ISO-8859-1 character set, you can convert the strings before formatting, and convert the result back to UTF8 when you are done
utf8_encode(sprintf("%-12s %-8s", utf8_decode($paramOne), utf8_decode($paramTwo))

Problem
There is no multibyte format functions.
Idea
You can't convert input strings. You should change format lengths.
A format %4s means 4 widths (not characters - see footnote). But PHP format functions count bytes.
So you should add format lengths to bytes - widths.
Implementations
from #nimmneun
function mb_sprintf($format, ...$args) {
$params = $args;
$callback = function ($length) use (&$params) {
$value = array_shift($params);
return $length[0] + strlen($value) - mb_strwidth($value);
};
$format = preg_replace_callback('/(?<=%|%-)\d+(?=s)/', $callback, $format);
return sprintf($format, ...$args);
}
And don't forget another option str_pad($input, $length, $pad_char=' ', STR_PAD_RIGHT)
function mb_str_pad(...$args) {
$args[1] += strlen($args[0]) - mb_strwidth($args[0]);
return str_pad(...$args);
}
Footnote
Asian characters have 3 bytes and 2 width and 1 character length.
If your format is %4s and the input is one asian character, you should need two spaces (padding) not three.

get http url parameter without auto decoding using PHP

I have a url like
test.php?x=hello+world&y=%00h%00e%00l%00l%00o
when i write it to file
file_put_contents('x.txt', $_GET['x']); // -->hello world
file_put_contents('y.txt', $_GET['y']); // -->\0h\0e\0l\0l\0o
but i need to write it to without encoding
file_put_contents('x.txt', ????); // -->hello+world
file_put_contents('y.txt', ????); // -->%00h%00e%00l%00l%00o
how can i do?
Thanks

You can get unencoded values from the $_SERVER["QUERY_STRING"] variable.
function getNonDecodedParameters() {
$a = array();
foreach (explode ("&", $_SERVER["QUERY_STRING"]) as $q) {
$p = explode ('=', $q, 2);
$a[$p[0]] = isset ($p[1]) ? $p[1] : '';
}
return $a;
}
$input = getNonDecodedParameters();
file_put_contents('x.txt', $input['x']);

Because the The $_GET and $_REQUEST superglobals are automatically run through a decoding function (equivalent to urldecode()), you simply need to re-urlencode() the data to get it to match the characters passed in the URL string:
file_put_contents('x.txt', urlencode($_GET['x'])); // -->hello+world
file_put_contents('y.txt', urlencode($_GET['y'])); // -->%00h%00e%00l%00l%00o
I've tested this out locally and it's working perfectly. However, from your comments, you might want to look at your encoding settings as well. If the result of urlencode($_GET['y']) is %5C0h%5C0e%5C0l%5C0l%5C0o then it appears that the null character that you're passing in (%00) is being interpreted as a literal string "\0" (like a \ character concatenated to a 0 character) instead of correctly interpreting the \0 as a single null character.
You should have a look at the PHP documentation on string encoding and ASCII device control characters.

i think you can use urlencode() to pass the value in URL and urldecode() to get the value.

PHP - smart, error tolerating string comparison

I'm looking either for routine or way to look for error tolerating string comparison.
Let's say, we have test string Čakánka - yes, it contains CE characters.
Now, I want to accept any of following strings as OK:
cakanka
cákanká
ČaKaNKA
CAKANKA
CAAKNKA
CKAANKA
cakakNa
The problem is, that I often switch letters in word, and I want to minimize user's frustration with not being able (i.e. you're in rush) to write one word right.
So, I know how to make ci comparison (just make it lowercase :]), I can delete CE characters, I just can't wrap my head around tolerating few switched characters.
Also, you often put one character not only in wrong place (character=>cahracter), but sometimes shift it by multiple places (character=>carahcter), just because one finger was lazy during writing.
Thank you :]

Not sure (especially about the accents / special characters stuff, which you might have to deal with first), but for characters that are in the wrong place or missing, the levenshtein function, that calculates Levenshtein distance between two strings, might help you (quoting) :
int levenshtein ( string $str1 , string $str2 )
int levenshtein ( string $str1 , string $str2 , int $cost_ins , int $cost_rep , int $cost_del )
The Levenshtein distance is defined as
the minimal number of characters you
have to replace, insert or delete to
transform str1 into str2
Other possibly useful functions could be soundex, similar_text, or metaphone.
And some of the user notes on the manual pages of those functions, especially the manual page of levenshtein might bring you some useful stuff too ;-)

You could transliterate the words to latin characters and use a phonetic algorithm like Soundex to get the essence from your word and compare it to the ones you have. In your case that would be C252 for all of your words except the last one that is C250.
Edit    The problem with comparative functions like levenshtein or similar_text is that you need to call them for each pair of input value and possible matching value. That means if you have a database with 1 million entries you will need to call these functions 1 million times.
But functions like soundex or metaphone, that calculate some kind of digest, can help to reduce the number of actual comparisons. If you store the soundex or metaphone value for each known word in your database, you can reduce the number of possible matches very quickly. Later, when the set of possible matching value is reduced, then you can use the comparative functions to get the best match.
Here’s an example:
// building the index that represents your database
$knownWords = array('Čakánka', 'Cakaka');
$index = array();
foreach ($knownWords as $key => $word) {
$code = soundex(iconv('utf-8', 'us-ascii//TRANSLIT', $word));
if (!isset($index[$code])) {
$index[$code] = array();
}
$index[$code][] = $key;
}
// test words
$testWords = array('cakanka', 'cákanká', 'ČaKaNKA', 'CAKANKA', 'CAAKNKA', 'CKAANKA', 'cakakNa');
echo '<ul>';
foreach ($testWords as $word) {
$code = soundex(iconv('utf-8', 'us-ascii//TRANSLIT', $word));
if (isset($index[$code])) {
echo '<li> '.$word.' is similar to: ';
$matches = array();
foreach ($index[$code] as $key) {
similar_text(strtolower($word), strtolower($knownWords[$key]), $percentage);
$matches[$knownWords[$key]] = $percentage;
}
arsort($matches);
echo '<ul>';
foreach ($matches as $match => $percentage) {
echo '<li>'.$match.' ('.$percentage.'%)</li>';
}
echo '</ul></li>';
} else {
echo '<li>no match found for '.$word.'</li>';
}
}
echo '</ul>';

Spelling checkers do something like fuzzy string comparison. Perhaps you can adapt an algorithm based on that reference. Or grab the spell checker guessing code from an open source project like Firefox.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP arabic text compare using strpos - php

Related

How to compare 2 strings with different encoding and different unicode characters in PHP

Normalize Name-Surname strings: PHP+REGEX (Spanish chars- UTF8)

php sprintf() with foreign characters?

get http url parameter without auto decoding using PHP

PHP - smart, error tolerating string comparison

Categories

Resources