Normalize Name-Surname strings: PHP+REGEX (Spanish chars- UTF8) - php

I'm having strings with name and surname which I need to normalize with a functiont and make them like:
Name Surname (I can recive strings like NAME SURNAME, Name SURNAME, etc...)
I've found this snipet:
echo nameize("HÉCTOR MAÑAÇ");
function nameize($str,$a_char = array("'","-"," ")){
//$str contains the complete raw name string
//$a_char is an array containing the characters we use as separators for capitalization. If you don't pass anything, there are three in there as default.
$string = strtolower($str);
foreach ($a_char as $temp){
$pos = strpos($string,$temp);
if ($pos){
//we are in the loop because we found one of the special characters in the array, so lets split it up into chunks and capitalize each one.
$mend = '';
$a_split = explode($temp,$string);
foreach ($a_split as $temp2){
//capitalize each portion of the string which was separated at a special character
$mend .= ucfirst($temp2).$temp;
}
$string = substr($mend,0,-1);
}
}
return ucfirst($string);
}
Which works pretty well, but, as you can see testing this exact example, doesn't parse spanish chars (utf8) I've tested mb_regex_encoding("UTF-8"); mb_internal_encoding("UTF-8");, headers UTF8, etc. But can't make it work fine with "special" spanish chars.
Any suggestion?

Can't see, where you use the Multibyte String Functions.
Maybe this would be convenient for your needs:
echo mb_convert_case("HÉCTOR MAÑAÇ", MB_CASE_TITLE, "UTF-8");
output:
Héctor Mañaç

Your function works fine for the given example also. Please check your file encoding type. It must be UTF-8. You can check it in Notepadd++.

Related

PHP str_replace removing unintentionally removing Chinese characters

i have a PHP scripts that removes special characters, but unfortunately, some Chinese characters are also removed.
<?php
function removeSpecialCharactersFromString($inputString){
$inputString = str_replace(str_split('#/\\:*?\"<>|[]\'_+(),{}’! &'), "", $inputString);
return $inputString;
}
$test = '赵景然 赵景然';
print(removeSpecialCharactersFromString($test));
?>
oddly, the output is 赵然 赵然. The character 景 is removed
in addition, 陈 一 is also removed. What might be the possible cause?
The string your using to act as a list of the things you want to replace doesn't work well with the mixed encoding. What I've done is to convert this string to UTF16 and then split it.
function removeSpecialCharactersFromString($inputString){
$inputString = str_replace(str_split(
mb_convert_encoding('#/\\:*?\"<>|[]\'_+(),{}’! &', 'UTF16')), "", $inputString);
return $inputString;
}
$test = '#赵景然 赵景然';
print(removeSpecialCharactersFromString($test));
Which gives...
赵景然赵景然
BTW -str_replace is MB safe - sort of recognised the poster... http://php.net/manual/en/ref.mbstring.php#109937

PHP arabic text compare using strpos

I have a arabic keyword in a mysql table like
*#1591; *#1610; *#1585;*#1575;*#1606
// Please consider & in the place of * , value with '&' automatically converts in to arabic.
Mysql table encoding: utf8_general_ci
I am getting some string from the external resources example twitter.
I would like to match the keyword with the tweet i am getting .
$tweet = 'وينج وأداسي الاماراتية توقعان اتفاقية تعاون لتوفير أنظمة الطائرات بدون طيا';
$keyword = '*#1591; *#1610; *#1585;*#1575;*#1606'; //From db
$status = strpos ($tweet, $keyword)
$status always returns false.
I have checked with utf8_encode(), utf_8_decode() , mb_strpos() without any luck.
I know need to convert both strings to one common format before compare but which format i need to convert ?
Please help me on this.
As arabic symbols are encoded using multibyte characters, you must use functions that support such a constraint: grapheme_strpos and mb_strpos (in that order).
Using them instead of plain old strpos will do the trick.
Also, keep in mind that you may have to check for its availability prior to its use, as not all hosted environments have them enabled:
if (function_exists('grapheme_strpos')) {
$pos = grapheme_strpos($tweet, $keyword);
} elseif (function_exists('mb_strpos')) {
$pos = mb_strpos($tweet, $keyword);
} else {
$pos = strpos($tweet, $keyword);
}
And last but not least, check the docs for the different arguments that functions take, as the encoding used by the strings.

php: converting from cp1251 to utf8

I have a problem converting a string from cp1251 to utf8...
I need to get some names from database and those names are in cp1251(i'm not the one who made that database, so I can't edit it, but I know for sure that these names are cp1251)...
The name in database is this - "Р?нтернет РІ цифрах"
I'm converting it to utf8 using iconv function like this:
iconv("UTF-8", "CP1251//IGNORE", $name)
and what I have in the result is this - "�?нтернет в цифрах"(it's Russian), but the first two symbols are not correct... it should be "Интернет в цифрах"...
So the final thing that I have to do is somehow change these two symbols "�?" to russian letter "И"... and I really don't know how to do that... I've tried to use preg_replace, but it doesn't work...or I'm not using it correctly.
And I'm sorry for Russian letters, it is really hard to explain what I need without showing them.
The first letter comes out incorrect because one of the bytes needed to store the UTF-8 encoding of И (0x98 to be exact) is not used in CP1251. If the database has replaced the 98 byte by a question mark you have to change it back before using iconv:
$name = str_replace("\xD0\x3F", "\xD0\x98", $name);
echo iconv("UTF-8", "CP1251//IGNORE", $name);
use this:
mb_convert_encoding($model->text, 'cp1252', 'utf8')
Try this:
function cp1251_to_utf8($s){
$c209 = chr(209); $c208 = chr(208); $c129 = chr(129);
for($i=0; $i<strlen($s); $i++) {
$c=ord($s[$i]);
if ($c>=192 and $c<=239) $t.=$c208.chr($c-48);
elseif ($c>239) $t.=$c209.chr($c-112);
elseif ($c==184) $t.=$c209.$c209;
elseif ($c==168) $t.=$c208.$c129;
else $t.=$s[$i];
}
return $t;
}

PHP regex issue: cannot find $C

I'm trying to parse dollar amounts from a text of in mixed French (Canadian) and English. The text is in UTF-8. They use $C to denote currency. For some reason when I use preg_match neither the '$' nor the 'C' can be found. Everything else works fine. Any ideas?
e.g. use
preg_match_all('/\$C/u', $match)
on "Thanks for a payment of 46,00 $C" returns empty.
I think the regex can't find those characters because they aren't there. If you initialize the string like this:
$source = "Thanks for a payment of 46,00 $C";
...(i.e., as a double-quoted string literal), $C gets interpreted as a variable name. Since you never initialized that variable, it gets replaced with nothing in the actual string. You should either use single-quotes to initialize the string, or escape the dollar sign with a backslash like you did in the regex.
By the way, this couldn't be an encoding problem, because (in the example, at least), all the characters are from the ASCII character set. Whether it was encoded as UTF-8, ISO-8859-1 or ASCII, the binary representation of the string would be identical.
preg_match_all('/\$C/u', 'Thanks for a payment of 46,00 $C', $matches);
print_r($matches);
works fine for me:
Array
(
[0] => Array
(
[0] => $C
)
)
Maybe this helps:
// assuming $text is the input string
$matches = array();
preg_match_all('/([0-9,\\.]+)\\s*\\$C/u', $text, $matches);
if ($matches) {
$price = floatval(str_replace(',', '.', $matches[1][0]));
printf("%.2f\n", $price);
} else {
printf("No price found\n");
}
Just make sure the input string ($text) has been properly decoded into an Unicode string. (For example, if it's in UTF-8, use the utf8_decode function.)

preg_replace not working for utf-8 Arabic text

I writing a php function to check existence of bad whole words (keep in mind whole word not sub-strings) and also highlight whole words in given string.
function badwordChecherAndHighLighter($str,$replace){
// $replace=1 will Highlight
// $replace=0 will Check the existence of any badwords
$result = mysql_query("SELECT settings_badwords_en,settings_badwords_ar FROM settings_badwords WHERE settings_badwords_status=1") or die(mysql_error());
// i dont create an array, may create overhead, so i directly apply in preg_replace
if($replace==1){
while($row = mysql_fetch_row($result))
{
//$str=preg_replace('/'.$row[0].'/i', str_repeat("*",strlen($row[0])), $str);
$str=preg_replace('/\b('.$row[0].'\b)/i',"" .$row[0] . "" , $str);
$str=preg_replace('/\b('.$row[1].'\b)/i',"" .$row[1] . "" , $str);
}
return $str;
}else{
while($row = mysql_fetch_row($result))
{
if(preg_match('/\b('.$row[0].'\b)/i',$str)) return 1;
if(preg_match('/\b('.$row[1].'\b)/i',$str)) return 1;
}
return 0;
}
}
// $row[1] conatin Arabic bad Words, and $row[0] contain English bad words.
This function gives correct results on Windows OS, WAMP5 1.7.3 for both Arabic and English.
But on Web Server It only works for English words, and not for Arabic.
So if Arabic text is given to this function , it is unable to check existence of any badword, and also unable to highlight arabic word.
I searched and try many options including \u but no error, no success.
So please help.
The \b is not compatible the utf8 characters. Try this:
preg_match('/(?<=^|[^\p{L}])' . preg_quote($utf8word,'/') . '(?=[^\p{L}]|$)/ui',$utf8string);

Categories