preg_replace not working for utf-8 Arabic text - php

I writing a php function to check existence of bad whole words (keep in mind whole word not sub-strings) and also highlight whole words in given string.
function badwordChecherAndHighLighter($str,$replace){
// $replace=1 will Highlight
// $replace=0 will Check the existence of any badwords
$result = mysql_query("SELECT settings_badwords_en,settings_badwords_ar FROM settings_badwords WHERE settings_badwords_status=1") or die(mysql_error());
// i dont create an array, may create overhead, so i directly apply in preg_replace
if($replace==1){
while($row = mysql_fetch_row($result))
{
//$str=preg_replace('/'.$row[0].'/i', str_repeat("*",strlen($row[0])), $str);
$str=preg_replace('/\b('.$row[0].'\b)/i',"" .$row[0] . "" , $str);
$str=preg_replace('/\b('.$row[1].'\b)/i',"" .$row[1] . "" , $str);
}
return $str;
}else{
while($row = mysql_fetch_row($result))
{
if(preg_match('/\b('.$row[0].'\b)/i',$str)) return 1;
if(preg_match('/\b('.$row[1].'\b)/i',$str)) return 1;
}
return 0;
}
}
// $row[1] conatin Arabic bad Words, and $row[0] contain English bad words.
This function gives correct results on Windows OS, WAMP5 1.7.3 for both Arabic and English.
But on Web Server It only works for English words, and not for Arabic.
So if Arabic text is given to this function , it is unable to check existence of any badword, and also unable to highlight arabic word.
I searched and try many options including \u but no error, no success.
So please help.

The \b is not compatible the utf8 characters. Try this:
preg_match('/(?<=^|[^\p{L}])' . preg_quote($utf8word,'/') . '(?=[^\p{L}]|$)/ui',$utf8string);

Related

php ucfirst ucwords discussion [duplicate]

This question already has answers here:
Make all words lowercase and the first letter of each word uppercase
(3 answers)
Closed 1 year ago.
I just wanted to share my experience when needing to deal with an language independent version of ucfirst.
the problem is when you are mixing English texts with Japanese, chinese or other languages as in my case sometimes Swedish etc. with ÅÄÖ, traditional ucfirst has issues with converting the string to capitalized.
I did however sometime ago stumbled across the following code snippet here on stack overflow:
function myucfirst($str) {
$fc = mb_strtoupper(mb_substr($str, 0, 1));
return $fc.mb_substr($str, 1);
}
It works fine in most cases but recently I also needed the translations autogenerate texts in dynamic pdfs using TCPDF.
This is when I hit my head over why TCPDF had issues with the text. I had no problems anywhere else, the character encoding was utf8 but still it bricked.
When showing Kanji for Japanese signs, I just put ignore using the above function to captitalize the word but all of a sudden when using Swedish, I encountered the same brick when I need to capitalize ÅÄÖ.
That led me to realize that the problem with the function above is that it's only looking at the first character. ÅÄÖ is taking up 2 letter spaces and kanjis for chinese or Japanese letters take up 3 letter spaces and the function above did not consider that resulting to bricking TCPDF.
To give more context, When generating PDF documents with TCPDF the TCP font will end up getting errors since the gerneal mb_string function will translate the first character to "?�"vrigt for the swedish word Övrigt and with for instance Japanese "?��"のととろ, for 隣のトトロ (my neighbour totoro.) this will make the font translation for the � not work correctly. you need to do the conversion of ÅÄÖ for the first two letters substr($str, 0,2) to be able to convert the letter properly.
Also I am not sure if you see the code examples I gave but since neither chinese or japanese use upper case letters in their writing language, I am excluding every sign that requires 3 letter spaces since they are not managing upper / lower cases at all. I don't really want to exclude them but parsing them through mb_string will lead to similar errors in TCPDF so, my examples are a workaround for now or if someone has a better solution.
so... my approach was to solve the above problem by using the following function.
function myucfirst($str) {
if ($str[0] !== "?"){
for($i = 1; $i <= 3; $i++){
$first = substr($str, 0, $i);
$first = mb_convert_case($first, MB_CASE_UPPER, "UTF-8");
if ($first !== '?'){
$rest = substr($str, $i);
break;
}
}
if ($i < 3){
$ret_string = $first . $rest;
} else {
$ret_string = $str;
}
} else {
$ret_string = $str;
}
return $ret_string;
}
Thanks to Steven Pennys' help below, this is the solution that's working both with Swedish and Japanese / chinese special characters, even when needing to use a string with the library TCPDF for dynamically creating PDFs:
function myucfirst($str) {
$ret_string = mb_convert_case($str, MB_CASE_TITLE, 'UTF-8');
return $ret_string;
}
and following to do a similar fix for ucwords
function myucwords($str){
$str = trim($str);
if (strpos($str, ' ') !== false){
$str_arr = explode(' ', $str);
foreach ($str_arr as $word){
$ret_str .= isset($ret_str)? ' ' . myucfirst($word):myucfirst($word);
}
} else {
$ret_str = myucfirst($str);
}
return $ret_str;
}
The myucwords is using the first myucfirst to capitalize each word.
Since I am not that experienced as a developer or a stack overflow contributor, you should be able to see 3 code examples and I would really appreciate if there's better ways to write these functions but for now, for those who have the similar problem, please enjoy!
/Chris
The examples you gave are poor, as with Övrigt the input is exactly the same
as the output. So I modified the example so they can be useful. See below:
<?php
# example 1
$s1 = mb_convert_case('åäö', MB_CASE_TITLE);
# example 2
$s2 = mb_convert_case('övrigt', MB_CASE_TITLE);
# exmaple 3
$s3 = mb_convert_case('隣のトトロ', MB_CASE_TITLE);
# print
var_dump($s1 == 'Åäö', $s2 == 'Övrigt', $s3 == '隣のトトロ');
Note you will need this in your php.ini, if its not already:
extension = mbstring
https://php.net/function.mb-convert-case

PHP - Find occurence in array and then place the replaced part at the start

Here is my code:
function TranslatedTitle($Title) {
ConnectWithMySQLDatabase();
$v = mysql_query("SELECT * FROM `ProductTranslations`");
while($vrowis = mysql_fetch_array($v)){
$English[] = $vrowis['English'];
$Bulgarian[] = $vrowis['Bulgarian'];
}
$TranslatedTitle = str_replace($English, $Bulgarian, $Title);
return $TranslatedTitle;
}
I am using this code to fetch data from MySQL table and then search for certain phrase in English and then replace it with the phrase setted to replace the English one with the Bulgarian one.
Example:
I have very big blue eyes.
Will be translated to:
I have very големи сини eyes . It takes the phrase big blue and replace it with големи сини at the position where it can be found.
In other words how can i make the replaced part to be moved in the beginning of the string giving final result by my example as големи сини I have very eyes.
The sentence in the example have no meaning but i have created it as an example.
I would try looping through the $English array and when finding the matching word move it to the beginning, then translating... something like:
foreach($English as $word){
$pos = strpos($Title, $word);
if ($pos !== false) {
//english word found
$Title = $word . str_replace($English, '', $Title);
break;
}
}
Then
$TranslatedTitle = str_replace($English, $Bulgarian, $Title);
First off, you will want to use PDO to interact with your database. mysql_ extensions are now deprecated, bad practice and vulnerable to sql injections. You can manipulate your strings using strpos see php.net/manual/en/function.strpos.php. You will want to first go like this: find the text to replace, translate, remove the word from where ever it is by using $strip = str_replace("",$word) and finally append your result to a new variable ike this $variable = $translate.$strip . Hope that helps

Normalize Name-Surname strings: PHP+REGEX (Spanish chars- UTF8)

I'm having strings with name and surname which I need to normalize with a functiont and make them like:
Name Surname (I can recive strings like NAME SURNAME, Name SURNAME, etc...)
I've found this snipet:
echo nameize("HÉCTOR MAÑAÇ");
function nameize($str,$a_char = array("'","-"," ")){
//$str contains the complete raw name string
//$a_char is an array containing the characters we use as separators for capitalization. If you don't pass anything, there are three in there as default.
$string = strtolower($str);
foreach ($a_char as $temp){
$pos = strpos($string,$temp);
if ($pos){
//we are in the loop because we found one of the special characters in the array, so lets split it up into chunks and capitalize each one.
$mend = '';
$a_split = explode($temp,$string);
foreach ($a_split as $temp2){
//capitalize each portion of the string which was separated at a special character
$mend .= ucfirst($temp2).$temp;
}
$string = substr($mend,0,-1);
}
}
return ucfirst($string);
}
Which works pretty well, but, as you can see testing this exact example, doesn't parse spanish chars (utf8) I've tested mb_regex_encoding("UTF-8"); mb_internal_encoding("UTF-8");, headers UTF8, etc. But can't make it work fine with "special" spanish chars.
Any suggestion?
Can't see, where you use the Multibyte String Functions.
Maybe this would be convenient for your needs:
echo mb_convert_case("HÉCTOR MAÑAÇ", MB_CASE_TITLE, "UTF-8");
output:
Héctor Mañaç
Your function works fine for the given example also. Please check your file encoding type. It must be UTF-8. You can check it in Notepadd++.

PHP arabic text compare using strpos

I have a arabic keyword in a mysql table like
*#1591; *#1610; *#1585;*#1575;*#1606
// Please consider & in the place of * , value with '&' automatically converts in to arabic.
Mysql table encoding: utf8_general_ci
I am getting some string from the external resources example twitter.
I would like to match the keyword with the tweet i am getting .
$tweet = 'وينج وأداسي الاماراتية توقعان اتفاقية تعاون لتوفير أنظمة الطائرات بدون طيا';
$keyword = '*#1591; *#1610; *#1585;*#1575;*#1606'; //From db
$status = strpos ($tweet, $keyword)
$status always returns false.
I have checked with utf8_encode(), utf_8_decode() , mb_strpos() without any luck.
I know need to convert both strings to one common format before compare but which format i need to convert ?
Please help me on this.
As arabic symbols are encoded using multibyte characters, you must use functions that support such a constraint: grapheme_strpos and mb_strpos (in that order).
Using them instead of plain old strpos will do the trick.
Also, keep in mind that you may have to check for its availability prior to its use, as not all hosted environments have them enabled:
if (function_exists('grapheme_strpos')) {
$pos = grapheme_strpos($tweet, $keyword);
} elseif (function_exists('mb_strpos')) {
$pos = mb_strpos($tweet, $keyword);
} else {
$pos = strpos($tweet, $keyword);
}
And last but not least, check the docs for the different arguments that functions take, as the encoding used by the strings.

Regexp and variable

I have this form, wich outputs some letters and a wordlength. But I've got some problems with getting a right output from my database.
if ($_SERVER['REQUEST_METHOD'] == 'POST') {
$letters = mysql_real_escape_string($_POST['letters']);
$length = mysql_real_escape_string($_POST['length']);
echo "Letters: $letters";
echo "Lengte: $length";
$res=mysql_query("SELECT word FROM words WHERE word REGEXP '[$letters]{$length}' ")
or die ('Error: '.mysql_error ());
while ($row=mysql_fetch_array($res)){
echo $row['word'];
echo "<br />";
}
}
else {
echo "Foutje";
}
If I change $length to the integer that was inputted by the form my script works. Copy/pasting [$letters] 6 times works also. I guess there is a problem with quotes but I totaly can't figure out what it exactly is.
Can anyone see what I did wrong?
Thanks.
The {} are being interpreted by PHP as delimiters for the variable inside since you are using a double-quoted string. Change your quoting around with concatenation:
$res=mysql_query("SELECT word FROM words WHERE word REGEXP '[" . $letters . "]{" . $length ."}'")
Or double up the {} inside a double-quoted string so the outer pair are interpreted as literals.
$res=mysql_query("SELECT word FROM words WHERE word REGEXP '[$letters]{{$length}}' ")
Note, you should also verify that $length contains a positive integer.
if (!ctype_digit($length)) {
// error - length must be an int
}
try doing this:
res=mysql_query("SELECT word FROM words WHERE word REGEXP '[".$letters."]{".$length."}' ")
I have a hunch that the $ is getting intepreted as part of the regex

Categories