Encode cyrillic UTF-8 to Unicode symbols - php

In a project's code there's this part:
$companies = $db->fetchAll("SELECT id FROM companies where contact_person like '%num%\"". $num. "\"%'");
For example, $num = 'ЦВ123456' (ЦВ - cyrillic symbols). But in the db ЦВ is stored in unicode - \u0426\u0412. So there are no hits. So how can I convert $num to unicode so the query becomes ...contact_person like '\u0426\u0412123456 ?

you can use php build in mb-convert encoding,
convert your $string to the encoding used in the database , and query with the encoded value
https://www.php.net/manual/en/function.mb-convert-encoding.php
something like this
$string = "ЦВ123456";
$unicode = mb_convert_encoding($string, "utf-8", "unicode");

Related

how to change ascii alphabet to utf-8 in php

I have an ASCII string. I like to change its encoding to utf-8.
But I found there's a simple function to change ascii to utf-8 in php.
and vice verse, I like to change utf-8 alphabet to ascii.
Please advise.
I have tried:
<?php
// utf-8
$str = "CHONKIOK";
// I can't even how to print these utf-8 characters in php. I just copied/pasted the string.
// strlen($str) => 24 bytes
// mb_detect_encoding($str) => utf-8
$str2 = "CHONKIOK";
// strlen($str2) => 8 bytes
// mb_detect_encoding($str2) => ascii
// change ascii to utf-8
$str = mb_convert_encoding($str2, "UTF-8");
echo mb_detect_encoding($str);
// returns ascii
What you are doing is correct.
As per mb_detect_encoding it states that it detects the most likely character encoding.
As the entire ASCII set is contained within UTF-8 at the exact same character positions, this function is telling you that it's an ASCII string because it technically is. The bytes of this string when encoded in both ASCII and UFT-8 are identical.
As you've found, when you include some characters outside of the ASCII set then it will give you the next probable encoding.
What exactly should I do to obtain this string: "CHONKIOK" from "CHONKIOK"?
The characters you're after are called "Fullwidth Latin" characters.
Given the C character provided is character 65,315 and a regular C is character 67, you could possible obtain the strings you're after by adding the difference of 65,248. This is only possible because the alphabet tends to repeat in the same order throughout different parts of the character charts.
You can get the code point of a character using mb_ord and convert it back to a character using mb_chr, after adding 65,248.
That might look something like:
$str_input = "ABC abc 123";
$convertable = "ABCDEFG12349abcdefg";
$str_output = "";
for ($i = 0; $i < strlen($str_input); $i++) {
$char = mb_ord($str_input[$i], "UTF-8");
if(str_contains($convertable, $str_input[$i])) $char += 65248;
$str_output .= mb_chr($char, "UTF-8");
}
echo $str_output; // outputs "ABC abc 123"
Just be sure to include the whole alphabet in $convertable
try this to convert to utf-8:
utf8_encode(string $string): string
try this to convert to ASCII:
utf8_decode(string $string): string

How to convert E5%AE%89%E5%85%A8 to this format 安全 in PHP 5.1

I got a keyword var from an url like this:
search.php?keyword=%E5%AE%89%E5%85%A8
with utf8 encoding. Then I want to convert the keyword to this format:
&#23433;&#20840;
So I can use it in MySQL/PHP/
How can I achieve this?
This should work.
$html = mb_convert_encoding($_GET['keyword'], 'HTML-ENTITIES', 'UTF-8');
echo htmlspecialchars($html);
//安全
First one is url encoded, and what you want are html entities.
$string = urldecode('%E5%AE%89%E5%85%A8'); // == 安全
$string2 = htmlentities($string); // == 安全
$string3 = htmlspecialchars($string2); // == &#x5B89;&#x5168;
The one from your question is double encoded (& got converted to &) which I assume is wrong.
23433 is decimal and equals hexadecimal x5B89. Same for the second code. For the browser, it doesn't matter if it's decimal or hexadecimal.
If you really intend double encoding, use htmlspecialchars($string); on the above code.

How to get substring of unicode characters from mysql using php

The Unicode characters are stored in mysql database in this format
یہاں تو
There is no only unicode characters in my database by also html and english characters mixed up.
The Problem is I want to get a part of the string from database field 'post_body'
I have used the following sql query
"SELECT SUBSTRING(post_body,1,120) as pst_body from mytable";
This string gives me back 120 characters accurately. But the Problem is if there are unicode symbols in the database then ی is equal to 1 unicode character, so my requirement does not fulfill in this way.
Is there any function that can give me back my specified number of characters regardless of is it unicode character or english character, mean if there is unicode data it should count ی as one character .
I do not think, there is any option in mysql, you can fetch data from mysql then take the action in PHP.
function getSubstring($string, $number){
$keywords = preg_split("/([&])+/", htmlentities($string));
$finalArray = array();
unset($keywords[0]);
for($index = 1;$index <= $number;$index++){
$finalArray[] = $keywords[$index];
}
return str_replace('amp;', '&', implode('', $finalArray));
}
//$string = یہاں تو
//$number = 10;// number of character to be fetch
echo getSubstring($string,10);

php: converting from cp1251 to utf8

I have a problem converting a string from cp1251 to utf8...
I need to get some names from database and those names are in cp1251(i'm not the one who made that database, so I can't edit it, but I know for sure that these names are cp1251)...
The name in database is this - "Р?нтернет РІ цифрах"
I'm converting it to utf8 using iconv function like this:
iconv("UTF-8", "CP1251//IGNORE", $name)
and what I have in the result is this - "�?нтернет в цифрах"(it's Russian), but the first two symbols are not correct... it should be "Интернет в цифрах"...
So the final thing that I have to do is somehow change these two symbols "�?" to russian letter "И"... and I really don't know how to do that... I've tried to use preg_replace, but it doesn't work...or I'm not using it correctly.
And I'm sorry for Russian letters, it is really hard to explain what I need without showing them.
The first letter comes out incorrect because one of the bytes needed to store the UTF-8 encoding of И (0x98 to be exact) is not used in CP1251. If the database has replaced the 98 byte by a question mark you have to change it back before using iconv:
$name = str_replace("\xD0\x3F", "\xD0\x98", $name);
echo iconv("UTF-8", "CP1251//IGNORE", $name);
use this:
mb_convert_encoding($model->text, 'cp1252', 'utf8')
Try this:
function cp1251_to_utf8($s){
$c209 = chr(209); $c208 = chr(208); $c129 = chr(129);
for($i=0; $i<strlen($s); $i++) {
$c=ord($s[$i]);
if ($c>=192 and $c<=239) $t.=$c208.chr($c-48);
elseif ($c>239) $t.=$c209.chr($c-112);
elseif ($c==184) $t.=$c209.$c209;
elseif ($c==168) $t.=$c208.$c129;
else $t.=$s[$i];
}
return $t;
}

Utf-8 to UTF-16BE

I save a record "فحص الرسالة العربية" in php that always saved as :
فحص الرسالة العربية
I want to convert this into UTF-16BE chars when i retrieve it so I am using a function that returns :
002600230031003600300031003b002600230031003500380031003b002600230031003500380039003b0020002600230031003500370035003b002600230031003600300034003b002600230031003500380035003b002600230031003500380037003b002600230031003500370035003b002600230031003600300034003b002600230031003500370037003b0020002600230031003500370035003b002600230031003600300034003b002600230031003500390033003b002600230031003500380035003b002600230031003500370036003b002600230031003600310030003b002600230031003500370037003b
This is function that m using for converting string retrieved from database
function convertCharsn($string) {
$in = '';
$out = iconv('UTF-8', 'UTF-16BE', $string);
for($i=0; $i<strlen($out); $i++) {
$in .= sprintf("%02X", ord($out[$i]));
}
return $in;
}
But when i type same character in below url, it shows different characters as compared to my string.
http://www.routesms.com/downloads/onlineunicode.asp
returning :
0641062D063500200627064406310633062706440629002006270644063906310628064A0629
I want my string to be converted as it is being converted in above url.
my database collation is utf-8_general_ci
Basically, you need to decode those characters out of HTML entities first. Just use html_entity_decode()
$rawChars = html_entity_decode($string, ENT_QUOTES | ENT_HTML401, 'UTF-8');
convertCharsn($rawChars);
Otherwise, you're just encoding the entities. You can see that as & is 0026 in UTF16, and # is 0023. So you can see the repeating sequence of 00260023 in the above transcoding that you posted. So decode it first, and you should be set...

Categories