php: converting from cp1251 to utf8

php: converting from cp1251 to utf8 - php

I have a problem converting a string from cp1251 to utf8...
I need to get some names from database and those names are in cp1251(i'm not the one who made that database, so I can't edit it, but I know for sure that these names are cp1251)...
The name in database is this - "Р?РЅС‚РµСЂРЅРµС‚ РІ С†РёС„СЂР°С…"
I'm converting it to utf8 using iconv function like this:
iconv("UTF-8", "CP1251//IGNORE", $name)
and what I have in the result is this - "�?нтернет в цифрах"(it's Russian), but the first two symbols are not correct... it should be "Интернет в цифрах"...
So the final thing that I have to do is somehow change these two symbols "�?" to russian letter "И"... and I really don't know how to do that... I've tried to use preg_replace, but it doesn't work...or I'm not using it correctly.
And I'm sorry for Russian letters, it is really hard to explain what I need without showing them.

The first letter comes out incorrect because one of the bytes needed to store the UTF-8 encoding of И (0x98 to be exact) is not used in CP1251. If the database has replaced the 98 byte by a question mark you have to change it back before using iconv:
$name = str_replace("\xD0\x3F", "\xD0\x98", $name);
echo iconv("UTF-8", "CP1251//IGNORE", $name);

use this:
mb_convert_encoding($model->text, 'cp1252', 'utf8')

Try this:
function cp1251_to_utf8($s){
$c209 = chr(209); $c208 = chr(208); $c129 = chr(129);
for($i=0; $i<strlen($s); $i++) {
$c=ord($s[$i]);
if ($c>=192 and $c<=239) $t.=$c208.chr($c-48);
elseif ($c>239) $t.=$c209.chr($c-112);
elseif ($c==184) $t.=$c209.$c209;
elseif ($c==168) $t.=$c208.$c129;
else $t.=$s[$i];
}
return $t;
}

Related

How to compare 2 strings with different encoding and different unicode characters in PHP

I need to make an international accreditation validation by name and surname. But the problem is, that i need to return TRUE even if characters are different.
EXAMPLE:
$str1 = 'Bożydar Kamiński';
$str2 = 'BOZYDAR KAMINSKI';
// I need this to be TRUE
if ($str1 == $str2) {
echo 'YOUR BUNNY WROTE';
}
Is there is some php default functions to convert UTF-8 string with unicode characters (str1) to a plain latin characters?

If you know the possible differences regarding the UTF-8 encoder, you can create your own decoder. You can create a table with all possible strange characters and in the function you should perform the comparison and then exchange for the equivalent.

$coll = Collator::create('');
$coll->setStrength(Collator::PRIMARY);
var_dump(0 == $coll->compare(
'Bożydar Kamiński',
'BOZYDAR KAMINSKI'
)); // bool(true)

Decoding ascii in url php

I try to use ascii letters in my url-site but I don't know what i use ascii letters ( è , é , à, ect.. ) in my url. I try to use urlencode\urldecode but i can't view my page when i use a ascii letters. I don't know what resolve this problem.
Can you help me with an example pls ?
function my_url($offer) {
$dd = url_title($offer->number." ".$offer->number,"-",true);
return site_url("siteweb\view".$offer->number."/".urlencode($dd));
My problem is in "urlencode($dd)"
update
function my_url($offer) {
$dd = url_title($offer->number." ".$offer->number,"-",true);
$ddd= rawurldecode($ddd)
return site_url("siteweb\view".$offer->number."/".htmlspecialchars($ddd));

I think what you're looking for is htmlentities
"htmlentities — Convert all applicable characters to HTML entities"

PHP: UTF-8 character gets messy in function which takes the first letter from each word of a sentence

I have this function which when executed it returns the first letters of each word of a string.
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', $stringsoftext) as $word)
$retturns .= ($word[0]);
return $retturns;
}
Everything works fine. The only problem is that when the words begin with special characters it starts to get messy.
For example "test økonomi" become "t�" instead of "tø"
How can i correct this?

That happens because $word[0] takes the first byte of a string, whereas you are using a multi-bye encoding. So a character may consist of multiple bytes. In case of a ø character it consists of 2 bytes: 0xC3 0xB8
That is how you would extract the first character instead:
mb_substr($word, 0, 1, 'utf8')
Working demo: http://ideone.com/XVnC87

You should use mb_substr with mb_internal_encoding as in example:
<?php
header('Content-Type: text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');
echo initials('ąęść óęłęł');
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', $stringsoftext) as $word) {
$retturns .= mb_substr($word,0,1);
}
return $retturns;
}

Complementing various answers above, you could convert utf-8 (to be precise, assumed as utf-8) encoded character to its ISO 8859 counterpart.
No multibyte support required, as it's not enabled by default in many PHP configurations.
Use utf8_encode() in order to do so
<?php
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', utf8_decode($stringsoftext)) as $word)
$retturns .= ($word[0]);
return $retturns;
}
echo initials("test økonomi");
//return tø
?>
Edit: This approach could break if the characters being converted is not defined on ISO 8859 charset (e.g non latin symbols). Just to reiterate if PHP multi byte support is turned on, mb_substr() solutions is certainly the most appropriate as it is able to properly process the string in utf8 encoding.

Normalize Name-Surname strings: PHP+REGEX (Spanish chars- UTF8)

I'm having strings with name and surname which I need to normalize with a functiont and make them like:
Name Surname (I can recive strings like NAME SURNAME, Name SURNAME, etc...)
I've found this snipet:
echo nameize("HÉCTOR MAÑAÇ");
function nameize($str,$a_char = array("'","-"," ")){
//$str contains the complete raw name string
//$a_char is an array containing the characters we use as separators for capitalization. If you don't pass anything, there are three in there as default.
$string = strtolower($str);
foreach ($a_char as $temp){
$pos = strpos($string,$temp);
if ($pos){
//we are in the loop because we found one of the special characters in the array, so lets split it up into chunks and capitalize each one.
$mend = '';
$a_split = explode($temp,$string);
foreach ($a_split as $temp2){
//capitalize each portion of the string which was separated at a special character
$mend .= ucfirst($temp2).$temp;
}
$string = substr($mend,0,-1);
}
}
return ucfirst($string);
}
Which works pretty well, but, as you can see testing this exact example, doesn't parse spanish chars (utf8) I've tested mb_regex_encoding("UTF-8"); mb_internal_encoding("UTF-8");, headers UTF8, etc. But can't make it work fine with "special" spanish chars.
Any suggestion?

Can't see, where you use the Multibyte String Functions.
Maybe this would be convenient for your needs:
echo mb_convert_case("HÉCTOR MAÑAÇ", MB_CASE_TITLE, "UTF-8");
output:
Héctor Mañaç

Your function works fine for the given example also. Please check your file encoding type. It must be UTF-8. You can check it in Notepadd++.

How to check if the word is Japanese or English using PHP

I want to have different process for English word and Japanese word in this function
function process_word($word) {
if($word is english) {
/////////
}else if($word is japanese) {
////////
}
}
thank you

A quick solution that doesn't need the mb_string extension:
if (strlen($str) != strlen(utf8_decode($str))) {
// $str uses multi-byte chars (isn't English)
}
else {
// $str is ASCII (probably English)
}
Or a modification of the solution provided by #Alexander Konstantinov:
function isKanji($str) {
return preg_match('/[\x{4E00}-\x{9FBF}]/u', $str) > 0;
}
function isHiragana($str) {
return preg_match('/[\x{3040}-\x{309F}]/u', $str) > 0;
}
function isKatakana($str) {
return preg_match('/[\x{30A0}-\x{30FF}]/u', $str) > 0;
}
function isJapanese($str) {
return isKanji($str) || isHiragana($str) || isKatakana($str);
}

This function checks whether a word contains at least one Japanese letter (I found unicode range for Japanese letters in Wikipedia).
function isJapanese($word) {
return preg_match('/[\x{4E00}-\x{9FBF}\x{3040}-\x{309F}\x{30A0}-\x{30FF}]/u', $word);
}

You could try Google's Translation API that has a detection function:
http://code.google.com/apis/language/translate/v2/using_rest.html#detect-language

Try with mb_detect_encoding function, if encoding is EUC-JP or UTF-8 / UTF-16 it can be japanese, otherwise english.
The better is if you can ensure which encoding each language, as UTF encodings can be used for many languages

English text usually consists only of ASCII characters (or better say, characters in ASCII range).

You can try to convert the charset and check if it succeeds.
Take a look at iconv: http://www.php.net/manual/en/function.iconv.php
If you can convert a string to ISO-8859-1 it might be english, if you can convert to iso-2022-jp it is propably japanese (I might be wrong for the exact charsets, you should google for them).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php: converting from cp1251 to utf8 - php

use this: mb_convert_encoding($model->text, 'cp1252', 'utf8')

Related

How to compare 2 strings with different encoding and different unicode characters in PHP

Decoding ascii in url php

PHP: UTF-8 character gets messy in function which takes the first letter from each word of a sentence

Normalize Name-Surname strings: PHP+REGEX (Spanish chars- UTF8)

How to check if the word is Japanese or English using PHP

Categories

Resources