UTF8 Encoding problem - With good examples - php

I have the following character encoding issue, somehow I have managed to save data with different character encoding into my database (UTF8) The code and outputs below show 2 sample strings and how they output. 1 of them would need to be changed to UTF8 and the other already is.
How do/should I go about checking if I should encode the string or not? e.g.
I need each string to be outputted correctly, so how do I check if it is already utf8 or whether it needs to be converted?
I am using PHP 5.2, mysql myisam tables:
CREATE TABLE IF NOT EXISTS `entities` (
....
`title` varchar(255) NOT NULL
....
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
echo 'UTF8 Encode : ', utf8_encode($text)."<br />";
echo 'UTF8 Decode : ', utf8_decode($text)."<br />";
echo 'TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//TRANSLIT", $text)."<br />";
echo 'IGNORE TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//IGNORE//TRANSLIT", $text)."<br />";
echo 'IGNORE : ', iconv("ISO-8859-1", "UTF-8//IGNORE", $text)."<br />";
echo 'Plain : ', iconv("ISO-8859-1", "UTF-8", $text)."<br />";
?>
Output 1:
Original : France Télécom
UTF8 Encode : France Télécom
UTF8 Decode : France T�l�com
TRANSLIT : France Télécom
IGNORE TRANSLIT : France Télécom
IGNORE : France Télécom
Plain : France Télécom
Output 2:###
Original : Cond� Nast Publications
UTF8 Encode : Condé Nast Publications
UTF8 Decode : Cond?ast Publications
TRANSLIT : Condé Nast Publications
IGNORE TRANSLIT : Condé Nast Publications
IGNORE : Condé Nast Publications
Plain : Condé Nast Publications
Thanks for you time on this one. Character encoding and I don't get on very well!
UPDATE:
echo strlen($string)."|".strlen(utf8_encode($string))."|";
echo (strlen($string)!==strlen(utf8_encode($string))) ? $string : utf8_encode($string);
echo "<br />";
echo strlen($string)."|".strlen(utf8_decode($string))."|";
echo (strlen($string)!==strlen(utf8_decode($string))) ? $string : utf8_decode($string);
echo "<br />";
23|24|Cond� Nast Publications
23|21|Cond� Nast Publications
16|20|France Télécom
16|14|France Télécom

This may be a job for the mb_detect_encoding() function.
In my limited experience with it, it's not 100% reliable when used as a generic "encoding sniffer" - It checks for the presence of certain characters and byte values to make an educated guess - but in this narrow case (it'll need to distinguish just between UTF-8 and ISO-8859-1 ) it should work.
<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
$enc = mb_detect_encoding($text, "UTF-8,ISO-8859-1");
echo 'Detected encoding '.$enc."<br />";
echo 'Fixed result: '.iconv($enc, "UTF-8", $text)."<br />";
?>
you may get incorrect results for strings that do not contain special characters, but that is not a problem.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().
<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
echo 'Encoding::toUTF8 : ', Encoding::toUTF8($text)."<br />";
?>
Output:
Original : France Télécom
Encoding::toUTF8 : France Télécom
Original : Cond� Nast Publications
Encoding::toUTF8 : Condé Nast Publications
You dont need to know what the encoding of your strings is as long as you know it is either on Latin1 (iso 8859-1), Windows-1252 or UTF8. The string can have a mix of them too.
Encoding::toUTF8() will convert everything to UTF8.
I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.
Usage:
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download:
http://dl.dropbox.com/u/186012/PHP/forceUTF8.zip
I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled.
Usage:
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Another way, maybe faster and less unreliable:
echo (strlen($str)!==strlen(utf8_decode($str)))
? $str //is multibyte, leave as is
: utf8_encode($str); //encode
It compares the length of the original string and the utf8_decoded string.
A string that contains a multibyte-character, has a strlen which differs from the similar singlebyte-encoded strlen.
For example:
strlen('Télécom')
should return 7 in Latin1 and 9 in UTF8

I made these little 2 functions that work well with UTF-8 and ISO-8859-1 detection / conversion...
function detect_encoding($string)
{
//http://w3.org/International/questions/qa-forms-utf-8.html
if (preg_match('%^(?: [\x09\x0A\x0D\x20-\x7E] | [\xC2-\xDF][\x80-\xBF] | \xE0[\xA0-\xBF][\x80-\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | \xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} )*$%xs', $string))
return 'UTF-8';
//If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list.
//if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
return mb_detect_encoding($string, array('UTF-8', 'ASCII', 'ISO-8859-1', 'JIS', 'EUC-JP', 'SJIS'));
}
function convert_encoding($string, $to_encoding, $from_encoding = '')
{
if ($from_encoding == '')
$from_encoding = detect_encoding($string);
if ($from_encoding == $to_encoding)
return $string;
return mb_convert_encoding($string, $to_encoding, $from_encoding);
}
If your database contains strings in 2 different charsets, what I would do instead of plaguing all your application code with charset detection / conversion is to writhe a "one shot" script that will read all of your tables records and update their strings to the correct format (I would pick UTF-8 if I were you). This way your code will be cleaner and simpler to maintain.
Just loop records in every tables of your database and convert strings like this:
//if the 3rd param is not specified the "from encoding" is detected automatically
$newString = convert_encoding($oldString, 'UTF-8');

I didn't try your samples here, but from past experiences, there is a quick fix for this. Right after database connection execute the following query BEFORE running any other queries:
SET NAMES UTF8;
This is SQL Standard compliant, and works well with other databases, like Firebird and PostgreSQL.
But remember, you need ensure UTF-8 declarations on other spots too in order to make your application works fine. Follow a quick checklist.
All files should be saved as UTF-8 (preferred without BOM [Byte Order Mask])
Your HTTP Server should send the encoding header UTF-8. Use Firebug or Live HTTP Headers to inspect.
If your server compress and/or tokenize the response, you may see header content as chunked or gzipped. This is not a problem if you save your files as UTF-8 and
Declare encoding into HTML header, using proper meta tag.
Over all application (sockets, file system, databases...) does not forget to flag up UTF-8 everytime you can. Making this when opening a database connection or so helps you to not need to encode/decode/debug all the time. Grab'em by root.

What database do you use?
You need to know the charset of original string before you convert it to utf-8, if it's in the ISO-8859-1 (latin1) then utf8_encode() is the easiest way, otherwise you need to use either icov or mbstring lib to convert and both of these need to know the charset of input in order to covert properly.
Do you tell your database about charset when you insert/select data?

Related

PHP Htmlentities function not encoding string to database using PDO

I have a string (foreign language) and I need to convert to htmlentities.
I'm runing a php script from my terminal on linux Ubuntu.
I need this:
$str = "Ettől a pillanattól kezdve,"
To become something like this:
EttЗl a pillanattßl kezdve,
$str = "Ettől a pillanattól kezdve,";
$strEncoded = htmlentities($str, ENT_QUOTES, "UTF-8");
$cmd = $pdo->prepare("UPDATE table SET field = :a");
$cmd->bindValue(":a", $strEncoded);
$cmd->execute();
Database/Table Information:
Charset: utf8
Collation: utf8_general_ci
It is not saving as expected.
Obs: I know it's not the best practice to use htmlentities to save into database, but I need to do it this way.
Example 2:
$a = "Quantità totale delle";
$b = html_entity_decode($a);
echo $a; //output: Quantità totale delle
echo $b; //output: Quantità totale delle (Need the reverse)
echo htmlspecialchars($b, ENT_QUOTES, 'UTF-8') . "\n"; //output: Quantità totale delle (didn't convert the special character to `à`
To match the question, you have to rebuild the entity yourself using the dec value. This will works with strings like you specified:
<?php
$str = str_split("Ettől a pillanattól kezdve,");
foreach ($str as $k => $v){
echo "&#".ord($v).";";
}
// Ettől a pillanattól kezdve,
But this won't work for chars above 255.
https://www.php.net/manual/en/function.ord.php
Interprets the binary value of the first byte of string as an unsigned
integer between 0 and 255.
If the string is in a single-byte encoding, such as ASCII, ISO-8859, or Windows 1252, this is equivalent to returning the
position of a character in the character set's mapping table. However,
note that this function is not aware of any string encoding, and in
particular will never identify a Unicode code point in a multi-byte
encoding such as UTF-8 or UTF-16.

How to convert “é” to “é” in PHP?

I'm trying to convert a string from this: “é” to this: “é”. It's a latin1 character but I can't do it right. So far I've tried two functions but none of them give me the right output.
$translation = 'Copà © rnico was Italian';
$translation = mb_convert_encoding($translation, 'utf-8', 'iso-8859-1'); //opt 1
$translation = iconv('utf-8', 'latin1', $translation); //opt 2
I'm getting this data from an Api so I don't know what's going on in the database.
This is the string in Spanish: Copérnico es italiano.
This is the data from the API: Copà © rnico is Italian
This is the result with $translation = bin2hex($translation);
436f70c38320c2a920726e69636f206973204974616c69616e
What's the right way to go? Greetings.
I had the same problem before and this option
$translation = iconv('utf-8', 'latin1', $translation); //opt 2
work verry well.
Your problem is `Copà © rnico was Italian` is not the same than `Copérnico was Italian`.
So when you try to convert the function iconv see 2 wrong UTF-8 symbols because de spaces, is not the same "à © "(2 invalid UTF-8 symbols and 2 spaces) than "é"(1 Valid UTF-8 symbol)

PHP UTF-8 mb_convert_encode and Internet-Explorer

Since some days I read about Character-Encoding, I want to make all my Pages with UTF-8 for Compability. But I get stuck when I try to convert User-Input to UTF-8, this works on all Browsers, expect Internet-Explorer (like always).
I don't know whats wrong with my code, it seems fine to me.
I set the header with char encoding
I saved the file in UTF-8 (No BOM)
This happens only, if you try to access to the page via $_GET on the internet-Explorer myscript.php?c=äüöß
When I write down specialchars on my site, they would displayed correct.
This is my Code:
// User Input
$_GET['c'] = "äüöß"; // Access URL ?c=äüöß
//--------
header("Content-Type: text/html; charset=utf-8");
mb_internal_encoding('UTF-8');
$_GET = userToUtf8($_GET);
function userToUtf8($string) {
if(is_array($string)) {
$tmp = array();
foreach($string as $key => $value) {
$tmp[$key] = userToUtf8($value);
}
return $tmp;
}
return userDataUtf8($string);
}
function userDataUtf8($string) {
print("1: " . mb_detect_encoding($string) . "<br>"); // Shows: 1: UTF-8
$string = mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string)); // Convert non UTF-8 String to UTF-8
print("2: " . mb_detect_encoding($string) . "<br>"); // Shows: 2: ASCII
$string = preg_replace('/[\xF0-\xF7].../s', '', $string);
print("3: " . mb_detect_encoding($string) . "<br>"); // Shows: 3: ASCII
return $string;
}
echo $_GET['c']; // Shows nothing
echo mb_detect_encoding($_GET['c']); // ASCII
echo "äöü+#"; // Shows "äöü+#"
The most confusing Part is, that it shows me, that's converted from UTF-8 to ASCII... Can someone tell me why it doesn't show me the specialchars correctly, whats wrong here? Or is this a Bug on the Internet-Explorer?
Edit:
If I disable converting it says, it's all UTF-8 but the Characters won't show to me either... They are displayed like "????"....
Note: This happens ONLY in the Internet-Explorer!
Although I prefer using urlencoded strings in address bar but for your case you can try to encode $_GET['c'] to utf8. Eg.
$_GET['c'] = utf8_encode($_GET['c']);
An approach to display the characters using IE 11.0.18 which worked:
Retrieve the Unicode of your character : example for 'ü' = 'U+00FC'
According to this post, convert it to utf8 entity
Decode it using utf8_decode before dumping
The line of code illustrating the example with the 'ü' character is :
var_dump(utf8_decode(html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", 'U+00FC'), ENT_NOQUOTES, 'UTF-8')));
To summarize: For displaying purposes, go from Unicode to UTF8 then decode it before displaying it.
Other resources:
a post to retrieve characters' unicode

replace multibyte utf8 character in php

I am trying to preg_replace the multibytecharacter for euro in UTF (shown as ⬠in my html) to a "$" and the * for an "#"
$orig = "2 **** reviews ⬠19,99 price";
$orig = mb_ereg_replace(mb_convert_encoding('€', 'UTF-8', 'HTML-ENTITIES'), "$", $orig);
$orig = preg_replace("/[\$\;\?\!\{\}\(\)\[\]\/\*\>\<]/", "#", $orig);
$a = htmlentities($orig);
$b = html_entity_decode($a);
The "*" are being replaced but not the "â¬" .......
Also tried to replace it with
$orig = preg_replace("/[\xe2\x82\xac]/", "$", $orig);
Doesn't convert either....
Another plan which didnt work:
$orig= mb_ereg_replace(mb_convert_encoding('€', 'UTF-8', 'HTML-ENTITIES'), "$", $orig);
Brrr someone knows how to get rid of this utf8 euro character:
echo html_entity_decode('€');
(driving me nuts)
This could be caused by two reasons:
The actual source text is UTF8 encoded, but your PHP code not.
You can solve this by just using this line and save your file UTF8 encoded (try using notepad++).
str_replace('€', '$', $source);
The source text is corrupted: multibyte characters are converted to latin1 (wrong database charset?). You can try to convert them back to latin1:
str_replace('€', '$', utf8_decode($source))
Pasting my comment here as an answer so you can mark it!
Wouldn't
str_replace(html_entity_decode('€'), '$', $source)
work?
In your $orig string you do not have euro sign.
When I run this php file:
<?php
$orig = "â¬";
for($i=0; $i<strlen($orig); $i++)
echo '0x' . dechex(ord($orig{$i})) . ' ';
?>
If saved as utf-8 I get: 0xc3 0xa2 0xc2 0xac
If saved as latin-1 I get: 0xe2 0xac
In any case it is not € sign which is:0xE2 0x82 0xAC or unicode \u20AC ( http://www.fileformat.info/info/unicode/char/20ac/index.htm ).
0x82 is missing!!!!!
Run this program above, see what do you get and use this hex values to get rid of â¬.
For real € sign this works:
<?php
$orig = html_entity_decode('€', ENT_COMPAT, 'UTF-8');
$dest = preg_replace('~\x{20ac}~u', '$', $orig);
echo "($orig) ($dest)";
?>
BTW if UTF-8 file containing € is displayed as latin-1 you should get:
€ and not ⬠as in your example.
So in fact, you have problems with encoding and conversion between encodings. If you try to save € in latin1 middle character will be lost (for example my Komodo will alert me and then replace ‚ with ?). In other words, you somehow damaged your € sign - and then you tried to replace it as it was complete. :D

get utf8 urlencoded characters in another page using php

I have used rawurlencode on a utf8 word.
For example
$tit = 'தேனின் "வாசம்"';
$t = (rawurlencode($tit));
when I click the utf8 word ($t), I will be transferred to another page using .htaccess and I get the utf8 word using $_GET['word'];
The word displays as தேனினà¯_"வாசமà¯" not the actual word. How can I get the actual utf8 word. I have used the header charset=utf-8.
Was my comment first, but should have been an answer:
magic_quotes is off? Would be weird if it was still on in 2011. But you should check and do stripslashes.
Did you use rawurldecode($_GET['word']); ? And do you use UTF-8 encoding for your PHP file?
<?php
$s1 = <<<EOD
தேனினà¯_"வாசமà¯"
EOD;
$s2 = <<<EOD
தேனின் "வாசம்"
EOD;
$s1 = mb_convert_encoding($s1, "WINDOWS-1252", "UTF-8");
echo bin2hex($s1), "\n";
echo bin2hex($s2), "\n";
echo $s1, "\n", $s2, "\n";
Output:
e0aea4e0af87e0aea9e0aebfe0aea9e0af5f22e0aeb5e0aebee0ae9ae0aeaee0af22
e0aea4e0af87e0aea9e0aebfe0aea9e0af8d2022e0aeb5e0aebee0ae9ae0aeaee0af8d22
தேனின��_"வாசம��"
தேனின் "வாசம்"
You're probably just not showing the data as UTF-8 and you're showing it as ISO-8859-1 or similar.

Categories