Codeigniter curly quotes (despite UTF-8 collation) - php

I'm using a simple form to write string to database, but the curly quotes (on retrieval) are being displayed as special characters.
So far I have tried changing the table collation to UTF-8 as well as using the CodeIgniter Typography class:
1. alter table test.vocab convert to character set utf8 collate
utf8_unicode_ci;
2. $this->load->library('typography');
$data['item'] = $this->typography->auto_typography($data['item'], FALSE);
This hasn't helped, though. Here's what the output looks like:
If needed I'll post more code. Please help.

Thank you everyone, but it seems like what I needed was the mb_convert_encoding() function:
$i['item'] = mb_convert_encoding($i['item'], 'HTML-ENTITIES', 'UTF-8');
Many thanks for taking out the time for commenting or answering.

I had a similar issue with a migrated DB. Turned out I needed to convert 'UTF-8' to 'Windows-1252'.
I found out by running the code below and checking which combination looked right:
$html = 'Your HTML here';
$charsets = array(
"UTF-8",
"ASCII",
"Windows-1252",
"ISO-8859-15",
"ISO-8859-1",
"ISO-8859-6",
"CP1256"
);
foreach ($charsets as $ch1) {
foreach ($charsets as $ch2){
echo "<h1>Combination $ch1 to $ch2 produces: </h1>".iconv($ch1, $ch2, $html);
}
}

Related

Revert badly encoded umlauts

For some reason my special characters got encoded as the following string in a mysql database:
Ã?
Which shows up as:
Ã?
But actually should show up as:
Ö
What went wrong here? I use UTF-8 everywhere.
How can I fix this without recreating all content?
I executed the following in PHP:
<?php
echo str_replace("&", "&", htmlentities("Ö", 0, "ISO-8859-1")) , '<br />';
echo str_replace("&", "&", htmlentities("Ö", 0, "UTF-8")), "</br>";
?>
The str_replace is just there to reveal any HTML mnemonics, which would otherwise
be translated by the browser to the original character, which I don't want to happen.
You will get this as output:
�
Ö
You'll recognise the first value as what you found in the database, and the second one
is a bit like you wanted it to be.
Add to this the fact that the default value for the third argument to htmlentities
depends on your PHP version and is ISO-9959-1 in the case of version 5.3, the one you use.
Also realise that HTML documents which do not specify a character encoding will
by default post form data in ISO-8859-1 format.
Combining all this might give a clue about the cause of your problem:
My guess is that the data is correctly posted as UTF-8 to the server, but then htmlentities interprets this as a non-UTF-8, single byte encoding, and so turns one, multi-byte character into two single byte characters.
Now to the measures to take that this does not continue to happen:
First make sure that your HTML form has the UTF-8 encoding, because this determines the
default encoding that a form will use for sending its data to the server:
<head>
<meta charset="UTF-8">
</head>
Make sure this is not overruled by another encoding in the form tag's accept-charset
attribute.
Then, skip the htmlentities call. You should not turn characters into their
HTML mnemonic when storing them in the database. MySql
supports UTF-8 characters, so just store them like that.
For the second question, you'll have to find all cases and bulk replace them as you find
new instances. You could get get a little help by producing some SQL statements
with a PHP script like the following:
<?php
// list all your non-ASCII characters here. Do not use str_split.
$chars = ["Ö","õ","Ũ","ũ"];
foreach ($chars as $ch) {
$bad = str_replace("&", "&", htmlentities($ch, 0, "ISO-8859-1"));
echo "update mytable set myfield = replace(myfield, '$bad', '$ch')
where instr(myfield, '$bad') > 0;<br />";
}
?>
The output of this script will look like this:
update mytable set myfield = replace(myfield, 'Ã�', 'Ö') where instr(myfield, 'Ã�') > 0;
update mytable set myfield = replace(myfield, 'õ', 'õ') where instr(myfield, 'õ') > 0;
update mytable set myfield = replace(myfield, 'Ũ', 'Ũ') where instr(myfield, 'Ũ') > 0;
update mytable set myfield = replace(myfield, 'Å©', 'ũ') where instr(myfield, 'Å©') > 0;
Of course, you could decide to make a PHP script that will even do the updates itself.
Hopefully you can use this information to fix the issues.
For PDO, use something like
$db = new PDO('dblib:host=host;dbname=db;charset=UTF-8', $user, $pwd);
Ã? is two or three things going wrong, not just one!
C396 is the utf8 hex for Ö or the latin1 hex for the two characters Ö. It requires something else to go wrong to get ? or the black diamond.
Let's see what is in the table; do
SELECT col, HEX(col) FROM tbl WHERE ...
(If you have already done the previously suggested replace(), then the table may be in an even worse mess. Or it might be fixed.)

Gettext not detecting utf8 properly in PHP

I have a PHP application using Gettext as the i18n engine. The translation works fine, the only problem is that I'm having encoding issues with UTF8 characters. My PHP code to load gettext is something like this:
bindtextdomain( $domain, PATH_BASE . DS . "language" . DS );
$this->utf8Encode = strtolower($encoding) == "utf-8";
bind_textdomain_codeset($domain, $encoding);
textdomain($domain);
My templates render the pages using the utf8 charset and I've tried just about anything to load the proper charset. For the current locale I'm loading SL_sl, the names appear correctly but have issues with UTF8 chars, so where it should appear Država, it shows up Dr?ava
So, it has happened before, and now it happened again, I found the solution myself! The problem was that like I said to #bozdoz, I was converting UTF8 text already, but I didn't realized that the gettext function returned a UTF8 string, so if you do this:
$encoded = utf8_encode($utf8String);
Then you'll have a really nasty bug when $utf8String is an actual UTF8 string. Therefore I did some modifications to my code and the translation method (simplified) ended up like this:
$translation = gettext($singular);
$encoded = $this->utf8Encode ? $this->Utf8Encode($translation) : $translation;
And the Utf8Encode method is like this:
private function Utf8Encode( $text )
{
if ( mb_check_encoding($text, "utf8") == TRUE ){
return $text;
return utf8_encode($text);
}
I hope that if somebody has the same error this can help!
From the partial information I can suggest you take a look at the actual mo/po files, in poedit there are several warnings about utf8 encoding. Assuming that everything else is correct (meta, headers, etc) it's the only thing left to check
Try encoding it with utf8_encode(). I can't really tell from your code, but perhaps it could be implemented like this:
utf8_encode($domain);

Problem with utf-8

I have a problem with utf-8. I use the framework Codeigniter. For a client i have to
convert a CSV file to a database. But when i add the data trough a query to the database
the is a problem. Some characters doesn,t work. For example this word: Eén. When i add this word at PhpMyadmin, it's right.
When i try trought Codeigniter query, it doesn't.
My database stands on Utf-8. The Codeigniter config is utf-8. The database config is on utf-8.
Here is the query:
$query = "INSERT INTO lds_leerdoel(id,leerdoel,kind_omschrijving,cito,groep_id,OCW,opbouw,
kerngebied_id,jaar_maand,KVH,craats,refnivo,toelichting,auteur)
VALUES
(
'".$this->db->escape_str($id)."',
'".$leerdoel."',
'".$this->db->escape_str($kind_omschrijving)."',
'".$this->db->escape_str($cito)."',
'".$this->db->escape_str($groep_id)."',
'".$this->db->escape_str($OCW)."',
'".$this->db->escape_str($opbouw)."',
'".$this->db->escape_str($kerngebied_id)."',
'".$this->db->escape_str($jaar_maand)."',
'".$this->db->escape_str($KVH)."',
'".$this->db->escape_str($craats)."',
'".$this->db->escape_str($refnivo)."',
'".$this->db->escape_str($toelichting)."',
'".$this->db->escape_str($auteur)."'
)";
$this->db->query($query);
The problem is the field leerdoel. Does somebody a solution. Thank you verry much!!
Greetings,
Jelle
You'll need to run this query before the insert query
"SET NAMES utf8"
Shouldn't you use a national character string literal? http://dev.mysql.com/doc/refman/5.0/en/charset-national.html
Meaning that you would write:
$query = "INSERT INTO lds_leerdoel(id,leerdoel,kind_omschrijving,cito,groep_id,OCW,opbouw,
kerngebied_id,jaar_maand,KVH,craats,refnivo,toelichting,auteur)
VALUES
(
'".$this->db->escape_str($id)."',
N'".$leerdoel."',
-- rest of query omitted
Try to convert the text to Unicode with iconv():
iconv( "ISO-8859-1", "UTF-8", $leerdoel );
You might need to experiment a little if you don't know what encoding the file uses. (I think ISO-8859-1 or ISO-8859-15 are the most common.)
Try adding this to your header
header('Content-Type: text/html; Charset=UTF-8');
Also check the encoding settings of your editor.
I had a similar problem and i solve adding this in the beginning of my PHP file:
ini_set('default_charset', 'UTF-8');
mb_internal_encoding('UTF-8');
Additionally, is very important to check if you are saving your PHP file in UTF-8 format without BOM, i had a big headache with this. I recomend Notepad++, it shows the current file encoding and allow you to convert to UTF-8 without BOM if necessary.
If you would like to see my problem and solution, it is here.
Hope it can help you!

Error in encoding mysql -> How can I reconvert it to something else?

I started a website some time ago using the wrong CHARSET in my DB and site. The HTML was set to ISO... and the DB to Latin... , the page was saved in Western latin... a big mess.
The site is in French, so I created a function that replaced all accents like "é" to "é". Which solved the issue temporarily.
I just learned a lot more about programming, and now my files are saved as Unicode UTF-8, the HTML is in UTF-8 and my MySQL table columns are set to ut8_encoding...
I tried to move back the accents to "é" instead of the "é", but I get the usual charset issues with the (?) or weird characters "â" both in MySQL and when the page is displayed.
I need to find a way to update my sql, through a function that cleans the strings so that it can finally go back to normal. At the moment my function looks like this but doesn't work:
function stripAcc3($value){
$ent = array(
'à'=>'à',
'â'=>'â',
'ù'=>'ù',
'û'=>'û',
'é'=>'é',
'è'=>'è',
'ê'=>'ê',
'ç'=>'ç',
'Ç'=>'Ç',
"î"=>'î',
"Ï"=>'ï',
"ö"=>'ö',
"ô"=>'ô',
"ë"=>'ë',
"ü"=>'ü',
"Ä"=>'ä',
"€"=>'€',
"′"=> "'",
"é"=> "é"
);
return strtr($value, $ent);
}
Any help welcome. Thanks in advance. If you need code, please tell me which part.
UPDATE
If you want the bounty points, I need detailed instructions on how to do it. Thanks.
Try using the following function instead, it should handle all the issues you described:
function makeStringUTF8($data)
{
if (is_string($data) === true)
{
// has html entities?
if (strpos($data, '&') !== false)
{
// if so, revert back to normal
$data = html_entity_decode($data, ENT_QUOTES, 'UTF-8');
}
// make sure it's UTF-8
if (function_exists('iconv') === true)
{
return #iconv('UTF-8', 'UTF-8//IGNORE', $data);
}
else if (function_exists('mb_convert_encoding') === true)
{
return mb_convert_encoding($data, 'UTF-8', 'UTF-8');
}
return utf8_encode(utf8_decode($data));
}
else if (is_array($data) === true)
{
$result = array();
foreach ($data as $key => $value)
{
$result[makeStringUTF8($key)] = makeStringUTF8($value);
}
return $result;
}
return $data;
}
Regarding the specific instructions of how to use this, I suggest the following:
export your old latin database (I hope you still have it) contents as an SQL/CSV dump *
use the above function on the file contents and save the result on another file
import the file you generated in the previous step into the UTF-8 aware schema / database
* Example:
file_put_contents('utf8.sql', makeStringUTF8(file_get_contents('latin.sql')));
This should do it, if it doesn't let me know.
You might want to investigate what is used to fix WP database encoding issues:
http://codex.wordpress.org/Converting_Database_Character_Sets
To cut a long story short, most old WP sites were created with Swedish/Latin1 collated tables, which were used to store UTF8 strings. To collate the tables properly, the approach is to change the column to binary type, and then to change it to UTF8 text.
This avoids that the text gets wrangled when converting from Latin1 to UTF8 directly.
You will need to convert the offending rows using for example iconv. The challenge for you will be to know what rows are already UTF-8 and which are latin-1.
I'm not completely sure I understand your question, but
if you have
a UTF-8 database
all special characters in there stored as HTML entities
then a
html_entity_decode($string, ENT_QUOTES, "UTF-8");
should do the trick and turn all entities back into their UTF-8 native characters.
Make sure, not just your tables use utf-8, your database connection should use utf-8 as well.
$this->db = mysql_connect(MYSQL_SERVER,DB_LOGIN,DB_PASS);
mysql_set_charset ('utf8',$this->getConnection());
If you want to discuss with your database in UTF-8 you have to tell the Database that the connexion flow is a UTF-8 flow. You have to sent a request before each request you make to the database, this request in the following :
"SET NAMES utf8";
Personnaly I use that in the connect.inc.php files which create the connection to the database. Which this statement the database know that your sending UTF-8 encoded string and works perfectly !
mysql_set_charset function isn't working well, i tried this function in the past but the truth is that it don't do the trick.
For your complete issue, if you want to convert latin1 string to UTF-8, you have to convert first the latin1 string to a binary string format. Then convert the binary string into UTF-8 string, all can be done inside the database with database commands. See that artile (in french) : http://www.noidea.ca/2009/06/15/comment-convertir-une-db-de-latin1-a-utf8/
I can tell you that this method works because i used it to transform data from a database I've created.

PHP - how to detect encoding?

I'm using Amazon's API to obtain the description of books. The API returns XML responses and the description is marked up (with HTML) very poorly. To deal with this poorly marked up description, which oftentimes breaks the layout of my site, I'm trying to use HTML Tidy to "clean it up."
In order to prevent "weird" characters from being displayed on my web page, I think I need to tell Tidy what the input encoding is and what the desired output encoding is. I know I want the output to be UTF8. However, I'm not sure how to determine the encoding of the input (Amazon's book description).
I've tried something like this:
mb_detect_encoding($amazon_description);
It's helped, but I'm still occasionally getting weird characters (a black diamond with a question mark in it: �). My guess is that I'm not detecting the encoding properly.
Any suggestions what I need to do?
EDIT:
This is my current solution:
$sanitized_amazon_markup = preg_replace('/[^\w`~!##$%^&*()-=_+[\]{}|;\':",.\/<>? ]/', '', $sanitized_amazon_markup);
I'm not sure about this as this may delete stuff that I should be keeping.
Can you provide your tidy repairString call?
If you tried to use input-encoding and output-encoding from tidy options, try to not use these and use the third argument or repairString instead, something like this :
$oTidy = new tidy();
$page_content = $oTidy->repairString($page_content,
array("show-errors" => 0, "show-warnings" => false),
"utf8"
);
Edit :
After doing some tests, what I said before cannot work if you don't have utf8 encoding in $page_content already before calling repairString
But you will mostly end up with ISO-8859-1 (latin1) encoding if not UTF-8 already.
May I suggest you try :
$charset = mb_detect_encoding($amazon_description, 'UTF-8, ISO-8859-1');
if ($charset == "ISO-8859-1") {
$amazon_description = utf8_encode($amazon_description);
}
$oTidy = new tidy();
$amazon_description = $oTidy->repairString($amazon_description,
array("show-errors" => 0, "show-warnings" => false),
"utf8"
);

Categories