I am building an app with Apache cordova for the support team for my company and everything was ok when I was using a test database in UTF8 was working.
Then when I was implement the real db I notice it was encoded with win-1252.
The problem is, even the db is with win-1252 we have many rows using special caracters like "ç" and "~" and "´" and "`" and with that when I am running the php all rows in the tables in my db will not show becasue of that.
Keep in mind I cann't convert the db to utf8.
ps:The solution I see is go to each row and remove that caracters but isn't a good solution(about 20,000 rows)
........................
PHP file:
header("Access-Control-Allow-Origin: *");
$dbconn = pg_connect("host=localhost dbname=bdgestclientes2
user=postgres password=postgres")
or die('Could not connect: ' . pg_last_error());
$data=array();
$q=pg_query($dbconn,"SELECT * FROM clientes WHERE idcliente = 3");
$row=pg_fetch_object($q)){$data[]=$row};
echo json_encode($data);
I just needed to add a line in php to encode to unicode so I could use the data and display the way it is
pg_set_client_encoding($dbconn, "UNICODE");
That shouldn't be a problem at all.
Windows-1252 supports “ç” (code point 0xE7), “~” (code point 0x7E), “`” (code point 0x60) and “´” (code point 0xB4).
PostgreSQL will automatically convert the characters to the database encoding.
You will get problems if you want to store characters that do not occur in Windows-1252, like “Σ”.
In that case, the correct solution is to use a database with a different encoding (UTF8).
If you cannot do that, you'll have to store the strings as binary objects (data type bytea) and handle encoding in your application. That will only work well if you don't need to process these functions in the database (e.g., use an index for case insensitive search).
I have a similar issue, where I cannot modify the database setup, but I use php's html entity encode to work around:
I removed the html key elements from the native htmlentities because I work with wysiwyg editors and need to keep the content like that. If you have no such limitations you can just use htmlentities on the string.
function makeFriendly($string)
$list = get_html_translation_table(HTML_ENTITIES);
unset($list['"']);
unset($list['\'']);
unset($list['<']);
unset($list['>']);
unset($list['&']);
$search = array_keys($list);
$replace = array_values($list);
$search = array_map('utf8_encode', $search);
str_replace($replace, $search, $string);
}
If I need the actual characters I can always call html_entity_decode on the database string to get the 'real' string.
Related
For some reason my special characters got encoded as the following string in a mysql database:
Ã?
Which shows up as:
Ã?
But actually should show up as:
Ö
What went wrong here? I use UTF-8 everywhere.
How can I fix this without recreating all content?
I executed the following in PHP:
<?php
echo str_replace("&", "&", htmlentities("Ö", 0, "ISO-8859-1")) , '<br />';
echo str_replace("&", "&", htmlentities("Ö", 0, "UTF-8")), "</br>";
?>
The str_replace is just there to reveal any HTML mnemonics, which would otherwise
be translated by the browser to the original character, which I don't want to happen.
You will get this as output:
�
Ö
You'll recognise the first value as what you found in the database, and the second one
is a bit like you wanted it to be.
Add to this the fact that the default value for the third argument to htmlentities
depends on your PHP version and is ISO-9959-1 in the case of version 5.3, the one you use.
Also realise that HTML documents which do not specify a character encoding will
by default post form data in ISO-8859-1 format.
Combining all this might give a clue about the cause of your problem:
My guess is that the data is correctly posted as UTF-8 to the server, but then htmlentities interprets this as a non-UTF-8, single byte encoding, and so turns one, multi-byte character into two single byte characters.
Now to the measures to take that this does not continue to happen:
First make sure that your HTML form has the UTF-8 encoding, because this determines the
default encoding that a form will use for sending its data to the server:
<head>
<meta charset="UTF-8">
</head>
Make sure this is not overruled by another encoding in the form tag's accept-charset
attribute.
Then, skip the htmlentities call. You should not turn characters into their
HTML mnemonic when storing them in the database. MySql
supports UTF-8 characters, so just store them like that.
For the second question, you'll have to find all cases and bulk replace them as you find
new instances. You could get get a little help by producing some SQL statements
with a PHP script like the following:
<?php
// list all your non-ASCII characters here. Do not use str_split.
$chars = ["Ö","õ","Ũ","ũ"];
foreach ($chars as $ch) {
$bad = str_replace("&", "&", htmlentities($ch, 0, "ISO-8859-1"));
echo "update mytable set myfield = replace(myfield, '$bad', '$ch')
where instr(myfield, '$bad') > 0;<br />";
}
?>
The output of this script will look like this:
update mytable set myfield = replace(myfield, 'Ã�', 'Ö') where instr(myfield, 'Ã�') > 0;
update mytable set myfield = replace(myfield, 'õ', 'õ') where instr(myfield, 'õ') > 0;
update mytable set myfield = replace(myfield, 'Ũ', 'Ũ') where instr(myfield, 'Ũ') > 0;
update mytable set myfield = replace(myfield, 'Å©', 'ũ') where instr(myfield, 'Å©') > 0;
Of course, you could decide to make a PHP script that will even do the updates itself.
Hopefully you can use this information to fix the issues.
For PDO, use something like
$db = new PDO('dblib:host=host;dbname=db;charset=UTF-8', $user, $pwd);
Ã? is two or three things going wrong, not just one!
C396 is the utf8 hex for Ö or the latin1 hex for the two characters Ö. It requires something else to go wrong to get ? or the black diamond.
Let's see what is in the table; do
SELECT col, HEX(col) FROM tbl WHERE ...
(If you have already done the previously suggested replace(), then the table may be in an even worse mess. Or it might be fixed.)
I have come across some problems when inputting certain characters into my mysql database using php. What I am doing is submitting user inputted text to a database. I cannot figure out what I need to change to allow any kind of character to be put into the database and printed back out through php as it's suppose to.
My MySQL collation is: latin1_swedish_ci
Just before I send the text to the database from my form I use mysql_real_escape_string() on the data.
Example below
this text:
�People are just as happy as they make up their minds to be.�
� Abraham Lincoln
is suppose to look like this:
“People are just as happy as they make up their minds to be.”
― Abraham Lincoln
As mentioned by others, you need to convert to UTF8 from end to end if you want to support "special" characters. This means your web page, PHP, mysql connection and mysql table. The web page is fairly simple, just use the meta tag for UTF8. Ideally your headers would say UTF8 also.
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
Set your PHP to use UTF8. Things would probably work anyway, but it's a good measure to do this:
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
For mysql, you want to convert your table to UTF8, no need to export/import.
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8
You can, and should, configure mysql to default utf8. But you can also run the query:
SET NAMES UTF8
as the first query after establishing a connection and that will "convert" your database connection to UTF8.
That should solve all your character display problems.
The likeliest cause of the problem is that the database connection is set to latin1 but you are feeding it text encoded in UTF-8. The simplest way to solve this is to convert your input into what the client expects:
$quote = iconv("UTF-8", "WINDOWS-1252//TRANSLIT", $quote);
(What MySQL calls latin1 is windows-1252 in the rest of the world.) Note that many characters, such as the quotation dash U+2015 that you use there, cannot be represented in this encoding and will be converted into something else. Ideally you should change the column encoding to utf8.
An alternative solution: set the database connection to utf8. It doesn't matter how the columns are encoded: MySQL internally converts text from the connection encoding into the storage encoding, you can keep the columns as latin1 if you want to. (If you do, the quotation dash U+2015 will be turned into a question mark ? because it's not in latin1)
How to set the connection encoding depends on what library you are using: if you use the deprecated MySQL library it's mysql_set_charset, if MySQLi it's mysqli_set_charset, if PDO add encoding=utf8 to the DSN.
If you do this you'll have set the page encoding to UTF-8 with the Content-Type header.
Otherwise you would be having the same problem with the browser: feeding it text encoded in UTF-8 when it's expecting something else:
header("Content-Type: text/html; charset=utf-8");
The solutions provided are helpful if starting from scratch. Putting all possible connections to UTF-8 is indeed the safest. UTF-8 is the most used charset on the net for a variety of reasons.
Some suggestions and a word of warning:
copy the tables you want to sanitize with a unique prefix (tmp_)
although your db-connection is forced to utf8, check you General Settings collation, change to utf8_bin if that was not done yet
you need to run this on the local server
the funny char error is mostly due to mixing LATIN1 with UTF-8 configurations. This solution is designed for this. It could work with other used char-sets that LATIN1 but I haven't checked this
check these tmp_tables extensively before copying back to the original
Builds the 2 array needed for the magic:
$chars = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES, "UTF-8");
$LATIN1 = $UTF8 = array();
while (list($key,$val) = each ($chars)) {
$UTF8[] = $key;
$LATIN1[] = $val;
}
Now build up the routines you need: (tables->)rows->fields and at each field call
$row[$field] = mysql_real_escape_string(str_replace($LATIN1 , $UTF8 , $row[$field]));
$q[] = "$field = '{$row[$field]}'";
Finally build up and send the query:
mysql_query("UPDATE $table SET " . implode(" , " , $q) . " WHERE id = '{$row['id']}' LIMIT 1");
change the MySQL collation to utf8_unicode_ci or utf8_general_ci, including the table and the database.
You will need to set your database in utf-8 yes. There is many ways to do it. By changin the config file, via phpmyadmin or by calling php function (sorry memory blank) right before insert and update the mysql.
Unfortunately, i think you will have to re-enter any data you entered before.
One thing you also need to know, from personnal experience, make sure all table with relation have the same collation or you won'T be able to JOIN them.
as reference: http://dev.mysql.com/doc/refman/5.6/en/charset-syntax.html
Also, i can be a apache setting. We've experienced the same issue on 'free-hosting' server as well as on my brother's server. Once switched to another server, all the charater's became neat. Verfiy you apache setting, sorry but i can't bting more light on apache's config.
Get rid of everything you just need to follow these two points, every problem regarding special languages characters will be resolved.
1- You need to define the collation of your table to be utf8_general_ci.
2- define <meta http-equiv="content-type" content="text/html; charset=utf-8"> in the HTML after head tag.
2- You need to define the mysql_set_charset('utf8',$link_identifier); in the file where you made connection with the database and right after the selection of database like 'mysql_select_db' use this 'mysql_set_charset' this will allow you to add and retrieve data properly in what ever the language it is.
If your text has been encoded and decoded with the wrong encoding and so the mojibake is actually "solidified" into unicode characters, then the solutions mentioned so far won't work. I ended up having success with the ftfy Python package to automatically detect/fix mojibake:
https://github.com/LuminosoInsight/python-ftfy
https://pypi.org/project/ftfy/
https://ftfy.readthedocs.io/en/latest/
>>> import ftfy
>>> print(ftfy.fix_encoding("(ง'⌣')ง"))
(ง'⌣')ง
Hopefully this helps people who are in a similar situation.
The feed in question is: http://api.inoads.com/snowstorm/feed.xml
Here is the PHP code I am using for the generation:
<?php
$database = 'xxxx';
$dbconnect = mysql_pconnect('xxxx', 'xxxx', 'xxxx');
mysql_select_db($database, $dbconnect);
$query = "SELECT * FROM the_queue WHERE id LIKE '%' ORDER BY id DESC LIMIT 25";
$result = mysql_query($query, $dbconnect);
while ($line = mysql_fetch_assoc($result))
{
$return[] = $line;
}
$now = date("D, d M Y H:i:s T");
$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<rss version=\"2.0\">
<channel>
<title>The Queue</title>
<link>http://readapp.net</link>
<description>A curated reading list.</description>
<language>en-us</language>
<pubDate>$now</pubDate>
<lastBuildDate>$now</lastBuildDate>
";
foreach ($return as $line)
{
$output .= "<item><title>".htmlspecialchars($line['title'])."</title>
<description>".htmlspecialchars($line['description'])."</description>
<link>".htmlspecialchars($line['link'])."</link>
<pubDate>".htmlspecialchars($line['pubDate'])."</pubDate>
</item>";
}
$output .= "</channel></rss>";
$fh = fopen('feed.xml', 'w');
fwrite($fh, $output);
?>
What might be causing the error?
Here's a link from the feed validator: http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fapi.inoads.com%2Fsnowstorm%2Ffeed.xml
You said the XML file is UTF-8, but when I download it and open it in my text editor it auto-detects the windows latin1 encoding, and the quotes display perfectly.
If I force my text editor to use UTF-8, it shows an error message because there are illegal characters for the UTF-8 encoding.
Therefore, your data is not UTF-8, it is latin1. You need to find out exactly where that's happening. It could be any one, or several of:
is the HTML page where the content is typed in by the user set to UTF-8?
If not, the browser will be sending latin1 quotes. To fix this, the first tag in your <head> needs to be:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
</head>
is every browser correctly respecting your UTF-8 setting in that page's HTML?
If you specify UTF-8 and the page contains characters illegal in that encoding, some browsers might decide to use a different encoding despite the <meta> tag. How to check this is different in every browser.
is the MySQL connection when inserting into the database set to use UTF-8?
You need to be using UTF-8 here, or else MySQL may try to convert the encoding for you, often corrupting them. Set the encoding with:
$database = 'xxxx';
$dbconnect = mysql_pconnect('xxxx', 'xxxx', 'xxxx');
mysql_select_db($database, $dbconnect);
mysql_query('SET NAMES utf8', $dbconnect);
is the MySQL table (and individual column) set to use UTF-8?
Again, to avoid MySQL doing it's own buggy conversion, you need to make sure it's using UTF-8 for the table and also the individual comment. Do a structure dump of the database and check for:
CREATE TABLE `the_queue` (
...
) ... DEFAULT CHARSET=utf8;
And also make sure there isn't something like this on any of the columns:
`description` varchar(255) CHARACTER SET latin1,
is the MySQL connection when reading the database set to use UTF-8?
Your read connection also needs to be utf8. So double check that.
are you doing anything in the PHP that cannot handle UTF-8?
PHP has some functions which cannot be used on utf-8 strings, as it will corrupt them. One of those functions is htmlentities() so make sure you always use htmlspecialchars(). The easiest way to test this is to start commenting out big chunks of your code to see where the encoding is breaking.
There is one problem here:
$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
...
There is a string containing "?>". This is the finalization marker for php. It will give you an error.
You can avoid these problems this way:
$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?".">
...
The point of htmlentities is to replace all characters that have define HTML character entities with those entities. If you really don't want any character entities (as your desired result suggests), don't use htmlentities.
By default, htmlentities uses the latin-1 charset, so it chokes on the smart quotes (indeed, all multibyte characters), which is where you see the question marks. One fix is to use htmlspecialchars to convert a much more limited set of characters (&, <, >, ' and "). This will still convert the double quotes because, well, that's the point of htmlspecialchars, unless you specify the ENT_NOQUOTES as the second argument. Another fix is to specify the character set as the third argument (this isn't exclusive of using htmlspecialchars).
The fourth argument to either specifies whether or not to encode already encoded characters. Whether or not do double-encode depends on the source data.
$line['description'] = '"Dave, stop. Stop, will you? Stop, Dave. Will you stop, Dave?” ... “Dave, my mind is going,” HAL says, forlornly. “I can feel it. I can feel it.”';
echo "<description>" . htmlspecialchars($line['description'], ENT_NOQUOTES, 'UTF-8', false) . "</description>";
See also:
RSS 2.0 Best Practice Tip: Entity-encoded HTML in Descriptions
Problem is that you are holding this string with quotes in database (as I assume). If it is true, PHP is removing quotes (which is proper), because of not causing bugs (SQL injection ex). So you have to remove quotes in DB and while generating XML file just add them. It is the simplest in my opinion. And try avoid double quotes ". You should use single ones '. In double PHP parser additionally checks what is in. So try to remove qoutes from DB and add them while generating XML. Should help.
Another error that you have it´s the format of the date. The date must be in format RFC-822, it must be in a format like this "Wed, 02 Oct 2002 08:00:00 EST", not "July/August 2008".
I started a website some time ago using the wrong CHARSET in my DB and site. The HTML was set to ISO... and the DB to Latin... , the page was saved in Western latin... a big mess.
The site is in French, so I created a function that replaced all accents like "é" to "é". Which solved the issue temporarily.
I just learned a lot more about programming, and now my files are saved as Unicode UTF-8, the HTML is in UTF-8 and my MySQL table columns are set to ut8_encoding...
I tried to move back the accents to "é" instead of the "é", but I get the usual charset issues with the (?) or weird characters "â" both in MySQL and when the page is displayed.
I need to find a way to update my sql, through a function that cleans the strings so that it can finally go back to normal. At the moment my function looks like this but doesn't work:
function stripAcc3($value){
$ent = array(
'à'=>'à',
'â'=>'â',
'ù'=>'ù',
'û'=>'û',
'é'=>'é',
'è'=>'è',
'ê'=>'ê',
'ç'=>'ç',
'Ç'=>'Ç',
"î"=>'î',
"Ï"=>'ï',
"ö"=>'ö',
"ô"=>'ô',
"ë"=>'ë',
"ü"=>'ü',
"Ä"=>'ä',
"€"=>'€',
"′"=> "'",
"é"=> "é"
);
return strtr($value, $ent);
}
Any help welcome. Thanks in advance. If you need code, please tell me which part.
UPDATE
If you want the bounty points, I need detailed instructions on how to do it. Thanks.
Try using the following function instead, it should handle all the issues you described:
function makeStringUTF8($data)
{
if (is_string($data) === true)
{
// has html entities?
if (strpos($data, '&') !== false)
{
// if so, revert back to normal
$data = html_entity_decode($data, ENT_QUOTES, 'UTF-8');
}
// make sure it's UTF-8
if (function_exists('iconv') === true)
{
return #iconv('UTF-8', 'UTF-8//IGNORE', $data);
}
else if (function_exists('mb_convert_encoding') === true)
{
return mb_convert_encoding($data, 'UTF-8', 'UTF-8');
}
return utf8_encode(utf8_decode($data));
}
else if (is_array($data) === true)
{
$result = array();
foreach ($data as $key => $value)
{
$result[makeStringUTF8($key)] = makeStringUTF8($value);
}
return $result;
}
return $data;
}
Regarding the specific instructions of how to use this, I suggest the following:
export your old latin database (I hope you still have it) contents as an SQL/CSV dump *
use the above function on the file contents and save the result on another file
import the file you generated in the previous step into the UTF-8 aware schema / database
* Example:
file_put_contents('utf8.sql', makeStringUTF8(file_get_contents('latin.sql')));
This should do it, if it doesn't let me know.
You might want to investigate what is used to fix WP database encoding issues:
http://codex.wordpress.org/Converting_Database_Character_Sets
To cut a long story short, most old WP sites were created with Swedish/Latin1 collated tables, which were used to store UTF8 strings. To collate the tables properly, the approach is to change the column to binary type, and then to change it to UTF8 text.
This avoids that the text gets wrangled when converting from Latin1 to UTF8 directly.
You will need to convert the offending rows using for example iconv. The challenge for you will be to know what rows are already UTF-8 and which are latin-1.
I'm not completely sure I understand your question, but
if you have
a UTF-8 database
all special characters in there stored as HTML entities
then a
html_entity_decode($string, ENT_QUOTES, "UTF-8");
should do the trick and turn all entities back into their UTF-8 native characters.
Make sure, not just your tables use utf-8, your database connection should use utf-8 as well.
$this->db = mysql_connect(MYSQL_SERVER,DB_LOGIN,DB_PASS);
mysql_set_charset ('utf8',$this->getConnection());
If you want to discuss with your database in UTF-8 you have to tell the Database that the connexion flow is a UTF-8 flow. You have to sent a request before each request you make to the database, this request in the following :
"SET NAMES utf8";
Personnaly I use that in the connect.inc.php files which create the connection to the database. Which this statement the database know that your sending UTF-8 encoded string and works perfectly !
mysql_set_charset function isn't working well, i tried this function in the past but the truth is that it don't do the trick.
For your complete issue, if you want to convert latin1 string to UTF-8, you have to convert first the latin1 string to a binary string format. Then convert the binary string into UTF-8 string, all can be done inside the database with database commands. See that artile (in french) : http://www.noidea.ca/2009/06/15/comment-convertir-une-db-de-latin1-a-utf8/
I can tell you that this method works because i used it to transform data from a database I've created.
When I insert some data with Zend_Form to database with non a-z characters like chrząszcz it cuts me this string and in database I have saved only chrz.
Everyting in MySql is set as utf8_general_ci, when connecting with MySql I call SET CHARACTER SET 'utf8', files are also UTF-8.
I have no idea what to do with that.
I wrote also standalone script and it inserts and reads me that string correct. ZendFramework reads it also correct. The problem is only with inserting.
Do anyone know how to fix it ?
UPDATE:
If I insert data with:
$db->query("INSERT INTO unit SET name = 'chrząszcz'");
in ZendFramework it works. Problem is with inserting that way:
$unitTable = new Model_Unit_Table();
$unit = $unitTable->createRow();
$unit->setFromArray($form->getValues());
$unit->save();
UPDATE 2:
Problem is with using Zend_Filter_StringToLower - it modifies string chrząszcz into chrz�szcz.
How to get this filterto work correct ?
Responding to your comment:
No. var_dump of $form->getValue()
gives chrz�szcz. When var_dump a
$_POST superglobal it gives correct
chrząszcz.
Does this work?
$testArray = array('name' => 'chrząszcz');
$unitTable = new Model_Unit_Table();
$unit = $unitTable->createRow();
$unit->setFromArray($testArray);
$unit->save();
If yes, your problem may be more Zend_Form related.
Edit:
Your filter needs to use mb_strtolower() instead of strtolower().
Edit2:
Try this:
$filter = new Zend_Filter_StringToLower();
$filter->setEncoding('UTF-8');
I'm pretty sure this is an encoding problem. Strings dropping off at the point of the first non-standard (i.e. above the ASCII character set) character is most often caused by inserting UTF-8 data in a non-UTF8 context so my first suspicion would be that the encoding of the database connection is not properly set.
Can you try $db->query("SET NAMES utf8"); before calling the insertion commands?
Is the connection Zend_form uses definitely $db?
Are you 100% sure the page your form is in is UTF-8 encoded?