Revert badly encoded umlauts - php

For some reason my special characters got encoded as the following string in a mysql database:
Ã?
Which shows up as:
Ã?
But actually should show up as:
Ö
What went wrong here? I use UTF-8 everywhere.
How can I fix this without recreating all content?

I executed the following in PHP:
<?php
echo str_replace("&", "&", htmlentities("Ö", 0, "ISO-8859-1")) , '<br />';
echo str_replace("&", "&", htmlentities("Ö", 0, "UTF-8")), "</br>";
?>
The str_replace is just there to reveal any HTML mnemonics, which would otherwise
be translated by the browser to the original character, which I don't want to happen.
You will get this as output:
�
Ö
You'll recognise the first value as what you found in the database, and the second one
is a bit like you wanted it to be.
Add to this the fact that the default value for the third argument to htmlentities
depends on your PHP version and is ISO-9959-1 in the case of version 5.3, the one you use.
Also realise that HTML documents which do not specify a character encoding will
by default post form data in ISO-8859-1 format.
Combining all this might give a clue about the cause of your problem:
My guess is that the data is correctly posted as UTF-8 to the server, but then htmlentities interprets this as a non-UTF-8, single byte encoding, and so turns one, multi-byte character into two single byte characters.
Now to the measures to take that this does not continue to happen:
First make sure that your HTML form has the UTF-8 encoding, because this determines the
default encoding that a form will use for sending its data to the server:
<head>
<meta charset="UTF-8">
</head>
Make sure this is not overruled by another encoding in the form tag's accept-charset
attribute.
Then, skip the htmlentities call. You should not turn characters into their
HTML mnemonic when storing them in the database. MySql
supports UTF-8 characters, so just store them like that.
For the second question, you'll have to find all cases and bulk replace them as you find
new instances. You could get get a little help by producing some SQL statements
with a PHP script like the following:
<?php
// list all your non-ASCII characters here. Do not use str_split.
$chars = ["Ö","õ","Ũ","ũ"];
foreach ($chars as $ch) {
$bad = str_replace("&", "&", htmlentities($ch, 0, "ISO-8859-1"));
echo "update mytable set myfield = replace(myfield, '$bad', '$ch')
where instr(myfield, '$bad') > 0;<br />";
}
?>
The output of this script will look like this:
update mytable set myfield = replace(myfield, 'Ã�', 'Ö') where instr(myfield, 'Ã�') > 0;
update mytable set myfield = replace(myfield, 'õ', 'õ') where instr(myfield, 'õ') > 0;
update mytable set myfield = replace(myfield, 'Ũ', 'Ũ') where instr(myfield, 'Ũ') > 0;
update mytable set myfield = replace(myfield, 'Å©', 'ũ') where instr(myfield, 'Å©') > 0;
Of course, you could decide to make a PHP script that will even do the updates itself.
Hopefully you can use this information to fix the issues.

For PDO, use something like
$db = new PDO('dblib:host=host;dbname=db;charset=UTF-8', $user, $pwd);
Ã? is two or three things going wrong, not just one!
C396 is the utf8 hex for Ö or the latin1 hex for the two characters Ö. It requires something else to go wrong to get ? or the black diamond.
Let's see what is in the table; do
SELECT col, HEX(col) FROM tbl WHERE ...
(If you have already done the previously suggested replace(), then the table may be in an even worse mess. Or it might be fixed.)

Related

Php problems with characters

I am building an app with Apache cordova for the support team for my company and everything was ok when I was using a test database in UTF8 was working.
Then when I was implement the real db I notice it was encoded with win-1252.
The problem is, even the db is with win-1252 we have many rows using special caracters like "ç" and "~" and "´" and "`" and with that when I am running the php all rows in the tables in my db will not show becasue of that.
Keep in mind I cann't convert the db to utf8.
ps:The solution I see is go to each row and remove that caracters but isn't a good solution(about 20,000 rows)
........................
PHP file:
header("Access-Control-Allow-Origin: *");
$dbconn = pg_connect("host=localhost dbname=bdgestclientes2
user=postgres password=postgres")
or die('Could not connect: ' . pg_last_error());
$data=array();
$q=pg_query($dbconn,"SELECT * FROM clientes WHERE idcliente = 3");
$row=pg_fetch_object($q)){$data[]=$row};
echo json_encode($data);
I just needed to add a line in php to encode to unicode so I could use the data and display the way it is
pg_set_client_encoding($dbconn, "UNICODE");
That shouldn't be a problem at all.
Windows-1252 supports “ç” (code point 0xE7), “~” (code point 0x7E), “`” (code point 0x60) and “´” (code point 0xB4).
PostgreSQL will automatically convert the characters to the database encoding.
You will get problems if you want to store characters that do not occur in Windows-1252, like “Σ”.
In that case, the correct solution is to use a database with a different encoding (UTF8).
If you cannot do that, you'll have to store the strings as binary objects (data type bytea) and handle encoding in your application. That will only work well if you don't need to process these functions in the database (e.g., use an index for case insensitive search).
I have a similar issue, where I cannot modify the database setup, but I use php's html entity encode to work around:
I removed the html key elements from the native htmlentities because I work with wysiwyg editors and need to keep the content like that. If you have no such limitations you can just use htmlentities on the string.
function makeFriendly($string)
$list = get_html_translation_table(HTML_ENTITIES);
unset($list['"']);
unset($list['\'']);
unset($list['<']);
unset($list['>']);
unset($list['&']);
$search = array_keys($list);
$replace = array_values($list);
$search = array_map('utf8_encode', $search);
str_replace($replace, $search, $string);
}
If I need the actual characters I can always call html_entity_decode on the database string to get the 'real' string.

Getting special characters out of a MySQL database with PHP [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 9 years ago.
I have a table that includes special characters such as ™.
This character can be entered and viewed using phpMyAdmin and other software, but when I use a SELECT statement in PHP to output to a browser, I get the diamond with question mark in it.
The table type is MyISAM. The encoding is UTF-8 Unicode. The collation is utf8_unicode_ci.
The first line of the html head is
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
I tried using the htmlentities() function on the string before outputting it. No luck.
I also tried adding this to php before any output (no difference):
header('Content-type: text/html; charset=utf-8');
Lastly I tried adding this right below the initial mysql connection (this resulted in additional odd characters being displayed):
$db_charset = mysql_set_charset('utf8',$db);
What have I missed?
Below code works for me.
$sql = "SELECT * FROM chartest";
mysql_set_charset("UTF8");
$rs = mysql_query($sql);
header('Content-type: text/html; charset=utf-8');
while ($row = mysql_fetch_array($rs)) {
echo $row['name'];
}
There are a couple things that might help. First, even though you're setting the charset to UTF-8 in the header, that might not be enough. I've seen the browser ignore that before. Try forcing it by adding this in the head of your html:
<meta charset='utf-8'>
Next, as mentioned here, try doing this:
mysql_query ("set character_set_client='utf8'");
mysql_query ("set character_set_results='utf8'");
mysql_query ("set collation_connection='utf8_general_ci'");
EDIT
So I've just done some reading up an playing around a bit. First let me tell you, despite what I mentioned in the comments, utf8_encode() and utf8_decode() will not help you here. It helps to actually understand UTF-8 encoding. I found the Wikipedia page on UTF-8 very helpful. Assuming the value you are getting back from the database is in fact already UTF-8 encoded and you simply dump it out right after getting it then it should be fine.
If you are doing anything with the database result (manipulating the string in any way especially) and you don't use the unicode aware functions from the PHP mbstring library then it will probably mess it up since the standard PHP string functions are not unicode aware.
Once you understand how UTF-8 encoding works you can do something cool like this:
$test = "™";
for($i = 0; $i < strlen($test); $i++) {
echo sprintf("%b ", ord($test[$i]));
}
Which dumps out something like this:
11100010 10000100 10100010
That's a properly encoded UTF-8 '™' character. If you don't have a character like that in your data retrieved from the database then something is messed up.
To check, try searching for a special character that you know is in the result using mb_strpos():
var_dump(mb_strpos($db_result, '™'));
If that returns anything other than false then the data from the database is fine, otherwise we can at least establish that it's a problem between PHP and the database.
you need to execute the following query first.
mysql_query("SET NAMES utf8");

PHP/MySQL encoding problems. � instead of certain characters

I have come across some problems when inputting certain characters into my mysql database using php. What I am doing is submitting user inputted text to a database. I cannot figure out what I need to change to allow any kind of character to be put into the database and printed back out through php as it's suppose to.
My MySQL collation is: latin1_swedish_ci
Just before I send the text to the database from my form I use mysql_real_escape_string() on the data.
Example below
this text:
�People are just as happy as they make up their minds to be.�
� Abraham Lincoln
is suppose to look like this:
“People are just as happy as they make up their minds to be.”
― Abraham Lincoln
As mentioned by others, you need to convert to UTF8 from end to end if you want to support "special" characters. This means your web page, PHP, mysql connection and mysql table. The web page is fairly simple, just use the meta tag for UTF8. Ideally your headers would say UTF8 also.
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
Set your PHP to use UTF8. Things would probably work anyway, but it's a good measure to do this:
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
For mysql, you want to convert your table to UTF8, no need to export/import.
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8
You can, and should, configure mysql to default utf8. But you can also run the query:
SET NAMES UTF8
as the first query after establishing a connection and that will "convert" your database connection to UTF8.
That should solve all your character display problems.
The likeliest cause of the problem is that the database connection is set to latin1 but you are feeding it text encoded in UTF-8. The simplest way to solve this is to convert your input into what the client expects:
$quote = iconv("UTF-8", "WINDOWS-1252//TRANSLIT", $quote);
(What MySQL calls latin1 is windows-1252 in the rest of the world.) Note that many characters, such as the quotation dash U+2015 that you use there, cannot be represented in this encoding and will be converted into something else. Ideally you should change the column encoding to utf8.
An alternative solution: set the database connection to utf8. It doesn't matter how the columns are encoded: MySQL internally converts text from the connection encoding into the storage encoding, you can keep the columns as latin1 if you want to. (If you do, the quotation dash U+2015 will be turned into a question mark ? because it's not in latin1)
How to set the connection encoding depends on what library you are using: if you use the deprecated MySQL library it's mysql_set_charset, if MySQLi it's mysqli_set_charset, if PDO add encoding=utf8 to the DSN.
If you do this you'll have set the page encoding to UTF-8 with the Content-Type header.
Otherwise you would be having the same problem with the browser: feeding it text encoded in UTF-8 when it's expecting something else:
header("Content-Type: text/html; charset=utf-8");
The solutions provided are helpful if starting from scratch. Putting all possible connections to UTF-8 is indeed the safest. UTF-8 is the most used charset on the net for a variety of reasons.
Some suggestions and a word of warning:
copy the tables you want to sanitize with a unique prefix (tmp_)
although your db-connection is forced to utf8, check you General Settings collation, change to utf8_bin if that was not done yet
you need to run this on the local server
the funny char error is mostly due to mixing LATIN1 with UTF-8 configurations. This solution is designed for this. It could work with other used char-sets that LATIN1 but I haven't checked this
check these tmp_tables extensively before copying back to the original
Builds the 2 array needed for the magic:
$chars = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES, "UTF-8");
$LATIN1 = $UTF8 = array();
while (list($key,$val) = each ($chars)) {
$UTF8[] = $key;
$LATIN1[] = $val;
}
Now build up the routines you need: (tables->)rows->fields and at each field call
$row[$field] = mysql_real_escape_string(str_replace($LATIN1 , $UTF8 , $row[$field]));
$q[] = "$field = '{$row[$field]}'";
Finally build up and send the query:
mysql_query("UPDATE $table SET " . implode(" , " , $q) . " WHERE id = '{$row['id']}' LIMIT 1");
change the MySQL collation to utf8_unicode_ci or utf8_general_ci, including the table and the database.
You will need to set your database in utf-8 yes. There is many ways to do it. By changin the config file, via phpmyadmin or by calling php function (sorry memory blank) right before insert and update the mysql.
Unfortunately, i think you will have to re-enter any data you entered before.
One thing you also need to know, from personnal experience, make sure all table with relation have the same collation or you won'T be able to JOIN them.
as reference: http://dev.mysql.com/doc/refman/5.6/en/charset-syntax.html
Also, i can be a apache setting. We've experienced the same issue on 'free-hosting' server as well as on my brother's server. Once switched to another server, all the charater's became neat. Verfiy you apache setting, sorry but i can't bting more light on apache's config.
Get rid of everything you just need to follow these two points, every problem regarding special languages characters will be resolved.
1- You need to define the collation of your table to be utf8_general_ci.
2- define <meta http-equiv="content-type" content="text/html; charset=utf-8"> in the HTML after head tag.
2- You need to define the mysql_set_charset('utf8',$link_identifier); in the file where you made connection with the database and right after the selection of database like 'mysql_select_db' use this 'mysql_set_charset' this will allow you to add and retrieve data properly in what ever the language it is.
If your text has been encoded and decoded with the wrong encoding and so the mojibake is actually "solidified" into unicode characters, then the solutions mentioned so far won't work. I ended up having success with the ftfy Python package to automatically detect/fix mojibake:
https://github.com/LuminosoInsight/python-ftfy
https://pypi.org/project/ftfy/
https://ftfy.readthedocs.io/en/latest/
>>> import ftfy
>>> print(ftfy.fix_encoding("(ง'⌣')ง"))
(ง'⌣')ง
Hopefully this helps people who are in a similar situation.

Character encoding error, cannot write valid XML from MySQL via PHP

The feed in question is: http://api.inoads.com/snowstorm/feed.xml
Here is the PHP code I am using for the generation:
<?php
$database = 'xxxx';
$dbconnect = mysql_pconnect('xxxx', 'xxxx', 'xxxx');
mysql_select_db($database, $dbconnect);
$query = "SELECT * FROM the_queue WHERE id LIKE '%' ORDER BY id DESC LIMIT 25";
$result = mysql_query($query, $dbconnect);
while ($line = mysql_fetch_assoc($result))
{
$return[] = $line;
}
$now = date("D, d M Y H:i:s T");
$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<rss version=\"2.0\">
<channel>
<title>The Queue</title>
<link>http://readapp.net</link>
<description>A curated reading list.</description>
<language>en-us</language>
<pubDate>$now</pubDate>
<lastBuildDate>$now</lastBuildDate>
";
foreach ($return as $line)
{
$output .= "<item><title>".htmlspecialchars($line['title'])."</title>
<description>".htmlspecialchars($line['description'])."</description>
<link>".htmlspecialchars($line['link'])."</link>
<pubDate>".htmlspecialchars($line['pubDate'])."</pubDate>
</item>";
}
$output .= "</channel></rss>";
$fh = fopen('feed.xml', 'w');
fwrite($fh, $output);
?>
What might be causing the error?
Here's a link from the feed validator: http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fapi.inoads.com%2Fsnowstorm%2Ffeed.xml
You said the XML file is UTF-8, but when I download it and open it in my text editor it auto-detects the windows latin1 encoding, and the quotes display perfectly.
If I force my text editor to use UTF-8, it shows an error message because there are illegal characters for the UTF-8 encoding.
Therefore, your data is not UTF-8, it is latin1. You need to find out exactly where that's happening. It could be any one, or several of:
is the HTML page where the content is typed in by the user set to UTF-8?
If not, the browser will be sending latin1 quotes. To fix this, the first tag in your <head> needs to be:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
</head>
is every browser correctly respecting your UTF-8 setting in that page's HTML?
If you specify UTF-8 and the page contains characters illegal in that encoding, some browsers might decide to use a different encoding despite the <meta> tag. How to check this is different in every browser.
is the MySQL connection when inserting into the database set to use UTF-8?
You need to be using UTF-8 here, or else MySQL may try to convert the encoding for you, often corrupting them. Set the encoding with:
$database = 'xxxx';
$dbconnect = mysql_pconnect('xxxx', 'xxxx', 'xxxx');
mysql_select_db($database, $dbconnect);
mysql_query('SET NAMES utf8', $dbconnect);
is the MySQL table (and individual column) set to use UTF-8?
Again, to avoid MySQL doing it's own buggy conversion, you need to make sure it's using UTF-8 for the table and also the individual comment. Do a structure dump of the database and check for:
CREATE TABLE `the_queue` (
...
) ... DEFAULT CHARSET=utf8;
And also make sure there isn't something like this on any of the columns:
`description` varchar(255) CHARACTER SET latin1,
is the MySQL connection when reading the database set to use UTF-8?
Your read connection also needs to be utf8. So double check that.
are you doing anything in the PHP that cannot handle UTF-8?
PHP has some functions which cannot be used on utf-8 strings, as it will corrupt them. One of those functions is htmlentities() so make sure you always use htmlspecialchars(). The easiest way to test this is to start commenting out big chunks of your code to see where the encoding is breaking.
There is one problem here:
$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
...
There is a string containing "?>". This is the finalization marker for php. It will give you an error.
You can avoid these problems this way:
$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?".">
...
The point of htmlentities is to replace all characters that have define HTML character entities with those entities. If you really don't want any character entities (as your desired result suggests), don't use htmlentities.
By default, htmlentities uses the latin-1 charset, so it chokes on the smart quotes (indeed, all multibyte characters), which is where you see the question marks. One fix is to use htmlspecialchars to convert a much more limited set of characters (&, <, >, ' and "). This will still convert the double quotes because, well, that's the point of htmlspecialchars, unless you specify the ENT_NOQUOTES as the second argument. Another fix is to specify the character set as the third argument (this isn't exclusive of using htmlspecialchars).
The fourth argument to either specifies whether or not to encode already encoded characters. Whether or not do double-encode depends on the source data.
$line['description'] = '"Dave, stop. Stop, will you? Stop, Dave. Will you stop, Dave?” ... “Dave, my mind is going,” HAL says, forlornly. “I can feel it. I can feel it.”';
echo "<description>" . htmlspecialchars($line['description'], ENT_NOQUOTES, 'UTF-8', false) . "</description>";
See also:
RSS 2.0 Best Practice Tip: Entity-encoded HTML in Descriptions
Problem is that you are holding this string with quotes in database (as I assume). If it is true, PHP is removing quotes (which is proper), because of not causing bugs (SQL injection ex). So you have to remove quotes in DB and while generating XML file just add them. It is the simplest in my opinion. And try avoid double quotes ". You should use single ones '. In double PHP parser additionally checks what is in. So try to remove qoutes from DB and add them while generating XML. Should help.
Another error that you have it´s the format of the date. The date must be in format RFC-822, it must be in a format like this "Wed, 02 Oct 2002 08:00:00 EST", not "July/August 2008".

Another charset problem with php and MySQL

I'm having a problem with some characters like 'í' or 'ñ' working in a web project with PHP and MySQL.
The database table is in UTF-8 charset and the web page is ISO-8859-1 (latin-1). at first look everything is handled ok, but a problem is coming when I use the JSON_ENCODE function of PHP.
When I get a query result, let's say, this row:
| ID | VALUE |
--------------------
| 1 | Línea |
I got the following (correct) array in PHP:
Array("ID"=>"1","VALUE"=>"Línea");
So far, so good. But, when i apply the JSON_ENCODE
$result = json_encode($result);
//$result is {"id":"1","value":"L"}
Then i tried some coding/decoding but i couldn't get the right result.
First I tried to decode the UTF-8 chars like follow:
$result['value'] = utf8_decode($result['value']);
//and I get $result['value'] is "L?a"
Then I tried with mb functions:
$result['value'] = mb_convert_encoding($result['value'],"ISO-8859-1","UTF-8");
//and I get that $result['value'] is "Lnea"
I don't really know why is the Json_encode breaking my string and i can't figure out what else to try. I will appreciate any help :)
Thanks!
The documentation for json_encode states that the function will only work on UTF-8 data. If it's not working for you, it means that your data is not UTF-8.
To understand what's going wrong, you need to know what your connection character set is. Is it UTF-8? Something else? Use SET NAMES utf-8 and see if it makes any difference.
Assuming the connection character set is indeed UTF-8, json_encode should work just fine. Then, you still have the final issue of converting the encoded data to ISO-8859-1. For example:
// assume any strings in $result are UTF-8 encoded
$json = json_encode($result);
$output = mb_convert_encoding($json, 'ISO-8859-1', 'UTF-8');
echo $output;
If it still doesn't work, it means that your UTF-8 strings contain characters not available in the ISO-8859-1 character set. There's nothing you can do about that.
Update:
When debugging complex character set conversions like this, you can use file_put_contents to write intermediate results to a file which you can inspect with a hex editor. This will help confirm that the output of a particular step of the process is correct or not.

Categories