List of 'messed up characters' in utf8

List of 'messed up characters' in utf8 - php

one of my clients has a website which has been totally messed up by the hosting companie forcing a characterset on the complete database. We've had troubles before with character sets but now it's just straight forward a drama!
So far I've added the charset=utf-8 to the page content type and set the charset for the mysql connection to utf8. And now it's time to replace all characters. So far what I've found is:
Ã¶ = ö
Ã« = ë
Ã© = é
The data inside the database is being updated like so:
UPDATE table SET `fieldname` = REPLACE(`fieldname`, 'Ã¶', 'ö');
Now I just need to find a complete list of alle characters that are messed up. I tried a MySQL query searching for field LIKE '%Ã%' but this returns me all records inside the database.
Google also just displays a couple of characters (mostly the 3 above) in some topics of other people that have had troubles, however it seems there's nowhere a complete list of these characters (or at least the most common) which I can use to find and replace all data for my client.
If anyone perhaps knows such location or is able to complete my list I will, in return, create a page containing these characters to help others (unless there's a list already which I'm not aware of somewhere ofcourse).
// EDIT:
it would be for the most common european characters such as é è ë, á à ä, ö ó ò, ï, ü and perhaps the ringel-S (German double S). Not so much for the spaning signs like ñ or ã, but if they are in a list somewhere that would be much appreciated aswel.
// EDIT 2:
I updated the MySQL database and tables using the 2 ALTER queries from the 1st part of this article: http://developer.loftdigital.com/blog/php-utf-8-cheatsheet. I DID NOT make use of the mb_ functions so far and didn't do any MB configuration as it seems.
The headers are all set to utf-8 in the files (I still have to check the headers for some ajax scripts tho, not sure if that's needed but it won't be harmfull doing so). And the files are all saved as UTF8 without BOM. Also PHPFreakMailer is updated by setting the charset to utf-8.
Bad enough, I'm still having these weird characters. I wasn't thinking they'd go away by theirself, but at least it was worth hoping so :-) So what's the final step I should take? Continuïng using the REPLACE query and changing all wierd characters manually?
Thanks in advance!

This is a bit crazy; what character set do you think "Ã¶" is in?
It looks like that's actually a correct UTF-8 sequence (since it's two bytes), you're just displaying it as ISO-8559-1.
Edit:
Based on your comment I think the following is going on:
I think (but really not 100% sure) that the correct UTF-8 binary sequence is stored in the database. But since the table is marked as ISO-8559-1, and you requested to automatically convert character set. So it thinks it's ISO-8559-1 (which looks like Ã¶), but then tries to convert that to UTF-8.
You should be able to verify this, if strlen('Ã¶') is 4, and not 2. If the length is indeed 2, your browser encoding somehow screws up.
To fix this, don't set the MySQL to encode the characters.
Option 2
The data could also be 'double encoded' in the table. To check this, simply also check the string length on the database. If the 'Ã¶' is 4 bytes long, this is the issue.
My advice in this case is to not try to make a big 'messed up character'-map. You should simply be able to 'utf8_decode' the string. Normally this function will output a ISO-8559-1 string, but in your case.. it should turn out to be the original valid UTF-8 string.
I hope this works!
Edit2
Ok so effectively what I believe has happened is Option 2. To put it in simple (php) terms:
$output = utf8_encode(utf8_encode('string'));
So one utf8_decode() should be enough.
Do test this before you run your migration scripts though :)

If they forced a character change, why is your database not converted? Are your tables still the old character set (see your phpMyAdmin on table information).
Is the data wrong if it shows up in your phpMyAdmin or only on your webpage? -> your names and collation should change, as well as headers and filetype (safe file as utf-8).
Or try:
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
I would start replacing characters only if there are no options from within MySQL left.

Since you've tagged this question with "php", I assume you read the database and it's values with PHP? If so, please have a look at mb_convert_encoding if you no longer have control over the database.
The better solution would be to fix the inconsistency between the data and the tables characterset. Backup the database (just in case), and alter all tables and columns to UTF-8. Note: when using MySQL, it is not enough to alter the table's charset, you'll have to do this per column.

Why don't you use: ä = ä, ö = ö,...
Do htmlentities(); in php and it will convert all special characters into Entitys. I think this would be the easiest way to do it.

Related

Can't get my unicode from MySQL to print to browser correctly with PHP

I have been poring over stackoverflow all night looking for a way to solve my issues, but I absolutely cannot get the browser to display my Unicode characters correctly when pulling them from my database. In particular, I am trying to use the "combining macron" character (U+0304), added after a character to put a macron over it. I want the user to have the option to turn them on and off, and having one character to look for and ignore seems easier to accomplish this than instead of making conversions between individual macroned letters and their non-macroned counterpart (e.g. Ā -> A).
It would be trivial to use the HTML entity (& #772;) to accomplish this, but if I were to use the MySQL database for something other than making a webpage I want it to be easily transferable. I have tested with the HTML entity and I can get it to successfully add a macron to the previous character.
However, when using the Unicode character in my MySQL table, I absolutely cannot get it to print anything other than question marks (?) in the browser. In the table itself, the entry is a VARCHAR(64) and looks like 'word¯' with the macron appearing afterwards, but I assume that's just a limitation of the cmd environment that it doesn't put the macron over the d. The column Collation is latin1_swedish_ci, if that makes a difference. Here is what I have tried to get the entry to print correctly:
Changing my php.ini to have a default charset of utf-8
Making the top of my php file read:
<?php
header('Content-Type: text/html; charset=utf-8');
?>
And setting the first parameter of my database PDO as mysql:dbname=NAME;host=localhost;charset=utf8'
When I simply make the php file echo the character I want, it prints to the page correctly. So I'm thinking the problem isn't with the encoding? Or maybe the encoding of the database and the server aren't the same and that is creating the ??
EDIT:
I can get it to correctly display if I insert the value from PhPMyAdmin, but not when I enter it through the cmd. In both cases I am pasting the same word with an ending character of 'U+0304'. Is there a reason that it works with PHPMyAdmin and not through a direct query, and what can I do so it works with both?

Special characters in mySQL (and php) - THE BASICS

I am confused! Recently my webhotel updated php and now my old tables render special characters differently (wrongly).
Both my tables and my input/output-php-pages are set to utf-8 and since this update, also the inputs from php are treated differently; now my special characters are being utf-8-encoded as they enter the database. So since this change, when I review tables within phpMyAdmin, the old inserts have the original (non-encoded) special characters - the new posts have utf-8-encoded charcters (also special).
So what I would like to do is rewrite input and output to insert and show non-encoded characters - but I am not sure if this is possible without skipping utf-8 entirely (in php and mySQL). But is there an utf-8- way to submit non-encoded characters?
AND - perhaps more fundamentally - I need to understand what the possible downsides are. I am using Danish characters in and out and I'm not going to use any other language (for this project). So if it IS possible to insert and output non-encoded characters using utf-8 - am I then going to have unexpected/destructive issues?
I have read a lot of posts regarding php/mySQL/special characters but I haven't seen this angle on the issue yet. Hope I am not duplicating
I hope not because it has been working very nicely until the update.

Even if you are using only Danish characters, you may as well go utf8 all the way.
There are many places where the encoding needs to be stated:
The at the top of the html
The columns in the database (column CHARACTER SET defaults from table, which defaults from database)
The encoding in your PHP code.
When you CREATE TABLE, tack on DEFAULT CHARACTER SET utf8. If you have existing tables, without that, speak up; we may need to deal with them.
If you want Danish collation, the specify COLLATION utf8_danish_ci, too. Then (if I recall correctly), aa will sort after z.
(The default is utf8_general_ci, which won't do that sorting.)
Figure out what encoding you have (or can get) in your php code. If you have some text with accents in it, do this:
$hex = unpack('H*', $text);
echo implode('', $hex)
If you have utf8, å will be C3A5, for latin1 it will be E5.
Regardless of what encoding in in the tables, you must call set_charset('utf8') or set_charset('latin1') depending on what encoding is in the data in PHP. MySQL will gladly transcode between latin1 and utf8 as things are passed between PHP and MySQL. For different APIs:
⚈ mysql: mysql_set_charset('utf8');
⚈ mysqli: $mysqli_obj->set_charset('utf8');
⚈ PDO: $db = new PDO('dblib:host=host;dbname=db;charset=UTF-8', $user, $pwd);
For much more info, see http://mysql.rjweb.org/doc.php/charcoll .

PHP - Can't remove strange character

I'd really appreciate some help with this. I've wasted days on this problem and none of the suggestions I have found online seem to give me a fix.
I have a CSV file from a supplier. It appears to have been exported from an Microsoft system.
I'm using PHP to import the data into MySQL (both latest versions).
I have one particular record which contains a strange character that I can't get rid of. Manual editing to remove the character is possible, but I would prefer an automated solution as this will happen multiple times a day.
The character appears to be an interpretation of a “smart quote”. A hex editor tells me that the character codes are C2 and 92. In the hex editor it looks like a weird A followed by a smart quote. In other editors and Calc, Writer etc it just appears as a box. ﾒ
I'm using mb_detect_encoding to determine the encoding. All records in the CSV file are returned as ASCII, except the one with the strange character, which is returned as UTF-8.
I can insert the offending record into MySQL and it just appears in Workbench as a square.
MySQL tables are configured to utf-8 – utf8_unicode_ci and other unusual UTF characters (eg fractions) are ok.
I've tried lots of solutions to this...
How to detect malformed utf-8 string in PHP?
Remove non-utf8 characters from string
Removing invalid/incomplete multibyte characters
How to detect malformed utf-8 string in PHP?
How to replace Microsoft-encoded quotes in PHP
etc etc but none of them have worked for me.
All I really want to do is remove or replace the offending character, ideally with a search and replace for the hex values but none of the examples I have tried have worked.
Can anyone help me move forward with this one please?
EDIT:
Can't post answer as not enough reputation:
Thanks for your input. Much appreciated.
I'm just going to go with the hex search and replace:
$DodgyText = preg_replace("/\xEF\xBE\x92/", "" ,$DodgyText);
I know it's not the elegant solution, but I need a quick fix and this works for me.

Another solution is:
$contents = iconv('UTF-8', 'Windows-1251//IGNORE',$contents);
$contents = iconv('Windows-1251', 'UTF-8//IGNORE',$contents);
Where you can replace Windows-1251 to your local encoding.

At a quick glance, this looks like a UTF-8 file. (UTF-8 is identical with the first 128 characters in the ASCII table, hence everything is detected as ASCII except for the special character.)
It should work if your database connection is also UTF-8 encoded (which it may not be by default).
How to do that depends on your database library, let us know which one you're using if you need help setting the connection encoding.

updated code based on established findings
You can do search & replace on strings using hexadecimal notation:
str_replace("\xEF\xBE\x92", '', $value);
This would return the value with the special code removed
That said, if your database table is UTF-8, you shouldn't need that conversion; instead you could look at the connection (or session) character set (i.e. SET NAMES utf8;). Configuring this depends on what library you use to connect to your database.
To debug the value you could use bin2hex(); this usually helps in doing searches online.

PHP MySQL Chinese UTF-8 Issue

I have a MySQL table & fields that are all set to UTF-8. The thing is, a previous PHP script, which was in charge of the database writing, was using some other encoding, not sure whether it is in the script itself, the MySQL connection or somewhere else. The result is that although the table & fields are set to UTF-8, we see the wrong chars instead of Chinese.
It looks like that:
Now, the previous scripts (which were in charge of the writing and corrupted the data) can read it well for some reason, but my new script which all encoded in UTF-8, shows chars like ½©. How can that be fixed?

By the sound of it, you have a utf8 column but you are writing to it and reading from it using a latin1 connection, so what is actually being stored in the table is mis-encoded. Your problem is that when you read from the table using a utf8 connection, you see the data that's actually stored there, which is why it looks wrong. You can fix the mis-encoded data in the table by converting to latin1, then back to utf8 via the binary character set (three steps in total).

The original database was in a Chinese encoding – GB-18030 or similar, not Latin-1 – and the bytes that make up these characters, when displayed in UTF-8, show up as a bunch of Latin diacritics. Read each string as GB-18030, convert it to UTF-8, and save.

Special Characters Problem

When I display contents from the database, I get this:
��Some will have a job. Others will want one. They are my people, they are my clients and they are being denied their rights.
This text had been entered by the user via textarea with tinyMCE. How can I replace special characters (using preg_replace()) from the sentence to ' ' except for the characters: <>?

This article is totally worth a read. Dealing with UTF-8 characters is something that we all go through at some point. The trick seems to be to catch them before they go into the database or to fix the database so that when they're going in they aren't broken. Once they're in there though it's slightly more difficult.

As Chuck mentioned above, it is the database problem. Unless you only wish to display non-Unicode, ie Latin characters, then yes, preg_replace is the way to go. You will need to know the character sets well enough to filter out what you don't want.
But if you just want everything to display nicely, ie no garbage characters, then change the corresponding parts of the db to accept utf-8.
e.g. If you are using mySQL, try changing the field and table encoding to be able to accept UTF-8. The default is latin1_general_ci - try changing it to utf8_general_ci. Hope that explains my point.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.