I have multi language characters that are being inserted into a varchar column of a MySQL MyISAM table.
One particular character, and I'm sure there are others is failing to be inserted:
�
What is this character and how can I convert this character without affecting the entire word. The server side code that can be used to manipulate these words is php.
Examples of actual words including this symbol are as follows:
portugu�s
espa�a
etc
What is the best way to insert these words correctly?
Getting charactset issues right can be a bit of a mission. Have a look at this blog post I wrote a couple of weeks ago which should cover this off. Post back here if you're still having problems.
How to Avoid Character Encoding Problems in PHP
Related
How to change the character set of a mysql database after setting it to the wrong character set.
In fact, i'm having some problems for printing special characters pulled from a mysql database.
It is a php programming language.
There are several things you could be doing wrong. Please study the 'best practices' in here
If that does not suffice, perhaps the it can help with "problems printing".
If still not sufficient, show us the connection, set_charset, SHOW CREATE TABLE, sample printout, and sample of text that is giving trouble.
I have been poring over stackoverflow all night looking for a way to solve my issues, but I absolutely cannot get the browser to display my Unicode characters correctly when pulling them from my database. In particular, I am trying to use the "combining macron" character (U+0304), added after a character to put a macron over it. I want the user to have the option to turn them on and off, and having one character to look for and ignore seems easier to accomplish this than instead of making conversions between individual macroned letters and their non-macroned counterpart (e.g. Ā -> A).
It would be trivial to use the HTML entity (& #772;) to accomplish this, but if I were to use the MySQL database for something other than making a webpage I want it to be easily transferable. I have tested with the HTML entity and I can get it to successfully add a macron to the previous character.
However, when using the Unicode character in my MySQL table, I absolutely cannot get it to print anything other than question marks (?) in the browser. In the table itself, the entry is a VARCHAR(64) and looks like 'word¯' with the macron appearing afterwards, but I assume that's just a limitation of the cmd environment that it doesn't put the macron over the d. The column Collation is latin1_swedish_ci, if that makes a difference. Here is what I have tried to get the entry to print correctly:
Changing my php.ini to have a default charset of utf-8
Making the top of my php file read:
<?php
header('Content-Type: text/html; charset=utf-8');
?>
And setting the first parameter of my database PDO as mysql:dbname=NAME;host=localhost;charset=utf8'
When I simply make the php file echo the character I want, it prints to the page correctly. So I'm thinking the problem isn't with the encoding? Or maybe the encoding of the database and the server aren't the same and that is creating the ??
EDIT:
I can get it to correctly display if I insert the value from PhPMyAdmin, but not when I enter it through the cmd. In both cases I am pasting the same word with an ending character of 'U+0304'. Is there a reason that it works with PHPMyAdmin and not through a direct query, and what can I do so it works with both?
I'd really appreciate some help with this. I've wasted days on this problem and none of the suggestions I have found online seem to give me a fix.
I have a CSV file from a supplier. It appears to have been exported from an Microsoft system.
I'm using PHP to import the data into MySQL (both latest versions).
I have one particular record which contains a strange character that I can't get rid of. Manual editing to remove the character is possible, but I would prefer an automated solution as this will happen multiple times a day.
The character appears to be an interpretation of a “smart quote”. A hex editor tells me that the character codes are C2 and 92. In the hex editor it looks like a weird A followed by a smart quote. In other editors and Calc, Writer etc it just appears as a box. メ
I'm using mb_detect_encoding to determine the encoding. All records in the CSV file are returned as ASCII, except the one with the strange character, which is returned as UTF-8.
I can insert the offending record into MySQL and it just appears in Workbench as a square.
MySQL tables are configured to utf-8 – utf8_unicode_ci and other unusual UTF characters (eg fractions) are ok.
I've tried lots of solutions to this...
How to detect malformed utf-8 string in PHP?
Remove non-utf8 characters from string
Removing invalid/incomplete multibyte characters
How to detect malformed utf-8 string in PHP?
How to replace Microsoft-encoded quotes in PHP
etc etc but none of them have worked for me.
All I really want to do is remove or replace the offending character, ideally with a search and replace for the hex values but none of the examples I have tried have worked.
Can anyone help me move forward with this one please?
EDIT:
Can't post answer as not enough reputation:
Thanks for your input. Much appreciated.
I'm just going to go with the hex search and replace:
$DodgyText = preg_replace("/\xEF\xBE\x92/", "" ,$DodgyText);
I know it's not the elegant solution, but I need a quick fix and this works for me.
Another solution is:
$contents = iconv('UTF-8', 'Windows-1251//IGNORE',$contents);
$contents = iconv('Windows-1251', 'UTF-8//IGNORE',$contents);
Where you can replace Windows-1251 to your local encoding.
At a quick glance, this looks like a UTF-8 file. (UTF-8 is identical with the first 128 characters in the ASCII table, hence everything is detected as ASCII except for the special character.)
It should work if your database connection is also UTF-8 encoded (which it may not be by default).
How to do that depends on your database library, let us know which one you're using if you need help setting the connection encoding.
updated code based on established findings
You can do search & replace on strings using hexadecimal notation:
str_replace("\xEF\xBE\x92", '', $value);
This would return the value with the special code removed
That said, if your database table is UTF-8, you shouldn't need that conversion; instead you could look at the connection (or session) character set (i.e. SET NAMES utf8;). Configuring this depends on what library you use to connect to your database.
To debug the value you could use bin2hex(); this usually helps in doing searches online.
When I display contents from the database, I get this:
��Some will have a job. Others will want one. They are my people, they are my clients and they are being denied their rights.
This text had been entered by the user via textarea with tinyMCE. How can I replace special characters (using preg_replace()) from the sentence to ' ' except for the characters: <>?
This article is totally worth a read. Dealing with UTF-8 characters is something that we all go through at some point. The trick seems to be to catch them before they go into the database or to fix the database so that when they're going in they aren't broken. Once they're in there though it's slightly more difficult.
As Chuck mentioned above, it is the database problem. Unless you only wish to display non-Unicode, ie Latin characters, then yes, preg_replace is the way to go. You will need to know the character sets well enough to filter out what you don't want.
But if you just want everything to display nicely, ie no garbage characters, then change the corresponding parts of the db to accept utf-8.
e.g. If you are using mySQL, try changing the field and table encoding to be able to accept UTF-8. The default is latin1_general_ci - try changing it to utf8_general_ci. Hope that explains my point.
I am running into a very strange issue with a site that I am working on. The site is basically a job board where the owner or users can create job listings including a description that ends up being stored into a MySQL text field. What we are experiencing is this, whenever listings from certain sources are entered, they initially end up with the "Black Diamond" with a question mark inside character in place of apostrophes and double spaces. This part I know is an encoding issue and can correct. The real question is this, these black diamonds show when the record is displayed in a MySQL admin tool and when the job listing is viewed in a web browser (simple select statement displays the listing in a PHP app), but after the first time it is viewed, then the problem somehow fixes itself. It is like the running the select then displaying the record updates the job description field and fixes the encoding issues. How could this be? Has anyone ever heard of this or anything similar? I cannot understand how a database field would change without running an update statement...
How are the job listings entered? Are they entered via a web page? If so, what character encoding does the web page use? (This should determine the character encoding of the submitted data AFAIK.) What character set is the connection used to communicate with MySQL? What is the character set of the column the data is stored in? Finally, what is the character encoding of the web page(s) on which the entered data is reviewed?
Here is what I do: I declare all of my pages as UTF-8 encoded, using the following tag at the start of the <head> section:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
I issue the following command immediately when I connect to MySQL, so as to make sure that MySQL understands the data I send to it will be UTF-8 encoded:
SET NAMES uft8
(Depending on the database abstraction method you use, a special function might be recommended in order to set the connection character set, like mysqli's mysqli_set_charset().)
I also make sure that those columns in which I intend to store UTF-8 data are declared to be UTF-8. You can find out what the character set of a column is by issuing SHOW CREATE TABLE table_name. The character set of the table (which by default is the character set for any column in the table) is displayed at the end. If the character set for the column is different to the default character set for the table then it is displayed as part of the column definition. If you wish to change the character set of a column then you can do so using ALTER TABLE.
If you have not previously taken the steps to handle character sets in your app then you may find that the tables are all using the latin1 character set. If you naively store UTF-8-encoded data (for example) into these columns, you may run into character encoding issues. Changing the column character set using ALTER TABLE does not necessarily fix your old data, because MySQL reads your old data assuming it to be valid latin1-encoded text and converts it to the eqivalent UTF-8 (correctly converting what it has read, but not giving the result you want).
The above steps would hopefully mean that future data will be correctly encoded and correctly displayed, but you may have data already mis-encoded in your database, so be aware that if you follow the above steps and still see older data displaying incorrectly, this may be why. Good luck.
Run into this problem a few years ago... I remember finding those notorious characters, and replacing them in php with a single quote or a double quote... Ofcourse with escaping... A simple preg_replace for those characters will do the trick... Its just an encoding issue...
This page, though geared for wordpress might help
http://codex.wordpress.org/Converting_Database_Character_Sets
I had the same issue (mysql encoding and webpage encoding set to UTF-8 but black diamonds showing up in my query results. I found this snippet while googling but cannot for the life of me find its source to give proper attribution:
if( function_exists('mysql_set_charset') ){
mysql_set_charset('utf8', $db_connection);
}else{
mysql_query("SET NAMES 'utf8'", $db_connection);
}
Anyway, it cleared up the issue for me.