I have a feed that I pull data into a database from. It provides the data in XML format. However, the data includes "illegal" characters. For example:
A GREAT NEIGHBOURHOOD – WITH A
or
large “country style†eat-in
or
Garage 14’x32’, large
or
OR…….ENDLESS POSSIBILITIES!!
My question is first, how do I identify the encoding of these characters, and second, how do I change the encoding to match the UTF8 format expected by my database?
EDIT: To be clear, there's no database involved in this process (at this point in the process, anyway). The data will be inserted into the DB later, but at the moment I'm just reading the data via a PHP script and printing it on screen using var_dump.
EDIT 2: the data is being pulled from a RETS feed using the PHP PHRETS library
The problem is that your UTF-8 response is treated in a different way or the database is not set up correctly. Here some examples on where this could happen and how to fix it.
Before Using Curl
header("Content-Type: text/html; charset=utf-8");
Mysql (my.cnf)
[client]
default-character-set=utf8
[mysql]
default-character-set=utf8
[mysqld]
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-server = utf8
When Creating The Database Manually
CREATE DATABASE `your_table_name` DEFAULT CHARACTER SET utf8 COLLATE utf8_polish_ci;
When Using Frameworks such as Doctrine
$conn = array(
'driver' => 'pdo_mysql',
'dbname' => 'test',
'user' => 'root',
'password' => '*****',
'charset' => 'utf8',
'driverOptions' => array(1002=>'SET NAMES utf8')
);
It seems that at some point the XML source or data, that is UTF-8, is treated as ISO-8859-1 and converted to UTF-8. Depending on how you generate the feed this could happen at several points.
The most likely point is the encoding for the database connection. Make sure it is UTF-8.
Another possibility is the content type header you send.
Please add your database encoding type so we can answer better.
In order to detect the encoding type of a string you will need to use the mb_detect_encoding as follow:
echo mb_detect_encoding("your-string");
You can also use this function to convert from one encoding type to another,
$str = mb_convert_encoding($str, $source_encode, $destination_encode);
Related
I'm working on a project using Doctrine 2.4.3 with a MySQL 5.7.21 database with utf8 as default charset.
Recently, I've been looking to implement emoji support. To overcome MySQL's limitation of 3 bytes for utf8, I need to change the columns that can receive emojis to the utf8mb4 charset (see https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html).
However, I have not found a way to reflect this in my entities (using annotations).
My database connection config is the following :
$data = array(
'driver' => 'pdo_mysql',
'host' => $dbhost,
'port' => $dbport,
'dbname' => $dbname,
'user' => $dbuser,
'password' => $dbpw,
'charset' => 'utf8mb4'
);
I tried adding annotations to the table :
/* #Entity(repositoryClass="path\to\DAO") #Table(name="post", indexes={#Index(name="uid", columns={"uid"})}, options={"charset":"utf8mb4", "collation":"utf8mb4_unicode_ci"})
* #HasLifecycleCallbacks */
class Post extends BaseEntity
{
...
}
In the same fashion, tried adding annotations to the column (in the same table) itself :
/* #Column(type="text", options={"charset":"utf8mb4", collation":"utf8mb4_unicode_ci"}) */
protected $text;
None of the above worked. I expected an ALTER TABLE query when executing doctrine orm:schema-tool:update --dump-sql but Doctrine sees no change, and I still can't insert 4 bytes emojis.
If I update the column's charset myself directly in MySQL, emojis do get supported, but when I do run orm:schema-tool:update, Doctrine sees a difference between my entity and the schema, but seems to not know what to make of it since the output I get is the following :
ALTER TABLE post CHANGE text text LONGTEXT NOT NULL ;
I also tried to add SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci as driverOptions in my database connection config array, alas to no result either.
Unfortunately, I could not find anything regarding this matter in Doctrine's documentation.
If any of you has any clue regarding this matter, feel free to hit me up! Thanks in advance.
To convert the whole table:
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4;
Please provide
SHOW CREATE TABLE ...
For more troubleshooting: Trouble with UTF-8 characters; what I see is not what I stored
As I have legacy requirements and cannot update Doctrine's lib as of right now, I had to find a workaround.
What I did was manually convert my tables to utf8mb4 with SQL queries, which is not overwritten by Doctrine back to utf8 when executing orm:schema-tool:update --force after the charset conversion.
For the record, I generated the update statements with the following script :
SELECT CONCAT('ALTER TABLE ', t.table_schema, '.', t.table_name, ' CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;')
FROM information_schema.tables t
WHERE t.table_schema LIKE {your_schema};
^ Do not execute this blindly - check beforehand if existing data will fit while utf8mb4 encoded. For more details check the very good article from Mathias Bynens on the matter : https://mathiasbynens.be/notes/mysql-utf8mb4#column-index-length
I also changed the database's charset settings.
ALTER DATABASE {database_name} CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
I did keep the 'charset' => 'utf8mb4' in the Doctrine's database connection settings array for correct transmission of the data.
For new entities (tables), annotating them with correct settings in table options does create them with the right charset and collation :
#Entity #Table(name="table", options={"charset":"utf8mb4", "collate":"utf8mb4_unicode_ci"})
Cheers.
Inserting UTF-8 encoded string into UTF-8 encoded table gives incorrect string value.
PDOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF0\x9D\x84\x8E i...' for column 'body_value' at row 1: INSERT INTO
I have a 𝄎 character, in a string that mb_detect_encoding claims is UTF-8 encoded.
I try to insert this string into a MySQL table, which is defined as (among other things) DEFAULT CHARSET=utf8
Edit: Drupal always does SET NAMES utf8 with optional COLLATE (atleast when talking to MySQL).
Edit 2: Some more details that appear to be relevant. I grab some text from a PostgreSQL database. I stick it onto an object, use mb_detect_encoding to verify that it's UTF-8, and persist the object to the database, using node_save. So while there is an HTTP request that triggers the import, the data does not come from the browser.
Edit 3: Data is denormalized over two tables:
SELECT character_set_name FROM information_schema.COLUMNS C WHERE table_schema = "[database]" AND table_name IN ("field_data_body", "field_revision_body") AND column_name = "body_value";
>+--------------------+
| character_set_name |
+--------------------+
| utf8 |
| utf8 |
+--------------------+
Edit 4: Is it possible that the character is "to new"? I'm more than a little fuzzy on the relationship between unicode and UTF-8, but this wikipedia article, implies that the character was standardized very recently.
I don't understand how that can fail with "Incorrect string value".
𝄎 (U+1D10E) is a character Unicode found outside the BMP (Basic Multilingual Plane) (above U+FFFF) and thus can't be represented in UTF-8 in 3 bytes. MySQL charset utf8 only accepts UTF-8 characters if they can be represented in 3 bytes. If you need to store this in MySQL, you'll need to use MySQL charset utf8mb4. You'll need MySQL 5.5.3 or later. You can use ALTER TABLE to change the character set without much problem; since it needs more space to store the characters, a couple issues show up that may require you to reduce string size. See http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-upgrading.html .
to solve this issue, first you change your database field to utf8m4b charset. For example:
ALTER TABLE `tb_name` CHANGE `field_name` `field_name` VARCHAR(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL;
then in your db connection, set driver_options for it to utf8mb4. For example, if you use PDO
$db = new PDO('mysql:host=localhost;dbname=testdb;charset=utf8mb4', 'username', 'password');
or in zend framework 1.2
$dbParam = array('host' => 'localhost', 'username' => 'db_user_name',
'password' => 'password', 'dbname' => 'db_name',
'driver_options' => array(
'1002' => "SET NAMES 'utf8mb4'",
'12' => 0 //this is not necessary
)
);
In your PDO connecton, set the charset.
new PDO('mysql:host=localhost;dbname=the_db;charset=utf8mb4', $user, $password);
I fixed the error:
SQLSTATE[HY000]: General error: 1366 Incorrect string value ......
with this method:
I use utf8mb4_unicode_ci for database
Set utf8mb4_unicode_ci for all tables
Set longblog datatype for column (not text, longtext.... you need big datatype to store 4 bytes of your content)
It is okay now.
If you use laravel, continue to edit config/database.php
'charset' => 'utf8mb4',
'collation' => 'utf8mb4_unicode_ci',
If you use function strtolower, replace it with mb_strtolower
Notice: you have to put <meta charset="utf-8"> on your head tag
This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 6 years ago.
My charset in the database is set to utf8_unicode_ci, all files encoded in UTF8 (without BOM).
Here is my php code:
<?php
require_once("./includes/config.php");
$article = new Article();
$fields = array(
'status' => '0',
'title' => 'מכבי ת"א אלופת אירופה בפעם ה-9',
'shorttitle' => 'מכבי ת"א אלופת אירופה',
'priority' => '1',
'type' => '1',
'category' => '2',
'template' => '68',
'author' => '1',
'date' => date("Y-m-d H:i"),
'lastupdate' => date("Y-m-d H:i"),
'preview' => 'בלה בלה בלה',
'content' => 'עוד קצת בלה בלה בלה',
'tags' => 'מכבי ת"א,יורוליג,אליפות אירופה',
'comments' => '1'
);
$article->set($fields);
$article->save();
for some reason, the Hebrew characters appear like this in phpmyadmin:
מכבי ת"× ×לופת ×ירופה ×‘×¤×¢× ×”-9
Database connection code:
<?php
final class Database
{
protected $fields;
protected $con;
public function __construct($host = "", $name = "", $username = "", $password = "")
{
if ($host == "")
{
global $config;
$this->fields = array(
'dbhost' => $config['Database']['host'],
'dbname' => $config['Database']['name'],
'dbusername' => $config['Database']['username'],
'dbpassword' => $config['Database']['password']
);
$this->con = new mysqli($this->fields['dbhost'], $this->fields['dbusername'], $this->fields['dbpassword'], $this->fields['dbname']);
if ($this->con->connect_errno > 0)
die("<b>Database connection error:</b> ".$this->con->connect_error);
}
else
{
$this->con = new mysqli($host, $username, $password, $name);
if ($this->con->connect_errno > 0)
die("<b>Database connection error:</b> ".$this->con->connect_error);
}
}
Any ideas why?
You have set the database's and file's character set to UTF-8, but the data transfer between PHP and the database also needs to be set correctly.
You can do this using set_charset:
Sets the default character set to be used when sending data from and to the database server.
Add the following as last statement of your Database constructor:
$this->con->set_charset("utf8");
This will not fix the issue for the data that is already in the database, but for new data written to the database you should notice the difference.
If you decide to rebuild your database, then please consider using the superior utf8mb4 character set, as described in the MySql docs:
The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters:
For a BMP character, utf8 and utf8mb4 have identical storage characteristics: same code values, same encoding, same length.
For a supplementary character, utf8 cannot store the character at all, while utf8mb4 requires four bytes to store it. Since utf8 cannot store the character at all, you do not have any supplementary characters in utf8 columns and you need not worry about converting characters or losing data when upgrading utf8 data from older versions of MySQL.
utf8mb4 is a superset of utf8
It's important that your entire line code has the same charset to avoid issues where characters displays incorrectly.
There are a few settings that needs to be properly defined and I'd strongly recommend UTF-8, as this has most letters you would need (Hebrew), but also supports a wide variety of other charsets too (Scandinavian, Greek, Arabic).
Here's a little list of things that has to be set to a specific charset.
Headers
Setting the charset in both HTML and PHP headers to UTF-8
PHP: header('Content-Type: text/html; charset=utf-8');
(PHP headers has to be placed before any kind output (echo, whitespace, HTML))
HTML: <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
(HTML-headers are placed within the <head> / </head> tag)
Connection
You also need to specify the charset in the connection itself (placed directly after creating the connection).
$this->con->set_charset("utf8");
Database and tables
Your database and all its tables has to be set to UTF-8. Note that charset is not exactly the same as collation (see this post).
You can do that by running the queries below once for each database and tables (for example in phpMyAdmin)
ALTER DATABASE yourDatabase CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE yourTable CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Other
Some specific functions have the attribute of a specific charset, and if you are using such functions, it should be specified there as well
It may be that you already have values in your database that are not encoded with UTF-8. Updating them manually could be a pain and could consume a lot of time. Should this be the case, you could use something like ForceUTF8 and loop through your databases, updating the fields with that function.
Should you follow all of the pointers above, chances are your problem will be solved. If not, you can take a look at this StackOverflow post: UTF-8 all the way through.
I have a project in Phalcon PHP and MySql.
when UTF8 characters have to keep these errors are stored.
For example:
I save : nueva descripción ñññ
in Database: nueva descipción ñññ
I have tried several types of collations both in the database, tables and fields.
Thanks for your help.
While having properly defined database elements, you have to also set your connection to use UTF-8 ecoding. As of Phalcon makes use of PDO, you can try to modify your connection alike to:
$di["db"] = function() {
return new \Phalcon\Db\Adapter\Pdo\Mysql(array(
"host" => "localhost",
"username" => "root",
"password" => "1234",
"dbname" => "test",
"options" => array( // this is your important part
PDO::MYSQL_ATTR_INIT_COMMAND => 'SET NAMES utf8'
)
));
};
Example from Phalcon Forum.
As of I'm working with Polish language, my DB collations are mostly set to utf8_polish_ci or sometimes to utf8_universal_ci. You have to test it out because of result sorting issues.
check your project database if it is utf8-unicode-ci collation.
Also check all your individual table has collation utf-8-unicode-ci
If it is not ok ,check your apache mysql config my.ini file
In that check UTF 8 Settings has no hash (#) comment like this
## UTF 8 Settings
init-connect=\'SET NAMES utf8\' //remove #
collation_server=utf8_unicode_ci
character_set_server=utf8
In Cake, I have this issue with Finnish language not displaying properly. I have set utf encoding in config.php, charset output in default.ctp and also config in core.php
Is there a reason why it's not coming out properly?
To give you an idea the link is below:
http://www.likeslomakkeet.net/petitions/add
What if you re-import your data to database after changed your database.php and database collations? Try re-adding any commune with special characters like "Hämeenkyrö" and see how it looks like in database.
edit: You could also filter out all communes with "(lakkautettu)" because they no longer exists.
Did you also set the database connection to UTF-8 in database.php?
For MySQL, that would be:
'encoding' => 'utf8' // no hyphen