How to insert ISO-8859-1 charset values in DB using php - php

I'm inserting contents into table as bulk upload format (.csv).
In my CSV some columns include special characters but server can't insert special characters included rows into db and stops execution.
How can I insert these values?
I'm trying:
$field1= addslashes(trim($data[0]));
and I'm using mysqli_real_escape_string, it's working but it omits all special characters from the fields...
$field2= mysqli_real_escape_string(trim($data[1]));

Do not call any conversion routines. Simply declare that the .csv file is encoded latin1. Do that with this clause in the LOAD DATA statement:
CHARACTER SET latin1
You have not said what charset the column/table is. Please provide SHOW CREATE TABLE. The column can be either latin1 or utf8.
Also, use mysqli_set_charset to establish which charset your client code is using. (It does not need to be the same as the column/table.)
You say "omits special characters". Do you mean that data is stored into the table(s), but with strings truncated at the special characters? See "Truncated text" in Trouble with utf8 characters; what I see is not what I stored .
You say "stops execution". That seems unlikely; please provide more details, such as the SQL that caused the termination, whether other commands come be run after it, whether you needed to start (not "restart") mysqld again.
You say "special characters"; let's see some examples. You used escaping to get $field variables, but what did you then do with those variables?

MySQL should do the conversion automatically, as long as you tell it what encoding your data is using. With the mysqli extension you use the mysqli::set_charset() function.
If you app is not using ISO-8859-1 and only your current data set does, you need to declare the application encoding and convert data yourself. You have at least three functions:
mb_convert_encoding()
iconv()
utf8_encode()
In either case, you'll get SQL errors if incompatible conversions happen. It shouldn't be the case if databases uses an encoding that contains all ISO-8859-1 characters (such as UTF-8).
Note: get rid of addslashes(), it serves no purpose at all and can only corrupt your data.

Related

How does charset affect php when doing MySQL queries?

I noticed that when doing database queries in PHP (e.g. Zend_db, mysqli...), you can set the character set. For example: mysqli_set_charset($con,"utf8"); I'm a little foggy as to what this actually does behind the scenes.
If I use php to do a database SELECT query, and I indicate a character set, what happens if it's not the same character set that the column was defined as in the database?
I mean, the database returns a binary sequence, but what is actually returned if the character is not encoded the same in the two character sets? Will mySQL take the internal binary data and return it "As-is"?
Or will MySQL try to convert it to the binary sequence that's the equivalent in the character set you indicated?
I guess the gist of my question is that I would like to know how the data is encoded when PHP is sending in the query, how it's transmitted back from MySQL, and whether there's another step of translation after PHP gets it back and stores it into a string in PHP internal memory.
Similarly, if you're doing an INSERT or update, how does it get sent from PHP to MySQL? Does PHP convert it to the correct binary encoding THEN send it into MySQL?
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Update:
Thanks to Raymond Nijland. I was able to fix my bug. But I did notice that for nonstandard characters, the charset does seem to matter.
I did a select statement using $db = new \PDO("mysql:host=$host;dbname=$database;charset=latin1", $dbuser, $dbpassword);. First, I tried latin1, then I tried utf8.
My problem was that I had a column with encrypted data, which I guess had some wierd characters in it. if I did an md5 on that field directly in the database query, it gave me an encoding that began with 889... BUT, I tried to pulled it into PHP with a SELECT statement. If I used PDO with a charset of latin1, then did an MD5() inside of PHP, it gives me the same hash (889...). Which implies that PHP has an exact copy of the binary that's in the database. BUT if I did used PDO with charset 'UTF-8', then did an MD5() in PHP, it gave me a hash beginning with 087... So somewhere, a conversion must be taking place.
At this point, my orignal bug is fixed, but I'm still curious as to what's happening. Is MYSQL doing the conversion before returning it to PHP, or does PDO do some sort of conversion on the PHP side?
mysqli_set_charset($con,"utf8"); (or other code for other client libraries) declares to MySQL that the encoding in the client will be MySQL's CHARACTER SET utf8. If bytes with a different encoding are sent to (think INSERT) mysql, garbage or errors will occur.
That setting also declares that the client desires that encoding from SELECTs.
The CHARACTER SET on each column in each table may be something else (eg, "latin1"). If so, MySQL will attempt to convert the encoding during the transmission.
Caution: MySQL's CHARACTER SET utf8 is a subset of the well-known UTF-8. To get the latter, use CHARACTER SET utf8mb4 in tables and mysqli_set_charset($con,"utf8mb4"); when connecting.
Going forward, utf8mb4 is preferred in most situations.
Non-text stuff ("as-is") should be put into BLOB or VARBINARY columns -- this bypasses any checking of the encoding. (Think a .jpg or AES_ENCRYPT.)
MySQL's MD5() function returns a hex string. UNHEX(MD5('...')) return binary stuff and must be store in, say, a BINARY(16) column.
Many forms of garbled text are discussed in Trouble with UTF-8 characters; what I see is not what I stored .

problems with character encoding pulled from mysql database [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
UTF-8 all the way through
okay, this is stupid that I can't figure it out.
Mysql database is set to utf8_general_ci collation. The field i'm having problems with is longtext type.
characters added to the database as &eacute or other accented characters are returning as �.
I run the output through stripslashes and i've tried both with and without html_entity_decode but can find no change in the output. What am I doing wrong?
Cheers
What character encoding does the string have that you try to insert? If it is in ISO-8859-1 you can use the PHP function utf8_encode() to encode it to UTF-8 before inserting it into the database.
http://php.net/manual/en/function.utf8-encode.php
Getting encoding right is really tricky - there are too many layers:
Browser
Page
PHP
MySQL
The SQL command "SET CHARSET utf8" from PHP will ensure that the client side (PHP) will get the data in utf8, no matter how they are stored in the database. Of course, they need to be stored correctly first.
DDL definition vs. real data
Encoding defined for a table/column doesn't really mean that the data are in that encoding. If you happened to have a table defined as utf8 but stored as differtent encoding, then MySQL will treat them as utf8 and you're in trouble. Which means you have to fix this first.
What to check
You need to check in what encoding the data flow at each layer.
Check HTTP headers, headers.
Check what's really sent in body of the request.
Don't forget that MySQL has encoding almost everywhere:
Database
Tables
Columns
Server as a whole
Client
Make sure that there's the right one everywhere.
Conversion
If you receive data in e.g. windows-1250, and want to store in utf-8, then use this SQL before storing:
SET NAMES 'cp1250';
If you have data in DB as windows-1250 and want to retreive utf8, use:
SET CHARSET 'utf8';
Last note:
Don't rely on too "smart" tools to show the data. E.g. phpMyAdmin does (was doing when I was using it) encoding really bad. And it goes through all the layers so it's hard to find out. Also, Internet Explorer had really stupid behavior of "guessing" the encoding based on weird rules. Use simple editors where you can switch encoding. Also, I recommend MySQL Workbench.

mysql and encoding

I moved my php application to the new server. i use mysql5 db. When i'm Updating or Inserting something to db, every " and - sign changed to ?. I use SET NAMES UTF8 and SET CHARACTER SET but it don't work. Any ideas?
SET NAMES UTF8 should be used on every page, when selecting as well as when updating or inserting.
actually this query must be used every time you connect to the database. just add it to connect code.
You need UTF-8 all the way through to make smart quotes and dashes (“”—) and other non-ASCII characters work reliably:
(1) Ensure that the browser sends you characters encoded to UTF-8. Do this by declaring the page that includes the form to be UTF-8:
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
...
(Ignore <form accept-encoding>, which doesn't work right in IE.)
(2) PHP deals with raw bytes and doesn't care what encoding they're in, but the database does care, so you have to tell it what encoding the bytes from PHP are coming in. This is what SET NAMES is doing, though mysql_set_charset may be preferable.
(3) Once the proper characters have reached the database, it'll need to store them in a Unicode encoding to make sure all characters can fit. Each column can have a different encoding, but you can use DEFAULT CHARACTER SET utf8 when you CREATE table to make all the text columns in it use UTF-8. You can also set the default character set for a database or the whole server to utf8 if you prefer.
If you have already CREATE​d the tables and they a non-UTF-8 collation, you'll have to recreate or alter the tables. You can check the current collation using SHOW FULL COLUMNS FROM sometable;.
(4) Make sure you HTML-encode text you output from PHP using htmlspecialchars() and not htmlentities(), which by default will mess up non-ASCII characters.
[You can, as an alternative to (2) and (3), just use the default Latin-1 encoding for the connection and the table storage, but put UTF-8 bytes in it nonetheless. The disadvantage of this approach is that it'll look wrong to other tools looking at the database, and lower/upper case characters won't compare against each other in the expected case-insensitive way.]
My guess is you are pasting from some text editor which is transforming the " into an angled pretty quote, and transforming your - into an mdash, which is causing both to be represented as ?.
While you set your database to accept UTF8 characters, you probably did not set your webserver/PHP to accept those characters. Try playing with mbstring functions, but check to make sure you arent using the slanted quotes or dashes.

PostgreSQL + PHP + UTF8 = invalid byte sequence for encoding

I'm migrating a db from mysql to postgresql. The mysql db's default collation is UTF8, postgres is also using UTF8, and I'm encoding the data with pg_escape_string(). For whatever reason however, I'm running into some funky errors about bad encoding:
pg_query() [function.pg-query]: Query failed: ERROR: invalid byte sequence for encoding "UTF8": 0xeb7374
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client"
I've been poking around trying to figure this out, and noticed that php is doing something weird; if a string has only ascii chars in it (eg. "hello"), the encoding is ASCII. If the string contains any non ascii chars, it says the encoding is UTF8 (eg. "Hëllo").
When I use utf8_encode() on strings that are already UTF8, it kills the special chars and makes them all messed up, so.. what can I do to get this to work?
(the exact char hanging it up right now is "�", but instead of just search/replace, i'd like to find a better solution so this kinda problem doesn't happen again)
Most likely, the data in your MySQL database isn't UTF8. It's a pretty common scenario. MySQL at least used to not do any proper validation at all on the data, so it accepted anything you threw at it as UTF8 as long as your client claimed it was UTF8. They may have fixed that by now (or not, I don't know if they even consider it a problem), but you may already have incorrectly encoded data in the db. PostgreSQL, of course, performs full validation when you load it, and thus it may fail.
You may want to feed the data through something like iconv that can be set to ignore unknown characters, or transform them to "best guess".
BTW, an ASCII string is exactly the same in UTF-8 because they share the same first 127 characters; so "Hello" in ASCII is exactly the same as "Hello" in UTF-8, there's no conversion needed.
The collation in the table may be UTF-8 but you may not be fetching information from it in the same encoding. Now if you have trouble with information you give to pg_escape_string it's probably because you're assuming content fetched from MySQL is encoded in UTF-8 while it's not. I suggest you look at this page on MySQL documentation and see the encoding of your connection; you're probably fetching from a table where the collation is UTF-8 but you're connection is something like Latin-1 (where special characters such as çéèêöà etc won't be encoded in UTF-8).

php mysql character set: storing html of international content

i'm completely confused by what i've read about character sets. I'm developing an interface to store french text formatted in html inside a mysql database.
What i understood was that the safe way to have all french special characters displayed properly would be to store them as utf8. so i've created a mysql database with utf8 specified for the database and each table.
I can see through phpmyadmin that the characters are stored exactly the way it is supposed to. But outputting these characters via php gives me erratic results: accented characters are replaced by meaningless characters. Why is that ?
do i have to utf8_encode or utf8_decode them? note: the html page character encodign is set to utf8.
more generally, what is the safe way to store this data? Should i combine htmlentities, addslashes, and utf8_encode when saving, and stripslashes,html_entity_decode and utf8_decode when i output?
MySQL performs character set conversions on the fly to something called the connection charset. You can specify this charset using the sql statement
SET NAMES utf8
or use a specific API function such as mysql_set_charset():
mysql_set_charset("utf8", $conn);
If this is done correctly there's no need to use functions such as utf8_encode() and utf8_decode().
You also have to make sure that the browser uses the same encoding. This is usually done using a simple header:
header('Content-type: text/html;charset=utf-8');
(Note that the charset is called utf-8 in the browser but utf8 in MySQL.)
In most cases the connection charset and web charset are the only things that you need to keep track of, so if it still doesn't work there's probably something else your doing wrong. Try experimenting with it a bit, it usually takes a while to fully understand.
I strongly recomend to read this article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky, to understand what are you doing and why.
It is useful to consider the PHP-generated front end and the MySQL backend separate components. MySQL should not have to worry about display logic, nor should PHP assume that the backend does any sort of preprocessing on the data.
My advice would be to store the data in plain characters using utf8 encoding, and escape any dangerous characters with MySQLs methods.
PHP then reads the utf8 encoded data from database, processes them (with htmlentities(), most often), and displays it via whichever template you choose to use.
Emil H. correctly suggested using
SET NAMES utf8
which should be the first thing you call after making a MySQL connection. This makes the MySQL treat all input and output as utf8.
Note that if you have to use utf8_encode or utf8_decode functions, you are not setting the html character encoding correctly. It is easiest to require that every component of your system uses utf8, since that way you should never have to do manual encoding/decoding, which can cause hard to track issues later on.
In adition to what Emil H said, you also need this in your page head tag:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

Categories