I have a fedora machine acting as server, with apache running php 5.3
A scripts acts as an entry page for various sources sending me "messages".
The php script is called like: serverAddress/phpScript.php?message=MyMessage the message is then saved via PDO to connect to SqlServer 2008 db.
If the message contains any special characters (e.g. german), like: üäöß then in the db I will get some gibberish instead of the correct string: üäöß
The db is perfectly capable of UTF-8 - I can connect and send/retrieve german characters without any issue with other tools (not via php).
Inside the php script:
if I echo the input string I get the correct string üäöß
if I save it to a file (log the input) I see the gibberish: üäöß
What is causing this behavior? How can I fix it?
multibyte is enabled (yum install php-mbstring followed by a apache restart)
at the start of my php script I have:
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
from what I understand the default encoding type when dealing with mssql via PDO is UTF-8
New development:
A colleague pointed me to the PDO_DBLIB page (visible only from cache in this moment) where I saw $res->bindValue(':value', iconv('UTF-8', 'ISO8859-1', $value);
I replaced all my $res->bindParam(':text',$text); with $res->bindParam(':text',iconv('UTF-8', 'ISO8859-1',$text)); and everything worked :).
The mb_internal_encoding.... and all other lines were no longer needed.
Why does it work when using the ISO8859-1 encoding?
A database may handle special characters without even supporting the Unicode set (which UTF-8 happens to be an encoding, specifically a variable-length one).
A character set is a mapping between numbers and characters. Unicode and ASCII are common examples of charsets. Unicode states that the sign € maps to the number 8364 (really it uses the code point U+20AC). UTF-8 is a way to encode Unicode code points, and represents U+20AC with three bytes: 0xE2 0x82 0xAC; UTF-16 is another encodind for Unicode code points, which always use two bytes: 0x20AC (link). Both of these encodings refer to the same 8364th entry in the Unicode catalogue.
ASCII is both a charset and an encoding scheme: the ASCII character set maps number from 0 to 127 to 128 human chars, and the ASCII encoding requires a single byte.
Always remember that a String is a human concept. It's represented in a computer by the tuple (byte_content, encoding). Let's say you want to store Unicode strings in your database. Please, note: it's not necessary to use the Unicode set if you just need to support German users. It's useful when you want to store Arabian, Chinese, Hebrew and German at the same time in the same column. MS SQLServer uses UCS-2 to encode Unicode, and this holds true for columns declared NCHAR or NVARCHAR (note the N prefix). So your first action will be checking if the target columns types are actually nvarchar (or nchar).
Then, let's assume that all input strings are UTF-8 encoded in your PHP script. You want to execute something like
$stmt->bindParam(':text', $utf8_encoded_text);
According to the documentation, UTF-8 is the default string encoding. I hope it's smart enough to work with NVARCHAR, otherwise you may need to use the extra options.
Your colleague's solution doesn't store Unicode strings: it converts in the ISO-8859-1 space, then saves the bytes in simple CHAR or VARCHAR columns. The difference is that you won't be able to store character outside of the ISO-8859-1 space (eg Polish)
Take a look at this article on "Handling Unicode Front to Back in a Web App". By far one of the best articles I've seen on the subject. If you follow the guide and the issues are still present, then you know for sure that it's not your fault.
Related
My boss likes to use n-dashes. They always cause problems with encoding and I cannot work out why.
I store my TEXT field in a database under the charset: utf8_general_ci.
I have the following tags under my <head> on my webpage:
<meta charset="UTF-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
I pull the information from my database with the following set:
mysql_set_charset('UTF8',$connection);
(I know MYSQL is depreciated)
But when I get information from the database, I end up with this:
– Europe
If I take this string and run it through utf8_decode, I get this:
��? Europe
I even tried running it thorugh utf8_encode, and I got this:
âÃâ¬Ãâ Europe
Can someone explain to me why this is happening? I dont understand. I even ran the string through mb_detect_encoding and It said the string was utf8. So why is not printing correctly?
The solution (or not really a solution, because it ruins the rest of the website) is to remove the mysql_set_encoding line, and use utf8_decode. Then it prints out fine. BUT WHY!?
You have to remember that computers handle all forms of data as nothing more than sequences of 1s and 0s. In order to turn those 1s and 0s into something meaningful, the computer must somehow be told how those bits should be interpreted.
When it comes to a textual string, such information regarding its bits' interpretation is known as its character encoding. For example, the bit sequence 111000101000000010010011, which for brevity I will express in hexadecimal notation as 0xe28093, is interpreted under the UTF-8 character encoding to be your boss's much-loved U+2013 (EN-DASH); however that same sequence of bits could mean absolutely anything under a different encoding: indeed, under the ISO-8859-1 encoding (for example), it represents a sequence of three characters: U+00E2 (LATIN SMALL LETTER A WITH CIRCUMFLEX), U+0080 (<control>) and U+0093 (SET TRANSMIT STATE).
Unfortunately, in their infinite wisdom, PHP's developers decided not to keep track of the encoding under which your string variables are stored—that is left to you, the application developer. Worse still, many PHP functions make arbitrary assumptions about the encoding of your variables, and they happily go ahead manipulating your bits without any thought of the consequences.
So, when you call utf8_decode on a string: it takes whatever bits you provide, works out what characters they happen to represent in UTF-8, and then returns to you those same characters encoded in ISO-8859-1. It's entirely possible to come up with an input sequence that, when passed to this function, produces absolutely any given result; indeed, if you provide as input 0xc3a2c280c293 (which happens to be the UTF-8 encoding of the three characters mentioned above), it will produce a result of 0xe28093—the UTF-8 encoding of an "en dash"!
Such double encoding (i.e. UTF-8 encoded, treated as ISO-8859-1 and transcoded to UTF-8) appears to be what you're retrieving from MySQL when you do not call mysql_set_charset (in such circumstances, MySQL transcodes results to whatever character set the client specifies upon connection—the standard drivers use latin1 unless you override their default configuration). In order for a result that MySQL transcodes to latin1 to produce such double encoded UTF-8, the value that is actually stored in your column must have been triple encoded (i.e. UTF-8 encoded, treated as ISO-8859-1, transcoded to UTF-8, then treated as latin1 again)!
You need to fix the data that is stored in your database:
Identify exactly how the incumbent data has actually been encoded. Some values may well be triple-encoded as described above, but others (perhaps that predate particular changes to your application code; or that were inserted/updated from a different source) may be encoded in some other way. I find SELECT HEX(myColumn) FROM myTable WHERE ... to be very useful for this purpose.
Correct the encodings of those values that are currently incorrect: e.g. UPDATE myTable SET myColumn = BINARY CONVERT(myColumn USING latin1) WHERE ...—if an entire column is misencoded, you can instead use ALTER TABLE to change it to a binary string type and then back to a character string of the correct encoding. Beware of transformations that increase the encoded length, as the result might overflow your existing column size.
My PHP form is submitting special latin characters as symbols.
So, Québec turns into Québec
My form is set to UTF-8 and my database table has latin1_swedish_ci collation.
PHP: $db = new PDO('mysql:host=localhost;dbname=x;charset=utf8', 'x', 'x');
A bindParam: $sql->bindParam(":x", $_POST['x'],PDO::PARAM_STR);
I am new to PDO so I am not sure what the problem is. Thank you
*I am using phpMyAdmin
To expand a little bit more on the encoding problem...
Any time you see one character in a source turn into two (or more characters), you should immediately suspect an encoding issue, especially if UTF-8 is involved. Here's why. (I apologize if you already know some of this, but I hope to help some future SO'ers as well.)
All characters are stored in your computer not as characters, but as bytes. Back in the olden days, space and transmission time were much more limited than now, so people tried to save every byte possible, even down to not using a full byte to store a character. Now, because we realize that we need to communicate with the whole world, we've decided it's more important to be able to represent every character in every language. That transition hasn't always been smooth, and that's what you're running up against.
Latin-1 (in various flavors) is an encoding that always uses a single 8-bit byte for a character. Which means it can only have 256 possible characters. Plenty if you only want to write English or Swedish, but not enough to add Russian and Chinese. (background on Latin-1)
UTF-8 encodes the first half of Latin-1 in exactly the same way, which is why you see most of the characters looking the same. But it doesn't always use a single byte for a character -- it can use up to four bytes on one character. (utf-8) As you discovered, it uses 2 bytes for é. But Latin-1 doesn't know that, and is doing its best to display those two bytes.
The trick is to always specify your encoding for byte streams (like info from a file, a URL, or a database), and to make sure that encoding is correct. (Sometimes that's a pain to find out, for sure.) Most modern languages, like Java and PHP do a good job of handling all the translation issues between different encodings, as long as you've correctly specified what you're dealing with.
You've pretty much answered your own question: you're receiving UTF-8 from the form but trying to store it in a Latin-1 column. You can either change the encoding on the column in MySQL or use the iconv function to translate between the two encodings.
Change your database table and column to utf8_unicode_ci.
Make sure you are saving the file with UTF-8 encoding (this is often overlooked)
Set headers:
<?php header("Content-type: text/html; charset=utf-8"); ?>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Twitter bootstrap won't show letter đ ć č ž š,Croatian letters.
I set charset to UTF-8 and it won't do...
I'm getting data form mysql and in database is ok, first i tought it was font Helvetica,so I get Helvetica Neue Pro LT but...not sure i load it correctly...any idea how to load...Twitter bootstap + Font Awesome,
so I tired with Arial, and it won't work!
so any help?
tnx alot
There can be several reasons.
Before troubleshooting the database, you should first try with some static content to ensure that your editor and server is correctly set up.
Try adding some html content to your php file that contains your croatian characters. If these characters come out wrong in your browser, make sure:
Your code editor saves your PHP files using UTF-8 encoding
Your webserver outputs your PHP files as UTF-8. To check this, look at the http header in your browser. There should be a line named "Content-Type" with a value of "text/html; charset=UTF-8". Look at this screenshot from Chrome to locate the http header: http://www.jesperkristensen.dk/webstandarder/doctype-chooser/chrome.png
If the static characters come out right, the next step is to troubleshoot the database.
The character set
Computers only know numbers by nature, so internally the computer thinks of letters and characters as numbers. For example, the letter a is, by default, stored as the number 97 on American computers, while b is 98. For a complete list, see http://www.asciitable.com/
Very simplified put, whenever displaying characters on the screen, the computer will use this numeric value and look up the value in a font library to find the appropriate glyph to display on screen.
The set of glyphs (characters) that the computer is searching whenever it is displaying some text is called the character set. The specific encoding rules, that define what numbers map to what glyphs in the character set, are called character encodings.
When people talk about the ASCII set they are talking about a collection of glyphs that include the English alphabet in both uppercase and lowercase, the arabic numbers (0-9) and a handful of special characters. But they may also be referring to the ASCII character encoding which specifies which numbers map to which glyphs. Again - see www.asciitable.com
Unfortunately, there are more than English latin characters and as computers proliferated throughout the world in the 60's and 70's, local character sets, fonts and encodings were invented to suit local needs. These have esoteric names like ISO-8859-5, EUC-JP or IBM860.
Attempting to read a text using a different character encoding standard than the text was encoded with would often cause headaches. English characters would work, because they are represented the same across different encodings and sets, but anything else would break. For instance, the character æ is a special Danish vowel character that has numeric value of 230 when using the ISO-8859-1 standard which was the predominant encoding standard in Denmark. However, if you saved a text file containing In Denmark, an apple is called æble. to a floppy disk and sent it to a friend in Bulgaria, his computer would assume that the text file is encoded using the ISO-8859-5 standard for cyrillic texts and it would show up as In Denmark, an apple is called цble which is wrong, because according to the ISO-8859-5 standard the numeric value of 230 maps to ц.
To compare the two character encodings, please see:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
http://en.wikipedia.org/wiki/ISO/IEC_8859-5
In the olden days, developers had to pick which character set and encoding to use for their application, because no universal solution existed, but fortunately a new character set called Unicode evolved in the 1990's.
This character set includes thousands and thousands of glyphs from around the world, enough to cover pretty much every current alphabet and language in the world. Together with Unicode, a few specifications on how to encode text was devised. The most popular today is called utf-8, which is conveniently backwards compatible with the old American ASCII 7-bit character encoding. Because of that, all valid ASCII text is also valid utf-8 text. This backward compatibility is also a curse, because it frequently leads novice developers to conclude that their software is working, when in fact it is only working with characters present in the ASCII character set.
First step - making sure your database is storing text as utf-8
Before you can display the text correctly, it must be stored correctly. Use your SQL management tool to check the character encoding for the table you are working with. If no information is present, defaults are probably inherited from the database schema or server configuration, so check that too.
If your table is NOT using utf-8 or you are unable to verify that it is using utf-8, you may run this command in your SQL tool against your table to explicitly instruct the database to store the data using utf-8:
ALTER TABLE name_of_your_table
CHARACTER SET utf8
COLLATE utf_unicode_ci;
This tells the database to store data using utf-8 encoding, making it capable of storing your special characters. Databases also uses a concept called collation which is a set of rules on how glyphs are sorted and compared. For instance ß may be interpreted as two s characters in german, while other languages might consider it a special character that comes before or after normal latin characters when sorting. Unless you have a good reason, use the utf_unicode_ci collation which is language agnostic and will usually sort your things correctly. the _ci in the name means case insensitive, meaning that when doing comparisons such as WHERE country = USA, records in lowercase will also match.
Second step - the webserver and the database needs to speak utf-8 together
Now that your database is storing things the right way, you have to make sure your webserver and database are communicating correctly too. Again, a multitude of environment settings affect their defaults, so it's a good idea to be explicit when connecting to the database. If you are using PDO in PHP to connect, you can use the following example to connect (taken from php.net) :
<?php
$dsn = 'mysql:host=localhost;dbname=testdb';
$username = 'username';
$password = 'password';
$options = array(
PDO::MYSQL_ATTR_INIT_COMMAND => 'SET NAMES utf8',
);
$dbh = new PDO($dsn, $username, $password, $options);
?>
What is important here is the options associative array. It contains "SET NAMES utf8" which is a SQL command that is run against the database whenever a connection is opened. It has two implications.
It instructs the database that any queries sent subsequently will be encoded using utf-8. That way the database will understand non-ascii characters coming from the webserver.
It instructs the database that any responses returned from the database to the web server, will be assumed, by the web server, to be encoded using utf-8 and treated as such.
With a database that stores your data using utf-8 encoding and a web server that connects and transfers query results using utf-8 encoding, you should be ready to display your croatian characters on your website.
We are trying to migrate database content (with a PHP script).
Content has been copied into a CMS and then written to the database. Content copied could be from any character encoding scheme (e.g. IS0-...-14) and any website.
The PHP CMS is UTF-8 so the character pasted into a textbox would be converted to UTF-8 when it was POSTed but then written to the database as Latin-1 (MSSQL db...db charset and query charset both latin-1).
We are desperately trying to think up how this could be reversed or if it is even possible (to get it so the character is fully UTF-8) in PHP.
If we can get the logic we can write an extension in C++ if PHP cant handle it (which it probably cant, mb_shite and iconv).
I keep getting lost in UTF-8 4 byte character streams (i.e. 0-127 is..ect).
Anybody got any ideas?
So far we have used PHP's ord() function to try and produce a Unicode/Acsii char ref for each char (I know ord returns ASCII but it prints character numbers over 128 which I thought was wierd if it is just meant to be ASCII, or maybe it repeats itself).
My thoughts are the latin1 will struggle to convert back to UTF-8 and will result in black diamond due to single byte char stream in Latin1 (ISO-...-1).
If latin1 is an 8-bit clean encoding for your database (it is in MySQL, donno about MSSQL), then you don't need to do anything to reconstruct the utf-8 string. When you pull it out of your database into PHP you will get back the same bytes you put in, i.e. UTF-8.
If latin1 is not an 8-bit-clean encoding for your database then your strings are irretrievably broken. This means any characters which the database considered invalid were either dropped or replaced the moment you wrote your utf-8 string to the database. There isn't any way to recover from this.
The manual clearly states " ucs2 cannot be used as a client character set, which means that it does not work for SET NAMES or SET CHARACTER SET". So how can I insert, for example, the codepoint U+2193? I am using PHP 5.3 + PDO.
If you want to use Unicode for communicating with a MySQL server, your only option is to use UTF-8.
If you're working with UCS-2 or UTF-16 strings in PHP now, you'll have to convert them to UTF-8 before trying to store them. Also note that MySQL will give you back UTF-8 if that's what you set your client character set to, so you'll need to convert query results as well if you're committed to working with UCS-2 on the PHP side. (If you're in a position to make bigger changes, you'd likely be better off simply using UTF-8 everywhere than doing all this extra conversion.)
As for storing the codepoint U+2193, no worries: UTF-8 can represent every Unicode codepoint (in this specific case, it'd be 0xE2 0x86 0x93).
Technically, this is fudging a little, since MySQL's utf8 and ucs2 character sets only cover a subset of Unicode called the Basic Multilingual Plane (BMP). The world of Unicode charsets is expanded in MySQL 5.5 to move beyond the BMP, but you still can't use ucs2, the new utf16 or utf32 charsets as client charsets, leaving you still stuck with UTF-8.
For posterity, CREATE TABLE test (encoding varchar(255) CHARACTER SET ucs2); and then INSERT INTO test VALUES (1, CHAR(0x2193));. If I then run a SELECT * FROM test I see a down arrow.