Form saves special latin characters as symbols - php

My PHP form is submitting special latin characters as symbols.
So, Québec turns into Québec
My form is set to UTF-8 and my database table has latin1_swedish_ci collation.
PHP: $db = new PDO('mysql:host=localhost;dbname=x;charset=utf8', 'x', 'x');
A bindParam: $sql->bindParam(":x", $_POST['x'],PDO::PARAM_STR);
I am new to PDO so I am not sure what the problem is. Thank you
*I am using phpMyAdmin

To expand a little bit more on the encoding problem...
Any time you see one character in a source turn into two (or more characters), you should immediately suspect an encoding issue, especially if UTF-8 is involved. Here's why. (I apologize if you already know some of this, but I hope to help some future SO'ers as well.)
All characters are stored in your computer not as characters, but as bytes. Back in the olden days, space and transmission time were much more limited than now, so people tried to save every byte possible, even down to not using a full byte to store a character. Now, because we realize that we need to communicate with the whole world, we've decided it's more important to be able to represent every character in every language. That transition hasn't always been smooth, and that's what you're running up against.
Latin-1 (in various flavors) is an encoding that always uses a single 8-bit byte for a character. Which means it can only have 256 possible characters. Plenty if you only want to write English or Swedish, but not enough to add Russian and Chinese. (background on Latin-1)
UTF-8 encodes the first half of Latin-1 in exactly the same way, which is why you see most of the characters looking the same. But it doesn't always use a single byte for a character -- it can use up to four bytes on one character. (utf-8) As you discovered, it uses 2 bytes for é. But Latin-1 doesn't know that, and is doing its best to display those two bytes.
The trick is to always specify your encoding for byte streams (like info from a file, a URL, or a database), and to make sure that encoding is correct. (Sometimes that's a pain to find out, for sure.) Most modern languages, like Java and PHP do a good job of handling all the translation issues between different encodings, as long as you've correctly specified what you're dealing with.

You've pretty much answered your own question: you're receiving UTF-8 from the form but trying to store it in a Latin-1 column. You can either change the encoding on the column in MySQL or use the iconv function to translate between the two encodings.

Change your database table and column to utf8_unicode_ci.

Make sure you are saving the file with UTF-8 encoding (this is often overlooked)
Set headers:
<?php header("Content-type: text/html; charset=utf-8"); ?>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Related

Charsets and Databases

My boss likes to use n-dashes. They always cause problems with encoding and I cannot work out why.
I store my TEXT field in a database under the charset: utf8_general_ci.
I have the following tags under my <head> on my webpage:
<meta charset="UTF-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
I pull the information from my database with the following set:
mysql_set_charset('UTF8',$connection);
(I know MYSQL is depreciated)
But when I get information from the database, I end up with this:
– Europe
If I take this string and run it through utf8_decode, I get this:
��? Europe
I even tried running it thorugh utf8_encode, and I got this:
âÃâ¬Ãâ Europe
Can someone explain to me why this is happening? I dont understand. I even ran the string through mb_detect_encoding and It said the string was utf8. So why is not printing correctly?
The solution (or not really a solution, because it ruins the rest of the website) is to remove the mysql_set_encoding line, and use utf8_decode. Then it prints out fine. BUT WHY!?
You have to remember that computers handle all forms of data as nothing more than sequences of 1s and 0s. In order to turn those 1s and 0s into something meaningful, the computer must somehow be told how those bits should be interpreted.
When it comes to a textual string, such information regarding its bits' interpretation is known as its character encoding. For example, the bit sequence 111000101000000010010011, which for brevity I will express in hexadecimal notation as 0xe28093, is interpreted under the UTF-8 character encoding to be your boss's much-loved U+2013 (EN-DASH); however that same sequence of bits could mean absolutely anything under a different encoding: indeed, under the ISO-8859-1 encoding (for example), it represents a sequence of three characters: U+00E2 (LATIN SMALL LETTER A WITH CIRCUMFLEX), U+0080 (<control>) and U+0093 (SET TRANSMIT STATE).
Unfortunately, in their infinite wisdom, PHP's developers decided not to keep track of the encoding under which your string variables are stored—that is left to you, the application developer. Worse still, many PHP functions make arbitrary assumptions about the encoding of your variables, and they happily go ahead manipulating your bits without any thought of the consequences.
So, when you call utf8_decode on a string: it takes whatever bits you provide, works out what characters they happen to represent in UTF-8, and then returns to you those same characters encoded in ISO-8859-1. It's entirely possible to come up with an input sequence that, when passed to this function, produces absolutely any given result; indeed, if you provide as input 0xc3a2c280c293 (which happens to be the UTF-8 encoding of the three characters mentioned above), it will produce a result of 0xe28093—the UTF-8 encoding of an "en dash"!
Such double encoding (i.e. UTF-8 encoded, treated as ISO-8859-1 and transcoded to UTF-8) appears to be what you're retrieving from MySQL when you do not call mysql_set_charset (in such circumstances, MySQL transcodes results to whatever character set the client specifies upon connection—the standard drivers use latin1 unless you override their default configuration). In order for a result that MySQL transcodes to latin1 to produce such double encoded UTF-8, the value that is actually stored in your column must have been triple encoded (i.e. UTF-8 encoded, treated as ISO-8859-1, transcoded to UTF-8, then treated as latin1 again)!
You need to fix the data that is stored in your database:
Identify exactly how the incumbent data has actually been encoded. Some values may well be triple-encoded as described above, but others (perhaps that predate particular changes to your application code; or that were inserted/updated from a different source) may be encoded in some other way. I find SELECT HEX(myColumn) FROM myTable WHERE ... to be very useful for this purpose.
Correct the encodings of those values that are currently incorrect: e.g. UPDATE myTable SET myColumn = BINARY CONVERT(myColumn USING latin1) WHERE ...—if an entire column is misencoded, you can instead use ALTER TABLE to change it to a binary string type and then back to a character string of the correct encoding. Beware of transformations that increase the encoded length, as the result might overflow your existing column size.

MySQL: Different charsets for different text contents, does it worth?

I have my database with utf8mb4 in all tables and all char/varchar/text columns. All is working fine but I was wondering if I really need it for all columns. I mean, I have columns that will contain user text that require utf8mb4 since the user can type in any language, insert emoticons, and so on. However I have different columns that will contain other kind of strings like user access tokens, country codes, user nicknames that does not contain strange characters, and so on.
Does it worth to change the charset of these columns to something like ascii or latin1? It would improve database space, efficiency? My feel is that set a charset like utf84mb for something that will never contain unicode characters is a waste of 'something'... but I really do not know how this is managed internally by MySQL.
In the other side I am connecting to this database from php and setting the connection charset to uft8mb4, so I suppose that all non utf8 columns will be converted automatically. I suppose is not a problem as utf8 is superset of ascii or latin1.
Any tips? pros and contras? Thanks!
The short answer is to make all your columns and tables defaulting to the same thing, UTF-8.
The long answer is because of the way UTF-8 is encoded, where ASCII will map 1:1 with UTF-8 and not incur any additional storage overhead like you might experience with UTF-16 or UTF-32, it's not a big deal. If you're storing non-ASCII characters it will take more space, but if you're storing those, you'll need the support anyway.
Having mixed character sets in your tables is just asking for trouble. The only exception is when defining BINARY or BLOB type columns that are not UTF-8 but instead binary.
Even the documentation makes it clear the only place this is an issue is with CHAR columns rather than VARCHAR, but it's not really a good idea to use CHAR columns in the first place.
ASCII is a strict subset of UTF-8, so there is exactly zero gain in space efficiency if you have nothing that uses special characters stored in UTF-8. There is a marginal improvement in space efficiency if you use latin-1 instead of UTF-8 for storing latin-derived text (special characters that UTF-8 uses 2 bytes for can be stored with just one byte in latin-1), but you gain a lot of headaches on the way, and you lose compatibility with wider character sets.
For example, ñ is stored as 0xC3 0xB1 in UTF-8, whereas latin-1 stores it as 0xF1. On the other hand, a is 0x61 in both encodings. The clever guys that invented UTF8 did it this way. You save a single byte, only for special characters.
TL;DR Use UTF-8 for everything. If you have to ask, you don't need anything else.

Croatian letters enconding + Twitter Bootstrap + Helvetica Neue Pro

Twitter bootstrap won't show letter đ ć č ž š,Croatian letters.
I set charset to UTF-8 and it won't do...
I'm getting data form mysql and in database is ok, first i tought it was font Helvetica,so I get Helvetica Neue Pro LT but...not sure i load it correctly...any idea how to load...Twitter bootstap + Font Awesome,
so I tired with Arial, and it won't work!
so any help?
tnx alot
There can be several reasons.
Before troubleshooting the database, you should first try with some static content to ensure that your editor and server is correctly set up.
Try adding some html content to your php file that contains your croatian characters. If these characters come out wrong in your browser, make sure:
Your code editor saves your PHP files using UTF-8 encoding
Your webserver outputs your PHP files as UTF-8. To check this, look at the http header in your browser. There should be a line named "Content-Type" with a value of "text/html; charset=UTF-8". Look at this screenshot from Chrome to locate the http header: http://www.jesperkristensen.dk/webstandarder/doctype-chooser/chrome.png
If the static characters come out right, the next step is to troubleshoot the database.
The character set
Computers only know numbers by nature, so internally the computer thinks of letters and characters as numbers. For example, the letter a is, by default, stored as the number 97 on American computers, while b is 98. For a complete list, see http://www.asciitable.com/
Very simplified put, whenever displaying characters on the screen, the computer will use this numeric value and look up the value in a font library to find the appropriate glyph to display on screen.
The set of glyphs (characters) that the computer is searching whenever it is displaying some text is called the character set. The specific encoding rules, that define what numbers map to what glyphs in the character set, are called character encodings.
When people talk about the ASCII set they are talking about a collection of glyphs that include the English alphabet in both uppercase and lowercase, the arabic numbers (0-9) and a handful of special characters. But they may also be referring to the ASCII character encoding which specifies which numbers map to which glyphs. Again - see www.asciitable.com
Unfortunately, there are more than English latin characters and as computers proliferated throughout the world in the 60's and 70's, local character sets, fonts and encodings were invented to suit local needs. These have esoteric names like ISO-8859-5, EUC-JP or IBM860.
Attempting to read a text using a different character encoding standard than the text was encoded with would often cause headaches. English characters would work, because they are represented the same across different encodings and sets, but anything else would break. For instance, the character æ is a special Danish vowel character that has numeric value of 230 when using the ISO-8859-1 standard which was the predominant encoding standard in Denmark. However, if you saved a text file containing In Denmark, an apple is called æble. to a floppy disk and sent it to a friend in Bulgaria, his computer would assume that the text file is encoded using the ISO-8859-5 standard for cyrillic texts and it would show up as In Denmark, an apple is called цble which is wrong, because according to the ISO-8859-5 standard the numeric value of 230 maps to ц.
To compare the two character encodings, please see:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
http://en.wikipedia.org/wiki/ISO/IEC_8859-5
In the olden days, developers had to pick which character set and encoding to use for their application, because no universal solution existed, but fortunately a new character set called Unicode evolved in the 1990's.
This character set includes thousands and thousands of glyphs from around the world, enough to cover pretty much every current alphabet and language in the world. Together with Unicode, a few specifications on how to encode text was devised. The most popular today is called utf-8, which is conveniently backwards compatible with the old American ASCII 7-bit character encoding. Because of that, all valid ASCII text is also valid utf-8 text. This backward compatibility is also a curse, because it frequently leads novice developers to conclude that their software is working, when in fact it is only working with characters present in the ASCII character set.
First step - making sure your database is storing text as utf-8
Before you can display the text correctly, it must be stored correctly. Use your SQL management tool to check the character encoding for the table you are working with. If no information is present, defaults are probably inherited from the database schema or server configuration, so check that too.
If your table is NOT using utf-8 or you are unable to verify that it is using utf-8, you may run this command in your SQL tool against your table to explicitly instruct the database to store the data using utf-8:
ALTER TABLE name_of_your_table
CHARACTER SET utf8
COLLATE utf_unicode_ci;
This tells the database to store data using utf-8 encoding, making it capable of storing your special characters. Databases also uses a concept called collation which is a set of rules on how glyphs are sorted and compared. For instance ß may be interpreted as two s characters in german, while other languages might consider it a special character that comes before or after normal latin characters when sorting. Unless you have a good reason, use the utf_unicode_ci collation which is language agnostic and will usually sort your things correctly. the _ci in the name means case insensitive, meaning that when doing comparisons such as WHERE country = USA, records in lowercase will also match.
Second step - the webserver and the database needs to speak utf-8 together
Now that your database is storing things the right way, you have to make sure your webserver and database are communicating correctly too. Again, a multitude of environment settings affect their defaults, so it's a good idea to be explicit when connecting to the database. If you are using PDO in PHP to connect, you can use the following example to connect (taken from php.net) :
<?php
$dsn = 'mysql:host=localhost;dbname=testdb';
$username = 'username';
$password = 'password';
$options = array(
PDO::MYSQL_ATTR_INIT_COMMAND => 'SET NAMES utf8',
);
$dbh = new PDO($dsn, $username, $password, $options);
?>
What is important here is the options associative array. It contains "SET NAMES utf8" which is a SQL command that is run against the database whenever a connection is opened. It has two implications.
It instructs the database that any queries sent subsequently will be encoded using utf-8. That way the database will understand non-ascii characters coming from the webserver.
It instructs the database that any responses returned from the database to the web server, will be assumed, by the web server, to be encoded using utf-8 and treated as such.
With a database that stores your data using utf-8 encoding and a web server that connects and transfers query results using utf-8 encoding, you should be ready to display your croatian characters on your website.

Store special characters (german) SqlServer via php

I have a fedora machine acting as server, with apache running php 5.3
A scripts acts as an entry page for various sources sending me "messages".
The php script is called like: serverAddress/phpScript.php?message=MyMessage the message is then saved via PDO to connect to SqlServer 2008 db.
If the message contains any special characters (e.g. german), like: üäöß then in the db I will get some gibberish instead of the correct string: üäöß
The db is perfectly capable of UTF-8 - I can connect and send/retrieve german characters without any issue with other tools (not via php).
Inside the php script:
if I echo the input string I get the correct string üäöß
if I save it to a file (log the input) I see the gibberish: üäöß
What is causing this behavior? How can I fix it?
multibyte is enabled (yum install php-mbstring followed by a apache restart)
at the start of my php script I have:
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
from what I understand the default encoding type when dealing with mssql via PDO is UTF-8
New development:
A colleague pointed me to the PDO_DBLIB page (visible only from cache in this moment) where I saw $res->bindValue(':value', iconv('UTF-8', 'ISO8859-1', $value);
I replaced all my $res->bindParam(':text',$text); with $res->bindParam(':text',iconv('UTF-8', 'ISO8859-1',$text)); and everything worked :).
The mb_internal_encoding.... and all other lines were no longer needed.
Why does it work when using the ISO8859-1 encoding?
A database may handle special characters without even supporting the Unicode set (which UTF-8 happens to be an encoding, specifically a variable-length one).
A character set is a mapping between numbers and characters. Unicode and ASCII are common examples of charsets. Unicode states that the sign € maps to the number 8364 (really it uses the code point U+20AC). UTF-8 is a way to encode Unicode code points, and represents U+20AC with three bytes: 0xE2 0x82 0xAC; UTF-16 is another encodind for Unicode code points, which always use two bytes: 0x20AC (link). Both of these encodings refer to the same 8364th entry in the Unicode catalogue.
ASCII is both a charset and an encoding scheme: the ASCII character set maps number from 0 to 127 to 128 human chars, and the ASCII encoding requires a single byte.
Always remember that a String is a human concept. It's represented in a computer by the tuple (byte_content, encoding). Let's say you want to store Unicode strings in your database. Please, note: it's not necessary to use the Unicode set if you just need to support German users. It's useful when you want to store Arabian, Chinese, Hebrew and German at the same time in the same column. MS SQLServer uses UCS-2 to encode Unicode, and this holds true for columns declared NCHAR or NVARCHAR (note the N prefix). So your first action will be checking if the target columns types are actually nvarchar (or nchar).
Then, let's assume that all input strings are UTF-8 encoded in your PHP script. You want to execute something like
$stmt->bindParam(':text', $utf8_encoded_text);
According to the documentation, UTF-8 is the default string encoding. I hope it's smart enough to work with NVARCHAR, otherwise you may need to use the extra options.
Your colleague's solution doesn't store Unicode strings: it converts in the ISO-8859-1 space, then saves the bytes in simple CHAR or VARCHAR columns. The difference is that you won't be able to store character outside of the ISO-8859-1 space (eg Polish)
Take a look at this article on "Handling Unicode Front to Back in a Web App". By far one of the best articles I've seen on the subject. If you follow the guide and the issues are still present, then you know for sure that it's not your fault.

Why is this the extended ascii character (â, é, etc) getting replaced with <?> characters?

Why is this the extended ascii character (â, é, etc) getting replaced with <?> characters?
I attached a pic... but I am using PHP to pull the data from MySQL, and some of these locations have extended characters... I am using the Font Arial.
You can see the screen shot here: http://img269.imageshack.us/i/funnychar.png/
Still happening after the suggestions, here is what I did:
My firefox (view->encoding) is set to UTF-8 after adding the line, however, the text inside the option tags is still showing the funny character instead of the actual accented one. What should I look for now?
UPDATE:
I have the following in the PHP program that is giving my those <?> characters...
ini_set( 'default_charset', 'UTF-8' );
And right after my zend db object creation, I am setting the following query:
$db->query("SET NAMES utf8;");
I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
Also STATUS is reporting:
Connection: Localhost via UNIX socket
Server characterset: latin1
Db characterset: latin1
Client characterset: utf8
Conn. characterset: utf8
UNIX socket: /var/run/mysqld/mysqld.sock
Uptime: 4 days 20 hours 59 min 41 sec
Looking at the source of the page, I see
<option value="Br�l� Lake"> Br�l� Lake
OK- NEW UPDATE-
I Changed everything in my PHP and HTML to:
and
header('Content-Type: text/html; charset=latin1');
Now it works, what gives?? How do I convert it all to UTF-8?
That's what the browser does when it doesn't know the encoding to use for a character. Make sure you specify the encoding type of the text you send to the client either in headers or markup meta.
In HTML:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
In PHP (before any other content is sent to the client):
header('Content-Type: text/html; charset=utf-8');
I'm assuming you'll want UTF-8 encoding. If your site uses another encoding for text, then you should replace UTF-8 with the encoding you're using.
One thing to note about using HTML to specify the encoding is that the browser will restart rendering a page once it sees the Content-Type meta tag, so you should include the <meta /> tag immediately after the <head /> tag in your page so the browser doesn't do any more extra processing than it needs.
Another common charset is "iso-8859-1" (Basic Latin), which you may want to use instead of UTF-8. You can find more detailed info from this awesome article on character encodings and the web. You can also get an exhaustive list of character encodings here if you need a specific type.
If nothing else works, another (rare) possibility is that you may not have a font installed on your computer with the characters needed to display the page. I've tried repeating your results on my own server and had no luck, possibly because I have a lot of fonts installed on my machine so the browser can always substitute unavailable characters from one font with another font.
What I did notice by investigating further is that if text is sent in an encoding different than the encoding the browser reports as, Unicode characters can render unexpectedly. To work around this, I used the HTML character entity representation of special characters, so â becomes â in my HTML and é becomes é. Once I did this, no matter what encoding I reported as, my characters rendered correctly.
Obviously you don't want to modify your database to HTML encode Unicode characters. Your best option if you must do this is to use a PHP function, htmlentities(). You should use this function on any data-driven text you expect to have Unicode characters in. This may be annoying to do, but if specifying the encoding doesn't help, this is a good last resort for forcing Unicode characters to work.
There is no such standard called "extended ASCII", just a bunch of proprietary extensions.
Anyway, there are a variety of possible causes, but it's not your font. You can start by checking the character set in MySQL, and then see what PHP is doing. As Dan said, you need to make sure PHP is specifying the character encoding it's actually using.
As others have mentioned, this is a character-encoding question. You should read Joel Spolsky's article about character encoding.
Setting
header('Content-Type: text/html; charset=utf-8');
will fix your problem if your php page is writing UTF-8 characters to the browser. If the text is still garbled, it's possible your text is not UTF-8; in that case you need to use the correct encoding name in the Content-Type header. If you have a choice, always use UTF-8 or some other Unicode encoding.
Simplest fix
ini_set( 'default_charset', 'UTF-8' );
this way you don't have to worry about manually sending the Content-Type header yourself.
EDIT
Make sure you are actually storing data as UTF-8 - sending non-UTF-8 data to the browser as UTF-8 is just as likely to cause problems as sending UTF-8 data as some other character set.
SELECT table_collation
FROM information_schema.`TABLES` T
WHERE table_name=[Table Name];
SELECT default_character_set_name
, default_collation_name
FROM information_schema.`SCHEMATA` S
WHERE schema_name=[Schema Name];
Check those values
There are two transmission encodings, PHP<->browser and Mysql<->PHP, and they need to be consistent with each other. Setting up the encoding for Mysql<->PHP is dealt with in the answers to the questions below:
Special characters in PHP / MySQL
How to make MySQL handle UTF-8 properly
php mysql character set: storing html of international content
The quick answer is "SET NAMES UTF8".
The slow answer is to read the articles recommended in the other answers - it's a lot better to understand what's going on and make one precise change than to apply trial and error until things seem to work. This isn't just a cosmetic UI issue, bad encoding configurations can mess up your data very badly. Think about the Simpsons episode where Lisa gets chewing gum in her hair, which Marge tries to get out by putting peanut butter on.
You should encode all special chars into HTML entities instead of depending on the charset.
htmlentities() will do the work for you.
I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
If your original data was latin1, then inserting it into a UTF-8 database won't convert it to UTF-8, AFAIK, it will insert the same data but now believe it's UTF-8, thus breaking.
If you've got a SQL dump, I'd suggest running it through a tool to convert to UTF-8. Notepad++ does this pretty well - simply open the file, check that the accented characters are displaying correctly, then find "convert to UTF-8" in the menu.
These special characters generally appear due to the the extensions. If we provide a meta tag with charset=utf-8 we can eliminate them by adding:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
to your meta tags

Categories