Problems with utf-8 encoding in php - php

Another utf-8 related problem I believe...
I am using php to update data in a mysql db then display that data elsewhere in the site. Previously I have run into utf-8 problems before where special characters are displayed as question marks when viewed in a browser but this one seems slightly different.
I have a number of records to enter that contain the è character. If I enter this directly in the db then it appears correctly on the page so I take this to mean that utf-8 content is being output correctly.
However when I try and update the values in the db through php, then the è character is replaced. What appears instead is & Atilde ; & uml ; (without the spaces) which appears in the browser as è
I have the tables in the database set to use UTF-8. I believe this is correct cos, as mentioned, if I update the db through phpMyAdmin, its all ok. Similarly I have set the character encoding for the page which seems to be correct. I am also running the sql statement "SET NAMES 'utf8';" before trying to update the db.
Anyone have any other ideas as to where the problem may lie?
Many thanks

Yup.
The character you have is LATIN SMALL LETTER E WITH GRAVE. As you can see, in UTF-8 that character is encoded into two bytes 0xC3 and 0xA8.
But in many default, western encodings (such as ISO-8859-1) which are single-byte only, this multi-byte character is decoded as two separate characters, LATIN CAPITAL LETTER A WITH TILDE and DIAERESIS. Notice how they are both encoded as C3 and A8 in ISO-8859-1?
Furthermore, it looks like PHP is processing these characters through htmlentities() which result in the à and ¨ respectively.
So, where exactly is the problem in your code? Well, htmlentities() could be doing it all by itself since its 3rd argument is a encoding name - which you may not have properly set to 'UTF-8'. But it could be some other string processing function as well. (Note: As a general rule, it's a bad idea to store HTML entities in the database - this step should be reserved for time of display)
There are a bunch of other ways to trip yourself up with UTF-8 in php - I suggest hitting up the cheatsheet and make sure you're in good shape.

Well it is your own code convert characters into entities.
To make it right:
Ban htmlentities function from your scripts forever.
Use htmlspecialchars, but not on insert, but whan displaying data.
Repair existing data in the database using html_entity_decode.

I suppose you're taking the results of some form submission and inserting the results in the database. If so, you must ensure that you instruct the browser to send UTF-8 data and you should validate the user input for a valid UTF-8 stream.
Change your form element to include accept-charset:
<form accept-charset="utf-8" method="post" ... >
<input type="text name="field" />
...
</form>
Validate the data with:
$valid = array_key_exists("field", $_POST) && !is_array($_POST['field']) &&
preg_match('//u', $_POST['field']) && ...; //check length with mb_strlen etc.

I think you miss Content-Type declaration on the html page:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
If you don't have it, the browser will guess the encoding, and convert any characters outside of that encoding to entities when posting a form.

Related

Special characters from forms using as variables on the same page

I've ran into an issue with special characters being used as variables.
Heres my HTML that has something to do with characters:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
<form class="nav-search left" method="POST" enctype="application/x-www-form-urlencoded">
<input name="summoner_name" type="text" placeholder="Summoner Name" required autofocus />
</form>
Heres the PHP which is trying to get the $_POST:
$summoner = htmlspecialchars($_POST['summoner_name']);
$summoner_name = strtolower($summoner);
-> outputs nothing, as it isn't read properly.
Letters such as Śý will be used, and I think these are from extended-latin
Using this walkthrough as a jumping off point, I would say:
First
...verify that your application is emitting the Content-Type: text/html; charset=utf-8 header. There's really no reason not to, and it's really simple.
Next
...the first place you may have a real issue is your form. Again referencing the above link the encoding to be used for application/x-www-form-urlencoded data is practically undefined.
Make sure your form is using the attribute accept-charset set to "utf-8" (accept-charset="utf-8").
One More
...If you aren't using a database in here, then there's nothing to do there, but if you ever store this data in a database, that's almost definitely where the mis-encoding is happening.
There's a lot that could go wrong in the database - from connecting, to how the data is stored, to how the data is retrieved. If you are using a database at all to store this data before it is sent back to the browser, take extra care to make sure the database is properly encoding and retrieving UTF-8 text.
Finally
...strtolower doesn't convert characters not represented by your locale. If your locale is, say, en-US, your special UTF-8 characters will not be converted, and you'll get an empty string (if there aren't any other characters).
If you were to use mb_strtolower, you could pass an encoding like: mb_strtolower($summoner, 'UTF-8');
This properly handles special characters.

Converting odd character encoding back to utf-8

I have a database full of strings containing strange characters such as:
Design Tattoo Ãœbungshaut
Mehrflächiges Biozid Reinigungs- & Desinfektionsmittel
Where the Ãœ and ä should be, as I understand, an Ü and à when in proper UTF-8.
Is there a standard function to revert these multiple characters back to there proper UTF-8 form?
In PHP I have come across $url = iconv('utf-8', 'iso-8859-1', $url); which seems to get close but falls short. Perhaps I have the wrong parameters, but in any case was just wondering how well this issue is know and if there is an established fix?
The original data was taken from the eCommerce system CubeCart which seems to have no problem converting it back to normal text FYI.
The data shown as example is UTF-8 encoded data mistakenly interpreted as ISO-8859-1 (or windows-1252). The problem combinations are in fact “Ü” and “ä” (“Ā” does not appear in German). So apparently what you need to do is to read the data as UTF-8 and display it that way, instead of converting it.
If the database and output is utf-8 it could be because your not using utf-8 as the client character set.
If your using mysqli you can use set_charset or run SET NAMES utf8 as a query before fetching data.

HTML - Mixing UTF-8 coming from MySQL database and special chars into HTML

I have a database where everything is defined in UTF-8 (charsets, collations, ...).
I have a PHP page that gets datas from that database and display it.
That PHP page contains some hard text with special charaters, like é, à, ...
My PHP page has meta charset defined to utf-8.
I call mysql_set_charset("utf8");
My PHP page is written on an editor that is configured to encode to utf-8 Unicode (Dreamweaver CS4, there is no other utf-8 option)
Anything coming from the database is ok, but...
I can't display well the hard special characters (é, à, ù, ...).
Same problem when I use strip_tags(html_entity_decode($datafromdatabase)); on datas coming from database. Here it's really problematic.
What may I do to keep using UTF-8, but being able to display well the special chars without having to use their html equivalent (é, &agrave, ...) ?
EDIT
The problem with hard characters was coming from the php page that was not saved using adhoc encoding. I have created a new document copyed/pasted the old code into that new page, and saved it over the old page. No more problem with hard characters.
But I still have problems with strip_tags(html_entity_decode($datafromdatabase));
using $datafromdatabase = htmlentities(strip_tags(html_entity_decode($datafromdatabase)), ENT_COMPAT, "UTF-8") does not solve the problem. I have stange characters starting with # for each é, à, ù in the text coming from the database (stored as &eacute, ...)
I looks like it's a problem with your browser properly displaying the characters rather than saving.
Check two things.
Issue a utf8 http header
header( 'Content-Type: text/html; charset=UTF-8' );
And make sure your html declaration is mentioning utf8
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
That's for html 4
If your document is properly encoded, this should do it.
The problem with hard characters was coming from the php page that was not saved using adhoc encoding. I have created a new document copyed/pasted the old code into that new page, and saved it over the old page. No more problem with hard characters.
For the problem coming from strip_tags(html_entity_decode($datafromdatabase)); I had in fact to use strip_tags(html_entity_decode($datafromdatabase, ENT_QUOTES, "UTF-8"));

There are symbols like  and so on in database, what to do?

I have a few symbols in my description like  ⠀ and so on. Can I do anything about it? Or if it's in database, I can't do nothing now?
It sort of depends what the problem actually is...
If it's that those characters are supposed to be there (such as "Mañana" in Spanish) then you'll need to ensure everything is in UTF-8... the best way is to:
1: check the database tables are in "utf-8" encoding (if not convert them to utf-8)
2: as Martin noted, ensure the database connector is utf-8 using something like:
mysql_set_charset('utf8'); //note that MySQL uses no hyphen here
3: ensure the the document is utf-8 (you can add a header at the top)
<?php header('Content-type:text/html;charset=utf-8'); ?>
4: just to be on the safe side, add it in as a meta tag as well
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
HOWEVER
It's quite possible you've got some duff characters in the database where something like ISO-8859-1 has been juggled to UTF-8, badly. In this case you'll notice things like £ where what you actually want is £ (because UTF-8 characters contain more data than ISO-8859-1 characters, that extra data can become an additional character if you're not careful).
In which case your best bet is to clean the database (you could probably do something like UPDATE table SET field = REPLACE(field, '£', '£') for common "errors") and then convert the whole kaboodle to UTF-8 (as outlined above) to avoid the problem recurring.
To avoid having such characters,
Set the charset for your form. HTML forms have charset attribute and value. You can use UTF-8
Set Charset for the Document, via PHP or using META tags ( but this only works on the output )
set Charset for the db table
get a class/function to do ascii character conversion as part of your data filtering and escaping

Google Autocompleter and character encoding

I am using this autocompleter from Google
http://code.google.com/p/jquery-autocomplete/ (if you click on "Source" you can find all the source files for the script)
and everything is working fine, except it's having problems with special Croatian characters (like č, ć, ž etc. I'm not sure if you'll see these, so here's an idea of what I am talking about: link - the letter c with a hachek on top etc.)
Here's the setup:
an html file points to a jquery autocomplete script and a php file with the results array
the metadata for the html file has a charset of utf-8, no other pages have any kind of encoding at all
the array in the php file has those special characters encoded with html codes (the letter "ž" is replaced with ž so a typical array element looks like this: "Požega" => "5")
when I enter a search string into the input field, the returning results are encoded correctly - Požega etc. but when I click the result to accept it, it enters Požega into the input field, which is obviously not what I want
when my search string has a special letter in it, the script doesn't find anything
How do I fix this? Should I just replace the HTML special codes in the array with the actual special letters(it seems to work fine then, but I'm not sure whether everybody will see this as I intended)? If not, how do I set the character encoding on all pages so the special letters display correctly on the input field and they're searchable?
Thanks for the help!
Character encoding is such a pain in the ass with browsers. There are several things you can do to cover your bases, one of which you've already done.
Set the tag to indicate charset of UTF-8
Use .htaccess to define a charset of UTF-8
Use PHP to define a charset of UTF-8 in the header (something like: header('Content-Type: text/html; charset=UTF-8');"
Making sure these are true should ensure that the data shows up on all UTF-8 supported browsers. By the way, I can see the special characters, so you must be doing something right. :)

Categories