Best method of converting user input to UTF-8 - php

I'm building a PHP web application, and it works in UTF-8. The database is UTF-8, the pages are served as UTF-8 and I set the charset using a meta tag to UTF-8. Of course, with users using Internet Explorer, and copying & pasting from Microsoft Office, I somehow manage to get not UTF-8 input occasionally.
The ideal solution would be to throw an HTTP 400 Bad Request error, but obviously I can't do that. The next best thing is converting $_GET, $_POST and $_REQUEST to UTF-8. Is there anyway to see what character encoding the input is in so I can pass it off to iconv? If not, what's the best solution for doing this?

Check out mb_detect_encoding() Example:
$utf8 = iconv(mb_detect_encoding($input), 'UTF-8', $input);
There's also utf8_encode() if you guarantee that the string is input as ISO-8859-1.

In some cases using just utf8_encode or general checks are ok but you might lose some characters within the string. If you can build out a basic array/string list based on various types, this example being windows, you can salvage quite a bit more.
if(!mb_detect_encoding($fileContents, "UTF-8", true)){
$checkArr = array("windows-1252", "windows-1251");
$encodeString = '';
foreach($checkArr as $encode){
if(mb_check_encoding($fileContents, $encode)){
$encodeString .= $encode.",";
}
}
$encodeString = substr($encodeString, 0, -1);
$fileContents = mb_convert_encoding($fileContents, "UTF-8", $encodeString);
}

Related

UTF-8 data received by php isn't decoded

I'm having some troubles with my $_POST/$_REQUEST datas, they appear to be utf8_encoded still.
I am sending conventional ajax post requests, in these conditions:
oXhr.setRequestHeader("Content-type", "application/x-www-form-urlencoded; charset=utf-8");
js file saved under utf8-nobom format
meta-tags in html <header> tag setup
php files saved under utf-8-nobom format as well
encodeURIComponent is used but I tried without and it gives the same result
Ok, so everything is fine: the database is also in utf8, and receives it this way, pages show well.
But when I'm receiving the character "º" for example (through $_REQUEST or $_POST), its binary represention is 11000010 10111010, while "º" hardcoded in php (utf8...) binary representation is 10111010 only.
wtf? I just don't know whether it is a good thing or not... for instance if I use "#º#" as a delimiter of the explode php function, it won't get detected and this is actually the problem which lead me here.
Any help will be as usual greatly appreciated, thank you so much for your time.
Best rgds.
EDIT1: checking against mb_check_encoding
if (mb_check_encoding($_REQUEST[$i], 'UTF-8')) {
raise("$_REQUEST is encoded properly in utf8 at index " . $i);
} else {
raise(false);
}
The encoding got confirmed, I had the message raised up properly.
Single byte utf-8 characters do not have bit 7(the eight bit) set so 10111010 is not utf-8, your file is probably encoded in ISO-8859-1.

Error in encoding mysql -> How can I reconvert it to something else?

I started a website some time ago using the wrong CHARSET in my DB and site. The HTML was set to ISO... and the DB to Latin... , the page was saved in Western latin... a big mess.
The site is in French, so I created a function that replaced all accents like "é" to "é". Which solved the issue temporarily.
I just learned a lot more about programming, and now my files are saved as Unicode UTF-8, the HTML is in UTF-8 and my MySQL table columns are set to ut8_encoding...
I tried to move back the accents to "é" instead of the "é", but I get the usual charset issues with the (?) or weird characters "â" both in MySQL and when the page is displayed.
I need to find a way to update my sql, through a function that cleans the strings so that it can finally go back to normal. At the moment my function looks like this but doesn't work:
function stripAcc3($value){
$ent = array(
'à'=>'à',
'â'=>'â',
'ù'=>'ù',
'û'=>'û',
'é'=>'é',
'è'=>'è',
'ê'=>'ê',
'ç'=>'ç',
'Ç'=>'Ç',
"î"=>'î',
"Ï"=>'ï',
"ö"=>'ö',
"ô"=>'ô',
"ë"=>'ë',
"ü"=>'ü',
"Ä"=>'ä',
"€"=>'€',
"′"=> "'",
"é"=> "é"
);
return strtr($value, $ent);
}
Any help welcome. Thanks in advance. If you need code, please tell me which part.
UPDATE
If you want the bounty points, I need detailed instructions on how to do it. Thanks.
Try using the following function instead, it should handle all the issues you described:
function makeStringUTF8($data)
{
if (is_string($data) === true)
{
// has html entities?
if (strpos($data, '&') !== false)
{
// if so, revert back to normal
$data = html_entity_decode($data, ENT_QUOTES, 'UTF-8');
}
// make sure it's UTF-8
if (function_exists('iconv') === true)
{
return #iconv('UTF-8', 'UTF-8//IGNORE', $data);
}
else if (function_exists('mb_convert_encoding') === true)
{
return mb_convert_encoding($data, 'UTF-8', 'UTF-8');
}
return utf8_encode(utf8_decode($data));
}
else if (is_array($data) === true)
{
$result = array();
foreach ($data as $key => $value)
{
$result[makeStringUTF8($key)] = makeStringUTF8($value);
}
return $result;
}
return $data;
}
Regarding the specific instructions of how to use this, I suggest the following:
export your old latin database (I hope you still have it) contents as an SQL/CSV dump *
use the above function on the file contents and save the result on another file
import the file you generated in the previous step into the UTF-8 aware schema / database
* Example:
file_put_contents('utf8.sql', makeStringUTF8(file_get_contents('latin.sql')));
This should do it, if it doesn't let me know.
You might want to investigate what is used to fix WP database encoding issues:
http://codex.wordpress.org/Converting_Database_Character_Sets
To cut a long story short, most old WP sites were created with Swedish/Latin1 collated tables, which were used to store UTF8 strings. To collate the tables properly, the approach is to change the column to binary type, and then to change it to UTF8 text.
This avoids that the text gets wrangled when converting from Latin1 to UTF8 directly.
You will need to convert the offending rows using for example iconv. The challenge for you will be to know what rows are already UTF-8 and which are latin-1.
I'm not completely sure I understand your question, but
if you have
a UTF-8 database
all special characters in there stored as HTML entities
then a
html_entity_decode($string, ENT_QUOTES, "UTF-8");
should do the trick and turn all entities back into their UTF-8 native characters.
Make sure, not just your tables use utf-8, your database connection should use utf-8 as well.
$this->db = mysql_connect(MYSQL_SERVER,DB_LOGIN,DB_PASS);
mysql_set_charset ('utf8',$this->getConnection());
If you want to discuss with your database in UTF-8 you have to tell the Database that the connexion flow is a UTF-8 flow. You have to sent a request before each request you make to the database, this request in the following :
"SET NAMES utf8";
Personnaly I use that in the connect.inc.php files which create the connection to the database. Which this statement the database know that your sending UTF-8 encoded string and works perfectly !
mysql_set_charset function isn't working well, i tried this function in the past but the truth is that it don't do the trick.
For your complete issue, if you want to convert latin1 string to UTF-8, you have to convert first the latin1 string to a binary string format. Then convert the binary string into UTF-8 string, all can be done inside the database with database commands. See that artile (in french) : http://www.noidea.ca/2009/06/15/comment-convertir-une-db-de-latin1-a-utf8/
I can tell you that this method works because i used it to transform data from a database I've created.

PHP - how to detect encoding?

I'm using Amazon's API to obtain the description of books. The API returns XML responses and the description is marked up (with HTML) very poorly. To deal with this poorly marked up description, which oftentimes breaks the layout of my site, I'm trying to use HTML Tidy to "clean it up."
In order to prevent "weird" characters from being displayed on my web page, I think I need to tell Tidy what the input encoding is and what the desired output encoding is. I know I want the output to be UTF8. However, I'm not sure how to determine the encoding of the input (Amazon's book description).
I've tried something like this:
mb_detect_encoding($amazon_description);
It's helped, but I'm still occasionally getting weird characters (a black diamond with a question mark in it: �). My guess is that I'm not detecting the encoding properly.
Any suggestions what I need to do?
EDIT:
This is my current solution:
$sanitized_amazon_markup = preg_replace('/[^\w`~!##$%^&*()-=_+[\]{}|;\':",.\/<>? ]/', '', $sanitized_amazon_markup);
I'm not sure about this as this may delete stuff that I should be keeping.
Can you provide your tidy repairString call?
If you tried to use input-encoding and output-encoding from tidy options, try to not use these and use the third argument or repairString instead, something like this :
$oTidy = new tidy();
$page_content = $oTidy->repairString($page_content,
array("show-errors" => 0, "show-warnings" => false),
"utf8"
);
Edit :
After doing some tests, what I said before cannot work if you don't have utf8 encoding in $page_content already before calling repairString
But you will mostly end up with ISO-8859-1 (latin1) encoding if not UTF-8 already.
May I suggest you try :
$charset = mb_detect_encoding($amazon_description, 'UTF-8, ISO-8859-1');
if ($charset == "ISO-8859-1") {
$amazon_description = utf8_encode($amazon_description);
}
$oTidy = new tidy();
$amazon_description = $oTidy->repairString($amazon_description,
array("show-errors" => 0, "show-warnings" => false),
"utf8"
);

POST from Flash (AS2) to PHP, outputs ??? when non-english characters are used

I am trying to use POST in Flash (ActionScript 2), to POST values to PHP mail script.
I tried the PHP mail script with HTML form, and it worked perfectly fine.
But when I POST from flash and input non-English characters, I get "????" in the mail.
I tried utf8_encode($_POST["name"]), but it doesn't help.
Edit:
I also tried utf8_decode($_POST["name"]), it didn't work.
Update: (So you wont have to go through all the comments)
I checked the variables in Flash,
the values are stored correctly.
The HTML page where the Flash is embedded is UTF-8 encoded.
I watched the POST headers with FireBug, the POST itself is already messed up, showing "????" instead of the real value.
The the messed up "????" value, is currently url-encoded by flash, and decoded by PHP, resulting in $_POST["name"] == "???";
I suspect its the sendAndLoad method that creates the mess.
Update:
Here is the flash code:
System.useCodepage = true;
send_btn.onRelease = function() {
my_vars = new LoadVars();
my_vars.email = email_box.text;
my_vars.name = name_box.text;
my_vars.family_box = comment.text;
my_vars.phone = phone_box.text;
if (my_vars.email != "" and my_vars.name != "") {
my_vars.sendAndLoad("http://aram.co.il/ido/sendMail.php", my_vars, "POST");
gotoAndStop(2);
} else {
error_clip.gotoAndPlay(2);
}
my_vars.onLoad = function() {
gotoAndStop(3);
};
};
email_box.onSetFocus = name_box.onSetFocus=message_box.onSetFocus=function () {
if (error_clip._currentframe != 1) {
error_clip.gotoAndPlay(6);
}
};
Flash uses UTF8-encoding for all strings, anyway. If you use LoadVars, transfer as a urlencoded string should also work automatically.
So your problem is most probably in the PHP part of your application. For example, in order for UTF8 to work correctly, all individual PHP files must be saved in UTF8-encoded format, as well.
If just changing the file encoding doesn't work, try parsing $HTTP_RAW_POST_DATA first, check if all the fields have been transferred correctly, then go on and echo your way through until you find the place where the encoding is lost.
Update:
Here is your problem: You use System.useCodePage = true;. This requires you to specifically encode all your data as unicode before sending it. Unless you have any other documents in other encodings, and/or allow your users to upload their own text data with their localized encodings, set System.useCodePage = false;, and your utf8-problem should go away.
If you receive data from flash you need to use utf8_decode and not utf8_encode.
Flash uses UTF8 - as long as you don't tell it to use the local characterset. And you want PHP to decode that to good old ISO-8859-1 which PHP uses internally.
You'd only use utf8_encode when preparing data for flash.

how to avoid encoding problems using ajax-json and php-postrgesql

i have problems with encoding when using json and ajax.
chrome and ie encode umlauts as unicode when jsonifying, firefox and safari returning those utf-8 escaped umlauts like ¼Ã.
where is the best place to give all the same encoding?
js / php-get or by writing them to the database.
and i think the next trouble is, when i reload the utf-8 encoded stuff from the db, and write them to the browser and then rewrite them to the db, again via ajax-request a get a real chaos?
can i avoid the chaous? can i handle the encoding in an easy way?
pls. help :-)
very important is to also provide security
You must set everything to UTF-8, this means :
Database collation
Table collation
Field collation
Your coding software (example notepad++) encryption.
Had a similar problem. Maybe you are actually interpreting encoding the wrong way, clientwise. Try setting the frontend encoding before your queries.
<?php
$connection = pg_pconnect("dbname=data");
pg_set_client_encoding($connection, "encoding goes here"); //check enconding aletrnatives on PostgreSQL
$result = pg_query($connection, "SELECT whatever FROM wherever");
//and so on...
?>
I'm a newbie, but it may help. Also won't affect security in anyway if you are already protected against db injection.
Cheers

Categories