PHP imap weird encoding characters

PHP imap weird encoding characters - php

I'm using PHP to get my gmail email messages. It gives me email titles which look like this:
=?ISO-8859-13?Q?Darba_s=E2k=F0ana_ar_Gmail?=
It should actually look like this (those are Latvian characters normally available in utf8):
Darba sākšana ar Gmail
I tried:
utf8_encode(quoted_printable_decode(
"=?ISO-8859-13?Q?Darba_s=E2k=F0ana_ar_Gmail?="
));
And it gives me the following, which is not correct:
=?ISO-8859-13?Q?Darba_sâkðana_ar_Gmail?
How do I get this - Darba sākšana ar Gmail

You must use imap_mime_header_decode function:
$text = "=?ISO-8859-13?Q?Darba_s=E2k=F0ana_ar_Gmail?=";
$elements = imap_mime_header_decode($text);
foreach ($elements as $element) {
echo "Charset: $element->charset\n";
echo "Text: $element->text\n\n";
}
And you can use iconv function to convert:
iconv($element->charset, 'utf-8', $element->text);

As Piotr answered, you need to use imap_mim_header_decode. I'm assuming that you are using the imap_headerinfo with fromaddress. This is what I did :
if (strpos($header->fromaddress,'?=')) {
$wrong_charset = imap_mime_header_decode($header->fromaddress);
$corrected_charset = '';
for ($i=0; $i<count($value); $i++) {
$corrected_charset .= "{$wrong_charset[$i]->text} ";
}
}
else{
$corrected_charset = $header->fromaddress;
}

Related

Make PHPBB 3.0.14 and ABBC3 compatible with PHP 7.3

I'm trying to make ABBC3 work with PHP 7.3 and PHPBB 3.0.14 since I can't move to PHPBB 3.3 due lots of issues with MODs not ported to extensions and theme (Absolution).
I have asked help in PHPBB forum without luck because 3.0.x and 3.1.x version are not supported anymore.
So after dozens of hours trying to understand bbcode functions I'm almost ready.
My code works when there's a single bbcode in message. But doesn't works when there's more bbcode or it's mixed with texts.
So I would like to get some help to solve this part to make everything work.
In line 98 in includes/bbcode.php this function:
$message = preg_replace($preg['search'], $preg['replace'], $message);
Is returning something like this:
$message = "some text $this->Text_effect_pass('glow', 'red', 'abc') another text. $this->moderator_pass('"fernando"', 'hello!') more text"
For this message:
some text [glow=red]abc[/glow] another text.
[mod="fernando"]hello![/mod] more text
The input for preg_replace above is like this just for context:
"some text [glow=red:mkpanc3g]abc[/glow:mkpanc3g] another text. [mod="fernando":mkpanc3g]hello![/mod:mkpanc3g]"
So basically I have to split this string in valid expressions to apply eval() then concatenate everything. Like this:
$message = "some text". eval($this->Text_effect_pass('glow', 'red', 'abc');) . "another text " . eval($this->moderator_pass('"fernando"', 'hello!');). "more text"
In this specific case there's also double quotes left in '"fernando"'.
I know is not safe apply eval() to user input so I would like to make some type of preg_match and/or preg_split to get values inside of () to pass as parameter to my functions.
The functions are basically:
Text_effect_pass()
moderator_pass()
anchor_pass()
simpleTabs_pass()
I'm thinking in something like this (Please ignore errors here):
if(preg_match("/$this->Text_effect_pass/", $message)
{
then split the string and get value inside of() and remove extra single or double quotes.
after:
$textEffect = Text_effect_pass($value[0], $value[1], $value[2]);
Finally concatenate everything:
$message = $string[0] .$textEffect. $string[1];
}
if(preg_match("/$this->moderator_pass/", $message)
{
.....
}
P.S.: ABBC3 is not compatible with PHP 7.3 due usage of e modifier. I have edited everything to remove the modifier.
Here you can see it working separately:
bbcode 1
bbcode 2
Can someone give me some help please?

After long time searching for a solution for this problem I found this site that helped me build the regex.
Now I have managed to solve the problem and I have my forum fully working with PHPBB 3.14, PHP 7.3 and ABBC3.
My solution is:
// Start Text_effect_pass
$regex = "/(\\$)(this->Text_effect_pass)(\().*?(\')(,)( )(\').*?(\')(,)( )(\').*?(\'\))/is";
if (preg_match_all($regex, $message, $matches)) {
foreach ($matches[0] as $key => $func) {
$bracket = preg_split("/(\\$)(this->Text_effect_pass)/", $func);
$param = explode("', '", $bracket[1]);
$param[0] = substr($param[0], 2);
$param[2] = substr($param[2], 0, strrpos($param[2], "')"));
$effect = $this->Text_effect_pass($param[0], $param[1], $param[2]);
if ($key == 0) {
$init = $message;
} else {
$init = $mess;
}
$mess = str_replace($matches[0][$key], $effect, $init);
}
$message = $mess;
} // End Text_effect_pass
// Start moderator_pass
$regex = "/(\\$)(this->moderator_pass)(\().*?(\')(,).*?(\').*?(\'\))/is";
if (preg_match_all($regex, $message, $matches)) {
foreach ($matches[0] as $key => $func) {
$bracket = "/(\\$)(this->moderator_pass)/";
$bracket = preg_split($bracket, $func);
$param = explode("', '", $bracket[1]);
$param[0] = substr($param[0], 2);
$param[1] = substr($param[1], 0, strrpos($param[1], "')"));
$effect = $this->moderator_pass($param[0], $param[1]);
if ($key == 0) {
$init = $message;
} else {
$init = $mess;
}
$mess = str_replace($matches[0][$key], $effect, $init);
}
$message = $mess;
} // End moderator_pass
If someone is interested can find patch files and instructions here.
Best regards.

php unserialize error notice [duplicate]

$ser = 'a:2:{i:0;s:5:"héllö";i:1;s:5:"wörld";}'; // fails
$ser2 = 'a:2:{i:0;s:5:"hello";i:1;s:5:"world";}'; // works
$out = unserialize($ser);
$out2 = unserialize($ser2);
print_r($out);
print_r($out2);
echo "<hr>";
But why?
Should I encode before serialzing than? How?
I am using Javascript to write the serialized string to a hidden field, than PHP's $_POST
In JS I have something like:
function writeImgData() {
var caption_arr = new Array();
$('.album img').each(function(index) {
caption_arr.push($(this).attr('alt'));
});
$("#hidden-field").attr("value", serializeArray(caption_arr));
};

The reason why unserialize() fails with:
$ser = 'a:2:{i:0;s:5:"héllö";i:1;s:5:"wörld";}';
Is because the length for héllö and wörld are wrong, since PHP doesn't correctly handle multi-byte strings natively:
echo strlen('héllö'); // 7
echo strlen('wörld'); // 6
However if you try to unserialize() the following correct string:
$ser = 'a:2:{i:0;s:7:"héllö";i:1;s:6:"wörld";}';
echo '<pre>';
print_r(unserialize($ser));
echo '</pre>';
It works:
Array
(
[0] => héllö
[1] => wörld
)
If you use PHP serialize() it should correctly compute the lengths of multi-byte string indexes.
On the other hand, if you want to work with serialized data in multiple (programming) languages you should forget it and move to something like JSON, which is way more standardized.

I know this was posted like one year ago, but I just have this issue and come across this, and in fact I found a solution for it. This piece of code works like charm!
The idea behind is easy. It's just helping you by recalculating the length of the multibyte strings as posted by #Alix above.
A few modifications should suits your code:
/**
* Mulit-byte Unserialize
*
* UTF-8 will screw up a serialized string
*
* #access private
* #param string
* #return string
*/
function mb_unserialize($string) {
$string = preg_replace('!s:(\d+):"(.*?)";!se', "'s:'.strlen('$2').':\"$2\";'", $string);
return unserialize($string);
}
Source: http://snippets.dzone.com/posts/show/6592
Tested on my machine, and it works like charm!!

Lionel Chan answer modified to work with PHP >= 5.5 :
function mb_unserialize($string) {
$string2 = preg_replace_callback(
'!s:(\d+):"(.*?)";!s',
function($m){
$len = strlen($m[2]);
$result = "s:$len:\"{$m[2]}\";";
return $result;
},
$string);
return unserialize($string2);
}
This code uses preg_replace_callback as preg_replace with the /e modifier is obsolete since PHP 5.5.

The issue is - as pointed out by Alix - related to encoding.
Until PHP 5.4 the internal encoding for PHP was ISO-8859-1, this encoding uses a single byte for some characters that in unicode are multibyte. The result is that multibyte values serialized on UTF-8 system will not be readable on ISO-8859-1 systems.
The avoid problems like this make sure all systems use the same encoding:
mb_internal_encoding('utf-8');
$arr = array('foo' => 'bár');
$buf = serialize($arr);
You can use utf8_(encode|decode) to cleanup:
// Set system encoding to iso-8859-1
mb_internal_encoding('iso-8859-1');
$arr = unserialize(utf8_encode($serialized));
print_r($arr);

In reply to #Lionel above, in fact the function mb_unserialize() as you proposed won't work if the serialized string itself contains char sequence "; (quote followed by semicolon).
Use with caution. For example:
$test = 'test";string';
// $test is now 's:12:"test";string";'
$string = preg_replace('!s:(\d+):"(.*?)";!se', "'s:'.strlen('$2').':\"$2\";'", $test);
print $string;
// output: s:4:"test";string"; (Wrong!!)
JSON is the ways to go, as mentioned by others, IMHO
Note: I post this as new answer as I don't know how to reply directly (new here).

This solution worked for me:
$unserialized = unserialize(utf8_encode($st));

Do not use PHP serialization/unserialization when the other end is not PHP. It is not meant to be a portable format - for example, it even includes ascii-1 characters for protected keys which is nothing you want to deal with in javascript (even though it would work perfectly fine, it's just extremely ugly).
Instead, use a portable format like JSON. XML would do the job, too, but JSON has less overhead and is more programmer-friendly as you can easily parse it into a simple data structure instead of having to deal with XPath, DOM trees etc.

One more slight variation here which will hopefully help someone ... I was serializing an array then writing it to a database. On retrieving the data the unserialize operation was failing.
It turns out that the database longtext field I was writing into was using latin1 not UTF8. When I switched it round everything worked as planned.
Thanks to all above who mentioned character encoding and got me on the right track!

I would advise you to use javascript to encode as json and then use json_decode to unserialize.

/**
* MULIT-BYTE UNSERIALIZE
*
* UTF-8 will screw up a serialized string
*
* #param string
* #return string
*/
function mb_unserialize($string) {
$string = preg_replace_callback('/!s:(\d+):"(.*?)";!se/', function($matches) { return 's:'.strlen($matches[1]).':"'.$matches[1].'";'; }, $string);
return unserialize($string);
}

we can break the string down to an array:
$finalArray = array();
$nodeArr = explode('&', $_POST['formData']);
foreach($nodeArr as $value){
$childArr = explode('=', $value);
$finalArray[$childArr[0]] = $childArr[1];
}

Serialize:
foreach ($income_data as $key => &$value)
{
$value = urlencode($value);
}
$data_str = serialize($income_data);
Unserialize:
$data = unserialize($data_str);
foreach ($data as $key => &$value)
{
$value = urldecode($value);
}

this one worked for me.
function mb_unserialize($string) {
$string = mb_convert_encoding($string, "UTF-8", mb_detect_encoding($string, "UTF-8, ISO-8859-1, ISO-8859-15", true));
$string = preg_replace_callback(
'/s:([0-9]+):"(.*?)";/',
function ($match) {
return "s:".strlen($match[2]).":\"".$match[2]."\";";
},
$string
);
return unserialize($string);
}

In my case the problem was with line endings (likely some editor have changed my file from DOS to Unix).
I put together these apadtive wrappers:
function unserialize_fetchError($original, &$unserialized, &$errorMsg) {
$unserialized = #unserialize($original);
$errorMsg = error_get_last()['message'];
return ( $unserialized !== false || $original == 'b:0;' ); // "$original == serialize(false)" is a good serialization even if deserialization actually returns false
}
function unserialize_checkAllLineEndings($original, &$unserialized, &$errorMsg, &$lineEndings) {
if ( unserialize_fetchError($original, $unserialized, $errorMsg) ) {
$lineEndings = 'unchanged';
return true;
} elseif ( unserialize_fetchError(str_replace("\n", "\n\r", $original), $unserialized, $errorMsg) ) {
$lineEndings = '\n to \n\r';
return true;
} elseif ( unserialize_fetchError(str_replace("\n\r", "\n", $original), $unserialized, $errorMsg) ) {
$lineEndings = '\n\r to \n';
return true;
} elseif ( unserialize_fetchError(str_replace("\r\n", "\n", $original), $unserialized, $errorMsg) ) {
$lineEndings = '\r\n to \n';
return true;
} //else
return false;
}

unserialize(): Error at offset 20 of 37 bytes [duplicate]

$ser = 'a:2:{i:0;s:5:"héllö";i:1;s:5:"wörld";}'; // fails
$ser2 = 'a:2:{i:0;s:5:"hello";i:1;s:5:"world";}'; // works
$out = unserialize($ser);
$out2 = unserialize($ser2);
print_r($out);
print_r($out2);
echo "<hr>";
But why?
Should I encode before serialzing than? How?
I am using Javascript to write the serialized string to a hidden field, than PHP's $_POST
In JS I have something like:
function writeImgData() {
var caption_arr = new Array();
$('.album img').each(function(index) {
caption_arr.push($(this).attr('alt'));
});
$("#hidden-field").attr("value", serializeArray(caption_arr));
};

The reason why unserialize() fails with:
$ser = 'a:2:{i:0;s:5:"héllö";i:1;s:5:"wörld";}';
Is because the length for héllö and wörld are wrong, since PHP doesn't correctly handle multi-byte strings natively:
echo strlen('héllö'); // 7
echo strlen('wörld'); // 6
However if you try to unserialize() the following correct string:
$ser = 'a:2:{i:0;s:7:"héllö";i:1;s:6:"wörld";}';
echo '<pre>';
print_r(unserialize($ser));
echo '</pre>';
It works:
Array
(
[0] => héllö
[1] => wörld
)
If you use PHP serialize() it should correctly compute the lengths of multi-byte string indexes.
On the other hand, if you want to work with serialized data in multiple (programming) languages you should forget it and move to something like JSON, which is way more standardized.

I know this was posted like one year ago, but I just have this issue and come across this, and in fact I found a solution for it. This piece of code works like charm!
The idea behind is easy. It's just helping you by recalculating the length of the multibyte strings as posted by #Alix above.
A few modifications should suits your code:
/**
* Mulit-byte Unserialize
*
* UTF-8 will screw up a serialized string
*
* #access private
* #param string
* #return string
*/
function mb_unserialize($string) {
$string = preg_replace('!s:(\d+):"(.*?)";!se', "'s:'.strlen('$2').':\"$2\";'", $string);
return unserialize($string);
}
Source: http://snippets.dzone.com/posts/show/6592
Tested on my machine, and it works like charm!!

Lionel Chan answer modified to work with PHP >= 5.5 :
function mb_unserialize($string) {
$string2 = preg_replace_callback(
'!s:(\d+):"(.*?)";!s',
function($m){
$len = strlen($m[2]);
$result = "s:$len:\"{$m[2]}\";";
return $result;
},
$string);
return unserialize($string2);
}
This code uses preg_replace_callback as preg_replace with the /e modifier is obsolete since PHP 5.5.

The issue is - as pointed out by Alix - related to encoding.
Until PHP 5.4 the internal encoding for PHP was ISO-8859-1, this encoding uses a single byte for some characters that in unicode are multibyte. The result is that multibyte values serialized on UTF-8 system will not be readable on ISO-8859-1 systems.
The avoid problems like this make sure all systems use the same encoding:
mb_internal_encoding('utf-8');
$arr = array('foo' => 'bár');
$buf = serialize($arr);
You can use utf8_(encode|decode) to cleanup:
// Set system encoding to iso-8859-1
mb_internal_encoding('iso-8859-1');
$arr = unserialize(utf8_encode($serialized));
print_r($arr);

In reply to #Lionel above, in fact the function mb_unserialize() as you proposed won't work if the serialized string itself contains char sequence "; (quote followed by semicolon).
Use with caution. For example:
$test = 'test";string';
// $test is now 's:12:"test";string";'
$string = preg_replace('!s:(\d+):"(.*?)";!se', "'s:'.strlen('$2').':\"$2\";'", $test);
print $string;
// output: s:4:"test";string"; (Wrong!!)
JSON is the ways to go, as mentioned by others, IMHO
Note: I post this as new answer as I don't know how to reply directly (new here).

This solution worked for me:
$unserialized = unserialize(utf8_encode($st));

Do not use PHP serialization/unserialization when the other end is not PHP. It is not meant to be a portable format - for example, it even includes ascii-1 characters for protected keys which is nothing you want to deal with in javascript (even though it would work perfectly fine, it's just extremely ugly).
Instead, use a portable format like JSON. XML would do the job, too, but JSON has less overhead and is more programmer-friendly as you can easily parse it into a simple data structure instead of having to deal with XPath, DOM trees etc.

One more slight variation here which will hopefully help someone ... I was serializing an array then writing it to a database. On retrieving the data the unserialize operation was failing.
It turns out that the database longtext field I was writing into was using latin1 not UTF8. When I switched it round everything worked as planned.
Thanks to all above who mentioned character encoding and got me on the right track!

I would advise you to use javascript to encode as json and then use json_decode to unserialize.

/**
* MULIT-BYTE UNSERIALIZE
*
* UTF-8 will screw up a serialized string
*
* #param string
* #return string
*/
function mb_unserialize($string) {
$string = preg_replace_callback('/!s:(\d+):"(.*?)";!se/', function($matches) { return 's:'.strlen($matches[1]).':"'.$matches[1].'";'; }, $string);
return unserialize($string);
}

we can break the string down to an array:
$finalArray = array();
$nodeArr = explode('&', $_POST['formData']);
foreach($nodeArr as $value){
$childArr = explode('=', $value);
$finalArray[$childArr[0]] = $childArr[1];
}

Serialize:
foreach ($income_data as $key => &$value)
{
$value = urlencode($value);
}
$data_str = serialize($income_data);
Unserialize:
$data = unserialize($data_str);
foreach ($data as $key => &$value)
{
$value = urldecode($value);
}

this one worked for me.
function mb_unserialize($string) {
$string = mb_convert_encoding($string, "UTF-8", mb_detect_encoding($string, "UTF-8, ISO-8859-1, ISO-8859-15", true));
$string = preg_replace_callback(
'/s:([0-9]+):"(.*?)";/',
function ($match) {
return "s:".strlen($match[2]).":\"".$match[2]."\";";
},
$string
);
return unserialize($string);
}

In my case the problem was with line endings (likely some editor have changed my file from DOS to Unix).
I put together these apadtive wrappers:
function unserialize_fetchError($original, &$unserialized, &$errorMsg) {
$unserialized = #unserialize($original);
$errorMsg = error_get_last()['message'];
return ( $unserialized !== false || $original == 'b:0;' ); // "$original == serialize(false)" is a good serialization even if deserialization actually returns false
}
function unserialize_checkAllLineEndings($original, &$unserialized, &$errorMsg, &$lineEndings) {
if ( unserialize_fetchError($original, $unserialized, $errorMsg) ) {
$lineEndings = 'unchanged';
return true;
} elseif ( unserialize_fetchError(str_replace("\n", "\n\r", $original), $unserialized, $errorMsg) ) {
$lineEndings = '\n to \n\r';
return true;
} elseif ( unserialize_fetchError(str_replace("\n\r", "\n", $original), $unserialized, $errorMsg) ) {
$lineEndings = '\n\r to \n';
return true;
} elseif ( unserialize_fetchError(str_replace("\r\n", "\n", $original), $unserialized, $errorMsg) ) {
$lineEndings = '\r\n to \n';
return true;
} //else
return false;
}

Converting SMS encoding to UTF-8 in PHP

I wrote an SMPP Server Transceiver in PHP.
I get this SMS string from my SMPP. It's a UTF8 message which is actually at 7Bit. Here is a sample message:
5d30205d30205d3
I know how to convert it. It should be:
\x5d3\x020\x5d3\x020\x5d3
I don't want to write it myself. I guess there is already a function that does that for me. Some hidden iconv or using pack() / unpack() to convert this string to the correct format.
I am trying to achieve this using PHP.
Any ideas?
Thanks.

This should do it :
$message = "5d30205d30205d3";
echo "\x".implode("\x", str_split($message, 3));
// \x5d3\x020\x5d3\x020\x5d3

Here is what i used eventually:
public static function sms__from_unicode($message)
{
$org_msg = str_split(strtolower($message), 3);
for($i = 0;$i < count($org_msg); $i++)
$org_msg[$i] = "\u0{$org_msg[$i]}";
$str = implode(null, $org_msg);
$str = preg_replace_callback('/\\\\u([0-9a-f]{4})/i', 'replace_unicode_escape_sequence', $str);
return $str;
}
function replace_unicode_escape_sequence($match) {
return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}
10x. all.

How to handle user input of invalid UTF-8 characters

I'm looking for a general strategy/advice on how to handle invalid UTF-8 input from users.
Even though my web application uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around.
W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.".
How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
How do you present the error in a helpful way to the user?
How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?
For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?
I'm very familiar with the mbstring extension and am not asking "how does UTF-8 work in PHP?". I'd like advice from people with experience in real-world situations how they've handled this.
As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD.

The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, and they are not forced to submit that in that way. Crappy form submission bots are a good example...
I usually ignore bad characters, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions. If you use iconv, you also have the option to transliterate bad characters.
Here is an example using iconv():
$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);
If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis. Something like this would probably do just fine:
function utf8_clean($str)
{
return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}
$clean_GET = array_map('utf8_clean', $_GET);
if (serialize($_GET) != serialize($clean_GET))
{
$_GET = $clean_GET;
$error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}
// $_GET is clean!
You may also want to normalize new lines and strip (non-)visible control chars, like this:
function Clean($string, $control = true)
{
$string = iconv('UTF-8', 'UTF-8//IGNORE', $string);
if ($control === true)
{
return preg_replace('~\p{C}+~u', '', $string);
}
return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
}
Code to convert from UTF-8 to Unicode code points:
function Codepoint($char)
{
$result = null;
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
{
$result = sprintf('U+%04X', $codepoint[1]);
}
return $result;
}
echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072
It is probably faster than any other alternative, but I haven't tested it extensively though.
Example:
$string = 'hello world�';
// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);
function Bad_Codepoint($string)
{
$result = array();
foreach ((array) $string as $char)
{
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
{
$result[] = sprintf('U+%04X', $codepoint[1]);
}
}
return implode('', $result);
}
This may be what you were looking for.

Receiving invalid characters from your web application might have to do with the character sets assumed for HTML forms. You can specify which character set to use for forms with the accept-charset attribute:
<form action="..." accept-charset="UTF-8">
You also might want to take a look at similar questions on Stack Overflow for pointers on how to handle invalid characters, e.g., those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs.

I put together a fairly simple class to check if input is in UTF-8 and to run through utf8_encode() as needs be:
class utf8
{
/**
* #param array $data
* #param int $options
* #return array
*/
public static function encode(array $data)
{
foreach ($data as $key=>$val) {
if (is_array($val)) {
$data[$key] = self::encode($val, $options);
} else {
if (false === self::check($val)) {
$data[$key] = utf8_encode($val);
}
}
}
return $data;
}
/**
* Regular expression to test a string is UTF8 encoded
*
* RFC3629
*
* #param string $string The string to be tested
* #return bool
*
* #link http://www.w3.org/International/questions/qa-forms-utf-8.en.php
*/
public static function check($string)
{
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs',
$string);
}
}
// For example
$data = utf8::encode($_POST);

For completeness to this question (not necessarily the best answer)...
function as_utf8($s) {
return mb_convert_encoding($s, "UTF-8", mb_detect_encoding($s));
}

There is a multibyte extension for PHP. See Multibyte String
You should try the mb_check_encoding() function.

I recommend merely not allowing garbage to get in. Don't rely on custom functions, which can bog your system down.
Simply walk the submitted data against an alphabet you design. Create an acceptable alphabet string and walk the submitted data, byte by byte, as if it were an array. Push acceptable characters to a new string, and omit unacceptable characters.
The data you store in your database then is data triggered by the user, but not actually user-supplied data.
<?php
// Build alphabet
// Optionally, you can remove characters from this array
$alpha[] = chr(0); // null
$alpha[] = chr(9); // tab
$alpha[] = chr(10); // new line
$alpha[] = chr(11); // tab
$alpha[] = chr(13); // carriage return
for ($i = 32; $i <= 126; $i++) {
$alpha[] = chr($i);
}
/* Remove comment to check ASCII ordinals */
// /*
// foreach ($alpha as $key => $val) {
// print ord($val);
// print '<br/>';
// }
// print '<hr/>';
//*/
//
// // Test case #1
//
// $str = 'afsjdfhasjhdgljhasdlfy42we875y342q8957y2wkjrgSAHKDJgfcv kzXnxbnSXbcv ' . chr(160) . chr(127) . chr(126);
//
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
//
// // Test case #2
//
// $str = '' . '©?™???';
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
//
// $str = '©';
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
$file = 'http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt';
$testfile = implode(chr(10), file($file));
$string = teststr($alpha, $testfile);
print $string;
print '<hr/>';
function teststr(&$alpha, &$str) {
$strlen = strlen($str);
$newstr = chr(0); // null
$x = 0;
if($strlen >= 2) {
for ($i = 0; $i < $strlen; $i++) {
$x++;
if(in_array($str[$i], $alpha)) {
// Passed
$newstr .= $str[$i];
}
else {
// Failed
print 'Found out of scope character. (ASCII: ' . ord($str[$i]). ')';
print '<br/>';
$newstr .= '�';
}
}
}
elseif($strlen <= 0) {
// Failed to qualify for test
print 'Non-existent.';
}
elseif($strlen === 1) {
$x++;
if(in_array($str, $alpha)) {
// Passed
$newstr = $str;
}
else {
// Failed
print 'Total character failed to qualify.';
$newstr = '�';
}
}
else {
print 'Non-existent (scope).';
}
if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
// Skip
}
else {
$newstr = utf8_encode($newstr);
}
// Test encoding:
if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
print 'UTF-8 :D<br/>';
}
else {
print 'ENCODED: ' . mb_detect_encoding($newstr, "UTF-8") . '<br/>';
}
return $newstr . ' (scope: ' . $x . ', ' . $strlen . ')';
}

Strip all characters outside your given subset. At least in some parts of my application I would not allow using characters outside the [a-Z] and [0-9] sets, for example in usernames.
You can build a filter function that silently strips all characters outside this range, or that returns an error if it detects them and pushes the decision to the user.

Try doing what Ruby on Rails does to force all browsers always to post UTF-8 data:
<form accept-charset="UTF-8" action="#{action}" method="post"><div
style="margin:0;padding:0;display:inline">
<input name="utf8" type="hidden" value="✓" />
</div>
<!-- form fields -->
</form>
See railssnowman.info or the initial patch for an explanation.
To have the browser sends form-submission data in the UTF-8 encoding, just render the page with a Content-Type header of "text/html; charset=utf-8" (or use a meta http-equiv tag).
To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), use accept-charset="UTF-8" in the form.
To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), and even if the browser is Internet Explorer and the user switched the page encoding to Korean and entered Korean characters in the form fields, add a hidden input to the form with a value such as ✓ which can only be from the Unicode charset (and, in this example, not the Korean charset).

Set UTF-8 as the character set for all headers output by your PHP code.
In every PHP output header, specify UTF-8 as the encoding:
header('Content-Type: text/html; charset=utf-8');

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP imap weird encoding characters - php

Related

Make PHPBB 3.0.14 and ABBC3 compatible with PHP 7.3

php unserialize error notice [duplicate]

unserialize(): Error at offset 20 of 37 bytes [duplicate]

Converting SMS encoding to UTF-8 in PHP

How to handle user input of invalid UTF-8 characters

Categories

Resources