Prepare UTF-8 string for mysql - php

I got a string from an email subject
input
=?UTF-8?B?RndkOiDwn5G+IEZpbmFsIEhvdXJzIHRvIFNhdmUg8J+Rvg==?=
output (from convert_nin_acsii method)
Fwd: 👾 Final Hours to Save 👾
I want to insert it into mysql, but get an error
error
2015-12-06 11:11 SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF0\x9F\x91\xBE F...' for column 'label' at ...
code
I use this code to prepare the string for mysql
private function convert_non_ascii($string){
$return = '';
if(preg_match('/^=\?(iso-8859-1|utf-8)\?q\?/i', $string)){
$return = str_replace('_',' ', mb_decode_mimeheader($string));
}
elseif(preg_match('/^(iso-8859-1\'\')(.*)$/i', $string, $matches)){
$return = utf8_encode(rawurldecode($matches[2]));
}
else{
$return = imap_utf8($string);
}
// Fix: Remove all non UTF-8 characters if mail is not correctly encoded
$return = mb_convert_encoding($return, 'UTF-8', 'UTF-8');
return $return;
}

The column/table in MySQL must be declared CHARACTER SET utf8mb4. MySQL's utf8 won't suffice for Emoji.
And you need SET NAMES utf8mb4.
The utf8 hex for 👾 is F09F91BE (4 bytes), and can be seen in the error message.

Related

PHP: UTF-8 character gets messy in function which takes the first letter from each word of a sentence

I have this function which when executed it returns the first letters of each word of a string.
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', $stringsoftext) as $word)
$retturns .= ($word[0]);
return $retturns;
}
Everything works fine. The only problem is that when the words begin with special characters it starts to get messy.
For example "test økonomi" become "t�" instead of "tø"
How can i correct this?
That happens because $word[0] takes the first byte of a string, whereas you are using a multi-bye encoding. So a character may consist of multiple bytes. In case of a ø character it consists of 2 bytes: 0xC3 0xB8
That is how you would extract the first character instead:
mb_substr($word, 0, 1, 'utf8')
Working demo: http://ideone.com/XVnC87
You should use mb_substr with mb_internal_encoding as in example:
<?php
header('Content-Type: text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');
echo initials('ąęść óęłęł');
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', $stringsoftext) as $word) {
$retturns .= mb_substr($word,0,1);
}
return $retturns;
}
Complementing various answers above, you could convert utf-8 (to be precise, assumed as utf-8) encoded character to its ISO 8859 counterpart.
No multibyte support required, as it's not enabled by default in many PHP configurations.
Use utf8_encode() in order to do so
<?php
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', utf8_decode($stringsoftext)) as $word)
$retturns .= ($word[0]);
return $retturns;
}
echo initials("test økonomi");
//return tø
?>
Edit: This approach could break if the characters being converted is not defined on ISO 8859 charset (e.g non latin symbols). Just to reiterate if PHP multi byte support is turned on, mb_substr() solutions is certainly the most appropriate as it is able to properly process the string in utf8 encoding.

MySQL insert String error - UTF-8?

I am inserting into a mysql database. I get the following error when trying to do the insert
Incorrect string value: '\xF0\x9F\x87\xB7\xF0\x9F...' for column 'field_4' at row 1
I thought I had figured out this error by simply changing the column encoding the to utf8mb4 and had tested but recently this error appeared again. I am using php to parse the string and run the following function before inserting...
function strip_emoji($subject) {
if (is_array($subject)) {
// Recursive strip for multidimensional array
foreach ($subject as &$value) $value = $this->strip_emoji($value);
return $subject;
} else {
// Match Emoticons
$regexEmoticons = '/[\x{1F600}-\x{1F64F}]/u';
$clean_text = preg_replace($regexEmoticons, '', $subject);
// Match Miscellaneous Symbols and Pictographs
$regexSymbols = '/[\x{1F300}-\x{1F5FF}]/u';
$clean_text = preg_replace($regexSymbols, '', $clean_text);
// Match Transport And Map Symbols
$regexTransport = '/[\x{1F680}-\x{1F6FF}]/u';
$clean_text = preg_replace($regexTransport, '', $clean_text);
return
}
There are several similar questions to this but I still have these errors. Any further advice on how to prevent this error? I realize that it is an emoji unicode character / sprite but not sure how to deal with it.
You are trying to insert a character that spans 4 bytes, so you have to convert the column to the utf8mb4 character set.
The utf8 character set is limited to characters that span 3 bytes (the Unicode characters U+0000 through U+FFFF).
Do you have utf8 charset for the connection as well?
Adding ";charset=utf8" in the PDO-connection string, or executing the query "set names utf8".

How to get substring of unicode characters from mysql using php

The Unicode characters are stored in mysql database in this format
یہاں تو
There is no only unicode characters in my database by also html and english characters mixed up.
The Problem is I want to get a part of the string from database field 'post_body'
I have used the following sql query
"SELECT SUBSTRING(post_body,1,120) as pst_body from mytable";
This string gives me back 120 characters accurately. But the Problem is if there are unicode symbols in the database then ی is equal to 1 unicode character, so my requirement does not fulfill in this way.
Is there any function that can give me back my specified number of characters regardless of is it unicode character or english character, mean if there is unicode data it should count ی as one character .
I do not think, there is any option in mysql, you can fetch data from mysql then take the action in PHP.
function getSubstring($string, $number){
$keywords = preg_split("/([&])+/", htmlentities($string));
$finalArray = array();
unset($keywords[0]);
for($index = 1;$index <= $number;$index++){
$finalArray[] = $keywords[$index];
}
return str_replace('amp;', '&', implode('', $finalArray));
}
//$string = یہاں تو
//$number = 10;// number of character to be fetch
echo getSubstring($string,10);

Utf-8 to UTF-16BE

I save a record "فحص الرسالة العربية" in php that always saved as :
فحص الرسالة العربية
I want to convert this into UTF-16BE chars when i retrieve it so I am using a function that returns :
002600230031003600300031003b002600230031003500380031003b002600230031003500380039003b0020002600230031003500370035003b002600230031003600300034003b002600230031003500380035003b002600230031003500380037003b002600230031003500370035003b002600230031003600300034003b002600230031003500370037003b0020002600230031003500370035003b002600230031003600300034003b002600230031003500390033003b002600230031003500380035003b002600230031003500370036003b002600230031003600310030003b002600230031003500370037003b
This is function that m using for converting string retrieved from database
function convertCharsn($string) {
$in = '';
$out = iconv('UTF-8', 'UTF-16BE', $string);
for($i=0; $i<strlen($out); $i++) {
$in .= sprintf("%02X", ord($out[$i]));
}
return $in;
}
But when i type same character in below url, it shows different characters as compared to my string.
http://www.routesms.com/downloads/onlineunicode.asp
returning :
0641062D063500200627064406310633062706440629002006270644063906310628064A0629
I want my string to be converted as it is being converted in above url.
my database collation is utf-8_general_ci
Basically, you need to decode those characters out of HTML entities first. Just use html_entity_decode()
$rawChars = html_entity_decode($string, ENT_QUOTES | ENT_HTML401, 'UTF-8');
convertCharsn($rawChars);
Otherwise, you're just encoding the entities. You can see that as & is 0026 in UTF16, and # is 0023. So you can see the repeating sequence of 00260023 in the above transcoding that you posted. So decode it first, and you should be set...

json_encode() non utf-8 strings?

So I have an array of strings, and all of the strings are using the system default ANSI encoding and were pulled from a SQL database. So there are 256 different possible character byte values (single byte encoding).
Is there a way I can get json_encode() to work and display these characters instead of having to use utf8_encode() on all of my strings and ending up with stuff like \u0082?
Or is that the standard for JSON?
Is there a way I can get json_encode() to work and display these characters instead of having to use utf8_encode() on all of my strings and ending up with stuff like "\u0082"?
If you have an ANSI encoded string, using utf8_encode() is the wrong function to deal with this. You need to properly convert it from ANSI to UTF-8 first. That will certainly reduce the number of Unicode escape sequences like \u0082 from the json output, but technically these sequences are valid for json, you must not fear them.
Converting ANSI to UTF-8 with PHP
json_encode works with UTF-8 encoded strings only. If you need to create valid json successfully from an ANSI encoded string, you need to re-encode/convert it to UTF-8 first. Then json_encode will just work as documented.
To convert an encoding from ANSI (more correctly I assume you have a Windows-1252 encoded string, which is popular but wrongly referred to as ANSI) to UTF-8 you can make use of the mb_convert_encoding() function:
$str = mb_convert_encoding($str, "UTF-8", "Windows-1252");
Another function in PHP that can convert the encoding / charset of a string is called iconv based on libiconv. You can use it as well:
$str = iconv("CP1252", "UTF-8", $str);
Note on utf8_encode()
utf8_encode() does only work for Latin-1, not for ANSI. So you will destroy part of your characters inside that string when you run it through that function.
Related: What is ANSI format?
For a more fine-grained control of what json_encode() returns, see the list of predifined constants (PHP version dependent, incl. PHP 5.4, some constants remain undocumented and are available in the source code only so far).
Changing the encoding of an array/iteratively (PDO comment)
As you wrote in a comment that you have problems to apply the function onto an array, here is some code example. It's always needed to first change the encoding before using json_encode. That's just a standard array operation, for the simpler case of pdo::fetch() a foreach iteration:
while($row = $q->fetch(PDO::FETCH_ASSOC))
{
foreach($row as &$value)
{
$value = mb_convert_encoding($value, "UTF-8", "Windows-1252");
}
unset($value); # safety: remove reference
$items[] = array_map('utf8_encode', $row );
}
The JSON standard ENFORCES Unicode encoding. From RFC4627:
3. Encoding
JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.
Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE
00 xx 00 xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx 00 UTF-16LE
xx xx xx xx UTF-8
Therefore, on the strictest sense, ANSI encoded JSON wouldn't be valid JSON; this is why PHP enforces unicode encoding when using json_encode().
As for "default ANSI", I'm pretty sure that your strings are encoded in Windows-1252. It is incorrectly referred to as ANSI.
<?php
$array = array('first word' => array('Слово','Кириллица'),'second word' => 'Кириллица','last word' => 'Кириллица');
echo json_encode($array);
/*
return {"first word":["\u0421\u043b\u043e\u0432\u043e","\u041a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"],"second word":"\u041a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430","last word":"\u041a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"}
*/
echo json_encode($array,256);
/*
return {"first word":["Слово","Кириллица"],"second word":"Кириллица","last word":"Кириллица"}
*/
?>
JSON_UNESCAPED_UNICODE (integer)
Encode multibyte Unicode characters literally (default is to escape as \uXXXX). Available since PHP 5.4.0.
http://php.net/manual/en/json.constants.php#constant.json-unescaped-unicode
I found the following answer for an analogous problem with a nested array not utf-8 encoded that i had to json encode:
$inputArray = array(
'a'=>'First item - à',
'c'=>'Third item - é'
);
$inputArray['b']= array (
'a'=>'First subitem - ù',
'b'=>'Second subitem - ì'
);
if (!function_exists('recursive_utf8')) {
function recursive_utf8 ($data) {
if (!is_array($data)) {
return utf8_encode($data);
}
$result = array();
foreach ($data as $index=>$item) {
if (is_array($item)) {
$result[$index] = array();
foreach($item as $key=>$value) {
$result[$index][$key] = recursive_utf8($value);
}
}
else if (is_object($item)) {
$result[$index] = array();
foreach(get_object_vars($item) as $key=>$value) {
$result[$index][$key] = recursive_utf8($value);
}
}
else {
$result[$index] = recursive_utf8($item);
}
}
return $result;
}
}
$outputArray = json_encode(array_map('recursive_utf8', $inputArray ));
json_encode($str,JSON_HEX_TAG|JSON_HEX_AMP|JSON_HEX_APOS|JSON_HEX_QUOT);
that will convert windows based ANSI to utf-8 and the error will be no more.
Use this instead:
<?php
//$return_arr = the array of data to json encode
//$out = the output of the function
//don't forget to escape the data before use it!
$out = '["' . implode('","', $return_arr) . '"]';
?>
Copy from json_encode php manual's comments. Always read the comments. They are useful.

Categories