MySQL insert String error - UTF-8? - php

I am inserting into a mysql database. I get the following error when trying to do the insert
Incorrect string value: '\xF0\x9F\x87\xB7\xF0\x9F...' for column 'field_4' at row 1
I thought I had figured out this error by simply changing the column encoding the to utf8mb4 and had tested but recently this error appeared again. I am using php to parse the string and run the following function before inserting...
function strip_emoji($subject) {
if (is_array($subject)) {
// Recursive strip for multidimensional array
foreach ($subject as &$value) $value = $this->strip_emoji($value);
return $subject;
} else {
// Match Emoticons
$regexEmoticons = '/[\x{1F600}-\x{1F64F}]/u';
$clean_text = preg_replace($regexEmoticons, '', $subject);
// Match Miscellaneous Symbols and Pictographs
$regexSymbols = '/[\x{1F300}-\x{1F5FF}]/u';
$clean_text = preg_replace($regexSymbols, '', $clean_text);
// Match Transport And Map Symbols
$regexTransport = '/[\x{1F680}-\x{1F6FF}]/u';
$clean_text = preg_replace($regexTransport, '', $clean_text);
return
}
There are several similar questions to this but I still have these errors. Any further advice on how to prevent this error? I realize that it is an emoji unicode character / sprite but not sure how to deal with it.

You are trying to insert a character that spans 4 bytes, so you have to convert the column to the utf8mb4 character set.
The utf8 character set is limited to characters that span 3 bytes (the Unicode characters U+0000 through U+FFFF).

Do you have utf8 charset for the connection as well?
Adding ";charset=utf8" in the PDO-connection string, or executing the query "set names utf8".

Related

how to change ascii alphabet to utf-8 in php

I have an ASCII string. I like to change its encoding to utf-8.
But I found there's a simple function to change ascii to utf-8 in php.
and vice verse, I like to change utf-8 alphabet to ascii.
Please advise.
I have tried:
<?php
// utf-8
$str = "CHONKIOK";
// I can't even how to print these utf-8 characters in php. I just copied/pasted the string.
// strlen($str) => 24 bytes
// mb_detect_encoding($str) => utf-8
$str2 = "CHONKIOK";
// strlen($str2) => 8 bytes
// mb_detect_encoding($str2) => ascii
// change ascii to utf-8
$str = mb_convert_encoding($str2, "UTF-8");
echo mb_detect_encoding($str);
// returns ascii
What you are doing is correct.
As per mb_detect_encoding it states that it detects the most likely character encoding.
As the entire ASCII set is contained within UTF-8 at the exact same character positions, this function is telling you that it's an ASCII string because it technically is. The bytes of this string when encoded in both ASCII and UFT-8 are identical.
As you've found, when you include some characters outside of the ASCII set then it will give you the next probable encoding.
What exactly should I do to obtain this string: "CHONKIOK" from "CHONKIOK"?
The characters you're after are called "Fullwidth Latin" characters.
Given the C character provided is character 65,315 and a regular C is character 67, you could possible obtain the strings you're after by adding the difference of 65,248. This is only possible because the alphabet tends to repeat in the same order throughout different parts of the character charts.
You can get the code point of a character using mb_ord and convert it back to a character using mb_chr, after adding 65,248.
That might look something like:
$str_input = "ABC abc 123";
$convertable = "ABCDEFG12349abcdefg";
$str_output = "";
for ($i = 0; $i < strlen($str_input); $i++) {
$char = mb_ord($str_input[$i], "UTF-8");
if(str_contains($convertable, $str_input[$i])) $char += 65248;
$str_output .= mb_chr($char, "UTF-8");
}
echo $str_output; // outputs "ABC abc 123"
Just be sure to include the whole alphabet in $convertable
try this to convert to utf-8:
utf8_encode(string $string): string
try this to convert to ASCII:
utf8_decode(string $string): string

Prepare UTF-8 string for mysql

I got a string from an email subject
input
=?UTF-8?B?RndkOiDwn5G+IEZpbmFsIEhvdXJzIHRvIFNhdmUg8J+Rvg==?=
output (from convert_nin_acsii method)
Fwd: 👾 Final Hours to Save 👾
I want to insert it into mysql, but get an error
error
2015-12-06 11:11 SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF0\x9F\x91\xBE F...' for column 'label' at ...
code
I use this code to prepare the string for mysql
private function convert_non_ascii($string){
$return = '';
if(preg_match('/^=\?(iso-8859-1|utf-8)\?q\?/i', $string)){
$return = str_replace('_',' ', mb_decode_mimeheader($string));
}
elseif(preg_match('/^(iso-8859-1\'\')(.*)$/i', $string, $matches)){
$return = utf8_encode(rawurldecode($matches[2]));
}
else{
$return = imap_utf8($string);
}
// Fix: Remove all non UTF-8 characters if mail is not correctly encoded
$return = mb_convert_encoding($return, 'UTF-8', 'UTF-8');
return $return;
}
The column/table in MySQL must be declared CHARACTER SET utf8mb4. MySQL's utf8 won't suffice for Emoji.
And you need SET NAMES utf8mb4.
The utf8 hex for 👾 is F09F91BE (4 bytes), and can be seen in the error message.

PHP: UTF8_decode needed with filter for ASCII values 126-160; proposed solution

I previously began exploring this problem here. Here is the true problem, and a proposed solution:
Filenames with ASCII characters values between 32 and 255 pose a problem for utf8_encode(). Specifically, it doesn't handle the character values inclusively between 126 and 160 correctly. While filenames with those character names may be written to a database, passing those filenames to a function in PHP code will produce error messages stating the file cannot be found, etc.
I discovered this when trying to pass a filename with the offending characters to getimagesize().
What is needed for utf8_encode is a filter to EXCLUDE the conversion of the inclusive values between 126 and 160, while INCLUDING the conversion of all other characters (or any character, characters, or character ranges of the user's dersire; mine is for the ranges stated, for the reason provided).
The solution I devised requires two functions, listed below, and their application that follows:
// With thanks to Mark Baker for this function, posted elsewhere on StackOverflow
function _unichr($o) {
if (function_exists('mb_convert_encoding')) {
return mb_convert_encoding('&#'.intval($o).';', 'UTF-8', 'HTML-ENTITIES');
} else {
return chr(intval($o));
}
}
// For each character where value is inclusively between 126 and 160,
// write out the _unichr of the character, else write out the UTF8_encode of the character
function smart_utf8_encode($source) {
$text_array = str_split($source, 1);
$new_string = '';
foreach ($text_array as $character) {
$value = ord($character);
if ((126 <= $value) && ($value <= 160)) {
$new_string .= _unichr($character);
} else {
$new_string .= utf8_encode($character);
}
}
return $new_string;
}
$file_name = "abcdefghijklmnopqrstuvxyz~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ.jpg";
// This MUST be done first:
$file_name = iconv('UTF-8', 'WINDOWS-1252', $file_name);
// Next, smart_utf8_encode the variable (from encoding.inc.php):
$file_name = smart_utf8_encode($file_name);
// Now the file name may be passed to getimagesize(), etc.
$getimagesize = getimagesize($file_name);
If only PHP7 (6 is being skipped in the numbering, yes?) would include a filter on utf8_encode() to exclude certain character values, none of this would be necessary.

PHP: UTF-8 character gets messy in function which takes the first letter from each word of a sentence

I have this function which when executed it returns the first letters of each word of a string.
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', $stringsoftext) as $word)
$retturns .= ($word[0]);
return $retturns;
}
Everything works fine. The only problem is that when the words begin with special characters it starts to get messy.
For example "test økonomi" become "t�" instead of "tø"
How can i correct this?
That happens because $word[0] takes the first byte of a string, whereas you are using a multi-bye encoding. So a character may consist of multiple bytes. In case of a ø character it consists of 2 bytes: 0xC3 0xB8
That is how you would extract the first character instead:
mb_substr($word, 0, 1, 'utf8')
Working demo: http://ideone.com/XVnC87
You should use mb_substr with mb_internal_encoding as in example:
<?php
header('Content-Type: text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');
echo initials('ąęść óęłęł');
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', $stringsoftext) as $word) {
$retturns .= mb_substr($word,0,1);
}
return $retturns;
}
Complementing various answers above, you could convert utf-8 (to be precise, assumed as utf-8) encoded character to its ISO 8859 counterpart.
No multibyte support required, as it's not enabled by default in many PHP configurations.
Use utf8_encode() in order to do so
<?php
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', utf8_decode($stringsoftext)) as $word)
$retturns .= ($word[0]);
return $retturns;
}
echo initials("test økonomi");
//return tø
?>
Edit: This approach could break if the characters being converted is not defined on ISO 8859 charset (e.g non latin symbols). Just to reiterate if PHP multi byte support is turned on, mb_substr() solutions is certainly the most appropriate as it is able to properly process the string in utf8 encoding.

How to get substring of unicode characters from mysql using php

The Unicode characters are stored in mysql database in this format
یہاں تو
There is no only unicode characters in my database by also html and english characters mixed up.
The Problem is I want to get a part of the string from database field 'post_body'
I have used the following sql query
"SELECT SUBSTRING(post_body,1,120) as pst_body from mytable";
This string gives me back 120 characters accurately. But the Problem is if there are unicode symbols in the database then ی is equal to 1 unicode character, so my requirement does not fulfill in this way.
Is there any function that can give me back my specified number of characters regardless of is it unicode character or english character, mean if there is unicode data it should count ی as one character .
I do not think, there is any option in mysql, you can fetch data from mysql then take the action in PHP.
function getSubstring($string, $number){
$keywords = preg_split("/([&])+/", htmlentities($string));
$finalArray = array();
unset($keywords[0]);
for($index = 1;$index <= $number;$index++){
$finalArray[] = $keywords[$index];
}
return str_replace('amp;', '&', implode('', $finalArray));
}
//$string = یہاں تو
//$number = 10;// number of character to be fetch
echo getSubstring($string,10);

Categories