Corrupted data using UTF-8 and mb_substr - php

I'm get data from MySQL db, varchar(255) utf8_general_ci field and try to write the text to a PDF with PHP. I need to determine the string length in the PDF to limit the output of the text in a table. But I noticed that the output of mb_substr/substr is really strange.
For example:
mb_internal_encoding("UTF-8");
$_tmpStr = $vfrow['title'];
$_tmpStrLen = mb_strlen($vfrow['title']);
for($i=$_tmpStrLen; $i >= 0; $i--){
file_put_contents('cutoffattributes.txt',$vfrow['field']." ".$_tmpStr."\n",FILE_APPEND);
file_put_contents('cutoffattributes.txt',$vfrow['field']." ".mb_substr($_tmpStr, 0, $i)."\n",FILE_APPEND);
}
outputs this:
npp file link
Database:
My question is where does the extra character come from?

You need to ensure you're actually getting the data from the database in UTF-8 encoding by setting your connection encoding appropriately. This depends on your database adapter, see UTF-8 all the way through for details.
You need to tell your mb_ functions that the data is in UTF-8 so they can treat it correctly. Either set this globally for all functions using mb_internal_encoding, or pass the $encoding parameter to your function when you call it:
mb_substr($_tmpStr, 0, $i, 'UTF-8')

The extra character is first part of two byte UTF-8 sequence. You may have problems with internal encoding of Multibyte String Functions. Your code treats text as fixed, 1-byte encoding. The ń in UTF-8, hex C5 84, is treated as Ĺ„ in CP-1250 and Ĺ[IND] in ISO-8859-2, two characters.
Try to execute this one on the top of script:
mb_internal_encoding("UTF-8");
http://php.net/manual/en/function.mb-internal-encoding.php

Aside from table and field being set to UTF-8 you need to set mysqli_set_charset('UTF-8') to UTF-8 also (if you are using mysqli).
Also did you try?
$_tmpStr = utf8_encode( $vfrow['title'] );

Related

mb_detect_encoding returns both ASCII and UTF8 [duplicate]

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

iconv with ascii // transit triggers ErrorException: "iconv(): Detected an illegal character in input string"

First of all, I have to say that; I am a stranger of multilingual conversions.
I have strings that i want to mb_lowercase in UTF-8 form if possible (sth like clean url), and I use
$str = iconv("UTF-8", "ASCII//TRANSLIT", utf8_encode($str));
$str = preg_replace("/[^a-zA-Z0-9_]/", "", $str);
$str = mb_strtolower($str);
to achive my requirements (an UTF8, lowercase string)
However, when I stress that function with "çokGüŞelLl" using CocoaRestClient; I get à as $str (thanks to my client?) and iconv triggers an error complaining about an illegal character in input string (Ã).
What is the problem with iconv? the str is encoded as utf8 by utf8_encode($str) already. How can it be an illegal character?
Notes:
I read about #iconv questions here, but I think it is not a good solution to have empty database entries.
Thanks to all answers, I will read and try to understand each of them.
The PHP function utf8_encode() expects your string to be ISO-8859-1 encoded. If it isn’t, well, you get funny results.
Ensure that your data is proper UTF-8 before saving it to your database:
// Validate that the input string is valid UTF-8
if (preg_match("//u", $string) === false) {
throw new \InvalidArgumentException("String contains invalid UTF-8 characters.");
}
// Normalize to Unicode NFC form (recommended by W3C)
$string = \Normalizer::normalize($string);
Now everything is stored the same way in our database and we don't have to care about this problem anymore when receiving data from our database.
$string = $database->getSomeRecordWithUnicode();
echo mb_strtolower($string);
Done!
PS: If you want to ensure that your database is using the exact same encoding as PHP either use utf8mb4 as character set (and utf8mb4_unicode_ci as default collation for perfect sorting) or a BLOB (binary) data type.
PPS: Use your database configuration file to force proper encoding of all strings instead of using e.g. $mysqli->set_charset("utf8") or similar.
About HTML forms
Because you asked in the comments of your question. How data is sent to your server has nothing to do with the locale the user has set in his operating system. It has to do with the client's browser. All modern browsers default to utf-8 when sending form data. If you are afraid that some of your clients might be using totally broken browsers, simply tell them that you only accept utf-8. Drupal is doing that on all their forms.
<!doctype html>
<html>
<body>
<form accept-charset="UTF-8">
Now all browsers should encode the data they submit in utf-8.
If you encode çokGüŞelLl as UTF-8 you should get the following bytes:
var_dump( bin2hex('çokGüŞelLl') );
string(26) "c3a76f6b47c3bcc59e656c4c6c"
That's a check you must do. You also have this:
utf8_encode($str)
Your string contains Ş, which cannot be represented in ISO-8859-1 to begin with.
So, whatever reason you have to convert your original UTF-8 (as stored in DB) to ISO-8859-1, I'm afraid that it's corrupting your data.
You're double encoding. First you set your database to UTF-8. That means your data is now UTF-8 encoded. Then you use utf8_encode on the iconv-function. But your input is already UTF-8. Try removing your utf8_encode statement from iconv.

PHP string array UTF-8 encoding fails

Everything is set to UTF-8 (file encoding, MySQL [however I don't use it], Apache, meta, mbstring etc...) but check this out:
$s="áéőúöüóűí";
echo $s; //works perfectly
echo $s[0] // doesn't work. Prints out a single '?'.
I have tried almost everything. Any ideas? Thanks in advance!
It is absolutely correct behavior.
if you want to get a first letter from a multi-byte string, not first byte from binary string, you have to use mb_substr():
mb_internal_encoding("UTF-8");
echo mb_substr($s,0,1);
You should use mb_* functions for multibyte strings. mb_substr() in your case.
And if you define $s[0]="á", does it work ? I believe that when encoded in UTF-8, those special chars are stored over two UTF-chars.
If you display in ANSI some UTF-8 text, it is rendered like this :
áéoúöüóuí
You see that á becomes á
So rendering the first char ($s[0]) would only display the "í", which is an incomplete character
you have to make some changes in database go to the the table structure
you can find a column "Collation"
which column you want to change click edit on right side menu
the default Collation is - 'latin1_general_ci' change it to 'utf8_general_ci'

mb_detect_encoding detects ASCII as UTF-8?

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

Read ansi file and convert to UTF-8 string

Is there any way to do that with PHP?
The data to be inserted looks fine when I print it out.
But when I insert it in the database the field becomes empty.
$tmp = iconv('YOUR CURRENT CHARSET', 'UTF-8', $string);
or
$tmp = utf8_encode($string);
Strange thing is you end up with an empty string in your DB. I can understand you'll end up with some garbarge in your DB but nothing at all (empty string) is strange.
I just typed this in my console:
iconv -l | grep -i ansi
It showed me:
ANSI_X3.4-1968
ANSI_X3.4-1986
ANSI_X3.4
ANSI_X3.110-1983
ANSI_X3.110
MS-ANSI
These are possible values for YOUR CURRENT CHARSET
As pointed out before when your input string contains chars that are allowed in UTF, you dont need to convert anything.
Change UTF-8 in UTF-8//TRANSLIT when you dont want to omit chars but replace them with a look-a-like (when they are not in the UTF-8 set)
"ANSI" is not really a charset. It's a short way of saying "whatever charset is the default in the computer that creates the data". So you have a double task:
Find out what's the charset data is using.
Use an appropriate function to convert into UTF-8.
For #2, I'm normally happy with iconv() but utf8_encode() can also do the job if source data happens to use ISO-8859-1.
Update
It looks like you don't know what charset your data is using. In some cases, you can figure it out if you know the country and language of the user (e.g., Spain/Spanish) through the default encoding used by Microsoft Windows in such territory.
Be careful, using iconv() can return false if the conversion fails.
I am also having a somewhat similar problem, some characters from the Chinese alphabet are mistaken for \n if the file is encoded in UNICODE, but not if it is UFT-8.
To get back to your problem, make sure the encoding of your file is the same with the one of your database. Also using utf-8_encode() on an already utf-8 text can have unpleasant results. Try using mb_detect_encoding() to see the encoding of the file, but unfortunately this way doesn't always work. There is no easy fix for character encoding from what i can see :(

Categories