Which char is upsetting iconv?

Which char is upsetting iconv? - php

I have a database of hundreds of thousands of sentences which I translate from utf-8 using iconv.
In two of the sentences, I get the following error:
Notice: iconv(): Detected an illegal character in input string
I tried to check the input strings using the methods here: How to detect malformed utf-8 string in PHP?
$isUTF8 = mb_check_encoding($input, 'utf-8');
But, this function returns true (i.e. $input is a valid utf-8), and I still get those two error notices.
How can I detect which sentences are causing the problems?

It's very possible that your input is valid UTF-8, but that your target character encoding doesn't include one of the characters you're using. See utf8 to ISO-8859-1 not converting some characters correctly through Curl for an example.
If this is the case, you could perhaps use //TRANSLIT, if a substitution would be appropriate, or consider if it's possible to deliver content in UTF-8.

Related

How to convert binary string to normal string in php

Description of the problem
I am trying to import email content into a database table. Sometimes, I get an SQL error while inserting a message. I found that it fails when the message content is a binary string instead of a string.
For exemple, I get this in the console if I print a message that is imported successfully (Truncated)
However, I get this with problematic import:
I found out that if I use the function utf8_encode, I am successfully able to import it into SQL. The problem is that it "breaks" previously successfull imports accented characters:
What I have tried
Detect if the string was a binary string with ctype_print, returned false for both non binary and binary string. I would have then be able to call utf8_encode only if it was binary
Use of unpack, did not work
Detect string encoding with mb_detect_encoding, return UTF-8 for both
use iconv , failed with iconv(): Detected an illegal character in input string
Cast the content as string using (string) / settype($html, 'string')
Question
How can I transform the binary string in a normal string so I can then import it in my database without breaking accented characters in other imports?

This is pretty late, but for anyone else reading... Apparently the b prefix is meaningless in PHP, it's a bit of a red herring. See: https://stackoverflow.com/a/51537602/6111743
What encodings did you pass to iconv()? This is the correct solution but you have to give it the correct first argument, which depends on your input. In my example I use "LATIN1" because that turned out to be the correct way to interpret my input but your use case may vary.
You can use mb_check_encoding() to check if it is valid UTF-8 or not. This returns a boolean.
Assuming the question is really something like "how to convert extended ascii string to valid utf-8 string in PHP" - Here is how I did it in my application:
if(!mb_check_encoding($string)) {
$string = iconv("LATIN1", "UTF-8//TRANSLIT//IGNORE", $string);
}
The "TRANSLIT" part tells it to attempt transliteration, that's optional for you. The "IGNORE" will prevent it from throwing Detected an illegal character in input string if it does detect one; instead the character will just get ignored, meaning, removed. Your use case may not need either of these.
When you're debugging, I recommend just using "UTF-8" as the second argument so you can see what it's doing. It's useful to see if it throws an error. For me, I had given it the wrong first argument at first (I wrote "ASCII" instead of "LATIN-1") and it threw the illegal character error on an accented character. That error went away once I passed it the correct encoding.
By the way, mb_detect_encoding() was no help to me in figuring out that Latin-1 was what I needed. What helped was dumping the contents of unpack("C*", $string) to see what exact bytes were in there. That's more debugging advice than solution but worth mentioning in case it helps.

iconv with ascii // transit triggers ErrorException: "iconv(): Detected an illegal character in input string"

First of all, I have to say that; I am a stranger of multilingual conversions.
I have strings that i want to mb_lowercase in UTF-8 form if possible (sth like clean url), and I use
$str = iconv("UTF-8", "ASCII//TRANSLIT", utf8_encode($str));
$str = preg_replace("/[^a-zA-Z0-9_]/", "", $str);
$str = mb_strtolower($str);
to achive my requirements (an UTF8, lowercase string)
However, when I stress that function with "çokGüŞelLl" using CocoaRestClient; I get Ã as $str (thanks to my client?) and iconv triggers an error complaining about an illegal character in input string (Ã).
What is the problem with iconv? the str is encoded as utf8 by utf8_encode($str) already. How can it be an illegal character?
Notes:
I read about #iconv questions here, but I think it is not a good solution to have empty database entries.
Thanks to all answers, I will read and try to understand each of them.

The PHP function utf8_encode() expects your string to be ISO-8859-1 encoded. If it isn’t, well, you get funny results.
Ensure that your data is proper UTF-8 before saving it to your database:
// Validate that the input string is valid UTF-8
if (preg_match("//u", $string) === false) {
throw new \InvalidArgumentException("String contains invalid UTF-8 characters.");
}
// Normalize to Unicode NFC form (recommended by W3C)
$string = \Normalizer::normalize($string);
Now everything is stored the same way in our database and we don't have to care about this problem anymore when receiving data from our database.
$string = $database->getSomeRecordWithUnicode();
echo mb_strtolower($string);
Done!
PS: If you want to ensure that your database is using the exact same encoding as PHP either use utf8mb4 as character set (and utf8mb4_unicode_ci as default collation for perfect sorting) or a BLOB (binary) data type.
PPS: Use your database configuration file to force proper encoding of all strings instead of using e.g. $mysqli->set_charset("utf8") or similar.
About HTML forms
Because you asked in the comments of your question. How data is sent to your server has nothing to do with the locale the user has set in his operating system. It has to do with the client's browser. All modern browsers default to utf-8 when sending form data. If you are afraid that some of your clients might be using totally broken browsers, simply tell them that you only accept utf-8. Drupal is doing that on all their forms.
<!doctype html>
<html>
<body>
<form accept-charset="UTF-8">
Now all browsers should encode the data they submit in utf-8.

If you encode çokGüŞelLl as UTF-8 you should get the following bytes:
var_dump( bin2hex('çokGüŞelLl') );
string(26) "c3a76f6b47c3bcc59e656c4c6c"
That's a check you must do. You also have this:
utf8_encode($str)
Your string contains Ş, which cannot be represented in ISO-8859-1 to begin with.
So, whatever reason you have to convert your original UTF-8 (as stored in DB) to ISO-8859-1, I'm afraid that it's corrupting your data.

You're double encoding. First you set your database to UTF-8. That means your data is now UTF-8 encoded. Then you use utf8_encode on the iconv-function. But your input is already UTF-8. Try removing your utf8_encode statement from iconv.

I'm trying to make sure my string has only valid UTF-8 characters in PHP. How can I do that? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
PHP: replace invalid characters in utf-8 string in
I have a string that has an invalid character in it (it's not UTF-8) such as the following displaying SUB:
I think it's some kind of foreign invalid character.
Is there a way in PHP to take a string and use preg_replace or something else to ensure that I am only using valid UTF-8 characters in my strings, and anything else just gets removed?
Thanks.

First of all, there is no invalid UTF-8 characters. There are invalid UTF-8 bytes and byte sequences, which means someone is trying to pull off an encoding attack on your server. These can be validated with mb_check_encoding on the coming input data, and immediately failing with 400 Bad Request if you don't get valid UTF-8.
What you have is just the SUBSTITUTE control character, a valid character but unprintable.
Originally intended for use as a transmission control character to
indicate that garbled or invalid characters had been received. It has
often been put to use for other purposes when the in-band signaling of
errors it provides is unneeded, especially where robust methods of
error detection and correction are used, or where errors are expected
to be rare enough to make using the character for other purposes
advisable.
You can use this regex to get rid of it (and a few others):
$reg = '/(?![\r\n\t])[\p{Cc}]/u';
preg_replace( $reg, "", $str );

The mb_check_encoding function should be able to do this.
mb_check_encoding("Jetzt gibts mehr Kanonen", "UTF-8");
Note: I haven't tested this.

PHP htmlspecialchars() function error when trying to use UTF-8 string

I did the following things:
I have a spreadsheet with data. One of the rows has a ü character in it.
I save this as a CSV file in OpenOffice.org. When it asks me for a character encoding, I choose UTF-8.
I use Navicat to create a MySQL database table, InnoDB with UTF-8 utf8_general encoding and import the CSV.
I try to use PHP function htmlspecialchars($string, ENT_COMPAT, 'UTF-8') where $string is the string containing the special ü character.
It gives me an error: Invalid multibyte sequence in argument. When I change 'UTF-8' with 'ISO8859-1', no error is thrown, but the incorrect character is shown. (The 'unknown character' character, looks like <?>)
If I use an HTML form to update the string in the database, the error disappears and the character is displayed correctly, however, when I then look at the record in Navicat, it looks two characters:
[1/4][A with some thing on top of it]
Some multibyte that isn't seen as one character.`
What is going on, where are things going wrong, and what can I do about it?

Although I don't understand where the "invalid multibyte" error comes from, I'm pretty sure htmlspecialchars() is not your culprit:
For the purposes of this function, the charsets ISO-8859-1, ISO-8859-15, UTF-8, cp866, cp1251, cp1252, and KOI8-R are effectively equivalent, as the characters affected by htmlspecialchars() occupy the same positions in all of these charsets.
In my understanding, htmlspecialchars() should work fine for a UTF-8 string without specifying a character set. My bet would be that either the HTML page containing the form, or the database connection you use is not UTF-8 encoded. For the latter, try sending a
SET NAMES utf8;
to mySQL before doing the insert.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medÃºlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt

To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.

I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.