excel-reader utf problem

excel-reader utf problem - php

Can you help me with problem writtent below:
I am getting error iconv() [function.iconv]: Detected an illegal character in input string when i am reading xls file and converting it into csv. I am using php-excel-library and i get this error in line:
case 'iconv' :
$result = iconv('UTF-16LE', $this->_defaultEncoding, $string);
break;
Does enybody know how to fix it?

It means that in your file is a bit-sequence that cannot be mapped onto a char, b/c it's meaningless, and therefore not be converted to another encoding.
Use the //IGNORE-flag as descriped here.

Be sure the file is UTF-16LE encoded as trying to import the wrong format will throw those kinds of errors.
Use PHP 5.3 or greater, as PHP < 5.3 cannot handle UTF-16 encoding (check the notes here: link)
Instead of ignore you may want to try //TRANSLIT to have iconv convert the character to one that is compatible in your default encoding (eg, convert MS "fancy quotes" to regular single/double quotes in ascii)

Related

How to convert binary string to normal string in php

Description of the problem
I am trying to import email content into a database table. Sometimes, I get an SQL error while inserting a message. I found that it fails when the message content is a binary string instead of a string.
For exemple, I get this in the console if I print a message that is imported successfully (Truncated)
However, I get this with problematic import:
I found out that if I use the function utf8_encode, I am successfully able to import it into SQL. The problem is that it "breaks" previously successfull imports accented characters:
What I have tried
Detect if the string was a binary string with ctype_print, returned false for both non binary and binary string. I would have then be able to call utf8_encode only if it was binary
Use of unpack, did not work
Detect string encoding with mb_detect_encoding, return UTF-8 for both
use iconv , failed with iconv(): Detected an illegal character in input string
Cast the content as string using (string) / settype($html, 'string')
Question
How can I transform the binary string in a normal string so I can then import it in my database without breaking accented characters in other imports?

This is pretty late, but for anyone else reading... Apparently the b prefix is meaningless in PHP, it's a bit of a red herring. See: https://stackoverflow.com/a/51537602/6111743
What encodings did you pass to iconv()? This is the correct solution but you have to give it the correct first argument, which depends on your input. In my example I use "LATIN1" because that turned out to be the correct way to interpret my input but your use case may vary.
You can use mb_check_encoding() to check if it is valid UTF-8 or not. This returns a boolean.
Assuming the question is really something like "how to convert extended ascii string to valid utf-8 string in PHP" - Here is how I did it in my application:
if(!mb_check_encoding($string)) {
$string = iconv("LATIN1", "UTF-8//TRANSLIT//IGNORE", $string);
}
The "TRANSLIT" part tells it to attempt transliteration, that's optional for you. The "IGNORE" will prevent it from throwing Detected an illegal character in input string if it does detect one; instead the character will just get ignored, meaning, removed. Your use case may not need either of these.
When you're debugging, I recommend just using "UTF-8" as the second argument so you can see what it's doing. It's useful to see if it throws an error. For me, I had given it the wrong first argument at first (I wrote "ASCII" instead of "LATIN-1") and it threw the illegal character error on an accented character. That error went away once I passed it the correct encoding.
By the way, mb_detect_encoding() was no help to me in figuring out that Latin-1 was what I needed. What helped was dumping the contents of unpack("C*", $string) to see what exact bytes were in there. That's more debugging advice than solution but worth mentioning in case it helps.

Which char is upsetting iconv?

I have a database of hundreds of thousands of sentences which I translate from utf-8 using iconv.
In two of the sentences, I get the following error:
Notice: iconv(): Detected an illegal character in input string
I tried to check the input strings using the methods here: How to detect malformed utf-8 string in PHP?
$isUTF8 = mb_check_encoding($input, 'utf-8');
But, this function returns true (i.e. $input is a valid utf-8), and I still get those two error notices.
How can I detect which sentences are causing the problems?

It's very possible that your input is valid UTF-8, but that your target character encoding doesn't include one of the characters you're using. See utf8 to ISO-8859-1 not converting some characters correctly through Curl for an example.
If this is the case, you could perhaps use //TRANSLIT, if a substitution would be appropriate, or consider if it's possible to deliver content in UTF-8.

Weird char (�) appears after doing html_entity_decode

In a separate YML file i have :
flags: [<img src="/images/cms_bo/icons/english.png" alt="English"/>]
When i call this into my code, it's not interpreted, so i used html_entity_decode.
It works but i have only 1 strange char just before my image : �
<?php echo html_entity_decode($form['lang']->render()); ?>
All my files are UTF8 encoded. Do you have an idea on what i've missed to solve this problem ?
PS:
public static function getI18nCulturesForChoice()
{
return array_combine(self::getI18nCultures(), self::getI18nCulturesFlags());
}

Try using html_entity_decode($form['lang']->render(),ENT_QUOTES, "UTF-8");

Prior to PHP 5.3.3, the default character set for html_entity_decode was ISO-8859-1! If you're working with UTF-8, you will need to use the third argument to the function to tell it to deal with UTF-8 instead of assuming ISO-8859-1.
This is blindly assuming you're using an older version of PHP.
If you are using a newer version of PHP, consider using iconv with the //IGNORE//TRANSLIT flags to try and remove any bad UTF-8 sequences before passing the string into html_entity_decode.

Maybe your file has a Byte Order Mark (BOM) set.

htmlspecialchars(): Invalid multibyte sequence in argument

I am getting this error in my local site.
Warning (2): htmlspecialchars(): Invalid multibyte sequence in argument in [/var/www/html/cake/basics.php, line 207]
Does anyone knows, what is the problem or what should be the solution for this?
Thanks.

Be sure to specify the encoding to UTF-8 if your files are encoded as such:
htmlspecialchars($str, ENT_COMPAT, 'UTF-8');
The default charset for htmlspecialchars is ISO-8859-1 (as of PHP v5.4 the default charset was turned to 'UTF-8'), which might explain why things go haywire when it meets multibyte characters.

I ran in to this error on production and found this great post about it -
http://insomanic.me.uk/post/191397106/php-htmlspecialchars-htmlentities-invalid
It appears to be a bug in PHP (for CentOS at least) that displays this error on when display errors is Off!

You are feeding corrupted character data into the function, or not specifying the right encoding.
I had this issue a while ago, old behavior (prior to PHP 5.2.7 I believe) was to return the string despite corruption, but since that version it will throw this error instead.
My solution involved writing a script to feed my strings through iconv using the //IGNORE modifier to remove corrupted data.
(We had a corrupted database which had some strings in UTF-8, some in latin-1 usually with incorrectly defined character types on the columns).
(Looking at the comment to Tatu's answer, I would start by looking at (and playing with) the contents of the $charset variable.

The correct code in order not to get any error is:
htmlentities($string, ENT_IGNORE, 'UTF-8') ;
Beside this you can also use str_replace to replace some bad characters to your needs and then use htmlentities function.
Have a look at this rss feed it replaced the greater html sign to gt; tag which might not look nice when reading thee rss feed. You can replace this with something like "-" sign or ")" and etc.

Had the same problem because I was using substr on utf-8 string.
Error was infrequent and seemingly random. Error occurred only if string was cut on multibyte char!
mb_substr solved the problem :)

That's actually one of the most frequent errors I get.
Sometimes I dont use __() translation - just plain German text containing äöü.
There it is especially important to mind the encoding of the files.
So make sure you properly save the files that contain special chars as UTF8.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medÃºlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt

To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.

I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.