convert string with UTF-16 and UTF-8 text to UTF-8

convert string with UTF-16 and UTF-8 text to UTF-8 - php

I read many posts on how to convert UTF-16 from/to UTF-8 but none advise what to do if I have both. I'm trying to insert an email body text that has UTF-16 and UTF-8 characters, using PHP, into SQL Server 2008 table column (UTF-8).
I use iconv() to convert from UTF-16 to UTF-8 but as I said it is not enough because it doesn't handle UTF-8:
$email->description_html = iconv("UTF-16","UTF-8//TRANSLIT",$that->getMessageText(
$msgNo, 'HTML', $structure, $fullHeader,$clean_email));
$email->description = iconv("UTF-16","UTF-8//TRANSLIT",$that->getMessageText(
$msgNo, 'PLAIN', $structure, $fullHeader,$clean_email));
I tried this for both UTF-16 and UTF-8 but it doesn't work, gives a database error:
can't convert UTF-16 to UTF-8
$email->description_html= iconv('','UTF-8',$that->getMessageText(
$msgNo, 'HTML', $structure, $fullHeader,$clean_email));
I don't know what else to do , please help.

There shouldn't be such a thing as "having both UTF-16 and UTF-8" in one text string. If this is the case, the string is broken. There must be an indicator stating which encoding was used, and this encoding only. This indicator must be trusted for converting characters into another encoding. If it doesn't work: Blame the source for incorrectly stating an encoding that wasn't true.
As for email: It might be possible to have a multipart mail that has two (read: more than one) different parts with two different multipart headers, both of them stating different encoding. Dealing with this must be done by applying the rules for parsing multipart mails, i.e. you cannot treat the whole mail as a single string, but must separate these parts first - and then you have a perfectly valid single encoding case for each part. :)

Related

PHP detect string encoding from different languages [duplicate]

I am trying to detect the encoding of a given string in order to convert it later on to utf-8 using iconv. I want to restrict the set of source encodings to utf8, iso8859-1, windows-1251, CP437
//...
$acceptedEncodings = array('utf-8',
'iso-8859-1',
'windows-1251'
);
$srcEncoding = mb_detect_encoding($content, $acceptedEncodings, true);
if($srcEncoding)
{
$content = iconv($srcEncoding, 'UTF-8', $content);
}
//...
The problem is thet mb_detect_encoding does not seem to accept CP437 as a supported encoding and when I give it a CP437 encoded string this is classified as iso-8859-1 which causes iconv to ignore characters like ü.
My question is: Is there a way to detect CP437 encoding earlier? The conversion from CP437 to UTF-8 using iconv works fine but I just cannot find the proper way to detect CP437.
Thank you very much.

As has been discussed countless times before: it is fundamentally impossible to distinguish any single-byte encoding from any other single-byte encoding. What you get are a bunch of bytes. In encoding A the byte x42 may map to character X and in encoding B the same byte may map to character Y. But nothing about the blob of bytes you have tells you that, because you only have the bytes. They can mean anything. They're equally valid in all encodings. It's possible to identify more complex multi-byte encodings like UTF-8, since they need to follow more complex internal rules. So it's possible to definitely be able to say This is not valid UTF-8. However, it is impossible to say with 100% certainty This is definitely UTF-8, not ISO-8859.
You need to have meta data about the content you receive which tells you what encoding the content is in. It's not practical to identify it after the fact. You'd need to employ actual content analysis to figure out which encoding a piece of text makes the most sense in.

php urlencod utf-8 string makes it ascii in mb_detect_encoding

During my work in updating some old projects im working through some old ANSI/ASCII files and encodings.
I want to have everything running utf-8 to make sure that i can support all kinds of languages.
I have a service where i send out sms'es using a microservice. I have an endpoint: /sms.php where i accept some parameters from _GET and these are then used in the application.
I have some test files where i make some requests to test if everything is ok.
My problem is that even though all files are utf8-encoded (i've checked multiple times)
My code looks like this:
$text = "message with æøå to make it utf8";
$params = urlencode($text);
$url = "http://localhost/sms.php?text=".$params;
echo mb_detect_encoding($text, "auto"); // this prints utf8
echo mb_detect_encoding($url, "auto"); // this prints ascii
$res = file_get_contents($url);
And this is also what i see in my receiving endpoint.
First i thought it was something to do with file_get_contents but since its being converted AFTER the urlencode it thought i might be it. But im not sure how to get around this problem.
The other problem i have is that a lot of my clients are using this old 2012 code (before i started using utf8 as standard) so i cant change the endpoint without causing them to make changes in their current setups.
In a comment i've been suggested to try to check for if the string is utf8 using
bin2hex:
bin2hex($_GET['text']); // 6d657373616765207769746820c3a6c3b8c3a520746f206d616b652069742075746638 which is inserted into the database: message with Ã¦Ã¸Ã¥ to make it utf8
bin2hex(utf8_decode($_GET['text'])); // 6d657373616765207769746820e6f8e520746f206d616b652069742075746638 which is inserted into the database: message with æøå to make it utf8
Hope someone out there can point me in a correct direction.
I've looked into multiple stackoverflow entries for example
get utf8 urlencoded characters in another page using php
What's the correct encoding of HTTP get request strings?
but im not sure if what im looking for is even possible?
i was just hoping to be able to rewrite entire project to be utf8-ready
Thanks
/Wel

mb_detect_encoding gives you the first encoding in which the tested string is valid. If left to its own devices, it tests for ASCII before UTF-8. Since a URL-encoded string consists solely of a subset of ASCII characters, it is valid ASCII and mb_detect_encoding will tell you so. Whereas a string containing non-ASCII characters is not valid ASCII, so it will continue testing other encodings and eventually arrive at UTF-8.
UTF-8 is a superset of ASCII, so any string that is valid ASCII is also valid UTF-8. A string can be valid in multiple encodings at once; mb_detect_encoding telling you it's valid ASCII does not mean that it's not also valid UTF-8, or Latin-1, or numerous other encodings for that matter. That's how Mojibake is born.
Detecting encodings is largely vague nonsense anyway and you should never do that. If you expect a string to be in UTF-8, simply test whether it is valid UTF-8 or not:
mb_check_encoding($url, 'UTF-8')
If it's not valid in the expected encoding, discard it, since you have no clue what it really is then.

How to detect CP437 using PHP

I am trying to detect the encoding of a given string in order to convert it later on to utf-8 using iconv. I want to restrict the set of source encodings to utf8, iso8859-1, windows-1251, CP437
//...
$acceptedEncodings = array('utf-8',
'iso-8859-1',
'windows-1251'
);
$srcEncoding = mb_detect_encoding($content, $acceptedEncodings, true);
if($srcEncoding)
{
$content = iconv($srcEncoding, 'UTF-8', $content);
}
//...
The problem is thet mb_detect_encoding does not seem to accept CP437 as a supported encoding and when I give it a CP437 encoded string this is classified as iso-8859-1 which causes iconv to ignore characters like ü.
My question is: Is there a way to detect CP437 encoding earlier? The conversion from CP437 to UTF-8 using iconv works fine but I just cannot find the proper way to detect CP437.
Thank you very much.

As has been discussed countless times before: it is fundamentally impossible to distinguish any single-byte encoding from any other single-byte encoding. What you get are a bunch of bytes. In encoding A the byte x42 may map to character X and in encoding B the same byte may map to character Y. But nothing about the blob of bytes you have tells you that, because you only have the bytes. They can mean anything. They're equally valid in all encodings. It's possible to identify more complex multi-byte encodings like UTF-8, since they need to follow more complex internal rules. So it's possible to definitely be able to say This is not valid UTF-8. However, it is impossible to say with 100% certainty This is definitely UTF-8, not ISO-8859.
You need to have meta data about the content you receive which tells you what encoding the content is in. It's not practical to identify it after the fact. You'd need to employ actual content analysis to figure out which encoding a piece of text makes the most sense in.

utf8_encode function purpose

Supposed that im encoding my files with UTF-8.
Within PHP script, a string will be compared:
$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...
Its that string really UTF-8 without the utf8_encode() function?
If you encode your files with UTF-8 dont need this function?

If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.
To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.

PHP does not care about string encoding generally, strings are binary data within PHP. So you must know the encoding of data inside the string if you need encoding. The question is: does encoding matter in your case?
If you set a string variables content to something like you did:
$string="ぁ";
It will not contain UTF-8. Instead it contains a binary sequence that is not a valid UTF-8 character. That's why the browser or editor displays a questionmark or similar. So before you go on, you already see that something might not be as intended. (Turned out it was a missing font on my end)
This also shows that your file in the editor is supporting UTF-8 or some other flavor of unicode encoding. Just keep the following in mind: One file - one encoding. If you store the string inside the file, it's in the encoding of that file. Check your editor in which encoding you save the file. Then you know the encoding of the string.
Let's just assume it is some valid UTF-8 like so (support for my font):
$string="ä";
You can then do a binary comparison of the string later on:
if ( 'ä' === $string )
# do your stuff
Because it's in the same file and PHP strings are binary data, this works with every encoding. So normally you don't need to re-encode (change the encoding) the data if you use functions that are binary safe - which means that the encoding of the data is not changed.
For regular expressions encoding does play a role. That's why there is the u modifier to signal you want to make the expression work on and with unicode encoded data. However, if the data is already unicode encoded, you don't need to change it into unicode before you use preg_match. However with your code example, regular expressions are not necessary at all and a simple string comparison does the job.
Summary:
$string="ä";
if ( 'ä' === $string )
# do your stuff

Your string is not a utf-8 character so it can't preg match it, hence why you need to utf8_encode it. Try encoding the PHP file as utf-8 (use something like Notepad++) and it may work without it.

Summary:
The utf8_encode() function will encode every byte from a given string to UTF-8.
No matter what encoding has been used previously to store the file.
It's purpose is encode strings¹ that arent UTF-8 yet.
1.- The correctly use of this function is giving as a parameter an ISO-8859-1 string.
Why? Because Unicode and ISO-8859-1 have the same characters at same positions.
[Char][Value/Position] [Encoded Value/Position]
[Windows-1252] [€][80] ----> [C2|80] Is this the UTF-8 encoded value/position of the [€]? No
[ISO-8859-1] [¢][A2] ----> [C2|A2] Is this the UTF-8 encoded value/position of the [¢]? Yes
The function seems that work with another encodings: it work if the string to encode contains only characters with same
values that the ISO-8859-1 encoding (e.g On Windows-1252 00-EF & A0-FF positions).
We should take into account that if the function receive an UTF-8 string (A file encoded as a UTF-8) will encode again that UTF-8 string and will make garbage.

PHP: What is that character encoding of this string?

In PHP, i have the following string: =CA=CC=D1=C8=C9
what is its character encoding?

It does not make sense to have a string without knowing what encoding it uses.
Those 5 bytes mean different things in different encodings.
In UTF-8, it's invalid. All lead bytes and no trail bytes.
In ISO-8859-1 and windows-1252, it's the string ÊÌÑÈÉ.
According to chardet, it's in KOI8-R, and decodes to йляхи

The answer and comments that you got assumed that you knew already that the transportation encoding was "quoted-printable" ... decoding using that, "=CA=CC=D1=C8=C9" becomes "\xCA\xCC\xD1\xC8\xC9" (which is NOT UTF-8, as you asked for in a comment) ... and they concentrated on what encoding might reasonably be used to produce Unicode out of that. To arrive at UTF-8, you need two more steps: decode "\xCA\xCC\xD1\xC8\xC9" into Unicode (using an encoding appropriate to Arabic text) and then encode into UTF-8.

It is called quoted printable
I can deceode it using :
quoted_printable_decode($string);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.