How to detect CP437 using PHP

How to detect CP437 using PHP - php

I am trying to detect the encoding of a given string in order to convert it later on to utf-8 using iconv. I want to restrict the set of source encodings to utf8, iso8859-1, windows-1251, CP437
//...
$acceptedEncodings = array('utf-8',
'iso-8859-1',
'windows-1251'
);
$srcEncoding = mb_detect_encoding($content, $acceptedEncodings, true);
if($srcEncoding)
{
$content = iconv($srcEncoding, 'UTF-8', $content);
}
//...
The problem is thet mb_detect_encoding does not seem to accept CP437 as a supported encoding and when I give it a CP437 encoded string this is classified as iso-8859-1 which causes iconv to ignore characters like ü.
My question is: Is there a way to detect CP437 encoding earlier? The conversion from CP437 to UTF-8 using iconv works fine but I just cannot find the proper way to detect CP437.
Thank you very much.

As has been discussed countless times before: it is fundamentally impossible to distinguish any single-byte encoding from any other single-byte encoding. What you get are a bunch of bytes. In encoding A the byte x42 may map to character X and in encoding B the same byte may map to character Y. But nothing about the blob of bytes you have tells you that, because you only have the bytes. They can mean anything. They're equally valid in all encodings. It's possible to identify more complex multi-byte encodings like UTF-8, since they need to follow more complex internal rules. So it's possible to definitely be able to say This is not valid UTF-8. However, it is impossible to say with 100% certainty This is definitely UTF-8, not ISO-8859.
You need to have meta data about the content you receive which tells you what encoding the content is in. It's not practical to identify it after the fact. You'd need to employ actual content analysis to figure out which encoding a piece of text makes the most sense in.

Related

PHP detect string encoding from different languages [duplicate]

I am trying to detect the encoding of a given string in order to convert it later on to utf-8 using iconv. I want to restrict the set of source encodings to utf8, iso8859-1, windows-1251, CP437
//...
$acceptedEncodings = array('utf-8',
'iso-8859-1',
'windows-1251'
);
$srcEncoding = mb_detect_encoding($content, $acceptedEncodings, true);
if($srcEncoding)
{
$content = iconv($srcEncoding, 'UTF-8', $content);
}
//...
The problem is thet mb_detect_encoding does not seem to accept CP437 as a supported encoding and when I give it a CP437 encoded string this is classified as iso-8859-1 which causes iconv to ignore characters like ü.
My question is: Is there a way to detect CP437 encoding earlier? The conversion from CP437 to UTF-8 using iconv works fine but I just cannot find the proper way to detect CP437.
Thank you very much.

As has been discussed countless times before: it is fundamentally impossible to distinguish any single-byte encoding from any other single-byte encoding. What you get are a bunch of bytes. In encoding A the byte x42 may map to character X and in encoding B the same byte may map to character Y. But nothing about the blob of bytes you have tells you that, because you only have the bytes. They can mean anything. They're equally valid in all encodings. It's possible to identify more complex multi-byte encodings like UTF-8, since they need to follow more complex internal rules. So it's possible to definitely be able to say This is not valid UTF-8. However, it is impossible to say with 100% certainty This is definitely UTF-8, not ISO-8859.
You need to have meta data about the content you receive which tells you what encoding the content is in. It's not practical to identify it after the fact. You'd need to employ actual content analysis to figure out which encoding a piece of text makes the most sense in.

php urlencod utf-8 string makes it ascii in mb_detect_encoding

During my work in updating some old projects im working through some old ANSI/ASCII files and encodings.
I want to have everything running utf-8 to make sure that i can support all kinds of languages.
I have a service where i send out sms'es using a microservice. I have an endpoint: /sms.php where i accept some parameters from _GET and these are then used in the application.
I have some test files where i make some requests to test if everything is ok.
My problem is that even though all files are utf8-encoded (i've checked multiple times)
My code looks like this:
$text = "message with æøå to make it utf8";
$params = urlencode($text);
$url = "http://localhost/sms.php?text=".$params;
echo mb_detect_encoding($text, "auto"); // this prints utf8
echo mb_detect_encoding($url, "auto"); // this prints ascii
$res = file_get_contents($url);
And this is also what i see in my receiving endpoint.
First i thought it was something to do with file_get_contents but since its being converted AFTER the urlencode it thought i might be it. But im not sure how to get around this problem.
The other problem i have is that a lot of my clients are using this old 2012 code (before i started using utf8 as standard) so i cant change the endpoint without causing them to make changes in their current setups.
In a comment i've been suggested to try to check for if the string is utf8 using
bin2hex:
bin2hex($_GET['text']); // 6d657373616765207769746820c3a6c3b8c3a520746f206d616b652069742075746638 which is inserted into the database: message with Ã¦Ã¸Ã¥ to make it utf8
bin2hex(utf8_decode($_GET['text'])); // 6d657373616765207769746820e6f8e520746f206d616b652069742075746638 which is inserted into the database: message with æøå to make it utf8
Hope someone out there can point me in a correct direction.
I've looked into multiple stackoverflow entries for example
get utf8 urlencoded characters in another page using php
What's the correct encoding of HTTP get request strings?
but im not sure if what im looking for is even possible?
i was just hoping to be able to rewrite entire project to be utf8-ready
Thanks
/Wel

mb_detect_encoding gives you the first encoding in which the tested string is valid. If left to its own devices, it tests for ASCII before UTF-8. Since a URL-encoded string consists solely of a subset of ASCII characters, it is valid ASCII and mb_detect_encoding will tell you so. Whereas a string containing non-ASCII characters is not valid ASCII, so it will continue testing other encodings and eventually arrive at UTF-8.
UTF-8 is a superset of ASCII, so any string that is valid ASCII is also valid UTF-8. A string can be valid in multiple encodings at once; mb_detect_encoding telling you it's valid ASCII does not mean that it's not also valid UTF-8, or Latin-1, or numerous other encodings for that matter. That's how Mojibake is born.
Detecting encodings is largely vague nonsense anyway and you should never do that. If you expect a string to be in UTF-8, simply test whether it is valid UTF-8 or not:
mb_check_encoding($url, 'UTF-8')
If it's not valid in the expected encoding, discard it, since you have no clue what it really is then.

How to detect which type of chinese encoding has text file?

on http://www.gnu.org/software/libiconv/ there are like 20 types of encoding for Chinese:
Chinese EUC-CN, HZ, GBK, CP936, GB18030, EUC-TW, BIG5, CP950,
BIG5-HKSCS, BIG5-HKSCS:2004, BIG5-HKSCS:2001, BIG5-HKSCS:1999,
ISO-2022-CN, ISO-2022-CN-EXT
So I have a text file that is not UTF-8. It's ASCII. And I want to convert it to UTF-8 using iconv(). But for that I need to know the character encoding of the source.
How can I do that if I don't know chinese? :(
I noticed that:
$str = iconv('GB18030', 'UTF-8', $str);
file_put_contents('file.txt', $str);
produces an UTF-8 encoded file, while other encodings I tried (CP950, GBK and EUC-CN) produced an ASCII file. Could that mean that iconv is able to detect if the input encoding is wrong for the given string?

This may work for your needs (but I really cant tell). Setting the locale and utf8_decode, and using mb_check_encoding instead of mt_detect_encoding seems to give some useful output..
// some text from http://chinesenotes.com/chinese_text_l10n.php
// have tried both as string and content loaded from a file
$chinese = '譧躆 礛簼繰 剆坲姏 潧 騔鯬 跠 瘱瘵瘲 忁曨曣 蛃袚觙';
$chinese=utf8_decode($chinese);
$chinese_encodings ='EUC-CN,HZ,GBK,CP936,GB18030,EUC-TW,BIG5,CP950,BIG5-HKSCS,BIG5-HKSCS:2004,BIG5-HKSCS:2001,BIG5-HKSCS:1999,ISO-2022-CN,ISO-2022-CN-EXT';
$encodings = explode(',',$chinese_encodings);
//set chinese locale
setlocale(LC_CTYPE, 'Chinese');
foreach($encodings as $encoding) {
if (#mb_check_encoding($chinese, $encoding)) {
echo 'The string seems to be compatible with '.$encoding.'<br>';
} else {
echo 'Not compatible with '.$encoding.'<br>';
}
}
outputs
The string seems to be compatible with EUC-CN
The string seems to be compatible with HZ
The string seems to be compatible with GBK
The string seems to be compatible with CP936
Not compatible with GB18030
The string seems to be compatible with EUC-TW
The string seems to be compatible with BIG5
The string seems to be compatible with CP950
Not compatible with BIG5-HKSCS
Not compatible with BIG5-HKSCS:2004
Not compatible with BIG5-HKSCS:2001
Not compatible with BIG5-HKSCS:1999
Not compatible with ISO-2022-CN
Not compatible with ISO-2022-CN-EXT
It is total guess. Now it at least seems to recognise some of the chinese encodings. Delete if it is total junk.

I have zero experience with chinese encoding and I know this question is tagged iconv, but if it will get the job done, then you may try mb_detect_encoding to detect your encoding; The second argument is list of encodings to check, and there is a user-crafted comment about chinese encodings:
For Chinese developers: please note that the second argument of this
function DOES NOT include 'GB2312' and 'GBK' and the return value is
'EUC-CN' when it is detected as a GB2312 string.
so maybe it will work if you explicitly provide full list of chinese encodings as a second argument? It could work like this:
$encoding = mb_detect_encoding($chineseString, 'GB2312,GBK,(...)');
if($encoding) $utf8text = iconv($encoding, 'UTF-8', $str);
you may also want to play with third argument (strict)

What makes it hard to detect the encoding is the fact that octet sequences decode to valid characters in several encodings, but the result makes sense in only the correct encoding. What I've done in these cases is take the decoded text and go to an automatic translation service and see if you get back legible text or a jumble of syllables.
You can do this programmatically for example by analyzing trigraph frequencies in the input text. Libraries like this one have already been created to solve this problem, and there are external programs that do it, but I have yet to see anything with a PHP API. This approach is not fool-proof though.

PHP Encoding Conversion to Windows-1252 whilst keeping UTF-8 Compatibility

I need to convert uploaded filenames with an unknown encoding to Windows-1252 whilst also keeping UTF-8 compatibility.
As I pass on those files to a controller (on which I don't have any influence), the files have to be Windows-1252 encoded. This controller then again generates a list of valid file(names) that are stored via MySQL into a database - therefore I need UTF-8 compatibility. Filenames passed to the controller and filenames written to the database MUST match. So far so good.
In some rare cases, when converting to "Windows-1252" (like with te character "ï"), the character is converted to something invalid in UTF-8. MySQL then drops those invalid characters - as a result filenames on disk and filenames stored to the database don't match anymore. This conversion, which failes sometimes, is achieved with simple recoding:
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//IGNORE", $sOriginalFilename);
To prevent invalid characters being generated by the conversion, I then again can remove all invalid UTF-8 characters from the recoded string:
ini_set('mbstring.substitute_character', "none");
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//TRANSLIT", $sOriginalFilename);
$sTargetFilename = mb_convert_encoding($sTargetFilename, 'UTF-8', 'Windows-1252');
But this will completely remove / recode any special characters left in the string. For example I lose all "äöüÄÖÜ" etc., which are quite regular in german language.
If you know a cleaner and simpler way of encoding to Windows-1252 (without losing valid special characters), please let me know.
Any help is very appreciated. Thank you in advance!

I think the main problem is that mb_detect_encoding() does not do exactly what you think it does. It attempts to detect the character encoding but it does it from a fairly limited list of predefined encodings. By default, those encodings are the ones returned by mb_detect_order(). In my computer they are:
ASCII
UTF-8
So this function is completely useless unless you take care of compiling a list of candidate encodings and feeding the function with it.
Additionally, there's basically no reliable way to guess the encoding of an arbitrary input string, even if you restrict yourself to a small subset of encodings. In your case, Windows-1252 is so close to ISO-8859-1 and ISO-8859-15 that you have no way to tell them apart other than visual inspection of key characters like ¤ or €.

You can't have a string be Windows-1252 and UTF-8 at the same time. The character sets are identical for the first 128 characters (they contain e.g. the basic latin alphabet), but when it goes beyond that (like for Umlauts), it's either one or the other. They have different code points in UTF-8 than they have in Windows-1252.

Keep to ASCII in the filesystem - if you need to sustain characters outside ASCII in a filename, there are
schemes you can use to represent unicode characters while keeping to ASCII.
For example, percent encoding:
äöüÄÖÜ.txt <-> %C3%A4%C3%B6%C3%BC%C3%84%C3%96%C3%9C.txt
Of course this will hit the file name limit pretty fast and is not very optimal.
How about punycode?
äöüÄÖÜ.txt <-> xn--4caa7cb2ac.txt

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medÃºlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt

To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.

I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.