Convert special character "Ⓡ" and "®" - php

I want to convert "Ⓡ" and "®" to readable. Currently when i using htmlEntities($text, ENT_COMPAT | ENT_HTML401, 'ISO-8859-1'), it show great and readable in my development(Laptop/Window). But when i set this in production server (Centos), it didn't show the symbol. I'm using php 5.5.14 for both.

You're telling PHP that the $text string is encoded in ISO-8859-1. Are you sure that's true? Depending on where you obtained that string from, it may be using the system's default encoding, which can vary from machine to machine and is probably UTF-8 on a CentOS system.
Look carefully at wherever you're getting that string from, and see if you can force it to always be UTF-8, regardless of the system's default encoding. (UTF-8 is preferable since it can represent any character in Unicode; ISO-8859-1 can't.)
If you're careful to keep all your strings in a consistent, known encoding, you can avoid needing to call htmlentities at all. Just use the Content-Type header's charset parameter to tell the browser what encoding the response is in (e.g. UTF-8) and it'll understand the characters.
BTW, the "Ⓡ" symbol (U+24C7 CIRCLED LATIN CAPITAL LETTER R) doesn't exist in ISO-8859-1. Only "®" (U+00AE REGISTERED SIGN) does. (You probably want to use the latter anyway, though.)

Related

Will comparing the binary data of a string with an unknown character encoding validate what its encoding is?

I need to automatically determine the character encoding of strings from email content and headers. For the most part this isn't an issue however there is an occasional email with content and/or a header that has an oddball character such as an en dash. Now I received an answer that technically seems to work if I statically test it on a specific header for a specific email however that blatantly ignores the fact that importing email needs to be a completely automated process in which case I am utterly unable to automatically determine the string's character encoding.
I've started with the basics such as detecting common trouble characters that seem to guarantee a character encoding issue will occur. However strpos('en dash: –', '–') works fine while intentionally / manually testing though it fails outright when added directly to the automated process. I'm going to guess that the issue there is that the string parameters have a UTF-8 encoding while the automated process is testing a string that isn't yet UTF-8 and thus internally the same character isn't using the same subset of code (via character encoding).
So my second attempt was mb_detect_encoding's second parameter can be an array. So I tried the following:
$encodings = array('UTF-8','UCS-4','UCS-4BE','UCS-4LE','UCS-2','UCS-2BE','UCS-2LE','UTF-32','UTF-32BE','UTF-32LE','UTF-16','UTF-16BE','UTF-16LE','UTF-7','UTF7-IMAP','ASCII','EUC-JP','SJIS','eucJP-win','SJIS-win','ISO-2022-JP','ISO-2022-JP-MS','CP932','CP51932','SJIS-mac','SJIS-Mobile#DOCOMO','SJIS-Mobile#KDDI','SJIS-Mobile#SOFTBANK','UTF-8-Mobile#DOCOMO','UTF-8-Mobile#KDDI-A','UTF-8-Mobile#KDDI-B','UTF-8-Mobile#SOFTBANK','ISO-2022-JP-MOBILE#KDDI','JIS','JIS-ms','CP50220','CP50220raw','CP50221','CP50222','ISO-8859-1','ISO-8859-2','ISO-8859-3','ISO-8859-4','ISO-8859-5','ISO-8859-6','ISO-8859-7','ISO-8859-8','ISO-8859-9','ISO-8859-10','ISO-8859-13','ISO-8859-14','ISO-8859-15','ISO-8859-16','byte2be','byte2le','byte4be','byte4le','BASE64','HTML-ENTITIES','7bit','8bit','EUC-CN','CP936','GB18030','HZ','EUC-TW','CP950','BIG-5','EUC-KR','UHC','ISO-2022-KR','Windows-1251','Windows-1252','CP866','KOI8-R','KOI8-U','ArmSCII-8');
$encoding = mb_detect_encoding($s, $encodings, true);
$compare = mb_convert_encoding($s, 'UTF-8', $encoding);
foreach ($encodings as $k1)
{
if (mb_convert_encoding($s, 'UTF-8', $k1) === $s) {$encoding = $k1; break;}
}
Unfortunately that seemed to result in the same failure based on what I presume was the same underlying issue.
So my third idea I'm looking for some more experienced validation. I could convert the string down in its binary form (ones and zeroes, not binary data). Then I could try converting the string and then converting that second string to binary to compare the two binary versions; if they === match then I might have determined the correct character encoding?
Now I can easily try this with this answer from an unrelated thread however I'm not certain if this is a valid idea or not. This is all intended to answer my question:
How can I determine the actual character encoding of a string in order to convert it to UTF-8 with fully automated validation without corrupting data?
By validation I'm talking about stuff like comparing the binary data though again, I'm not certain if that is a valid approach or not. I do know that I absolutely hate en dashes though.
The answer won't change: it's impossible. You have to rely on external information which encoding is used on text.
Guessing an encoding can horribly go wrong:
Based on the order in which you test against it can either turn out as i.e. ASCII or UTF-8 or Windows-1252, just because so far it fits in. Your list is questionable, because it may match Base64 which is not even a text encoding.
If the source is not properly encoded itself then guessing its encoding will most likely exclude the correct one. And guess a wrong one. Which makes things worse.
Many encodings share the same area: the source can either fit i.e. Windows-1252 or Windows-1251 and even detecting the lexical sense of the text cannot guarantee which of both is correct.
Also: ones and zeroes are binary. PHP strings are only byte arrays, so they're binary to begin with. How they're interpreted relies on you: if your code is $text= "グリーン"; then it's up to which encoding your PHP text file has and how your PHP defaults are set. There is no "internal ... character", only bytes. Which is also the reason why there are functions which operate on bytes (i.e. strlen()) and on a specific text encoding (i.e. mb_strlen()).
If you hate single characters or not: they can be easily used as what they are: characters in texts. And – has its own valid meaning in contrast to — and ‒ and -; don't replace it by personal opinion, because that could corrupt a context's meaning. It's like ignoring the fact that A and Α and A are all different characters. You might want to look up the difference between homoglyphs and synoglyphs - the latter is your current perspective.
You may ask "And in which encoding does PHP interpret the scripts?" Luckily ASCII is for most encodings the most common denominator, so interpreting the first bytes of a file as such to search for <?php (all these are ASCII characters, so for PHP code itself it doesn't matter if it is effectively UTF-8 or ISO-8859-1 or Shift-JIS) will only fail when the document is encoded in i.e. UTF-16 - in that case you must set your PHP defaults to that encoding. Which again proves: text encodings must be told outside of the text.

PHP Encoding Conversion to Windows-1252 whilst keeping UTF-8 Compatibility

I need to convert uploaded filenames with an unknown encoding to Windows-1252 whilst also keeping UTF-8 compatibility.
As I pass on those files to a controller (on which I don't have any influence), the files have to be Windows-1252 encoded. This controller then again generates a list of valid file(names) that are stored via MySQL into a database - therefore I need UTF-8 compatibility. Filenames passed to the controller and filenames written to the database MUST match. So far so good.
In some rare cases, when converting to "Windows-1252" (like with te character "ï"), the character is converted to something invalid in UTF-8. MySQL then drops those invalid characters - as a result filenames on disk and filenames stored to the database don't match anymore. This conversion, which failes sometimes, is achieved with simple recoding:
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//IGNORE", $sOriginalFilename);
To prevent invalid characters being generated by the conversion, I then again can remove all invalid UTF-8 characters from the recoded string:
ini_set('mbstring.substitute_character', "none");
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//TRANSLIT", $sOriginalFilename);
$sTargetFilename = mb_convert_encoding($sTargetFilename, 'UTF-8', 'Windows-1252');
But this will completely remove / recode any special characters left in the string. For example I lose all "äöüÄÖÜ" etc., which are quite regular in german language.
If you know a cleaner and simpler way of encoding to Windows-1252 (without losing valid special characters), please let me know.
Any help is very appreciated. Thank you in advance!
I think the main problem is that mb_detect_encoding() does not do exactly what you think it does. It attempts to detect the character encoding but it does it from a fairly limited list of predefined encodings. By default, those encodings are the ones returned by mb_detect_order(). In my computer they are:
ASCII
UTF-8
So this function is completely useless unless you take care of compiling a list of candidate encodings and feeding the function with it.
Additionally, there's basically no reliable way to guess the encoding of an arbitrary input string, even if you restrict yourself to a small subset of encodings. In your case, Windows-1252 is so close to ISO-8859-1 and ISO-8859-15 that you have no way to tell them apart other than visual inspection of key characters like ¤ or €.
You can't have a string be Windows-1252 and UTF-8 at the same time. The character sets are identical for the first 128 characters (they contain e.g. the basic latin alphabet), but when it goes beyond that (like for Umlauts), it's either one or the other. They have different code points in UTF-8 than they have in Windows-1252.
Keep to ASCII in the filesystem - if you need to sustain characters outside ASCII in a filename, there are
schemes you can use to represent unicode characters while keeping to ASCII.
For example, percent encoding:
äöüÄÖÜ.txt <-> %C3%A4%C3%B6%C3%BC%C3%84%C3%96%C3%9C.txt
Of course this will hit the file name limit pretty fast and is not very optimal.
How about punycode?
äöüÄÖÜ.txt <-> xn--4caa7cb2ac.txt

Encoding of strings

This is what is found in the php manual under the String datatype http://php.net/manual/en/language.types.string.php
Given that PHP does not dictate a specific encoding for strings, one might wonder how string literals are encoded. For instance, is the string "á" equivalent to "\xE1" (ISO-8859-1), "\xC3\xA1" (UTF-8, C form), "\x61\xCC\x81" (UTF-8, D form) or any other possible representation? The answer is that string will be encoded in whatever fashion it is encoded in the script file. Thus, if the script is written in ISO-8859-1, the string will be encoded in ISO-8859-1 and so on. However, this does not apply if Zend Multibyte is enabled; in that case, the script may be written in an arbitrary encoding (which is explicity declared or is detected) and then converted to a certain internal encoding, which is then the encoding that will be used for the string literals. Note that there are some constraints on the encoding of the script (or on the internal encoding, should Zend Multibyte be enabled) – this almost always means that this encoding should be a compatible superset of ASCII, such as UTF-8 or ISO-8859-1. Note, however, that state-dependent encodings where the same byte values can be used in initial and non-initial shift states may be problematic.
Could you explain in simple terms as to what this means ?. Thanks
Given that PHP does not dictate a specific encoding for strings, one
might wonder how string literals are encoded. For instance, is the
string "á" equivalent to "\xE1" (ISO-8859-1), "\xC3\xA1" (UTF-8,
Cform), "\x61\xCC\x81" (UTF-8, D form) or any other possible
representation? The answer is that string will be encoded in whatever
fashion it is encoded in the script file. Thus, if the script is
written in ISO-8859-1, the string will be encoded in ISO-8859-1 and
soon.
This part of statement says that if your webpage is encoded in (UTF-8, C form) than "á" will be equivalent to "\xC3\xA1" you specify encoding in php.ini it's config file for your php script.
However, this does not apply if Zend Multibyte is enabled; in that
case, the script may be written in an arbitrary encoding (which is
explicity declared or is detected) and then converted to a certain
internal encoding, which is then the encoding that will be used for
the string literals. Note that there are some constraints on the
encoding of the script (or on the internal encoding, should Zend
Multibyte be enabled) – this almost always means that this encoding
should be a compatible superset of ASCII, such as UTF-8 or ISO-8859-1.
Note, however, that state-dependent encodings where the same byte
values can be used in initial and non-initial shift states may be
problematic.
Down here they just say that there is another option to specify your encoding, but now you are doing it in a script, but your encoding must be compatible with ASCII superset

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medúlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt
To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.
I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a U​RI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

(PHP) rawurlencode/decode seems to encode '£' sign as '£' (%C2%A3 instead of %A3)

So, I've run into a problem with PHP's rawurlencode function. All text fields in our web app are of course converted before being processed by the web-server, and we've used rawurlencode for this. This works fine with almost every character I've found, expect for the "£" sign. Now, there is no reason for our users to ever enter a pound sign, but they might, so I want to take care of this.
The problem is that rawurlencode doesn't encode a pound sign entered on the webpage as %A3, but instead as %C2%A3. Even worse, if the user failed to enter another bit of critical information (which causes the webpage to refresh - the checks are done on the backend side - and try and refill the form boxes with the information the user had used), then when the %C2 is run through rawurldecode/encode, it becomes Ã? - aka, %C3?. And of course the "£" is also turned into another £!
So, what is causing this? I assume it's a character encoding issue, but I'm not that knowledgable about these things. I heard somewhere that I can encode £s as &pound manually, but why should I need to do that when the database can handle "£"s, and there is a percentage-encoding for a pound sign? Is this a bug in rawurlencode, or a bug caused by differing character sets?
Thanks for any help.
The standard requires forms to be submitted in the character encoding you specify in <form accept-charset="..."> or UTF-8 if it's not specified or the text the user has entered cannot be represented in the charset you specify.
Clearly, you're receiving the pound sign encoded in UTF-8. If you want to convert it to ISO-8859-15, write:
iconv("UTF-8", "ISO-8859-15//TRANSLIT", $original)
This is probably encoding A3 character in your native character set to C2A3 in UTF-8 encoding, which seems to be the valid UTF-8 encoding for an ANSI A3. Just consume your encoded url using UTF-8 encoding, or specify an ANSI encoding to urlencode.
Artefacto's answer represents a case when you need to convert character encodings, for example, you are displaying a page and the page encoding is set to Latin-1. (Raw)Urlencode will produce escaped strings with multibyte character representations. (Raw)Urldecode will by default produce utf-8 encoded strings, and will represent £ as two bytes. If you display this string making a claim that it is a ISO-8859 encoded string, it will appear as two characters.
A primer on PHP and UTF-8: http://www.phpwact.org/php/i18n/utf-8
Some "hot tips": http://www.sitepoint.com/blogs/2006/08/10/hot-php-utf-8-tips/
Likely, between getting the string from rawurldecode, and using the string, the locale is assumed to be ISO8859, so two bytes get interpreted as two characters when they represent one.
Use mb_convert_encoding to force PHP to realize that the bytes in the string represent a UTF-8 encoded string.

Categories