Encoding of strings

Encoding of strings - php

This is what is found in the php manual under the String datatype http://php.net/manual/en/language.types.string.php
Given that PHP does not dictate a specific encoding for strings, one might wonder how string literals are encoded. For instance, is the string "á" equivalent to "\xE1" (ISO-8859-1), "\xC3\xA1" (UTF-8, C form), "\x61\xCC\x81" (UTF-8, D form) or any other possible representation? The answer is that string will be encoded in whatever fashion it is encoded in the script file. Thus, if the script is written in ISO-8859-1, the string will be encoded in ISO-8859-1 and so on. However, this does not apply if Zend Multibyte is enabled; in that case, the script may be written in an arbitrary encoding (which is explicity declared or is detected) and then converted to a certain internal encoding, which is then the encoding that will be used for the string literals. Note that there are some constraints on the encoding of the script (or on the internal encoding, should Zend Multibyte be enabled) – this almost always means that this encoding should be a compatible superset of ASCII, such as UTF-8 or ISO-8859-1. Note, however, that state-dependent encodings where the same byte values can be used in initial and non-initial shift states may be problematic.
Could you explain in simple terms as to what this means ?. Thanks

Given that PHP does not dictate a specific encoding for strings, one
might wonder how string literals are encoded. For instance, is the
string "á" equivalent to "\xE1" (ISO-8859-1), "\xC3\xA1" (UTF-8,
Cform), "\x61\xCC\x81" (UTF-8, D form) or any other possible
representation? The answer is that string will be encoded in whatever
fashion it is encoded in the script file. Thus, if the script is
written in ISO-8859-1, the string will be encoded in ISO-8859-1 and
soon.
This part of statement says that if your webpage is encoded in (UTF-8, C form) than "á" will be equivalent to "\xC3\xA1" you specify encoding in php.ini it's config file for your php script.
However, this does not apply if Zend Multibyte is enabled; in that
case, the script may be written in an arbitrary encoding (which is
explicity declared or is detected) and then converted to a certain
internal encoding, which is then the encoding that will be used for
the string literals. Note that there are some constraints on the
encoding of the script (or on the internal encoding, should Zend
Multibyte be enabled) – this almost always means that this encoding
should be a compatible superset of ASCII, such as UTF-8 or ISO-8859-1.
Note, however, that state-dependent encodings where the same byte
values can be used in initial and non-initial shift states may be
problematic.
Down here they just say that there is another option to specify your encoding, but now you are doing it in a script, but your encoding must be compatible with ASCII superset

Related

Convert special character "Ⓡ" and "®"

I want to convert "Ⓡ" and "®" to readable. Currently when i using htmlEntities($text, ENT_COMPAT | ENT_HTML401, 'ISO-8859-1'), it show great and readable in my development(Laptop/Window). But when i set this in production server (Centos), it didn't show the symbol. I'm using php 5.5.14 for both.

You're telling PHP that the $text string is encoded in ISO-8859-1. Are you sure that's true? Depending on where you obtained that string from, it may be using the system's default encoding, which can vary from machine to machine and is probably UTF-8 on a CentOS system.
Look carefully at wherever you're getting that string from, and see if you can force it to always be UTF-8, regardless of the system's default encoding. (UTF-8 is preferable since it can represent any character in Unicode; ISO-8859-1 can't.)
If you're careful to keep all your strings in a consistent, known encoding, you can avoid needing to call htmlentities at all. Just use the Content-Type header's charset parameter to tell the browser what encoding the response is in (e.g. UTF-8) and it'll understand the characters.
BTW, the "Ⓡ" symbol (U+24C7 CIRCLED LATIN CAPITAL LETTER R) doesn't exist in ISO-8859-1. Only "®" (U+00AE REGISTERED SIGN) does. (You probably want to use the latter anyway, though.)

ini_set('mbstring.internal_encoding','UTF-8')

ini_set('mbstring.internal_encoding','UTF-8')
what does this signify at the beginning of a php file and what is it used for ?
I know that a php manual exists but it does not explain it in plain common man's language.

It defines the default internal character encoding to UTF-8 character set type.
This is used to make a site multilingual by changing the mbstring.internal_encoding value.
encoding is the character encoding name used for the HTTP input
character encoding conversion, HTTP output character encoding
conversion, and the default character encoding for string functions
defined by the mbstring module. You should notice that the internal
encoding is totally different from the one for multibyte regex.
And UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode.
You can change it using the iso-8859-1, UTF-8, etc.
Here is the list of Unicode character list.

PHP Encoding Conversion to Windows-1252 whilst keeping UTF-8 Compatibility

I need to convert uploaded filenames with an unknown encoding to Windows-1252 whilst also keeping UTF-8 compatibility.
As I pass on those files to a controller (on which I don't have any influence), the files have to be Windows-1252 encoded. This controller then again generates a list of valid file(names) that are stored via MySQL into a database - therefore I need UTF-8 compatibility. Filenames passed to the controller and filenames written to the database MUST match. So far so good.
In some rare cases, when converting to "Windows-1252" (like with te character "ï"), the character is converted to something invalid in UTF-8. MySQL then drops those invalid characters - as a result filenames on disk and filenames stored to the database don't match anymore. This conversion, which failes sometimes, is achieved with simple recoding:
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//IGNORE", $sOriginalFilename);
To prevent invalid characters being generated by the conversion, I then again can remove all invalid UTF-8 characters from the recoded string:
ini_set('mbstring.substitute_character', "none");
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//TRANSLIT", $sOriginalFilename);
$sTargetFilename = mb_convert_encoding($sTargetFilename, 'UTF-8', 'Windows-1252');
But this will completely remove / recode any special characters left in the string. For example I lose all "äöüÄÖÜ" etc., which are quite regular in german language.
If you know a cleaner and simpler way of encoding to Windows-1252 (without losing valid special characters), please let me know.
Any help is very appreciated. Thank you in advance!

I think the main problem is that mb_detect_encoding() does not do exactly what you think it does. It attempts to detect the character encoding but it does it from a fairly limited list of predefined encodings. By default, those encodings are the ones returned by mb_detect_order(). In my computer they are:
ASCII
UTF-8
So this function is completely useless unless you take care of compiling a list of candidate encodings and feeding the function with it.
Additionally, there's basically no reliable way to guess the encoding of an arbitrary input string, even if you restrict yourself to a small subset of encodings. In your case, Windows-1252 is so close to ISO-8859-1 and ISO-8859-15 that you have no way to tell them apart other than visual inspection of key characters like ¤ or €.

You can't have a string be Windows-1252 and UTF-8 at the same time. The character sets are identical for the first 128 characters (they contain e.g. the basic latin alphabet), but when it goes beyond that (like for Umlauts), it's either one or the other. They have different code points in UTF-8 than they have in Windows-1252.

Keep to ASCII in the filesystem - if you need to sustain characters outside ASCII in a filename, there are
schemes you can use to represent unicode characters while keeping to ASCII.
For example, percent encoding:
äöüÄÖÜ.txt <-> %C3%A4%C3%B6%C3%BC%C3%84%C3%96%C3%9C.txt
Of course this will hit the file name limit pretty fast and is not very optimal.
How about punycode?
äöüÄÖÜ.txt <-> xn--4caa7cb2ac.txt

utf8_encode function purpose

Supposed that im encoding my files with UTF-8.
Within PHP script, a string will be compared:
$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...
Its that string really UTF-8 without the utf8_encode() function?
If you encode your files with UTF-8 dont need this function?

If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.
To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.

PHP does not care about string encoding generally, strings are binary data within PHP. So you must know the encoding of data inside the string if you need encoding. The question is: does encoding matter in your case?
If you set a string variables content to something like you did:
$string="ぁ";
It will not contain UTF-8. Instead it contains a binary sequence that is not a valid UTF-8 character. That's why the browser or editor displays a questionmark or similar. So before you go on, you already see that something might not be as intended. (Turned out it was a missing font on my end)
This also shows that your file in the editor is supporting UTF-8 or some other flavor of unicode encoding. Just keep the following in mind: One file - one encoding. If you store the string inside the file, it's in the encoding of that file. Check your editor in which encoding you save the file. Then you know the encoding of the string.
Let's just assume it is some valid UTF-8 like so (support for my font):
$string="ä";
You can then do a binary comparison of the string later on:
if ( 'ä' === $string )
# do your stuff
Because it's in the same file and PHP strings are binary data, this works with every encoding. So normally you don't need to re-encode (change the encoding) the data if you use functions that are binary safe - which means that the encoding of the data is not changed.
For regular expressions encoding does play a role. That's why there is the u modifier to signal you want to make the expression work on and with unicode encoded data. However, if the data is already unicode encoded, you don't need to change it into unicode before you use preg_match. However with your code example, regular expressions are not necessary at all and a simple string comparison does the job.
Summary:
$string="ä";
if ( 'ä' === $string )
# do your stuff

Your string is not a utf-8 character so it can't preg match it, hence why you need to utf8_encode it. Try encoding the PHP file as utf-8 (use something like Notepad++) and it may work without it.

Summary:
The utf8_encode() function will encode every byte from a given string to UTF-8.
No matter what encoding has been used previously to store the file.
It's purpose is encode strings¹ that arent UTF-8 yet.
1.- The correctly use of this function is giving as a parameter an ISO-8859-1 string.
Why? Because Unicode and ISO-8859-1 have the same characters at same positions.
[Char][Value/Position] [Encoded Value/Position]
[Windows-1252] [€][80] ----> [C2|80] Is this the UTF-8 encoded value/position of the [€]? No
[ISO-8859-1] [¢][A2] ----> [C2|A2] Is this the UTF-8 encoded value/position of the [¢]? Yes
The function seems that work with another encodings: it work if the string to encode contains only characters with same
values that the ISO-8859-1 encoding (e.g On Windows-1252 00-EF & A0-FF positions).
We should take into account that if the function receive an UTF-8 string (A file encoded as a UTF-8) will encode again that UTF-8 string and will make garbage.

Error on file saving

When I create file on Windows hosting, it gets name like РґР¶СѓР»РёСЏ.jpg
It has to be a cyrillic name.
fopen() is used for creation.
What can I do with this?

It's an encoding issue.
Setting PHP to use UTF-8 encoding will probably suffice: http://php.net/manual/en/function.utf8-encode.php
UTF-8 can represent every character in the Unicode character set, plus it has the special property of being backwards-compatible with ASCII.

Check if all the script files use the same encoding (ANSI, ISO-..., UTF-8, etc).
Check the internal encoding your script use and the encoding of the string
multibyte functions
internal encoding of your script
encoding of your string
NB: Not recommending you to use string input from websites in the filesystem!
But if you expect input in a certain format, be sure to specify the content type of your html page.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.