ini_set('mbstring.internal_encoding','UTF-8')
what does this signify at the beginning of a php file and what is it used for ?
I know that a php manual exists but it does not explain it in plain common man's language.
It defines the default internal character encoding to UTF-8 character set type.
This is used to make a site multilingual by changing the mbstring.internal_encoding value.
encoding is the character encoding name used for the HTTP input
character encoding conversion, HTTP output character encoding
conversion, and the default character encoding for string functions
defined by the mbstring module. You should notice that the internal
encoding is totally different from the one for multibyte regex.
And UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode.
You can change it using the iso-8859-1, UTF-8, etc.
Here is the list of Unicode character list.
Related
I learned that ISO-8859-1 is a single-byte charset.
See the page http://www.manoramaonline.com/cgi-bin/MMOnline.dll/portal/ep/malayalamContentView.do?tabId=11&programId=1073753760&BV_ID=###&contentId=15238737&contentType=EDITORIAL&articleType=Malayalam%20News. It is using Malayalam language.
The HTTP header and meta tag tell that it is using ISO-8859-1 as character-encoding.
But in this page a two byte character (0x201A) is used (http://unicodelookup.com/#%E2%80%9A).
(copy the character and look up it in http://unicodelookup.com)
<div id="articleTitleMal" style="padding-top:10px;">
<font face= "Manorama" >
¼ÈØOVA¢: ÜÍß‚Äí 1.28 ...
</font>
</div>
How is it possible to use two byte character in the single byte encoding?
Mine is not a curiosity to know that. One of my task is stucked because of not understanding the above issue.
Update: They are using the font www.manoramaonline.com/portal/mmcss/Manorama.ttf and I think some of the character in the Manaorama-font using two byte.
UPDATE2: I tried to convert the document from ISO-8859-1 to UTF-8 using the below code.
<?php
$t = file_get_contents('http://www.manoramaonline.com/cgi-bin/MMOnline.dll/portal/ep/malayalamContentView.do?tabId=11&programId=1073753760&BV_ID=###&contentId=15238737&contentType=EDITORIAL&articleType=Malayalam%20News');
// Change the charset info in meta-tag
$t = str_replace('ISO-8859-1', 'UTF-8', $t);
file_put_contents('t.html', utf8_encode($t));
That time the above selected character is missing.
Even though the page is declared as ISO-8859-1 encoded in HTTP headers, browsers interpret it as Windows-1252 encoded. This is a longstanding tradition, now being formalized e.g. in the WHATWG Encoding Standard.
Thus, when the data contains the byte 82 (hex), it is not taken as a control character (as per ISO 8859-1) but as U+201A “‚” (as per Windows-1252).
However, the page uses font trickery that maps code positions to Malayalam characters according to a special internal, nonstandard encoding. (You can see this if you disable style sheets on the page. All texts become gibberish.) The page is not really meant to contain U+201A “‚” but the byte 82 to which a Malayalam character is assigned in the font.
So you need to preserve the byte as-is to get the same results. A conversion to UTF-8 would break this.
If you wanted to convert the data to Unicode, you would need to find out the internal encoding of the font being used and perform that mapping at the character level.
This is what is found in the php manual under the String datatype http://php.net/manual/en/language.types.string.php
Given that PHP does not dictate a specific encoding for strings, one might wonder how string literals are encoded. For instance, is the string "á" equivalent to "\xE1" (ISO-8859-1), "\xC3\xA1" (UTF-8, C form), "\x61\xCC\x81" (UTF-8, D form) or any other possible representation? The answer is that string will be encoded in whatever fashion it is encoded in the script file. Thus, if the script is written in ISO-8859-1, the string will be encoded in ISO-8859-1 and so on. However, this does not apply if Zend Multibyte is enabled; in that case, the script may be written in an arbitrary encoding (which is explicity declared or is detected) and then converted to a certain internal encoding, which is then the encoding that will be used for the string literals. Note that there are some constraints on the encoding of the script (or on the internal encoding, should Zend Multibyte be enabled) – this almost always means that this encoding should be a compatible superset of ASCII, such as UTF-8 or ISO-8859-1. Note, however, that state-dependent encodings where the same byte values can be used in initial and non-initial shift states may be problematic.
Could you explain in simple terms as to what this means ?. Thanks
Given that PHP does not dictate a specific encoding for strings, one
might wonder how string literals are encoded. For instance, is the
string "á" equivalent to "\xE1" (ISO-8859-1), "\xC3\xA1" (UTF-8,
Cform), "\x61\xCC\x81" (UTF-8, D form) or any other possible
representation? The answer is that string will be encoded in whatever
fashion it is encoded in the script file. Thus, if the script is
written in ISO-8859-1, the string will be encoded in ISO-8859-1 and
soon.
This part of statement says that if your webpage is encoded in (UTF-8, C form) than "á" will be equivalent to "\xC3\xA1" you specify encoding in php.ini it's config file for your php script.
However, this does not apply if Zend Multibyte is enabled; in that
case, the script may be written in an arbitrary encoding (which is
explicity declared or is detected) and then converted to a certain
internal encoding, which is then the encoding that will be used for
the string literals. Note that there are some constraints on the
encoding of the script (or on the internal encoding, should Zend
Multibyte be enabled) – this almost always means that this encoding
should be a compatible superset of ASCII, such as UTF-8 or ISO-8859-1.
Note, however, that state-dependent encodings where the same byte
values can be used in initial and non-initial shift states may be
problematic.
Down here they just say that there is another option to specify your encoding, but now you are doing it in a script, but your encoding must be compatible with ASCII superset
I'm currently trying to remove all special characters and accents from an UTF-8 string by turning them into their equivalent ASCII character if possible.
So I'm simply using this code:
$result = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
The problem is that for example the word "début" turns into "dbut" instead of "debut".
To make it work, I need to add a call to setlocale, like this:
setlocale(LC_ALL, 'en_US.UTF8');
$result = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
And I don't understand why. I thought UTF-8 and ASCII were always the same, whatever locale you use.
EDIT: I didn't mean UTF-8 equals ASCII, I meant UTF-8 always equals UTF-8 and ASCII always equals ASCII
The subset of UTF-8 that overlaps with ASCII (which is code points 0-127) is indeed identical with ASCII. However, accented latin characters are not part of the ASCII character set and if you don't setlocale yourself, the system's default locale (which evidently does not contain these accented characters) is used to get a character set to work with.
In general, iconv can be a little iffy; this is mentioned in the introduction of the extension:
This module contains an interface to iconv character set conversion
facility. With this module, you can turn a string represented by a
local character set into the one represented by another character set,
which may be the Unicode character set. Supported character sets
depend on the iconv implementation of your system. Note that the iconv
function on some systems may not work as you expect. In such case,
it'd be a good idea to install the GNU libiconv library. It will
most likely end up with more consistent results.
i wanna convert to original string of “Cool†..Origingal string is cool . (' is backquote)
It seems that you just forgot to specify the character encoding properly.
Because “ is what you get when the character “ (U+201C) encoded in UTF-8 (0xE2809C) is interpreted with a single-byte character encoding like Windows-1252 (default character encoding in some browsers) where 0xE2, 0x80, and 0x9C represent the characters â, €, and œ respectively.
So just make sure to specify your character encoding properly. Or if you actually want to use Windows-1252 as your output character encoding, you can convert your UTF-8 data with mb_convert_encoding, iconv or similar functions.
There's a wide variety of character encoding functions in PHP, especially if you have access to the multibyte string functions. (mb_string is thankfully enabled on most PHP installs.)
What you need to do is convert the encoding of the original string to the encoding you require, but as I don't know what encoding has been used/is required all I can suggest is that you could try using the mb_convert_encoding function, possibly after using mb_detect_encoding on the original string.
Incidentally, I'd highly recommend attempting to keep all data in UTF-8, (text files, HTML encoding, database connections/data, etc.) as you'll make your life a lot easier this way.
When I create file on Windows hosting, it gets name like джулия.jpg
It has to be a cyrillic name.
fopen() is used for creation.
What can I do with this?
It's an encoding issue.
Setting PHP to use UTF-8 encoding will probably suffice: http://php.net/manual/en/function.utf8-encode.php
UTF-8 can represent every character in the Unicode character set, plus it has the special property of being backwards-compatible with ASCII.
Check if all the script files use the same encoding (ANSI, ISO-..., UTF-8, etc).
Check the internal encoding your script use and the encoding of the string
multibyte functions
internal encoding of your script
encoding of your string
NB: Not recommending you to use string input from websites in the filesystem!
But if you expect input in a certain format, be sure to specify the content type of your html page.