I was wondering if there were any way to define the default encoding for htmlentities(). I have a big project going that uses htmlentities calls all over the place, and was wondering if there was a simple way to set it from ISO-8859-1 to UTF-8 as the default character encoding, using something simple like ini_set(). Or possibly with a separate namespace declaration.
Failing that, I would not be opposed to renaming and overriding the htmlentities function to always use Unicode, but am reluctant to install anything as freaky (to me) as PECL apd.
As the manual page doesn't say anything about changing the default charset, I don't think there is a way to do that ; and I don't remember having ever seen anything about that.
I wouldn't use anything like apd either -- instead, I would probably :
create my own function, that calls htmlentities with the right parameters
and replace every call to htmlentities by a call to my new function (this can probably be done automatically, using a few lines of scripts)
#Pascal MARTIN's solution is definitely correct, you can also use utf8-encode to convert ISO-8859-1 to UTF-8.
And utf8_decode to convert UTF-8 to ISO-8859-1.
Related
I have a URL like: domain.tld/Σχετικά_με_μας
[edit]
Reading the $_SERVER['REQUEST_URI'] I get to work with:
%CE%A3%CF%87%CE%B5%CF%84%CE%B9%CE%BA%CE%AC_%CE%BC%CE%B5_%CE%BC%CE%B1%CF%82
[/edit]
In PHP I need to convert it to HTML, I get pretty far with:
htmlentities(urldecode($navstring), ENT_QUOTES, 'UTF-8');
It results in:
Σχετικά_με_μας
but the 'ά' becomes 'ά' But I need it converted to
ά
I'dd really appreciate help. I need a universal solution, not a "string replace"
I have been playing around a little, and the following worked. Use mb-convert-encoding instead of htmlentities.:
mb_convert_encoding(urldecode($navstring),'HTML-ENTITIES','UTF-8');
//string(90) "domain.tld/Σχετικά_με_μας"
See mb-convert-encoding
Information
All modern web browsers understand UTF-8 character encoding.
My advice would be :
Always know the character encoding of the data you are using.
Store your data with UTF-8.
Output data with UTF-8
The mbstring php extension doesn't just manipulate Unicode strings. It also converts multibyte strings between various character encodings.
Use the mb_detect_encoding() (ref) and mb_convert_encoding() (ref 2) functions to convert Unicode strings from one character encoding to another.
PHP Needs to know !
You also need to tell PHP that you are working with UTF-8, to tell him the default value, you can do it in your php.ini file :
default_charset = "UTF-8";
That default value is added to the default Content-Type header returned by PHP unless you specified it with the header() function :
header('Content-Type: application/json;charset=utf-8');
Keep in mind
The default character set is used by a lot of functions in PHP such as :
htmlentities()
htmlspecialchars()
all the mbstring functions
...
According to the PHP website it does this:
encoding is the character encoding name used for the HTTP input
character encoding conversion, HTTP output character encoding
conversion, and the default character encoding for string functions
defined by the mbstring module. You should notice that the internal
encoding is totally different from the one for multibyte regex.
Can someone please explain this in simpler terms?
HTTP input character encoding conversion
HTTP output character encoding conversion
default character encoding for string functions
What is meant by “internal encoding is totally different from the one for multibyte regex”?
My guess is that
means GET and POST are treated as that encoding.
means it outputs to that encoding.
means it uses that encoding for all multibyte string functions.
I have no idea about. Why would regex be different to normal string functions?
If point 2 is correct would you need to do:
ini_set('default_charset', 'UTF-8');
If I understand 3 correctly does that mean if you do:
mb_internal_encoding('UTF-8')
You don't need to do:
mb_strtolower($str, 'UTF-8');
Just:
mb_strtolower($str);
I did read on another SO post that mb_strtolower($str) should no be trusted and that you need to set the encoding for each multibyte string function. Is this true?
The mbstring extension added the glorious idea (</sarcasm>) to automatically convert all incoming data and all output data from some encoding to another. See mbstring HTTP Input and Output. It's configured with the mbstring.http_input ini setting and by using the mb_output_handler. mb_internal_encoding influences this conversion. IMO you should leave those settings off and never touch them; I have yet to find any problem that can elegantly be solved by this and it sounds like a terrible idea overall to have implicit encoding conversions going on. Especially if it's all controlled via one global flag (mb_internal_encoding) which is used in a variety of different contexts.
So that's 1. and 2.
For 3., yes indeed, mb_internal_encoding basically sets the default value for all mb_ functions which accept an $encoding parameter. Essentially it just sets a global variable (internally) which other functions read from, that's all.
The last part refers to the fact that there's a separate mb_regex_encoding function to set the internal encoding for mb_ereg_ functions.
I did read on another SO post that mb_strtolower($str) should no be trusted and that you need to set the encoding for each multibyte string function. Is this true?
I'd agree to this insofar as all global state cannot be trusted. This is pretty trustworthy:
mb_internal_encoding('UTF-8');
mb_strtolower($string);
However, this is not really:
mb_strtolower($string);
See the difference? If you rely on global state being set correctly elsewhere, you can never be sure it actually is correct. You just need to make a call to some third party library which sets mb_internal_encoding to something else without you knowing, and your mb_strtolower call will suddenly behave very differently.
There is a string I'm trying to output in an htmlencoded way, and the htmlentities() function always returns an empty string.
I know exactly why it does so. Well, I am not running PHP 5.4 I got the latest PHP 5.3 flavor installed.
The question is how I am gonna be able to htmlencode a string which has invalid code unit sequences.
According to the manual, ENT_SUBSTITUTE is the way to go. But this constant is not defined in PHP 5.3.X.
I did this:
if (!defined('ENT_SUBSTITUTE')) {
define('ENT_SUBSTITUTE', 8);
}
still no luck. htmlentities is still returning empty string.
I wanted to try ENT_DISALLOWED instead, but I cannot find its corresponding long value for it.
So my question is two folded
What's the constant value of PHP 5.4's ENT_DISALLOWED?
How do I make sure that a string containing non UTF-8 characters (such as the smart quotes), can be cleared out of them? - Not just the smart quotes but anything that causes htmlentities() to return blank string.
It is true that htmlentities() in PHP 5.3 does not have the ENT_SUBSTITUTE flag, however it has the (not really suggested) ENT_IGNORE flag. Be ware of the note and try to understand it before use:
Using this flag is discouraged as it » may have security implications.
It is far better that you understand why there is a problem with the input string in the first place. Most often users are only missing to specify the correct encoding.
E.g. first re-encode the string into UTF-8, then pass it to htmlspecialchars() or htmlentities(). Speaking of smart-quotes you are probably using a Windows-1252 encoded string. You won't even need to convert that one before use, you can just specify the charset properly (PHP 5.3):
htmlentities($string, ENT_QUOTES, $encoding = 'Windows-1252');
Naturally this only works if the input $string is encoded in Windows-1252 (CP1252). Find out the correct encoding first, then it's normally no problem. For non-supported encodings re-encode into a supported one first, for example with iconv or mb_string.
As you say, these constants were added in 5.4.0. The thing is, the support is new to 5.4.0 as well. Meaning you can pass whatever values you want, older htmlentities will not understand it.
As it is most probably the case, php changelog is quite misleading.
Is there a way to tell PHP to use UTF-8 as default for functions like htmlspecialchars ?
I have already setted this:
ini_set('mbstring.internal_encoding','UTF-8');
ini_set('mbstring.func_overload',7);
If not, please can you post a list of all functions where I need to specify the charset?
(I need this because I am re-factorizing all my framework to get working with UTF-8)
Just use htmlspecialchars() instead of htmlentities(). Because it doesn't touch the non-ASCII characters, it doesn't matter whether you use 'utf8' charset or the default 'latin1'(*), the results are the same. As a bonus your output is smaller. (Though it does mean you have to ensure you're actually serving your page with the correct encoding.)
(*: there are a few East Asian multibyte charsets which can differ in their use of ASCII code points, so if you're using those you would still need to pass a $charset argument to htmlspecialchars(). But certainly no such problem for UTF-8.)
Is there a way to tell PHP to use UTF-8 as default for functions like htmlspecialchars ?
Nope, not as far as I know. mbstring.internal_encoding will define a default encoding for the mb_* family of functions only.
If not, please can you post a list of all functions where I need to specify the charset?
I'm not sure whether such a list exists - if in doubt, just walk through the manual and look out for any charset parameters.
EDIT: the script mentioned in the question, and the other script pointed among the answers, both work just fine with multibyte strings - turned out my problem was elsewhere.
Does anyone know of such implementation? The script at http://phpjs.org/functions/view/469 works well, just not on multibyte strings.
This implementation seems to handle UTF-8 strings correctly. If you want to test the demo, make sure you change the encoding of the page to UTF-8 in your browser settings first.
The script you've posted has str = utf8_encode(str);.
You should probably remove this line and pass in your Cyrillic as UTF-8.