let htmlspecialchars use UTF-8 as default charset? - php

Is there a way to tell PHP to use UTF-8 as default for functions like htmlspecialchars ?
I have already setted this:
ini_set('mbstring.internal_encoding','UTF-8');
ini_set('mbstring.func_overload',7);
If not, please can you post a list of all functions where I need to specify the charset?
(I need this because I am re-factorizing all my framework to get working with UTF-8)

Just use htmlspecialchars() instead of htmlentities(). Because it doesn't touch the non-ASCII characters, it doesn't matter whether you use 'utf8' charset or the default 'latin1'(*), the results are the same. As a bonus your output is smaller. (Though it does mean you have to ensure you're actually serving your page with the correct encoding.)
(*: there are a few East Asian multibyte charsets which can differ in their use of ASCII code points, so if you're using those you would still need to pass a $charset argument to htmlspecialchars(). But certainly no such problem for UTF-8.)

Is there a way to tell PHP to use UTF-8 as default for functions like htmlspecialchars ?
Nope, not as far as I know. mbstring.internal_encoding will define a default encoding for the mb_* family of functions only.
If not, please can you post a list of all functions where I need to specify the charset?
I'm not sure whether such a list exists - if in doubt, just walk through the manual and look out for any charset parameters.

Related

PHP greek url convert

I have a URL like: domain.tld/Σχετικά_με_μας
[edit]
Reading the $_SERVER['REQUEST_URI'] I get to work with:
%CE%A3%CF%87%CE%B5%CF%84%CE%B9%CE%BA%CE%AC_%CE%BC%CE%B5_%CE%BC%CE%B1%CF%82
[/edit]
In PHP I need to convert it to HTML, I get pretty far with:
htmlentities(urldecode($navstring), ENT_QUOTES, 'UTF-8');
It results in:
Σχετικά_με_μας
but the 'ά' becomes 'ά' But I need it converted to
ά
I'dd really appreciate help. I need a universal solution, not a "string replace"
I have been playing around a little, and the following worked. Use mb-convert-encoding instead of htmlentities.:
mb_convert_encoding(urldecode($navstring),'HTML-ENTITIES','UTF-8');
//string(90) "domain.tld/Σχετικά_με_μας"
See mb-convert-encoding
Information
All modern web browsers understand UTF-8 character encoding.
My advice would be :
Always know the character encoding of the data you are using.
Store your data with UTF-8.
Output data with UTF-8
The mbstring php extension doesn't just manipulate Unicode strings. It also converts multibyte strings between various character encodings.
Use the mb_detect_encoding() (ref) and mb_convert_encoding() (ref 2) functions to convert Unicode strings from one character encoding to another.
PHP Needs to know !
You also need to tell PHP that you are working with UTF-8, to tell him the default value, you can do it in your php.ini file :
default_charset = "UTF-8";
That default value is added to the default Content-Type header returned by PHP unless you specified it with the header() function :
header('Content-Type: application/json;charset=utf-8');
Keep in mind
The default character set is used by a lot of functions in PHP such as :
htmlentities()
htmlspecialchars()
all the mbstring functions
...

PHP: parsing ascii string safely when running in multibyte mode

In my PHP config file I have
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
To ensure UTF8 support. I have read that one should also use the multibyte string manipulation functions throughout if you have set these settings. I am currently altering a library which parses an excel file, and I need to split the one attribute value in the form N12 to determine the spreadsheet size. I know for a fact that the value cannot have values outside of ascii range. Do I need to use the multibyte string manipulation functions to parse the 12 out of N12 or can I use the normal ones. I am asking as I would like to keep the solution general and maybe submit the solution back to the library. If I need to use the correct function depending on whether current mode is utf8 or not, what is the best way to check for this?
UTF-8 is a pure superset of ASCII. If your functions can handle UTF-8, they by definition can also handle ASCII. The core PHP string functions mostly expect single-byte encodings, but that doesn't mean they won't work with other encodings; for example: Multibyte trim in PHP?.
So it depends on what exactly you're trying to do. Possibly core PHP string functions will already work fine regardless of encoding. If they do not, and your operation would break when using multi-byte strings, then you can use the appropriate MB function instead which by definition will also handle ASCII just fine when treating the input as UTF-8.

What does PHP's mb_internal_encoding actually do?

According to the PHP website it does this:
encoding is the character encoding name used for the HTTP input
character encoding conversion, HTTP output character encoding
conversion, and the default character encoding for string functions
defined by the mbstring module. You should notice that the internal
encoding is totally different from the one for multibyte regex.
Can someone please explain this in simpler terms?
HTTP input character encoding conversion
HTTP output character encoding conversion
default character encoding for string functions
What is meant by “internal encoding is totally different from the one for multibyte regex”?
My guess is that
means GET and POST are treated as that encoding.
means it outputs to that encoding.
means it uses that encoding for all multibyte string functions.
I have no idea about. Why would regex be different to normal string functions?
If point 2 is correct would you need to do:
ini_set('default_charset', 'UTF-8');
If I understand 3 correctly does that mean if you do:
mb_internal_encoding('UTF-8')
You don't need to do:
mb_strtolower($str, 'UTF-8');
Just:
mb_strtolower($str);
I did read on another SO post that mb_strtolower($str) should no be trusted and that you need to set the encoding for each multibyte string function. Is this true?
The mbstring extension added the glorious idea (</sarcasm>) to automatically convert all incoming data and all output data from some encoding to another. See mbstring HTTP Input and Output. It's configured with the mbstring.http_input ini setting and by using the mb_output_handler. mb_internal_encoding influences this conversion. IMO you should leave those settings off and never touch them; I have yet to find any problem that can elegantly be solved by this and it sounds like a terrible idea overall to have implicit encoding conversions going on. Especially if it's all controlled via one global flag (mb_internal_encoding) which is used in a variety of different contexts.
So that's 1. and 2.
For 3., yes indeed, mb_internal_encoding basically sets the default value for all mb_ functions which accept an $encoding parameter. Essentially it just sets a global variable (internally) which other functions read from, that's all.
The last part refers to the fact that there's a separate mb_regex_encoding function to set the internal encoding for mb_ereg_ functions.
I did read on another SO post that mb_strtolower($str) should no be trusted and that you need to set the encoding for each multibyte string function. Is this true?
I'd agree to this insofar as all global state cannot be trusted. This is pretty trustworthy:
mb_internal_encoding('UTF-8');
mb_strtolower($string);
However, this is not really:
mb_strtolower($string);
See the difference? If you rely on global state being set correctly elsewhere, you can never be sure it actually is correct. You just need to make a call to some third party library which sets mb_internal_encoding to something else without you knowing, and your mb_strtolower call will suddenly behave very differently.

How to handle character encoding in PHP - Codeigniter?

What is the best way to convert user input to UTF-8?
I have a simple form where a user will pass in HTML, the HTML can be in any language and it can be in any character encoding format.
My question is:
Is it possible to represent everything as UTF-8?
What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?
I am trying to work out how to best implement this - advice and links appreciated.
I am making use of Codeigniter and its input class to retrieve post data.
A few points I should make:
I need to convert HTML special characters to their respective entities
It might be a good idea to accept encoding and return it in that same encoding. However, my web app is making use of :
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
This might have an adverse effect on things.
Specify accept-charset in your <form> tag to tell the browser to submit user-entered data encoded in UTF-8:
<form action="foo" accept-charset="UTF-8">...</form>
See here for a complete guide on HOW TO Use UTF-8 Throughout Your Web Stack.
Is it possible to represent everything as UTF-8?
Yes, UTF-8 is a Unicode encoding, so you can use any character defined in Unicode. That's the best you can do with a computer to date.
What can I use to effectively convert any character encoding to UTF-8
iconv lets you convert virtually any encoding to any other encoding. But, for that you have to know what encoding you're dealing with. You can't say "iconv, whatever this is, make it UTF-8!". That's unfortunately not how it works. You can only say "iconv, I have this string here in BIG5, please convert that to UTF-8.".
If you're only dealing with form data in UTF-8 though, you'll probably never need to convert anything.
so that I can parse it with PHP string functions
"PHP string functions" work on bytes. They don't care about characters or encodings. Depending on what you want to do, working with naive PHP string functions on UTF-8 text will give you bad results. Use encoding-aware string functions in the MB extension for any multi-byte encoding string manipulation.
save it to my database
Just make sure your database stores text in UTF-8 and you have set your database connection to UTF-8 (i.e. the database knows you're sending it UTF-8 data). You should be able to specify that in the CodeIgniter database connection settings.
subsequently echo out using htmlentities?
Just echo htmlentities($text), nothing more you need to do.
However, my web app is making use of : <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
This might have an adverse effect on things.
Not at all. It just signals to the browser that your page is encoded in UTF-8. Now you just need to make sure that's actually the case (as you're trying to do anyway). It also implies to the browser that it should send UTF-8 to the server. You can make that explicit with the accept-charset attribute on forms.
May I recommend What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text, which might help you understand more.
1) Is it possible to represent everything as UTF-8?
Yes, everything defined in UNICODE. That's the most you can get nowadays, and there is room for the future that UNICODE can support.
2) What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?
The only thing you need to know is the actual encoding of your data. If you want your webapplication to support UTF-8 for input and output, the frontend needs to signal that it supports UTF-8. See Character Encodings for a guide regarding your applications user-interface.
Within PHP you need to feed any function with the encoding it supports. Some need to have the encoding specified, for some you need to convert it. Always check the function docs if it supports what you ask for. Additionally check your PHP configuration.
Related:
Preparing PHP application to use with UTF-8
How to detect malformed utf-8 string in PHP?
If you want to change the encoding of a string you can try
$utf8_string = mb_convert_encoding( $yourBadString , 'UTF-8' );
I found out that the only thing that works out for UTF-8 encoding is setting inside my config.php
putenv('LC_ALL=en_US.utf8'); // or whatever language you need
setlocale(LC_ALL, 'en_US.utf8'); // or whatever language you need
bindtextdomain("mydomain", dirname(__FILE__) . "/../language");
textdomain("mydomain");
EDIT :
Is it possible to represent everything as UTF-8?
Yes, these is what you need to ensure :
html : headers/meta-header set to utf-8
all files saved as utf-8
database collation, tables and data encoding to utf-8
What can I use to effectively convert any character encoding to UTF-8
You can use utf8_encode (Since for a system set up mainly for Western European languages, it will generally be ISO-8859-1 or its close relation,ref) before saving it into your database.
// eg
$name = utf8_encode($this->input->post('name'));
And as i mention before, you need to make sure database collation, tables and data encoding to utf-8. In CI, at your database connection config
// Make sure have these lines
$db['default']['char_set'] = 'utf8';
$db['default']['dbcollat'] = 'utf8_general_ci';

Define Default Charset for htmlentities()

I was wondering if there were any way to define the default encoding for htmlentities(). I have a big project going that uses htmlentities calls all over the place, and was wondering if there was a simple way to set it from ISO-8859-1 to UTF-8 as the default character encoding, using something simple like ini_set(). Or possibly with a separate namespace declaration.
Failing that, I would not be opposed to renaming and overriding the htmlentities function to always use Unicode, but am reluctant to install anything as freaky (to me) as PECL apd.
As the manual page doesn't say anything about changing the default charset, I don't think there is a way to do that ; and I don't remember having ever seen anything about that.
I wouldn't use anything like apd either -- instead, I would probably :
create my own function, that calls htmlentities with the right parameters
and replace every call to htmlentities by a call to my new function (this can probably be done automatically, using a few lines of scripts)
#Pascal MARTIN's solution is definitely correct, you can also use utf8-encode to convert ISO-8859-1 to UTF-8.
And utf8_decode to convert UTF-8 to ISO-8859-1.

Categories