PHP and UTF-8 String functions WITHOUT MB-Functions? - php

I try to use UTF-8 with PHP, the Output seems okay (Display correct äöüß etc, when testing) on my Site, but there is a simply Problem... When I use echo strlen("Ä"); it shows me "2"... I read this Topic: strlen() and UTF-8 encoding
In the answer I read this:
The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.
I wonder, why my Data is not valid UTF-8? Because:
I saved all my files in "UTF-8 no BOM"
Used UTF-8 header on the first line
My browser says also "Encoding: UTF-8"
This is my code:
<?php
header("Content-Type: text/html; charset=utf-8");
$test = 'Ä';
echo strlen($test);
var_dump($test);
?>
My Question: Can I use normal PHP-Functions with UTF-8 or must I use the "mb"-Functions?
If it's possible to use the normal PHP-Functions, why show me strlen() 2 in my code, instead of 1?

strlen() will return the length of the string in bytes by default, not characters... you can change this by setting the mbstring.func_overload ini setting to tell PHP to return characters from a strlen() call instead.... but this is global, and affects a number of other functions as well, like strpos() and substr() (full list in the documentation link)
This can have serious adverse effects elsewhere in your code, particularly if you're using 3rd party libraries that aren't aware of it, so it isn't recommended.
It's better to use the mb_* functions if you know that you're working with UTF-8 strings... and (when it comes to it) setting the mbstring.func_overload is simply telling PHP to use mb_* functions as an alternative to the normal string functions "under the hood"

Related

PHP greek url convert

I have a URL like: domain.tld/Σχετικά_με_μας
[edit]
Reading the $_SERVER['REQUEST_URI'] I get to work with:
%CE%A3%CF%87%CE%B5%CF%84%CE%B9%CE%BA%CE%AC_%CE%BC%CE%B5_%CE%BC%CE%B1%CF%82
[/edit]
In PHP I need to convert it to HTML, I get pretty far with:
htmlentities(urldecode($navstring), ENT_QUOTES, 'UTF-8');
It results in:
Σχετικά_με_μας
but the 'ά' becomes 'ά' But I need it converted to
ά
I'dd really appreciate help. I need a universal solution, not a "string replace"
I have been playing around a little, and the following worked. Use mb-convert-encoding instead of htmlentities.:
mb_convert_encoding(urldecode($navstring),'HTML-ENTITIES','UTF-8');
//string(90) "domain.tld/Σχετικά_με_μας"
See mb-convert-encoding
Information
All modern web browsers understand UTF-8 character encoding.
My advice would be :
Always know the character encoding of the data you are using.
Store your data with UTF-8.
Output data with UTF-8
The mbstring php extension doesn't just manipulate Unicode strings. It also converts multibyte strings between various character encodings.
Use the mb_detect_encoding() (ref) and mb_convert_encoding() (ref 2) functions to convert Unicode strings from one character encoding to another.
PHP Needs to know !
You also need to tell PHP that you are working with UTF-8, to tell him the default value, you can do it in your php.ini file :
default_charset = "UTF-8";
That default value is added to the default Content-Type header returned by PHP unless you specified it with the header() function :
header('Content-Type: application/json;charset=utf-8');
Keep in mind
The default character set is used by a lot of functions in PHP such as :
htmlentities()
htmlspecialchars()
all the mbstring functions
...

PHP: parsing ascii string safely when running in multibyte mode

In my PHP config file I have
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
To ensure UTF8 support. I have read that one should also use the multibyte string manipulation functions throughout if you have set these settings. I am currently altering a library which parses an excel file, and I need to split the one attribute value in the form N12 to determine the spreadsheet size. I know for a fact that the value cannot have values outside of ascii range. Do I need to use the multibyte string manipulation functions to parse the 12 out of N12 or can I use the normal ones. I am asking as I would like to keep the solution general and maybe submit the solution back to the library. If I need to use the correct function depending on whether current mode is utf8 or not, what is the best way to check for this?
UTF-8 is a pure superset of ASCII. If your functions can handle UTF-8, they by definition can also handle ASCII. The core PHP string functions mostly expect single-byte encodings, but that doesn't mean they won't work with other encodings; for example: Multibyte trim in PHP?.
So it depends on what exactly you're trying to do. Possibly core PHP string functions will already work fine regardless of encoding. If they do not, and your operation would break when using multi-byte strings, then you can use the appropriate MB function instead which by definition will also handle ASCII just fine when treating the input as UTF-8.

What does PHP's mb_internal_encoding actually do?

According to the PHP website it does this:
encoding is the character encoding name used for the HTTP input
character encoding conversion, HTTP output character encoding
conversion, and the default character encoding for string functions
defined by the mbstring module. You should notice that the internal
encoding is totally different from the one for multibyte regex.
Can someone please explain this in simpler terms?
HTTP input character encoding conversion
HTTP output character encoding conversion
default character encoding for string functions
What is meant by “internal encoding is totally different from the one for multibyte regex”?
My guess is that
means GET and POST are treated as that encoding.
means it outputs to that encoding.
means it uses that encoding for all multibyte string functions.
I have no idea about. Why would regex be different to normal string functions?
If point 2 is correct would you need to do:
ini_set('default_charset', 'UTF-8');
If I understand 3 correctly does that mean if you do:
mb_internal_encoding('UTF-8')
You don't need to do:
mb_strtolower($str, 'UTF-8');
Just:
mb_strtolower($str);
I did read on another SO post that mb_strtolower($str) should no be trusted and that you need to set the encoding for each multibyte string function. Is this true?
The mbstring extension added the glorious idea (</sarcasm>) to automatically convert all incoming data and all output data from some encoding to another. See mbstring HTTP Input and Output. It's configured with the mbstring.http_input ini setting and by using the mb_output_handler. mb_internal_encoding influences this conversion. IMO you should leave those settings off and never touch them; I have yet to find any problem that can elegantly be solved by this and it sounds like a terrible idea overall to have implicit encoding conversions going on. Especially if it's all controlled via one global flag (mb_internal_encoding) which is used in a variety of different contexts.
So that's 1. and 2.
For 3., yes indeed, mb_internal_encoding basically sets the default value for all mb_ functions which accept an $encoding parameter. Essentially it just sets a global variable (internally) which other functions read from, that's all.
The last part refers to the fact that there's a separate mb_regex_encoding function to set the internal encoding for mb_ereg_ functions.
I did read on another SO post that mb_strtolower($str) should no be trusted and that you need to set the encoding for each multibyte string function. Is this true?
I'd agree to this insofar as all global state cannot be trusted. This is pretty trustworthy:
mb_internal_encoding('UTF-8');
mb_strtolower($string);
However, this is not really:
mb_strtolower($string);
See the difference? If you rely on global state being set correctly elsewhere, you can never be sure it actually is correct. You just need to make a call to some third party library which sets mb_internal_encoding to something else without you knowing, and your mb_strtolower call will suddenly behave very differently.

using the php 5.4's new constant ENT_DISALLOWED in htmlentities

There is a string I'm trying to output in an htmlencoded way, and the htmlentities() function always returns an empty string.
I know exactly why it does so. Well, I am not running PHP 5.4 I got the latest PHP 5.3 flavor installed.
The question is how I am gonna be able to htmlencode a string which has invalid code unit sequences.
According to the manual, ENT_SUBSTITUTE is the way to go. But this constant is not defined in PHP 5.3.X.
I did this:
if (!defined('ENT_SUBSTITUTE')) {
define('ENT_SUBSTITUTE', 8);
}
still no luck. htmlentities is still returning empty string.
I wanted to try ENT_DISALLOWED instead, but I cannot find its corresponding long value for it.
So my question is two folded
What's the constant value of PHP 5.4's ENT_DISALLOWED?
How do I make sure that a string containing non UTF-8 characters (such as the smart quotes), can be cleared out of them? - Not just the smart quotes but anything that causes htmlentities() to return blank string.
It is true that htmlentities() in PHP 5.3 does not have the ENT_SUBSTITUTE flag, however it has the (not really suggested) ENT_IGNORE flag. Be ware of the note and try to understand it before use:
Using this flag is discouraged as it » may have security implications.
It is far better that you understand why there is a problem with the input string in the first place. Most often users are only missing to specify the correct encoding.
E.g. first re-encode the string into UTF-8, then pass it to htmlspecialchars() or htmlentities(). Speaking of smart-quotes you are probably using a Windows-1252 encoded string. You won't even need to convert that one before use, you can just specify the charset properly (PHP 5.3):
htmlentities($string, ENT_QUOTES, $encoding = 'Windows-1252');
Naturally this only works if the input $string is encoded in Windows-1252 (CP1252). Find out the correct encoding first, then it's normally no problem. For non-supported encodings re-encode into a supported one first, for example with iconv or mb_string.
As you say, these constants were added in 5.4.0. The thing is, the support is new to 5.4.0 as well. Meaning you can pass whatever values you want, older htmlentities will not understand it.
As it is most probably the case, php changelog is quite misleading.

Is there any downside to save all my source code files in UTF-8?

If that's relevant (it very well could be), they are PHP source code files.
There are a few pitfalls to take care of:
PHP is not aware of the BOM character certain editors or IDEs like to put at the very beginning of UTF-8 files. This character indicates the file is UTF-8, but it is not necessary, and it is invisible. This can cause "headers already sent out" warnings from functions that deal with HTTP headers because PHP will output the BOM to the browser if it sees one, and that will prevent you from sending any header. Make sure your text editor has a UTF-8 (No BOM) encoding; if you're not sure, simply do the test. If <?php header('Content-Type: text/html') ?> at the beginning of an otherwise empty file doesn't trigger a warning, you're fine.
Default string functions are not multibyte encodings-aware. This means that strlen really returns the number of bytes in the string, not the actual number of characters. This isn't too much of a problem until you start splicing strings of non-ASCII characters with functions like substr: when you do, indices you pass to it refer to byte indices rather than character indices, and this can cause your script to break non-ASCII characters in two. For instance, echo substr("é", 0, 1) will return an invalid UTF-8 character because in UTF-8, é actually takes two bytes and substr will return only the first one. (The solution is to use the mb_ string functions, which are aware of multibyte encodings.)
You must ensure that your data sources (like external text files or databases) return UTF-8 strings too, because PHP makes no automagic conversion. To that end, you may use implementation-specific means (for instance, MySQL has a special query that lets you specify in which encoding you expect the result: SET CHARACTER SET UTF8 or something along these lines), or if you couldn't find a better way, mb_convert_encoding or iconv will convert one string into another encoding.
It's actually usually recommended that you keep all sources in UTF8. It won't matter size of regular code with latin characters at all, but will prevent glitches with any special characters.
If you are using any special chars in e.g string values, the size is a little bit bigger, but that shouldn't matter.
Nevertheless my suggestion is, to always leave the default format. I spent so many hours because there was an error with the format saving and all characters changed.
From a technical point of few, there isn't a difference!
Very relevant, the PHP parser may start to output spurious characters, like a funky unside-down questionmark. Just stick to the norm, much preferred.

Categories