Encoding of $_SERVER variables like $_SERVER['REQUEST_URI'] - php

I need to do some string operations on $_SERVER value especially $_SERVER['REQUEST_URI']
How PHP encodes such strings? Should I use mb_* family functions?
To understand better my question, let's say I have a page on my webserver called like this:
ããã.php
And I need to get the second char:
echo mb_substr($_SERVER['REQUEST_URI'],1,1);

How PHP encodes such strings?
PHP won't do anything to the string, but the web browser will usually percent encode any non-ASCII characters in REQUEST_URI. (I say "usually" because I have seen IE not do it. I expect, however, Apache to do the job in that case - but I'm not entirely sure whether it will. You'd have to try out.)
Running urldecode() will decode those characters.
Related reading: Unicode characters in URLs

$_SERVER['REQUEST_URI'] usually doesn't contain stuff which would require usage of Multibyte functions so I guess if you don't tend to do unusual stuff with the request URIs such as stuffing them with non-ASCII characters (caused by weird URI rewrites etc.) you are safe to use the "normal" PHP string manipulation functions.

Related

Purpose of using esc_url

I don't understand why we need to use the esc_url if I myself am the one who actually wrote the URL like:
echo get_template_directory_url . '/someText'
Although the /someText is hardcoded but I know it's clean and safe because I wrote it. What are the circumstances that this will be unsafe (like how do bad guys do bad things when I don't use the esc_url in this case? Do they hack into the server? If they can really hack into the server, they won't even bother the esc_url already?
I have referred to https://stackoverflow.com/a/30583251/19507498 , but he just explain how we use it without explaining why we need it.
The purpose of this function is to replace spaces and special characters with their encoded url pendants. For example will be replace with %20. This is needed, because spaces and some other special characters like umlauts or ß are not allowed in urls.
EDIT:
Furthermore ? and & need to be encoded, because those have special meanings in urls.

What does PHP's mb_internal_encoding actually do?

According to the PHP website it does this:
encoding is the character encoding name used for the HTTP input
character encoding conversion, HTTP output character encoding
conversion, and the default character encoding for string functions
defined by the mbstring module. You should notice that the internal
encoding is totally different from the one for multibyte regex.
Can someone please explain this in simpler terms?
HTTP input character encoding conversion
HTTP output character encoding conversion
default character encoding for string functions
What is meant by “internal encoding is totally different from the one for multibyte regex”?
My guess is that
means GET and POST are treated as that encoding.
means it outputs to that encoding.
means it uses that encoding for all multibyte string functions.
I have no idea about. Why would regex be different to normal string functions?
If point 2 is correct would you need to do:
ini_set('default_charset', 'UTF-8');
If I understand 3 correctly does that mean if you do:
mb_internal_encoding('UTF-8')
You don't need to do:
mb_strtolower($str, 'UTF-8');
Just:
mb_strtolower($str);
I did read on another SO post that mb_strtolower($str) should no be trusted and that you need to set the encoding for each multibyte string function. Is this true?
The mbstring extension added the glorious idea (</sarcasm>) to automatically convert all incoming data and all output data from some encoding to another. See mbstring HTTP Input and Output. It's configured with the mbstring.http_input ini setting and by using the mb_output_handler. mb_internal_encoding influences this conversion. IMO you should leave those settings off and never touch them; I have yet to find any problem that can elegantly be solved by this and it sounds like a terrible idea overall to have implicit encoding conversions going on. Especially if it's all controlled via one global flag (mb_internal_encoding) which is used in a variety of different contexts.
So that's 1. and 2.
For 3., yes indeed, mb_internal_encoding basically sets the default value for all mb_ functions which accept an $encoding parameter. Essentially it just sets a global variable (internally) which other functions read from, that's all.
The last part refers to the fact that there's a separate mb_regex_encoding function to set the internal encoding for mb_ereg_ functions.
I did read on another SO post that mb_strtolower($str) should no be trusted and that you need to set the encoding for each multibyte string function. Is this true?
I'd agree to this insofar as all global state cannot be trusted. This is pretty trustworthy:
mb_internal_encoding('UTF-8');
mb_strtolower($string);
However, this is not really:
mb_strtolower($string);
See the difference? If you rely on global state being set correctly elsewhere, you can never be sure it actually is correct. You just need to make a call to some third party library which sets mb_internal_encoding to something else without you knowing, and your mb_strtolower call will suddenly behave very differently.

using the php 5.4's new constant ENT_DISALLOWED in htmlentities

There is a string I'm trying to output in an htmlencoded way, and the htmlentities() function always returns an empty string.
I know exactly why it does so. Well, I am not running PHP 5.4 I got the latest PHP 5.3 flavor installed.
The question is how I am gonna be able to htmlencode a string which has invalid code unit sequences.
According to the manual, ENT_SUBSTITUTE is the way to go. But this constant is not defined in PHP 5.3.X.
I did this:
if (!defined('ENT_SUBSTITUTE')) {
define('ENT_SUBSTITUTE', 8);
}
still no luck. htmlentities is still returning empty string.
I wanted to try ENT_DISALLOWED instead, but I cannot find its corresponding long value for it.
So my question is two folded
What's the constant value of PHP 5.4's ENT_DISALLOWED?
How do I make sure that a string containing non UTF-8 characters (such as the smart quotes), can be cleared out of them? - Not just the smart quotes but anything that causes htmlentities() to return blank string.
It is true that htmlentities() in PHP 5.3 does not have the ENT_SUBSTITUTE flag, however it has the (not really suggested) ENT_IGNORE flag. Be ware of the note and try to understand it before use:
Using this flag is discouraged as it » may have security implications.
It is far better that you understand why there is a problem with the input string in the first place. Most often users are only missing to specify the correct encoding.
E.g. first re-encode the string into UTF-8, then pass it to htmlspecialchars() or htmlentities(). Speaking of smart-quotes you are probably using a Windows-1252 encoded string. You won't even need to convert that one before use, you can just specify the charset properly (PHP 5.3):
htmlentities($string, ENT_QUOTES, $encoding = 'Windows-1252');
Naturally this only works if the input $string is encoded in Windows-1252 (CP1252). Find out the correct encoding first, then it's normally no problem. For non-supported encodings re-encode into a supported one first, for example with iconv or mb_string.
As you say, these constants were added in 5.4.0. The thing is, the support is new to 5.4.0 as well. Meaning you can pass whatever values you want, older htmlentities will not understand it.
As it is most probably the case, php changelog is quite misleading.

Handling Multibyte characters in php

Am working on php based mime parser. If the body contains string like Iñtërnâtiônàlizætiøn we see that It is getting converted into Iñtërnâtiônàlizætiøn. Can somebody suggest how to handle (what functions) for such string ?
So we are doing the following
Using Zend Library connecting to the IMAP server
mail = new Zend_Mail_Storage_Imap($params);
Read the message using
$message = $mail->getMessage($i);
in the loop.
When we print the $message we see the string e.g. Iñtërnâtiônàlizætiøn printed as Iñtërnâtiônà lizætiøn.
What I need is if there is someway by which we can retain the original string? And this is just one example we may run into other multi-byte characters, so what to know how we handle this generically?
There's no specific function for that, you simply need to treat the string in the encoding it's in. A string is just a blob of bytes, it gets turned into characters by whatever is interpreting those bytes as text. And that something needs to use the correct encoding for that, otherwise those bytes are not interpreted as the characters they were supposed to be. See Handling Unicode Front To Back In A Web App for a rundown of the common pitfalls.
as mentioned in the comment, you can use php mb_* functions to work with multibyte characters. Here is just an example to detect the encoding of a string:
$s="Iñtërnâtiônàlizætiøn";
echo mb_detect_encoding($s); //UTF-8
then you can work with this, use utf8_decode($s) or any mb_ functions to convert the string to your wished encoding.

Is there any downside to save all my source code files in UTF-8?

If that's relevant (it very well could be), they are PHP source code files.
There are a few pitfalls to take care of:
PHP is not aware of the BOM character certain editors or IDEs like to put at the very beginning of UTF-8 files. This character indicates the file is UTF-8, but it is not necessary, and it is invisible. This can cause "headers already sent out" warnings from functions that deal with HTTP headers because PHP will output the BOM to the browser if it sees one, and that will prevent you from sending any header. Make sure your text editor has a UTF-8 (No BOM) encoding; if you're not sure, simply do the test. If <?php header('Content-Type: text/html') ?> at the beginning of an otherwise empty file doesn't trigger a warning, you're fine.
Default string functions are not multibyte encodings-aware. This means that strlen really returns the number of bytes in the string, not the actual number of characters. This isn't too much of a problem until you start splicing strings of non-ASCII characters with functions like substr: when you do, indices you pass to it refer to byte indices rather than character indices, and this can cause your script to break non-ASCII characters in two. For instance, echo substr("é", 0, 1) will return an invalid UTF-8 character because in UTF-8, é actually takes two bytes and substr will return only the first one. (The solution is to use the mb_ string functions, which are aware of multibyte encodings.)
You must ensure that your data sources (like external text files or databases) return UTF-8 strings too, because PHP makes no automagic conversion. To that end, you may use implementation-specific means (for instance, MySQL has a special query that lets you specify in which encoding you expect the result: SET CHARACTER SET UTF8 or something along these lines), or if you couldn't find a better way, mb_convert_encoding or iconv will convert one string into another encoding.
It's actually usually recommended that you keep all sources in UTF8. It won't matter size of regular code with latin characters at all, but will prevent glitches with any special characters.
If you are using any special chars in e.g string values, the size is a little bit bigger, but that shouldn't matter.
Nevertheless my suggestion is, to always leave the default format. I spent so many hours because there was an error with the format saving and all characters changed.
From a technical point of few, there isn't a difference!
Very relevant, the PHP parser may start to output spurious characters, like a funky unside-down questionmark. Just stick to the norm, much preferred.

Categories