I've done some tests, and it appears that when I test this:
http://127.0.0.1/test.php?x={some non-english string}
http://127.0.0.1/test.php?x=الapple
By examining the output of:
echo bin2hex($_GET["x"]);
In Firefox & Chrome, I get the UTF-8 representation of the string d8a7d9846170706c65.
$_GET['x'] variable. In IE, I get 3f3f6170706c65. which is wrong
And I know that PHP does not change encoding, and only sees the string as a byte array.
The question is:
Is this controlled by the browser used?
Is it reliable to always assume the input it in UTF-8 encoding?
Is there a way to manage what encoding the browser sends to the server? across all browsers?
There is a difference from where the request originated.
If it’s from a user’s input, e.g., entering the URL into the browser’s address field, most browsers follow the suggestion in RFC 3986 and use UTF-8 as encoding:
When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; […]
Although this is intended for new URI schemes and HTTP is quite old.
However, if the URL was embedded in a document, e.g., as a link or form action, the document’s encoding is used unless the data was already encoded using the URL encoding. And in case the data has a wrong encoding, invalid sequences may be replaces with certain characters that should denote those invalid sequences like the � (U+FFFD) in Unicode does. Similarly, the invalid encoded characters ل and ا may have been replaces by ?, which has the code point 0x3F in ASCII.
I think it should come down to how urldecode (http://www.php.net/manual/en/function.urldecode.php) interprets it, since the $_GET variables are all passed through that function (see http://php.net/manual/en/reserved.variables.get.php)
EDIT
To encode the characters to UTF-8 for use in a URL from the client side, you can use the encodeURI in JavaScript.
For the example you gave, you can do encodeURI('الapple');, which should return "%D8%A7%D9%84apple"
Giving this to PHP's urldecode function (as it would be automatically) returns the original string, with the following hex output;
echo bin2hex(urldecode("%D8%A7%D9%84apple")); //outputs d8a7d9846170706c65
yes it's possible !
To encode the URL :
<?php
$url = "http://127.0.0.1/test.php?x=".urlencode("some non-english string");
?>
To decode the URL :
<?php
$url = urldecode($_GET["x"]);
?>
Related
Everything in my code is running my database(Postgresql) is using utf8 encoding, I've checked the php.ini file its encoding is utf8, I tried debugging to see if it was any of the functions I used that were doing this, but nothing everything is running as expected, however after my frontend sends a post request to backend server through curl for some text to be inserted in the database, some characters like 'da' are converted to '?' in postgre and in memcached, I think php is converting them to Latin-1 again after the request reaches the other side for some reason becuase I use utf8_encode before the request and utf8_decode on the other side
this is the code to send the request
$pre_opp->
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST",str_replace(" ","%",utf8_encode($bio)));
this is how the backend system receives this
$data= str_replace("%"," ",utf8_decode($_POST["Data"]));
Don't replace " " with "%".
Use urlencode and urldecode instead of utf8_encode and utf8_decode - It will give you a clean alphanumeric representation of any character to easily transport your data.
If everything in your environment defaults to UTF-8, you shouldn't need utf_encode and utf_decode anyways, I guess. But if you still do, you could try combining both like this:
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST", urlencode(utf8_encode($bio)));
and
$data= str_replace("%"," ",utf8_decode(urldecode($_POST["Data"])));
You say this like it's a mystery:
I think php is converting them to Latin-1 again after the request reaches the other side for some reason
But then you give the reason yourself:
because I use utf8_encode before the request and utf8_decode on the other side
That is exactly what uf8_decode does: it converts UTF-8 to Latin-1.
As the manual explains, this is also where your '?' replacements come from:
This function converts the string string from the UTF-8 encoding to ISO-8859-1. Bytes in the string which are not valid UTF-8, and UTF-8 characters which do not exist in ISO-8859-1 (that is, characters above U+00FF) are replaced with ?.
Since you'd picked the unfortunate replacement of % for space, sequences like "%da" were being interpreted as URL percent escapes, and generating invalid UTF-8 strings. You then asked PHP to convert them to Latin-1, and it couldn't, so it substituted "?".
The simple solution is: don't do that. If your data is already in UTF-8, neither of those functions will do anything but mess it up; if it's not already in UTF-8, then work out what encoding it's in and use iconv or mb_convert_encoding to convert it, once. See also "UTF-8 all the way through".
Since we can't see your Send_Request_To_BackEnd function, it's hard to know why you thought you needed it. If you're constructing a URL with that string, you should use urlencode inside your request sending code; you shouldn't need to decode it the other end, PHP will do that for you.
I got Chinese characters encoded in ISO-8859-1, for example 兼 = 兼
Those characters are taken form the database using AJAX and sent by Json using json_encode.
I then use the template Handlebars to set the data on the page.
When I look at the ajax page the characters are displayed correctly, the source is still encoded.
But the final result displays the encrypted characters.
I tried to decode on the javascript part with unescape but there is no foreach with the template that gives me the possibility to decode the specific variable, so it crashes.
I tried to decode on the PHP side with htmlspecialchars_decode but without success.
Both pages are encoded in ISO-8859-1, but I can change them in UTF8 if necessary, but the data in the database remains encoded in ISO-8859-1.
Thank you for your help.
You're simply representing your characters in HTML entities. If you want them as "actual characters", you'll need to use an encoding that can represent those characters, ISO-8859 won't do. htmlspecialchars_decode doesn't work because it only decodes a handful of characters that are special in HTML and leaves other characters alone. You'll need html_entity_decode to decode all entities, and you'll need to provide it with a character set to decode to which can handle Chinese characters, UTF-8 being the obvious best choice:
$str = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
You'll then need to make sure the browser knows that you're sending it UTF-8. If you want to store the text in the database in UTF-8 as well (which you really should), best follow the guide How to handle UTF-8 in a web app which explains all the pitfalls.
Are you including your text with the "double-stache" Handlebars syntax?
{{your expression}}
As the Handlebars documentation mentions, that syntax HTML-escapes its output, which would cause the results you're mentioning, where you're seeing the entity 兼 instead of 兼.
Using three braces instead ("triple-stache") won't escape the output and will let the browser correctly interpet those numeric entities:
{{{your expression}}}
So, I've run into a problem with PHP's rawurlencode function. All text fields in our web app are of course converted before being processed by the web-server, and we've used rawurlencode for this. This works fine with almost every character I've found, expect for the "£" sign. Now, there is no reason for our users to ever enter a pound sign, but they might, so I want to take care of this.
The problem is that rawurlencode doesn't encode a pound sign entered on the webpage as %A3, but instead as %C2%A3. Even worse, if the user failed to enter another bit of critical information (which causes the webpage to refresh - the checks are done on the backend side - and try and refill the form boxes with the information the user had used), then when the %C2 is run through rawurldecode/encode, it becomes Ã? - aka, %C3?. And of course the "£" is also turned into another £!
So, what is causing this? I assume it's a character encoding issue, but I'm not that knowledgable about these things. I heard somewhere that I can encode £s as £ manually, but why should I need to do that when the database can handle "£"s, and there is a percentage-encoding for a pound sign? Is this a bug in rawurlencode, or a bug caused by differing character sets?
Thanks for any help.
The standard requires forms to be submitted in the character encoding you specify in <form accept-charset="..."> or UTF-8 if it's not specified or the text the user has entered cannot be represented in the charset you specify.
Clearly, you're receiving the pound sign encoded in UTF-8. If you want to convert it to ISO-8859-15, write:
iconv("UTF-8", "ISO-8859-15//TRANSLIT", $original)
This is probably encoding A3 character in your native character set to C2A3 in UTF-8 encoding, which seems to be the valid UTF-8 encoding for an ANSI A3. Just consume your encoded url using UTF-8 encoding, or specify an ANSI encoding to urlencode.
Artefacto's answer represents a case when you need to convert character encodings, for example, you are displaying a page and the page encoding is set to Latin-1. (Raw)Urlencode will produce escaped strings with multibyte character representations. (Raw)Urldecode will by default produce utf-8 encoded strings, and will represent £ as two bytes. If you display this string making a claim that it is a ISO-8859 encoded string, it will appear as two characters.
A primer on PHP and UTF-8: http://www.phpwact.org/php/i18n/utf-8
Some "hot tips": http://www.sitepoint.com/blogs/2006/08/10/hot-php-utf-8-tips/
Likely, between getting the string from rawurldecode, and using the string, the locale is assumed to be ISO8859, so two bytes get interpreted as two characters when they represent one.
Use mb_convert_encoding to force PHP to realize that the bytes in the string represent a UTF-8 encoded string.
I have user submitted tags that can be any type of (valid) UTF-8 string. I want to know if it is safe to include them in the URL merly by running them through urlencode().
In other words, is urlencode() safe to use for valid UTF-8 strings?
(by valid I mean id have already force-encoded them to UTF-8)
urlencode does not depend on a specific character encoding. It just looks at the bytes, interprets them as ASCII characters and replaces any byte that is either not allowed in ASCII (0x80–0xFF) or not allowed in plain in a URL.
Now to your question: Yes, using urlencode does encode any string in any character encoding to be safely used – but only in the URL query! Because urlencode formats the input according to application/x-www-form-urlencoded that differs from the “normal” percent encoding in how the space is encoded: In application/x-www-form-urlencoded spaces are replaced by + while the “normal” percent encoding replaces them by %20.
If you want to “normal” percent encoding use rawurlencode instead.
Just to be entirely on the safe side, I would remove newlines first. They are not dangerous in themselves, but they can be stepping stones in exploiting other vulnerabilities.
Yes, urlencode() should make a safe URL string out of any input string. As long as whatever that URL is mapping to (folder/file/htaccess), doesn't have funky characters in it. Whenever sanitizing stuff from a user where they could be posting something funky I love this function:
utf8_encode()
curl downloads http://mysite.com/Lunacy%20Disc%202%20of%202%20(U)(Saturn).zip
but not
http://mysite.com/Lunacy Disc 2 of 2 (U)(Saturn).zip
Why is this the case?
Do I need to convert it to the first format ?
using the URL generated via urlencode($url) fails.
Two problems:
urlencode will also encode the slashes on you. It's meant to encode query strings for use in urls, not full urls.
urlencode encodes spaces as +. You need rawurlencode if you want spaces as %20.
To convert an URL to the "first format", you can use the PHP function urlencode.
Now, for the "why", the answer can probably be found in the RFC 1738 - Uniform Resource Locators (URL).
Quoting some paragraphs :
Octets must be encoded if they have no corresponding graphic
character within the US-ASCII coded character set, if the use of the
corresponding character is unsafe, or if the corresponding character
is reserved for some other interpretation within the particular URL
scheme.
No corresponding graphic US-ASCII:
URLs are written only with the graphic printable characters of the
US-ASCII coded character set. The octets 80-FF hexadecimal are not
used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent
control characters; these must be encoded.
A space has the code %20 -- it's not in the range 00-1F, so it should be encoded for that reason... But, a bit later :
Unsafe:
Characters can be unsafe for a number of reasons. The space
character is unsafe because significant spaces may disappear and
insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
And here, you know why the space character has to be escaped/encoded too ;-)
urlencode() does indeed fail with curl, if your problem is just with spaces, you can manually substitute them
$url = str_replace(' ', '%20', $url);
You need to urlencode to translate the spaces (in your example; there are other characters that require it) for transmission across the internet. The encoding ensures that the various communications protocols don't terminate or otherwise mangle the string while they're handling it.
http://mysite.com/Lunacy Disc 2 of 2 (U)(Saturn).zip
That is not a valid url. Accessing urls like this may work in your browser because most modern browsers will automatically encode the url for you if required. The curl library must not do this automatically.
Why? Because some characters has special meanings such as # (html anchor).
So all characters except alfanumeric ones are encoded regardless need to be encoded or not.