I have an online service that uses MD5s of JSON-encoded REQUEST objects as a form of verification that the object that is sent over POST is the object that is received, and has not been edited or truncated in transit.
We have come across a problem when working with clients written in VBA, which convert UTF characters to uppercase unicode when converted to JSON through WebHelpers.ConvertToJson
For example, in PHP, json_encode("£") will return '\u00a3', but in VBA, WebHelpers.ConvertToJson('£') will return '\u00A3'.
When I MD5 those two strings, the MD5s are obviously very different.
So, my question is - how do I set WebHelpers.ConvertToJson to output lowercase unicode?
Or, how would I translate a JSON string after the conversion to lowercase the unicode?
Related
If a json data
{"inf": "Väri-väri"}
is saved like
{"inf": "Vu00e4ri-vu00e4ri"}
How to recover letters õ, ä, ö, ü, etc in whole json with php. utf8_decode, and utf8_encode i tried.
Thank you.
you have some flag for jsong_encode for get option : http://php.net/manual/en/json.constants.php try
json_encode($myVar,JSON_UNESCAPED_UNICODE)
The problem in your case is not the JSON encoding in itself, but how you store the encoded JSON document. Note how the encoded JSON document actually should look like:
$a = ["inf" => "Väri-väri"];
echo json_encode($a) . "\n";
// prints: {"inf":"V\u00e4ri-v\u00e4ri"}
This is the expected behaviour in PHP and totally consistent with the JSON spec in RFC-7159:
Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lower case. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".
However, you're losing the \ characters at some point when storing the data. A wild guess is that you're storing these strings in a relational database using SQL and did not escape properly. The first thing I'd suggest is to investigate how you store your data and ensure that backslashes are properly escaped when storing these strings in a database. If stored correctly, json_decode will easily decode the encoded characters back to regular unicode characters.
Alternatively, you can disable this behaviour by passing the JSON_UNESCAPED_UNICODE flag into json_encode:
echo json_encode($a, JSON_UNESCAPED_UNICODE));
Have a look on the php documentation. If you decode the json code, the letters will be recovered.
I've done some tests, and it appears that when I test this:
http://127.0.0.1/test.php?x={some non-english string}
http://127.0.0.1/test.php?x=الapple
By examining the output of:
echo bin2hex($_GET["x"]);
In Firefox & Chrome, I get the UTF-8 representation of the string d8a7d9846170706c65.
$_GET['x'] variable. In IE, I get 3f3f6170706c65. which is wrong
And I know that PHP does not change encoding, and only sees the string as a byte array.
The question is:
Is this controlled by the browser used?
Is it reliable to always assume the input it in UTF-8 encoding?
Is there a way to manage what encoding the browser sends to the server? across all browsers?
There is a difference from where the request originated.
If it’s from a user’s input, e.g., entering the URL into the browser’s address field, most browsers follow the suggestion in RFC 3986 and use UTF-8 as encoding:
When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; […]
Although this is intended for new URI schemes and HTTP is quite old.
However, if the URL was embedded in a document, e.g., as a link or form action, the document’s encoding is used unless the data was already encoded using the URL encoding. And in case the data has a wrong encoding, invalid sequences may be replaces with certain characters that should denote those invalid sequences like the � (U+FFFD) in Unicode does. Similarly, the invalid encoded characters ل and ا may have been replaces by ?, which has the code point 0x3F in ASCII.
I think it should come down to how urldecode (http://www.php.net/manual/en/function.urldecode.php) interprets it, since the $_GET variables are all passed through that function (see http://php.net/manual/en/reserved.variables.get.php)
EDIT
To encode the characters to UTF-8 for use in a URL from the client side, you can use the encodeURI in JavaScript.
For the example you gave, you can do encodeURI('الapple');, which should return "%D8%A7%D9%84apple"
Giving this to PHP's urldecode function (as it would be automatically) returns the original string, with the following hex output;
echo bin2hex(urldecode("%D8%A7%D9%84apple")); //outputs d8a7d9846170706c65
yes it's possible !
To encode the URL :
<?php
$url = "http://127.0.0.1/test.php?x=".urlencode("some non-english string");
?>
To decode the URL :
<?php
$url = urldecode($_GET["x"]);
?>
My web application communicates with the server over JSON protocol. Before sending each JSON message from the web application, I run a hmac-sha1 function on it (on already encoded object) and insert the resulting HMAC into the header of JSON request.
On server side, I decode JSON message with PHP, extract the HMAC, unset() the HMAC from the object and then encode the object back into JSON and create a HMAC of it.
The HMACs match as long as I don't use characters like "ž, š, č". When I use those characters in the message, the HMACs don't match anymore.
In the web application I'm using jQuery.post() to transmit the already encoded JSON string.
If I send the data I got from the web application back to it in the JSON encoded reply, the application will display "ž, č, š" just nicely.
How can I make the HMACs match?
UPDATE:
This is only a problem on latest version of Firefox and Opera. It works fine on IE8 and Chrome. On the former browsers, the JSON string (before it is sent) is:
{"body":[{"name":"Žiga Kraljevič","email":"test#email.com","password":"secretpass"}],"header":{"apiID":"person-27jhfa83ha-js84sjj18dasjd","hmac":"e4259d6ef8f477c020d644409cc16dd9c42301e8"}}
While on the latter browsers (IE8 and Chrome, where it works) is the following:
{"body":[{"name":"\u017diga Kraljevi\u010d","email":"test#email.com","password":"secretpass"}],"header":{"apiID":"person-27jhfa83ha-js84sjj18dasjd","hmac":"e4e9e2d0d8d11728a2b4329ad6dacdb9409b1de1"}}
You're probably running into multiple issues. One of them may well be that the character encoding being used on the client is different from that being used on the server, worth ensuring that they're the same (more about character encoding in Joel's excellent essay). Another may well be that there are multiple correct ways to encode things. The encoders may well be using different ways. For instance, you can encode a " within a string as either \" or \u0022. Both are valid, and they're equivalent, but the hashes won't match. Similarly, I'm a bit surprised you're not running into more trouble when not using accented characters, for instance with whitespace.
What is your hmac-sha1 function, where's it from? If it is taking a JSON String as input then there's an implicit encode-to-bytes step going on here because SHA1 operates on bytes, not UTF-16 code units like JS String.
I would suspect that your JS function is using a “one code unit n per byte n” type of encoding, for easy calculation with tools like getCharCodeAt. This is effectively the same as if the character string input had been encoded to ISO-8859-1. Whereas if you are using encodeURIComponent or posting the raw characters via XMLHttpRequest, the implicit encoding there is UTF-8.
You could convert the String to UTF-8-bytes-stored-as-code-units format for the JS hmac-sha1 function, that might make it match PHP. There's a sneaky idiom to do this:
var utf8= unescape(encodeURIComponent(s));
When POSTing JSON I base64 and urlencode it anyway
URL-encoding should be enough (with encodeURIComponent, not escape which is the wrong thing for absolutely everything except the reverse step of the UTF-8-conversion trick above).
BTW, what's the purpose of this? You do know it doesn't in any way secure the connection between the browser and the server, yeah?
Edit:
I'm using jssha.sourceforge.net for sha1-hmac. In PHP I'm using hash_hmac.
Works for me:
var data= '\u017E, \u010D, \u0161'; // 'ž, č, š' in a Unucode string
var utf8bytes= unescape(encodeURIComponent(data));
var hmac= new jsSHA(utf8bytes).getHMAC('foo', 'ASCII', 'SHA-1', 'HEX');
alert(hmac); // 5d15f0b9...
var form= 'message='+encodeURIComponent(data)+'&hmac='+encodeURIComponent(hmac);
xmlhttprequest.send(form);
...
$utf8bytes= $_POST['message']; // "\xc5\xbe, \xc4\x8d, \xc5\xa1"
// which is 'ž, č, š' as UTF-8 in byte string
$hmac= hash_hmac('sha1', $utf8bytes, 'foo');
echo $hmac; // 5d15f0b9...
echo strtolower($hmac)===strtolower($_POST['hmac']); // true
This uses the binary ('ASCII' to jsSHA) key foo. If you are using a binary key with non-ASCII characters in it, you would have to make sure that those are properly encoded too, in the same way as the data.
The key for HMAC is a shared secret between the server and the client, which has been previously exchanged over a secure connection.
It's not only the key you'd have to send over a secure connection, but the entire page and all scripts in it. Otherwise a man in the middle attack could sabotage your scripts on the way to the browser to replace them with a version that used the secret key to sign bogus messages. If you've got an HTTPS server for all this stuff, fine. I'm not sure what the HMAC would be doing in that case though, it seems a bit involved for an anti-XSRF scheme.
So, I've run into a problem with PHP's rawurlencode function. All text fields in our web app are of course converted before being processed by the web-server, and we've used rawurlencode for this. This works fine with almost every character I've found, expect for the "£" sign. Now, there is no reason for our users to ever enter a pound sign, but they might, so I want to take care of this.
The problem is that rawurlencode doesn't encode a pound sign entered on the webpage as %A3, but instead as %C2%A3. Even worse, if the user failed to enter another bit of critical information (which causes the webpage to refresh - the checks are done on the backend side - and try and refill the form boxes with the information the user had used), then when the %C2 is run through rawurldecode/encode, it becomes Ã? - aka, %C3?. And of course the "£" is also turned into another £!
So, what is causing this? I assume it's a character encoding issue, but I'm not that knowledgable about these things. I heard somewhere that I can encode £s as £ manually, but why should I need to do that when the database can handle "£"s, and there is a percentage-encoding for a pound sign? Is this a bug in rawurlencode, or a bug caused by differing character sets?
Thanks for any help.
The standard requires forms to be submitted in the character encoding you specify in <form accept-charset="..."> or UTF-8 if it's not specified or the text the user has entered cannot be represented in the charset you specify.
Clearly, you're receiving the pound sign encoded in UTF-8. If you want to convert it to ISO-8859-15, write:
iconv("UTF-8", "ISO-8859-15//TRANSLIT", $original)
This is probably encoding A3 character in your native character set to C2A3 in UTF-8 encoding, which seems to be the valid UTF-8 encoding for an ANSI A3. Just consume your encoded url using UTF-8 encoding, or specify an ANSI encoding to urlencode.
Artefacto's answer represents a case when you need to convert character encodings, for example, you are displaying a page and the page encoding is set to Latin-1. (Raw)Urlencode will produce escaped strings with multibyte character representations. (Raw)Urldecode will by default produce utf-8 encoded strings, and will represent £ as two bytes. If you display this string making a claim that it is a ISO-8859 encoded string, it will appear as two characters.
A primer on PHP and UTF-8: http://www.phpwact.org/php/i18n/utf-8
Some "hot tips": http://www.sitepoint.com/blogs/2006/08/10/hot-php-utf-8-tips/
Likely, between getting the string from rawurldecode, and using the string, the locale is assumed to be ISO8859, so two bytes get interpreted as two characters when they represent one.
Use mb_convert_encoding to force PHP to realize that the bytes in the string represent a UTF-8 encoded string.
so I have my php API (html Get api for Flash builder and C# apps). So if you want to submit data to it you use string like
http://localhost/cms/api.php?method=someMethod&string=Your_String
If there are english letters in it its ok. But what if I need to pass UTF-8 string like this Русское Имя to my api what shall I do?
Use the rawurlencode() function. It will encode your string byte by byte, but it is not a problem, since UTF-8 is an ASCII aware representation. All code positions below 128 are identical to the ASCII one, all code positions above 127 are represented with byte sequences which are all between 128 and 255, so you will not have problems with it. The input wrapper should decode the parameters into your $_REQUEST array properly.