PHP - string encoding - php

I am receiving as a $_GET parameter a string with "6d617263f2" as hex representation.
As far as I understand character encoding, this is not an UTF-8 string. If I print it with UTF-8 encoding what I get is "marc�". If I convert the string to UTF-8 with utf8_encode I get the correct representation, which is marcò.
I setted all my character encodings (default_carset, iconv and mbstring) in the php.ini file to work with UTF-8. I also have the mbstring.encoding_translation set to On.
I'm not able to fully understand what is going on... why I am not getting my $_GET parameter encoded correctly with UTF-8?
My guesses are:
the client is using another character encoding and if I want to use UTF-8, there is no other way that explicitely convert my parameter to UTF-8
I am missing something somewhere...
Could you please help me to shed some light on this?

If you don't control the origin of that GET parameter, then there's nothing you can do. PHP will give you the string as is and won't automatically convert its encoding. It can't, since it doesn't know what encoding to convert from. There's no spec or anything where anyone could get that information from. You need to specify what encoding you accept strings in. Don't leave it up to the client to decide, because then you have no idea what you're going to get.
If the client sends you ISO-8859 encoded text, but you want it to be UTF-8 encoded internally (a sensible choice BTW), you will simply have to convert its encoding. I'd use iconv('ISO-8859-1', 'UTF-8', $_GET['foo']) for that since it's more explicit what it does, but utf8_encode happens to do exactly the same thing.

Related

Why is php converting certain characters to '?'

Everything in my code is running my database(Postgresql) is using utf8 encoding, I've checked the php.ini file its encoding is utf8, I tried debugging to see if it was any of the functions I used that were doing this, but nothing everything is running as expected, however after my frontend sends a post request to backend server through curl for some text to be inserted in the database, some characters like 'da' are converted to '?' in postgre and in memcached, I think php is converting them to Latin-1 again after the request reaches the other side for some reason becuase I use utf8_encode before the request and utf8_decode on the other side
this is the code to send the request
$pre_opp->
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST",str_replace(" ","%",utf8_encode($bio)));
this is how the backend system receives this
$data= str_replace("%"," ",utf8_decode($_POST["Data"]));
Don't replace " " with "%".
Use urlencode and urldecode instead of utf8_encode and utf8_decode - It will give you a clean alphanumeric representation of any character to easily transport your data.
If everything in your environment defaults to UTF-8, you shouldn't need utf_encode and utf_decode anyways, I guess. But if you still do, you could try combining both like this:
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST", urlencode(utf8_encode($bio)));
and
$data= str_replace("%"," ",utf8_decode(urldecode($_POST["Data"])));
You say this like it's a mystery:
I think php is converting them to Latin-1 again after the request reaches the other side for some reason
But then you give the reason yourself:
because I use utf8_encode before the request and utf8_decode on the other side
That is exactly what uf8_decode does: it converts UTF-8 to Latin-1.
As the manual explains, this is also where your '?' replacements come from:
This function converts the string string from the UTF-8 encoding to ISO-8859-1. Bytes in the string which are not valid UTF-8, and UTF-8 characters which do not exist in ISO-8859-1 (that is, characters above U+00FF) are replaced with ?.
Since you'd picked the unfortunate replacement of % for space, sequences like "%da" were being interpreted as URL percent escapes, and generating invalid UTF-8 strings. You then asked PHP to convert them to Latin-1, and it couldn't, so it substituted "?".
The simple solution is: don't do that. If your data is already in UTF-8, neither of those functions will do anything but mess it up; if it's not already in UTF-8, then work out what encoding it's in and use iconv or mb_convert_encoding to convert it, once. See also "UTF-8 all the way through".
Since we can't see your Send_Request_To_BackEnd function, it's hard to know why you thought you needed it. If you're constructing a URL with that string, you should use urlencode inside your request sending code; you shouldn't need to decode it the other end, PHP will do that for you.

json encode utf8 error

i have problem encoding this character with json_encode
http://www.fileformat.info/info/unicode/char/92/index.htm
first it give me this error
JSON_ERROR_UTF8 which is
'Malformed UTF-8 characters, possibly incorrectly encoded'
so tried this function utf8_encode() before json_encode
now return this result '\u0092'
so i found this one
function jsonRemoveUnicodeSequences($struct) {
return preg_replace("/\\\\u([a-f0-9]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", json_encode($struct));
}
the character show up but with other one
Â’
also tried htmlentities then html_entity_decode
with no result
json_encode() requires input that is
null
integer, float, boolean
string encoded as UTF-8
objects implementing JsonSerializable (or whatever it's called, I'm too lazy to look it up)
arrays of JSON-encodable objects
stdClass instances of JSON-encodable objects
So, if you have a string, you must first transcode it to UTF-8. The correct tool for that is the iconv library, but you need to know which encoding the string currently has in order to correctly transcode it.
Your approach to recursively transcode arrays or objects should work, but I'd strongly suggest not using anything but UTF-8 internally. If you have an interface where you have to accept different encodings, validate and reject immediately and use UTF-8 onwards. Similarly, when replying, keep UTF-8 until the last possible point where you can still signal encoding problems.
If you look at the link you included to the character U+0092, it is a control character, and it is also known as PRIVATE USE TWO. Its existence in your string means that your string is almost certainly not a UTF-8 string. Instead, it is probably a Windows-specific encoding, likely Windows-1252 if your text is English, in which 0x92 is a "smart quote" apostrophe, also known as a right single quotation mark. The Unicode equivalent of this character is U+2019.
Thus your data source is not giving you UTF-8 text. Either you can fix the source data to be UTF-8 encoded, or you can convert the text you receive. For example, the output of
echo iconv('Windows-1252','UTF-8', "\x92")
is
’
which is probably what you want. However, you want to make sure that all of your input is the same encoding. If some of your data is UTF-8 and some is Windows-1252, the above iconv call will properly handle Windows-1252 encoded apostrophes, but it will convert UTF-8 encoded apostrophes to
’

Why is my PHP urlencode not functioning as examples on internet?

Why does my urlencode() produce something different than I expected?
This might be my expectations being wrong but then I would be even more puzzled.
example
urlencode("ä");
expectations = returns %C3%A4
reality = returns %E4
Where have I gone wrong in my expections? It seems to be linked to encoding. But I'm not very familiar in what I should do/use.
Should I change something on my server to that the function uses the right encoding?
urlencode encodes the raw bytes in your string into a percent-encoded representation. If you expect %C3%A4 that means you expect the UTF-8 byte representation of "ä". If you get %E4 that means your string is actually encoded in ISO-8859-1 instead.
Encode your string in UTF-8 to get the expected result. How to do this depends on where this string comes from. If it's a string literal in your source code file, save the file as UTF-8 in your text editor. If it comes from a database, see UTF-8 all the way through.
For more background information, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Can php convert strings with all charset encodes to utf8

Can php convert strings with all charset encodes to utf8?
Solutions that don't works:
utf8_encode($string) - but its only Encodes an ISO-8859-1 string to UTF-8?
iconv($incharset, $outcharset,$text) - but how can be find string current encodding?
(only can be if string part of html dom document, not just string)
thanks
It is possible to convert a string from any encoding supported by iconv() into UTF-8 in PHP.
but how can be find string current encodding?
You should never need to "find" the current encoding: Your script should always know what it is. Any resource you query, if properly encoded, will give you its encoding in the content-type header or through other means.
As Artefacto says, there is the possibility of using mb_detect_encoding() but this is not a reliable method. The data flow of the program should always have it defined what encoding a string is in (and preferably work with UTF-8 internally) - that's the way to go.
In general, you cannot know the encoding a given string using.
All you can do is guess. There's mb_detect_encoding, which doesn't really work well and then there are more complex heuristics, such as those used by browsers, which employ language cues.

(PHP) rawurlencode/decode seems to encode '£' sign as '£' (%C2%A3 instead of %A3)

So, I've run into a problem with PHP's rawurlencode function. All text fields in our web app are of course converted before being processed by the web-server, and we've used rawurlencode for this. This works fine with almost every character I've found, expect for the "£" sign. Now, there is no reason for our users to ever enter a pound sign, but they might, so I want to take care of this.
The problem is that rawurlencode doesn't encode a pound sign entered on the webpage as %A3, but instead as %C2%A3. Even worse, if the user failed to enter another bit of critical information (which causes the webpage to refresh - the checks are done on the backend side - and try and refill the form boxes with the information the user had used), then when the %C2 is run through rawurldecode/encode, it becomes Ã? - aka, %C3?. And of course the "£" is also turned into another £!
So, what is causing this? I assume it's a character encoding issue, but I'm not that knowledgable about these things. I heard somewhere that I can encode £s as &pound manually, but why should I need to do that when the database can handle "£"s, and there is a percentage-encoding for a pound sign? Is this a bug in rawurlencode, or a bug caused by differing character sets?
Thanks for any help.
The standard requires forms to be submitted in the character encoding you specify in <form accept-charset="..."> or UTF-8 if it's not specified or the text the user has entered cannot be represented in the charset you specify.
Clearly, you're receiving the pound sign encoded in UTF-8. If you want to convert it to ISO-8859-15, write:
iconv("UTF-8", "ISO-8859-15//TRANSLIT", $original)
This is probably encoding A3 character in your native character set to C2A3 in UTF-8 encoding, which seems to be the valid UTF-8 encoding for an ANSI A3. Just consume your encoded url using UTF-8 encoding, or specify an ANSI encoding to urlencode.
Artefacto's answer represents a case when you need to convert character encodings, for example, you are displaying a page and the page encoding is set to Latin-1. (Raw)Urlencode will produce escaped strings with multibyte character representations. (Raw)Urldecode will by default produce utf-8 encoded strings, and will represent £ as two bytes. If you display this string making a claim that it is a ISO-8859 encoded string, it will appear as two characters.
A primer on PHP and UTF-8: http://www.phpwact.org/php/i18n/utf-8
Some "hot tips": http://www.sitepoint.com/blogs/2006/08/10/hot-php-utf-8-tips/
Likely, between getting the string from rawurldecode, and using the string, the locale is assumed to be ISO8859, so two bytes get interpreted as two characters when they represent one.
Use mb_convert_encoding to force PHP to realize that the bytes in the string represent a UTF-8 encoded string.

Categories