Can php convert strings with all charset encodes to utf8 - php

Can php convert strings with all charset encodes to utf8?
Solutions that don't works:
utf8_encode($string) - but its only Encodes an ISO-8859-1 string to UTF-8?
iconv($incharset, $outcharset,$text) - but how can be find string current encodding?
(only can be if string part of html dom document, not just string)
thanks

It is possible to convert a string from any encoding supported by iconv() into UTF-8 in PHP.
but how can be find string current encodding?
You should never need to "find" the current encoding: Your script should always know what it is. Any resource you query, if properly encoded, will give you its encoding in the content-type header or through other means.
As Artefacto says, there is the possibility of using mb_detect_encoding() but this is not a reliable method. The data flow of the program should always have it defined what encoding a string is in (and preferably work with UTF-8 internally) - that's the way to go.

In general, you cannot know the encoding a given string using.
All you can do is guess. There's mb_detect_encoding, which doesn't really work well and then there are more complex heuristics, such as those used by browsers, which employ language cues.

Related

Why is php converting certain characters to '?'

Everything in my code is running my database(Postgresql) is using utf8 encoding, I've checked the php.ini file its encoding is utf8, I tried debugging to see if it was any of the functions I used that were doing this, but nothing everything is running as expected, however after my frontend sends a post request to backend server through curl for some text to be inserted in the database, some characters like 'da' are converted to '?' in postgre and in memcached, I think php is converting them to Latin-1 again after the request reaches the other side for some reason becuase I use utf8_encode before the request and utf8_decode on the other side
this is the code to send the request
$pre_opp->
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST",str_replace(" ","%",utf8_encode($bio)));
this is how the backend system receives this
$data= str_replace("%"," ",utf8_decode($_POST["Data"]));
Don't replace " " with "%".
Use urlencode and urldecode instead of utf8_encode and utf8_decode - It will give you a clean alphanumeric representation of any character to easily transport your data.
If everything in your environment defaults to UTF-8, you shouldn't need utf_encode and utf_decode anyways, I guess. But if you still do, you could try combining both like this:
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST", urlencode(utf8_encode($bio)));
and
$data= str_replace("%"," ",utf8_decode(urldecode($_POST["Data"])));
You say this like it's a mystery:
I think php is converting them to Latin-1 again after the request reaches the other side for some reason
But then you give the reason yourself:
because I use utf8_encode before the request and utf8_decode on the other side
That is exactly what uf8_decode does: it converts UTF-8 to Latin-1.
As the manual explains, this is also where your '?' replacements come from:
This function converts the string string from the UTF-8 encoding to ISO-8859-1. Bytes in the string which are not valid UTF-8, and UTF-8 characters which do not exist in ISO-8859-1 (that is, characters above U+00FF) are replaced with ?.
Since you'd picked the unfortunate replacement of % for space, sequences like "%da" were being interpreted as URL percent escapes, and generating invalid UTF-8 strings. You then asked PHP to convert them to Latin-1, and it couldn't, so it substituted "?".
The simple solution is: don't do that. If your data is already in UTF-8, neither of those functions will do anything but mess it up; if it's not already in UTF-8, then work out what encoding it's in and use iconv or mb_convert_encoding to convert it, once. See also "UTF-8 all the way through".
Since we can't see your Send_Request_To_BackEnd function, it's hard to know why you thought you needed it. If you're constructing a URL with that string, you should use urlencode inside your request sending code; you shouldn't need to decode it the other end, PHP will do that for you.

PHP - string encoding

I am receiving as a $_GET parameter a string with "6d617263f2" as hex representation.
As far as I understand character encoding, this is not an UTF-8 string. If I print it with UTF-8 encoding what I get is "marc�". If I convert the string to UTF-8 with utf8_encode I get the correct representation, which is marcò.
I setted all my character encodings (default_carset, iconv and mbstring) in the php.ini file to work with UTF-8. I also have the mbstring.encoding_translation set to On.
I'm not able to fully understand what is going on... why I am not getting my $_GET parameter encoded correctly with UTF-8?
My guesses are:
the client is using another character encoding and if I want to use UTF-8, there is no other way that explicitely convert my parameter to UTF-8
I am missing something somewhere...
Could you please help me to shed some light on this?
If you don't control the origin of that GET parameter, then there's nothing you can do. PHP will give you the string as is and won't automatically convert its encoding. It can't, since it doesn't know what encoding to convert from. There's no spec or anything where anyone could get that information from. You need to specify what encoding you accept strings in. Don't leave it up to the client to decide, because then you have no idea what you're going to get.
If the client sends you ISO-8859 encoded text, but you want it to be UTF-8 encoded internally (a sensible choice BTW), you will simply have to convert its encoding. I'd use iconv('ISO-8859-1', 'UTF-8', $_GET['foo']) for that since it's more explicit what it does, but utf8_encode happens to do exactly the same thing.

Why is my PHP urlencode not functioning as examples on internet?

Why does my urlencode() produce something different than I expected?
This might be my expectations being wrong but then I would be even more puzzled.
example
urlencode("ä");
expectations = returns %C3%A4
reality = returns %E4
Where have I gone wrong in my expections? It seems to be linked to encoding. But I'm not very familiar in what I should do/use.
Should I change something on my server to that the function uses the right encoding?
urlencode encodes the raw bytes in your string into a percent-encoded representation. If you expect %C3%A4 that means you expect the UTF-8 byte representation of "ä". If you get %E4 that means your string is actually encoded in ISO-8859-1 instead.
Encode your string in UTF-8 to get the expected result. How to do this depends on where this string comes from. If it's a string literal in your source code file, save the file as UTF-8 in your text editor. If it comes from a database, see UTF-8 all the way through.
For more background information, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

How to encode foreign characters in url in PHP

How do you correctly encode an URL with foreign characters in PHP?
I assumed urlencode() would do the trick but it does not.
The correct encoding for the following URL
http://eu.battle.net/wow/en/character/anachronos/Paddestøel/advanced
Is this:
http://eu.battle.net/wow/en/character/anachronos/Paddest%C3%B8el/advanced
But urlencode encodes it like this:
http://eu.battle.net/wow/en/character/anachronos/Paddest%F8el/advanced
What function do I use to encode it like on the second example?
Your PHP scripts seem to use some single-byte encoding. You can either:
Save the source code as UTF-8
Convert data to UTF-8 with iconv() or mb_convert_encoding()
In general, making the full switch to UTF-8 fixes all encoding issues at once but initial migration might require some extra work.
There is no "correct" encoding. URL-percent-encoding simply represents raw bytes. It's up to you what those bytes are or how you're going to interpret them later. If your string is UTF-8 encoded, the percent-encoded raw byte representation is %C3%B8. If your string is not UTF-8 encoded, it's something else. If you want %C3%B8, make sure your string is UTF-8 encoded.
Use UTF-8 encoding
function url_encode($string){
return urlencode(utf8_encode($string));
}
Then use this function to encode your url (got it in a comment here: http://php.net/manual/en/function.urlencode.php)

utf8_encode function purpose

Supposed that im encoding my files with UTF-8.
Within PHP script, a string will be compared:
$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...
Its that string really UTF-8 without the utf8_encode() function?
If you encode your files with UTF-8 dont need this function?
If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.
To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.
PHP does not care about string encoding generally, strings are binary data within PHP. So you must know the encoding of data inside the string if you need encoding. The question is: does encoding matter in your case?
If you set a string variables content to something like you did:
$string="ぁ";
It will not contain UTF-8. Instead it contains a binary sequence that is not a valid UTF-8 character. That's why the browser or editor displays a questionmark or similar. So before you go on, you already see that something might not be as intended. (Turned out it was a missing font on my end)
This also shows that your file in the editor is supporting UTF-8 or some other flavor of unicode encoding. Just keep the following in mind: One file - one encoding. If you store the string inside the file, it's in the encoding of that file. Check your editor in which encoding you save the file. Then you know the encoding of the string.
Let's just assume it is some valid UTF-8 like so (support for my font):
$string="ä";
You can then do a binary comparison of the string later on:
if ( 'ä' === $string )
# do your stuff
Because it's in the same file and PHP strings are binary data, this works with every encoding. So normally you don't need to re-encode (change the encoding) the data if you use functions that are binary safe - which means that the encoding of the data is not changed.
For regular expressions encoding does play a role. That's why there is the u modifier to signal you want to make the expression work on and with unicode encoded data. However, if the data is already unicode encoded, you don't need to change it into unicode before you use preg_match. However with your code example, regular expressions are not necessary at all and a simple string comparison does the job.
Summary:
$string="ä";
if ( 'ä' === $string )
# do your stuff
Your string is not a utf-8 character so it can't preg match it, hence why you need to utf8_encode it. Try encoding the PHP file as utf-8 (use something like Notepad++) and it may work without it.
Summary:
The utf8_encode() function will encode every byte from a given string to UTF-8.
No matter what encoding has been used previously to store the file.
It's purpose is encode strings¹ that arent UTF-8 yet.
1.- The correctly use of this function is giving as a parameter an ISO-8859-1 string.
Why? Because Unicode and ISO-8859-1 have the same characters at same positions.
[Char][Value/Position] [Encoded Value/Position]
[Windows-1252] [€][80] ----> [C2|80] Is this the UTF-8 encoded value/position of the [€]? No
[ISO-8859-1] [¢][A2] ----> [C2|A2] Is this the UTF-8 encoded value/position of the [¢]? Yes
The function seems that work with another encodings: it work if the string to encode contains only characters with same
values that the ISO-8859-1 encoding (e.g On Windows-1252 00-EF & A0-FF positions).
We should take into account that if the function receive an UTF-8 string (A file encoded as a UTF-8) will encode again that UTF-8 string and will make garbage.

Categories