My php file is in UTF-8 encoding and I am trying to encode my data for safe sending into application but some characters get encoded incorrectly.
$text = "Š";
$text = urlencode(utf8_decode($text));
echo $text;
Echos %3F but according to w3c urlencoding reference found here (http://www.w3schools.com/tags/ref_urlencode.asp), "Š" should be converted into %8A. Php's own reference also does not state what reference is it using. Could this be encoding/decoding issue or something else?
utf8_decode tries to convert from UTF-8 to ISO-8859-1 but Š does not exist in ISO-8859-1. So you obtain '?' (= %3F), the substitution character.
It exists in CP1252 (maybe others), under the hexadecimal code 8A. So:
$text = urlencode(iconv('UTF-8', 'CP1252', $text));
Should give what you expect. In fact, you shouldn't decode an unicode string.
Related
Here is a PHP code snippet I came up with when I found a bug in my project.
print(($str == utf8_encode($str) ? "the same text" : "not the same text") . PHP_EOL);
print(mb_detect_encoding($str));
Now what this does, is tell me if a string $str has the same encoding as its UTF-8 encoded version, after that it prints its initial encoding.
What I expected is that either the UTF-8 text is the same as the original, or that the original text is already UTF-8 and therefore the UTF-8 encoded text is the same as the original.
But what really happened is the following output:
not the same text
UTF-8
This is only the case if i set $str = array_keys($_POST)[0]; and i use a key with special characters in my request body like äöü=test so that the $str will be äöü (defining it directly in the code will not result in the same output).
I interpret from the output that the original character encoding is UTF-8, but the two strings are not the same. If I print the initial string it is empty and the encoded string would be äöü.
I don't understand how a string can be different when encoded with its own encoding. Can someone please explain this to me?
The problem is your assumption that "that the original text is already UTF-8 and therefore the UTF-8 encoded text is the same as the original".
From the PHP Official Documentation regarding utf8_encode (https://www.php.net/manual/en/function.utf8-encode.php):
This function converts the string data from the ISO-8859-1 encoding to UTF-8.
In other words, this function is a ISO-8859-1 to UTF-8 converter. A proper use of this function, as seen above, expects only a ISO-8859-1 string. Therefore, if you use another encoding as parameter you should expect garbage.
This thread (PHP: Convert any string to UTF-8 without knowing the original character set, or at least try) discuss an "any character enconding to UTF-8".
Hope it hepls
Please can you help me decode this URL so that it displays properly using PHP to output
This is the link
http://www.megalithic.co.uk/visits.php?op=site&sid=18341&title=Ōyu
I think it's actually coming through as UTF-8 - ie
&title=%C5%8Cyu
$title displays as ÅŒyu
How do I convert this in PHP? I need to use ISO-8859-1 on the page
None of these work
$title=iconv("UTF-8","ISO-8859-1",$title);
$title=iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $title);
$title = utf8_decode($title);
$title = urldecode($title);
Do I need to use the Multibyte MB extension and if so how?
Many thanks in advance
Andy
If that link is to your PHP page, and you get the value via $_GET['title'], then it's already decoded from the URL encoding and $_GET['title'] holds a UTF-8 encoded string with the character Ō. This character cannot be encoded in ISO-8859-1. If that is a strict requirement, you'll have to encode the character as HTML entity in order to express it in a strictly ISO-8859-1 encoded page:
echo htmlentities('Ō', ENT_COMPAT | ENT_HTML5, 'UTF-8');
The character "Ō" is not there in ISO-8859-1, so it is not possible to convert it from UTF-8 with any of the standard charset conversion functions.
It might, however, be possible to write a function that converts to numerical HTML encodings, like Ō for "Ō".
In PHP, I want to convert a string which contains non-ASCII characters into a sequence of hexadecimal numbers which represents the UTF-8 encoding of these characters. For instance, given this:
$text = 'ąćę';
I need to produce this:
C4=84=C4=87=C4=99
How do I do that?
As your question is written, and assuming that your text is properly UTF-8 encoded to start with, this should work:
$text = 'ąćę';
$result = implode('=', str_split(strtoupper(bin2hex($text)), 2));
If your text is not UTF-8, but some other encoding, then you can use
$utf8 = mb_convert_encoding($text, 'UTF-8', $yourEncoding);
to get it into UTF-8, where $yourEncoding is some other character encoding like 'ISO-8859-1'.
This works because in PHP, strings are just arrays of bytes. So as long as your text is encoded properly to start with, you don't have to do anything special to treat it as bytes. In fact, this code will work for any character encoding you want without modification.
Now, if you want to do quoted-printable, then that's another story. You could try using the function quoted_printable_encode (requires PHP 5.3 or higher).
Supposed that im encoding my files with UTF-8.
Within PHP script, a string will be compared:
$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...
Its that string really UTF-8 without the utf8_encode() function?
If you encode your files with UTF-8 dont need this function?
If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.
To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.
PHP does not care about string encoding generally, strings are binary data within PHP. So you must know the encoding of data inside the string if you need encoding. The question is: does encoding matter in your case?
If you set a string variables content to something like you did:
$string="ぁ";
It will not contain UTF-8. Instead it contains a binary sequence that is not a valid UTF-8 character. That's why the browser or editor displays a questionmark or similar. So before you go on, you already see that something might not be as intended. (Turned out it was a missing font on my end)
This also shows that your file in the editor is supporting UTF-8 or some other flavor of unicode encoding. Just keep the following in mind: One file - one encoding. If you store the string inside the file, it's in the encoding of that file. Check your editor in which encoding you save the file. Then you know the encoding of the string.
Let's just assume it is some valid UTF-8 like so (support for my font):
$string="ä";
You can then do a binary comparison of the string later on:
if ( 'ä' === $string )
# do your stuff
Because it's in the same file and PHP strings are binary data, this works with every encoding. So normally you don't need to re-encode (change the encoding) the data if you use functions that are binary safe - which means that the encoding of the data is not changed.
For regular expressions encoding does play a role. That's why there is the u modifier to signal you want to make the expression work on and with unicode encoded data. However, if the data is already unicode encoded, you don't need to change it into unicode before you use preg_match. However with your code example, regular expressions are not necessary at all and a simple string comparison does the job.
Summary:
$string="ä";
if ( 'ä' === $string )
# do your stuff
Your string is not a utf-8 character so it can't preg match it, hence why you need to utf8_encode it. Try encoding the PHP file as utf-8 (use something like Notepad++) and it may work without it.
Summary:
The utf8_encode() function will encode every byte from a given string to UTF-8.
No matter what encoding has been used previously to store the file.
It's purpose is encode strings¹ that arent UTF-8 yet.
1.- The correctly use of this function is giving as a parameter an ISO-8859-1 string.
Why? Because Unicode and ISO-8859-1 have the same characters at same positions.
[Char][Value/Position] [Encoded Value/Position]
[Windows-1252] [€][80] ----> [C2|80] Is this the UTF-8 encoded value/position of the [€]? No
[ISO-8859-1] [¢][A2] ----> [C2|A2] Is this the UTF-8 encoded value/position of the [¢]? Yes
The function seems that work with another encodings: it work if the string to encode contains only characters with same
values that the ISO-8859-1 encoding (e.g On Windows-1252 00-EF & A0-FF positions).
We should take into account that if the function receive an UTF-8 string (A file encoded as a UTF-8) will encode again that UTF-8 string and will make garbage.
I've got a string that is in my database like 中华武魂 when I post my request to retrieve the data via my website I'm getting the data to the server in the format %E4%B8%AD%E5%8D%8E%E6%AD%A6%E9%AD%82
What decoding steps to I have to take in order to get it back to the usable form?
While also cleaning the user input to ensure they're not going to try an SQL injection attack?
(escape string before or after encoding?)
EDIT:
rawurldecode(); // returns "ä¸åŽæ¦é‚"
urldecode(); // returns "ä¸åŽæ¦é‚"
public function utf8_urldecode($str) {
$str = preg_replace("/%u([0-9a-f]{3,4})/i","&#x\\1;",urldecode($str));
return html_entity_decode($str,null,'UTF-8');
}
// returns "ä¸åŽæ¦é‚"
... which actually works when I try and use it in an SQL statement.
I think because I was doing an echo and die(); without specifying a header of UTF-8 (thus I guess that was reading to me as latin)
Thanks for the help!
When your data is actually that percent-encoded form, you just have to call rawurldecode:
$data = '%E4%B8%AD%E5%8D%8E%E6%AD%A6%E9%AD%82';
$str = rawurldecode($data);
This suffices as the data already is encoded in UTF-8: 中 (U+4E2D) is encoded with the byte sequence 0xE4B8AD in UTF-8 and that is encoded with %E4%B8%AD when using the percent-encoding.
That your output does not seem to be as expected is probably because the output is interpreted with the wrong character encoding, probably Windows-1252 instead of UTF-8. Because in Windows-1252, 0xE4 represents ä, 0xB8 represents ¸, 0xAD represents å, and so on. So make sure to specify the output character encoding properly.
Use PHP's urldecode:
http://php.net/manual/en/function.urldecode.php
You have choices here: urldecode or rawurldecode.
If you had encoded your string using urlencode, you must use urldecode because of the way spaces are handled. While urlencode converts spaces to +, it is not the same with rawurlencode.