In php, what function can I use to convert the text 'pétition' to 'p%E9tition'.
I have tried with uft8_encode and uft8_decode with no success.
%E9 is an URL encoded escape character. You can achieve this by urlecode($string).
If you want HTML escaping, you can either use htmlentities($string) (more encoding) or htmlspecialchars($string) (less encoding).
http://php.net/manual/en/function.urlencode.php
http://php.net/manual/en/function.htmlentities.php
http://php.net/manual/en/function.htmlspecialchars.php
When dealing with UTF-8 strings, you will need to decode the string (ie. with utf8_decode) before encoding with urlencode to be used in a query part of a URL.
print_r( urlencode(utf8_decode('pétition')) );
// p%E9tition
You can try to have a look at htmlentities.
This link can help
Related
My PHP application outputs JSON where special characters are encoded, f.ex. the string "Brøndum" is represented as "Br\u00f8ndum".
Can you tell me which encoding this is, as well as how I get back from "Br\u00f8ndum" to "Brøndum".
I have tried utf8_encode/decode but they don't work as expected.
Thanks!
That's standard JSON unicode escaping.
You get back to the actual character by using a JSON parser. json_decode in the case of PHP.
You can tell PHP not to escape Unicode characters in the first place with the JSON_UNESCAPED_UNICODE flag.
json_encode("Brøndum", JSON_UNESCAPED_UNICODE)
mb_detect_encoding is your function. You just pass it the string and it detects the codification. You can also send it an array with the possibilities (as a regular string like "hello" could potentially be encoded in different codifications.
echo mb_detect_encoding("Br\u00f8ndum");
How do you correctly encode an URL with foreign characters in PHP?
I assumed urlencode() would do the trick but it does not.
The correct encoding for the following URL
http://eu.battle.net/wow/en/character/anachronos/Paddestøel/advanced
Is this:
http://eu.battle.net/wow/en/character/anachronos/Paddest%C3%B8el/advanced
But urlencode encodes it like this:
http://eu.battle.net/wow/en/character/anachronos/Paddest%F8el/advanced
What function do I use to encode it like on the second example?
Your PHP scripts seem to use some single-byte encoding. You can either:
Save the source code as UTF-8
Convert data to UTF-8 with iconv() or mb_convert_encoding()
In general, making the full switch to UTF-8 fixes all encoding issues at once but initial migration might require some extra work.
There is no "correct" encoding. URL-percent-encoding simply represents raw bytes. It's up to you what those bytes are or how you're going to interpret them later. If your string is UTF-8 encoded, the percent-encoded raw byte representation is %C3%B8. If your string is not UTF-8 encoded, it's something else. If you want %C3%B8, make sure your string is UTF-8 encoded.
Use UTF-8 encoding
function url_encode($string){
return urlencode(utf8_encode($string));
}
Then use this function to encode your url (got it in a comment here: http://php.net/manual/en/function.urlencode.php)
How can I convert spaces in string into %20?
Here is my attempt:
$str = "What happens here?";
echo urlencode($str);
The output is "What+happens+here%3F", so the spaces are not represented as %20.
What am I doing wrong?
Use the rawurlencode function instead.
The plus sign is the historic encoding for a space character in URL parameters, as documented in the help for the urlencode() function.
That same page contains the answer you need - use rawurlencode() instead to get RFC 3986 compatible encoding.
I believe that, if you need to use the %20 variant, you could perhaps use rawurlencode().
I've got a string that is in my database like 中华武魂 when I post my request to retrieve the data via my website I'm getting the data to the server in the format %E4%B8%AD%E5%8D%8E%E6%AD%A6%E9%AD%82
What decoding steps to I have to take in order to get it back to the usable form?
While also cleaning the user input to ensure they're not going to try an SQL injection attack?
(escape string before or after encoding?)
EDIT:
rawurldecode(); // returns "ä¸åŽæ¦é‚"
urldecode(); // returns "ä¸åŽæ¦é‚"
public function utf8_urldecode($str) {
$str = preg_replace("/%u([0-9a-f]{3,4})/i","&#x\\1;",urldecode($str));
return html_entity_decode($str,null,'UTF-8');
}
// returns "ä¸åŽæ¦é‚"
... which actually works when I try and use it in an SQL statement.
I think because I was doing an echo and die(); without specifying a header of UTF-8 (thus I guess that was reading to me as latin)
Thanks for the help!
When your data is actually that percent-encoded form, you just have to call rawurldecode:
$data = '%E4%B8%AD%E5%8D%8E%E6%AD%A6%E9%AD%82';
$str = rawurldecode($data);
This suffices as the data already is encoded in UTF-8: 中 (U+4E2D) is encoded with the byte sequence 0xE4B8AD in UTF-8 and that is encoded with %E4%B8%AD when using the percent-encoding.
That your output does not seem to be as expected is probably because the output is interpreted with the wrong character encoding, probably Windows-1252 instead of UTF-8. Because in Windows-1252, 0xE4 represents ä, 0xB8 represents ¸, 0xAD represents å, and so on. So make sure to specify the output character encoding properly.
Use PHP's urldecode:
http://php.net/manual/en/function.urldecode.php
You have choices here: urldecode or rawurldecode.
If you had encoded your string using urlencode, you must use urldecode because of the way spaces are handled. While urlencode converts spaces to +, it is not the same with rawurlencode.
I'm writing some RSS feeds in PHP and stuggling with character-encoding issues. Should I utf8_encode() before or after htmlentities() encoding? For example, I've got both ampersands and Chinese characters in a description element, and I'm not sure which of these is proper:
$output = utf8_encode(htmlentities($source)); or
$output = htmlentities(utf8_encode($source));
And why?
It's important to pass the character set to the htmlentities function, as the default is ISO-8859-1:
utf8_encode(htmlentities($source,ENT_COMPAT,'utf-8'));
You should apply htmlentities first as to allow utf8_encode to encode the entities properly.
(EDIT: I changed from my opinion before that the order didn't matter based on the comments. This code is tested and works well).
First: The utf8_encode function converts from ISO 8859-1 to UTF-8. So you only need this function, if your input encoding/charset is ISO 8859-1. But why don’t you use UTF-8 in the first place?
Second: You don’t need htmlentities. You just need htmlspecialchars to replace the special characters by character references. htmlentities would replace “too much” characters that can be encoded directly using UTF-8. Important is that you use the ENT_QUOTES quote style to replace the single quotes as well.
So my proposal:
// if your input encoding is ISO 8859-1
htmlspecialchars(utf8_encode($string), ENT_QUOTES)
// if your input encoding is UTF-8
htmlspecialchars($string, ENT_QUOTES, 'UTF-8')
Don't use htmlentities()!
Simply use UTF-8 characters. Just make sure you declare encoding of the feed in HTTP headers (Content-Type:application/xml;charset=UTF-8) or failing that, in the feed itself using <?xml version="1.0" encoding="UTF-8"?> on the first line.
It might be easier to forget htmlentities and use a CDATA section. It works for the title section, which doesn't seem support encoded HTML characters in Firefox's RSS viewer:
<title><![CDATA[News & Updates " > » ☂ ☺ ☹ ☃ Test!]]></title>
You want to do $output = htmlentities(utf8_encode($source));. This is because you want to convert your international characters into proper UTF8 first, and then have ampersands (and possibly some of the UTF-8 characters as well) turned in to HTML entities. If you do the entities first, then some of the international characters may not be handled properly.
If none of your international characters are going to be changed by utf8_encode, then it doesn't matter which order you call them in.
After much trial & error, I finally found a way to properly display a string from a utf8-encoded database value, through an xml file, to an html page:
$output = '<![CDATA['.utf8_encode(htmlentities($string)).']]>';
I hope this helps someone.