Yesterday I faced problem about getting some Unicode strings from URL. Actually I use CodeIgniter and the URL segments will be passed into controller/function(parameters). I don't know is it the CI changes the encoding or it's something else. I have the right encoding in my HTML page in the content and in the address bar till I get those Unicode segments.
For example:
localhost/df-gamez/news/افتتاح-جدید-سایت-تیم-دریم-فکتوری
The last segment is perisan characters which is under UTF-8. It's present correctly in everywhere but when I get it in my code It will turn to something like this:
%d8%a7%d9%81%d8%aa%d8%aa%d8%a7%d8%ad-%d8%b3%d8%a7%db%8c%d8%aa-%d8%ac%d8%af%db%8c%d8%af-
%d8%aa%db%8c%d9%85-%d8%af%d8%b1%db%8c%d9%85-%d9%81%da%a9%d8%aa%d9%88%d8%b1%db%8c
I tried to change it to UTF-8 with mb_encode but it didn't. Both my HTML page and Controller file are formatted as UTF-8 Without BOM.
Peace Out!
Use urldecode function to decode.
echo urldecode("%d8%a7%d9%81%d8%aa%d8%aa%d8%a7%d8%ad-%d8%b3%d8%a7%db%8c%d8%aa-%d8%ac%d8%af%db%8c%d8%af-%d8%aa%db%8c%d9%85-%d8%af%d8%b1%db%8c%d9%85-%d9%81%da%a9%d8%aa%d9%88%d8%b1%db%8c");
will give you افتتاح-سایت-جدید-تیم-دریم-فکتوری
Related
Everything in my code is running my database(Postgresql) is using utf8 encoding, I've checked the php.ini file its encoding is utf8, I tried debugging to see if it was any of the functions I used that were doing this, but nothing everything is running as expected, however after my frontend sends a post request to backend server through curl for some text to be inserted in the database, some characters like 'da' are converted to '?' in postgre and in memcached, I think php is converting them to Latin-1 again after the request reaches the other side for some reason becuase I use utf8_encode before the request and utf8_decode on the other side
this is the code to send the request
$pre_opp->
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST",str_replace(" ","%",utf8_encode($bio)));
this is how the backend system receives this
$data= str_replace("%"," ",utf8_decode($_POST["Data"]));
Don't replace " " with "%".
Use urlencode and urldecode instead of utf8_encode and utf8_decode - It will give you a clean alphanumeric representation of any character to easily transport your data.
If everything in your environment defaults to UTF-8, you shouldn't need utf_encode and utf_decode anyways, I guess. But if you still do, you could try combining both like this:
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST", urlencode(utf8_encode($bio)));
and
$data= str_replace("%"," ",utf8_decode(urldecode($_POST["Data"])));
You say this like it's a mystery:
I think php is converting them to Latin-1 again after the request reaches the other side for some reason
But then you give the reason yourself:
because I use utf8_encode before the request and utf8_decode on the other side
That is exactly what uf8_decode does: it converts UTF-8 to Latin-1.
As the manual explains, this is also where your '?' replacements come from:
This function converts the string string from the UTF-8 encoding to ISO-8859-1. Bytes in the string which are not valid UTF-8, and UTF-8 characters which do not exist in ISO-8859-1 (that is, characters above U+00FF) are replaced with ?.
Since you'd picked the unfortunate replacement of % for space, sequences like "%da" were being interpreted as URL percent escapes, and generating invalid UTF-8 strings. You then asked PHP to convert them to Latin-1, and it couldn't, so it substituted "?".
The simple solution is: don't do that. If your data is already in UTF-8, neither of those functions will do anything but mess it up; if it's not already in UTF-8, then work out what encoding it's in and use iconv or mb_convert_encoding to convert it, once. See also "UTF-8 all the way through".
Since we can't see your Send_Request_To_BackEnd function, it's hard to know why you thought you needed it. If you're constructing a URL with that string, you should use urlencode inside your request sending code; you shouldn't need to decode it the other end, PHP will do that for you.
I use file_get_contents() to download a JSON. There're some Chinese characters in the URL, I tried to print the URL out, it's OK. But when I ran the program, the URL I put in the function became error code. How do I know that is this URL links to a JSON that links to a MySQL request, and in the console of MySQL, I saw the URL became error code. I tried lots of ways to change URL string to UTF-8 or GB2312, etc, but none of that works. I Wish I could get help here, thanks.
Its very difficult to understand your question. I think i understood the first part of your question:
I use file_get_contents() to download a JSON. There're some Chinese
characters in the URL, I tried to print the URL out, it's OK. But when
I ran the program, the URL I put in the function became error code.
You try to access a URL containing chinese characters using file_get_contents():
The answer to this is:
You need to encode the part of the url containing chinese characters using urlencode() or rawurlencode().
The main difference between urlencode()and rawurlencode() is, that urlencode() converts spaces to +. rawurlencode() converts spaces to %20.
urlencode is used for Query Parameters as example ?q=my+search+key, in every other case you use rawurlencode.
Example:
$test = 'http://www.example.com/'.rawurlencode('以怎么下载').'.html';
print_r($test);
// $html = file_get_contents($test);
// output:
http://www.example.com/%E4%BB%A5%E6%80%8E%E4%B9%88%E4%B8%8B%E8%BD%BD.html
I hope it solves your problem.
I'm creating a php application that involves sending chinese characters as url parameters.
I have to send query like :
http://xyz.com/?q=新
But the script at xyz.com won't automatically encode the chinese character. So, I need to explicitly send an encoded string as the paramter. It becomes:
http://xyz.com/?q=%E6%96%B0
The problem is, PHP won't encode the chinese character properly.
I've tried urlencode() and rawurlencode(). But they give %D0%C2 (doesn't work for my purpose) instead of %E6%96%B0 (works well with xyz.com) as the output.
I'm using this website to create the latter encoded string.
I've also defined header('Content-Type: text/html; charset=gb2312'); to display chinese characters properly.
Is there anything I can do to urlencode the chinese character properly?
Thanks!
PS: I'm a relatively new programmer and don't understand chinese.
You're URLencoding using the charset you specify in your header. %D0%C2 is 新 in gb2312; %E6%96%B0 is 新 in UTF-8. Switch your charset over to UTF-8 and you should fix this issue and still be able to display Simplified Chinese Han.
In order to reproduce your problem I created a simple PHP file:
<?php
var_dump(urlencode('新'));
?>
First I used UTF8 encoding and got %E6%96%B0. Afterwards I changed to GB2312 and got %D0%C2.
At http://meyerweb.com/eric/tools/dencoder/ they seem to use JavaScript, that's UTF8 capable and therefore returns %E6%96%B0, too.
PS: When changing from GB2312 to UTF8 some editors might break code some internationalized code. So please make sure to have a copy of your file before converting!
I insert from csv characters from different languages..
I apply this to every set of characters:
private function process_elements($element){
utf8_encode($element);
return $element;
}
The problem is when they go into the database, they go like this:
???????? ?? ???????????? ????? ??????? ??? ???????...
When I retrieve them from the databse, I also get this.
This happens with greek. However, when I retrieve greek pages (through scrapping), who are on a utf encoded page. The characters look like this:
Δες webcam δωμάτια | Gr.ImLive.com
which is okay, because when i use the utf8_encode function, they look normal on the screen..
But when the data is taken from the csv and be put into the database, i get those question marks..
Is there a way to encode form any language to utf.. why retrieving data from csv and a utf8 encoded webpage makes such a difference.. they look the same.. how do I address that problem?
please take a look at this
it will help you
Handling Unicode Front To Back In A Web App
It's not about "languages", it's about encodings. Text is encoded as bits and bytes. Any one byte is equal to any other byte. If you only have a blob of bytes, you cannot know what encoding it represents. You can guess, but that's not accurate. You have to know what encoding some text is in by reading the accompanying meta data. That may be documentation, a <meta> tag or an HTTP header. Then you need to treat the text in that encoding.
utf8_encode actually converts text from ISO-8859-1 to UTF-8. It does not simply encode anything to UTF-8, because it does not have the means to determine what something is encoded in either. If your text is already UTF-8 encoded or was not ISO-8859-1 encoded to begin with, you're just garbling the text (as you are).
I got Chinese characters encoded in ISO-8859-1, for example 兼 = 兼
Those characters are taken form the database using AJAX and sent by Json using json_encode.
I then use the template Handlebars to set the data on the page.
When I look at the ajax page the characters are displayed correctly, the source is still encoded.
But the final result displays the encrypted characters.
I tried to decode on the javascript part with unescape but there is no foreach with the template that gives me the possibility to decode the specific variable, so it crashes.
I tried to decode on the PHP side with htmlspecialchars_decode but without success.
Both pages are encoded in ISO-8859-1, but I can change them in UTF8 if necessary, but the data in the database remains encoded in ISO-8859-1.
Thank you for your help.
You're simply representing your characters in HTML entities. If you want them as "actual characters", you'll need to use an encoding that can represent those characters, ISO-8859 won't do. htmlspecialchars_decode doesn't work because it only decodes a handful of characters that are special in HTML and leaves other characters alone. You'll need html_entity_decode to decode all entities, and you'll need to provide it with a character set to decode to which can handle Chinese characters, UTF-8 being the obvious best choice:
$str = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
You'll then need to make sure the browser knows that you're sending it UTF-8. If you want to store the text in the database in UTF-8 as well (which you really should), best follow the guide How to handle UTF-8 in a web app which explains all the pitfalls.
Are you including your text with the "double-stache" Handlebars syntax?
{{your expression}}
As the Handlebars documentation mentions, that syntax HTML-escapes its output, which would cause the results you're mentioning, where you're seeing the entity 兼 instead of 兼.
Using three braces instead ("triple-stache") won't escape the output and will let the browser correctly interpet those numeric entities:
{{{your expression}}}