Japanese and Russian characters - web encoding?

Japanese and Russian characters - web encoding? - php

I have a Zope/Plone WS that calls some functions written in Python.
That WS are called by PHP pages (utf-8 into header) but characters aren't visible.
I've tried to decode (where possible) special chars into entities (into Python) and that works, but not all chars have corresponding HTML entities.
I've tried to save the original Python file in UTF-8 format, but I thought that wasn't the right way.
Can someone help?
note : I pass through some php include, if this could be an hint...
Edit it's weird, because if I log all the "pieces" singly, then I have the right chars encoded. If I go up to the "main php page" (where I include all pieces), that messes up everything.
Obviously, the "main php page" has that:
<meta http-equiv="content-type" content="text/html;charset=utf-8" />

496e73e972657220646174652064926172726976e9652065742064652064e970617274
That string is encoded in ISO-8859-1, not UTF-8.
Somewhere you're converting your strings to ISO-8859-1, which means they're not interpreted correctly when trying to interpret them as UTF-8, and all non-European characters will be discarded since ISO-8859-1 can't encode anything but a handful of European characters.

I just edited the file site.py of python.
I follow that guide: click here and everything is ok now.
Thank you all for help.

Related

PHP urlencode for chinese characters

I'm creating a php application that involves sending chinese characters as url parameters.
I have to send query like :
http://xyz.com/?q=新
But the script at xyz.com won't automatically encode the chinese character. So, I need to explicitly send an encoded string as the paramter. It becomes:
http://xyz.com/?q=%E6%96%B0
The problem is, PHP won't encode the chinese character properly.
I've tried urlencode() and rawurlencode(). But they give %D0%C2 (doesn't work for my purpose) instead of %E6%96%B0 (works well with xyz.com) as the output.
I'm using this website to create the latter encoded string.
I've also defined header('Content-Type: text/html; charset=gb2312'); to display chinese characters properly.
Is there anything I can do to urlencode the chinese character properly?
Thanks!
PS: I'm a relatively new programmer and don't understand chinese.

You're URLencoding using the charset you specify in your header. %D0%C2 is 新 in gb2312; %E6%96%B0 is 新 in UTF-8. Switch your charset over to UTF-8 and you should fix this issue and still be able to display Simplified Chinese Han.

In order to reproduce your problem I created a simple PHP file:
<?php
var_dump(urlencode('新'));
?>
First I used UTF8 encoding and got %E6%96%B0. Afterwards I changed to GB2312 and got %D0%C2.
At http://meyerweb.com/eric/tools/dencoder/ they seem to use JavaScript, that's UTF8 capable and therefore returns %E6%96%B0, too.
PS: When changing from GB2312 to UTF8 some editors might break code some internationalized code. So please make sure to have a copy of your file before converting!

Arabic Character Encoding Issue: UTF-8 versus Windows-1256

Quick Background: I inherited a large sql dump file containing a combination of english and arabic text and (I think) it was originally exported using 'latin1'. I changed all occurrences of 'latin1' to 'utf8' prior to importing the file. The the arabic text didn't appear correctly in phpmyadmin (which I guess is normal), but when I loaded the text to a web page with the following...
<meta http-equiv='Content-Type' content='text/html; charset=windows-1256'/>
...everything looked good and the arabic text displayed perfectly.
Problem: My client is really really really picky and doesn't want to change his...
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
...to the 'Windows-1256' equivalent. I didn't think this would be a problem, but when I changed the charset value to 'UTF-8', all of the arabic characters appeared as diamonds with question marks. Shouldn't UTF-8 display arabic text correctly?
Here are a few notes about my database configuration:
Database charset is 'utf8'
Database connection collation is 'utf8_general_ci'
All databases, tables, and applicable fields have been collated as 'utf8_general_ci'
I've been scouring stack overflow and other forums for anything the relates to my issue. I've found similar problems, but not of the solutions seem to work for my specific situation. Hope someone can help!

If the document looks right when declared as windows-1256 encoded, then it most probably is windows-1256 encoded. So it was apparently not exported using latin1—which would have been impossible, since latin1 has no Arabic letters.
If this is just about a single file, then the simplest way is to convert it from windows-1256 encoding to utf-8 encoding, using e.g. Notepad++. (Open the file in it, change the encoding, via File format menu, to Arabic, windows-1256. Then select Convert to UTF-8 in the File format menu and do File → Save.)
Windows-1256 and UTF-8 are completely different encodings, so data gets all messed up if you declare windows-1256 data as UTF-8 or vice versa. Only ASCII characters, such as English letters, have the same representation in both encodings.

We can't find the error in your code if you don't show us your code, so we're very limited in how we can help you.
You told the browser to interpret the document as being UTF-8 rather than Windows-1256, but did you actually change the encoding used from Windows-1256 to UTF-8?
For example,
$ cat a.pl
use strict;
use warnings;
use feature qw( say );
use charnames ':full';
my $enc = $ARGV[0] or die;
binmode STDOUT, ":encoding($enc)";
print <<"__EOI__";
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=$enc">
<title>Foo!</title>
</head>
<body dir="rtl">
\N{ARABIC LETTER ALEF}\N{ARABIC LETTER LAM}\N{ARABIC LETTER AIN}\N{ARABIC LETTER REH}\N{ARABIC LETTER BEH}\N{ARABIC LETTER YEH}\N{ARABIC LETTER TEH MARBUTA}
</body>
</html>
__EOI__
$ perl a.pl UTF-8 > utf8.html
$ perl a.pl Windows-1256 > cp1256.html

I think you need to go back to square one. It sounds like you have a database dump in Win-1256 encoding and you want to work with it in UTF-8 from now on. It also sounds like you are using PHP but you have lots of irrelevant tags on your question and are missing the most important one, PHP.
First, you need to convert the text dump into UTF-8 and you should be able to do that with PHP. Chances are that your conversion script will have two steps, first read the Win-1256 bytes and decode them into internal Unicode text strings, then encode the Unicode text strings into UTF-8 bytes for output to a new text file.
Once you have done that, redo the database import as you did before, but now you have correctly encoded the input data as UTF-8.
After that it should be as simple as reading the database and rendering a web page with the correct UTF-8 encoding.
P.S. It is actually possible to reencode the data every time you display it, but that does not solve the problem of having a database full of incorrectly encoded data.

inorder to display arabic characters correctly , you need to convert your php file to utf-8 without Bom
this happened with me, arabic characters was displayed diamonds, but conversion to utf-8 without bom will solve this problem

I seems that the db is configured as UTF8, but the data itself is extended ascii. If the data is converted to UTF8, it will display correctly in content type set to UTF8

HTML - Mixing UTF-8 coming from MySQL database and special chars into HTML

I have a database where everything is defined in UTF-8 (charsets, collations, ...).
I have a PHP page that gets datas from that database and display it.
That PHP page contains some hard text with special charaters, like é, à, ...
My PHP page has meta charset defined to utf-8.
I call mysql_set_charset("utf8");
My PHP page is written on an editor that is configured to encode to utf-8 Unicode (Dreamweaver CS4, there is no other utf-8 option)
Anything coming from the database is ok, but...
I can't display well the hard special characters (é, à, ù, ...).
Same problem when I use strip_tags(html_entity_decode($datafromdatabase)); on datas coming from database. Here it's really problematic.
What may I do to keep using UTF-8, but being able to display well the special chars without having to use their html equivalent (é, &agrave, ...) ?
EDIT
The problem with hard characters was coming from the php page that was not saved using adhoc encoding. I have created a new document copyed/pasted the old code into that new page, and saved it over the old page. No more problem with hard characters.
But I still have problems with strip_tags(html_entity_decode($datafromdatabase));
using $datafromdatabase = htmlentities(strip_tags(html_entity_decode($datafromdatabase)), ENT_COMPAT, "UTF-8") does not solve the problem. I have stange characters starting with # for each é, à, ù in the text coming from the database (stored as &eacute, ...)

I looks like it's a problem with your browser properly displaying the characters rather than saving.
Check two things.
Issue a utf8 http header
header( 'Content-Type: text/html; charset=UTF-8' );
And make sure your html declaration is mentioning utf8
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
That's for html 4
If your document is properly encoded, this should do it.

The problem with hard characters was coming from the php page that was not saved using adhoc encoding. I have created a new document copyed/pasted the old code into that new page, and saved it over the old page. No more problem with hard characters.
For the problem coming from strip_tags(html_entity_decode($datafromdatabase)); I had in fact to use strip_tags(html_entity_decode($datafromdatabase, ENT_QUOTES, "UTF-8"));

Character encoding in PHP

I never had this problem before, it was usually my database or the html page. But now i think its my php. I import text from a csv or from a text area and in both ways it goes wrong.
for example é changes to Ã©. I used htmlentities to fix this but it didn't work. The htmlentities function didn't return é in html but Ã© in html entities, so it already loses the real characters before htmlentities comes in to place... So does that mean my php file has the wrong encoding or something?
I hope someone can help me out..
Thanks!
Chris

A file is usually ISO-8859-1 (Latin) or UTF-8 ... ISO-8859-1 is 1 byte per char, UTF-8 is 1-4 bytes per char. So if you get 2 chars when you expect one, then you are reading UTF-8 and showing it as ISO-8859-1 ... if you get strange chars, then you are reading ISO-8859-1 and showing it as UTF-8.
If you provide more details, it would be easier to pinpoint, but in short, you have inconsistent charsets and need to convert one or the other so they're all the same. But from what it seems, you're using ISO-8859-1 in your project, but you are reading some UTF-8 from somewhere... use utf8_decode($text) if that data should be indeed be stored as UTF-8, or find the data and convert it manually.
EDIT: If you are using AJAX somewhere, then you will ALWAYS get UTF-8 from it, and you'll have to decode it yourself with utf8_decode() if you want to keep using ISO-8859-1.

Try opening your php file and change the encoding to UTF-8
if that doesn't help, add this to your php:
header('Content-Type: text/html; charset=utf-8');
Or this to your html:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Take a look at PHP's iconv().

Unicode and PHP - am I doing something wrong?

I'm using Kohana 3, which has full support for Unicode.
I have this as the first child of my <head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The Unicode character I am inserting into is é as in Café.
However, I am getting the triangle with a ? (as in could not decode character).
As far as I can tell in my own code, I am not doing any string manipulation on the text.
In fact, I have placed the accent straight into a view's PHP file and it is still not working.
I copied the character from this page: http://www.fileformat.info/info/unicode/char/00e9/index.htm
I've only just started examining PHP's Unicode limitations, so I could be doing something horribly wrong.
So, how do I display this character? Do I need to resort to the HTML entity?
Update
So this works
Caf<?php echo html_entity_decode('é', ENT_NOQUOTES, 'UTF-8'); ?>
Why does that work? If I copy the output accented e from that script and insert it into my document, it doesn't work.

View the http headers. You should see something like
Content-Type: text/html; charset=UTF-8
Browsers don't pay much attention to meta tags, if there was a real http header stating a different encoding.
update
Whatcha get from this?
echo bin2hex('é');
echo chr(0xc3) . chr(0xa9);
You should get c3a9é, otherwise I'd say file encoding issue.

I guess, you see �, the replacement character for invalid UTF-8 byte sequences. Your text is not UTF-8 encoded. Check your editor’s settings to control the encoding of the PHP file.
If you’re not sure about the encoding of your sources, you can enforce UTF-8 compatibilty as described here (German text): Force UTF-8.
You should never need entities except the basic ones.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Japanese and Russian characters - web encoding? - php

I just edited the file site.py of python. I follow that guide: click here and everything is ok now. Thank you all for help.

Related

PHP urlencode for chinese characters

Arabic Character Encoding Issue: UTF-8 versus Windows-1256

HTML - Mixing UTF-8 coming from MySQL database and special chars into HTML

Character encoding in PHP

Unicode and PHP - am I doing something wrong?

Categories

Resources