Incorrect rendering of Language (e.g. Arabic)

Incorrect rendering of Language (e.g. Arabic) - php

I apologize if this question is not directly related to programming. I'm having an issue, of which I have two examples;
I have a website, where I store Arabic words in a DB, and then retrieve it, and display it on a page, using php. (Here's the link to my page, that is displaying Arabic incorrectly.)
I visit any random website, where the majority of the content is supposed to be in Arabic. (An example of a random website that gives me this issue.)
In both these cases, the Arabic text is displayed as 'ÇáÔíÎ: ÇáÓáÝ ãÚäÇå ÇáãÊÞÏãæä Ýßá'... or such weird characters. Do note that, in the first case, I may be able to correct it, since I control the content. So, I can set the encoding.
But what about the second case [this is where I want to apologize, since it isn't directly related to programming (the code) from my end] - what do I do for random websites I visit, where the text (Arabic) is displayed incorrectly? Any help would really be appreciated.

For the second case:
This website is encoded with Windows-1256 (Arabic), however, it wrongly declares to be encoded with ISO 8859-1 (Latin/Western European). If you look at the source, you can see that it declares <meta ... charset=ISO-8859-1" /> in its header.
So, what happens is that the server sends to your browser an HTML file that is encoded with Windows-1256, but your browser decodes this file with ISO 8859-1 (because that's what the file claims to be).
For the ASCII characters, this is no problem as they are encoded identically in both encodings. However, not so for the Arabic characters: each code byte corresponding to an Arabic character (as encoded by Windows-1256) maps to some Latin character of the ISO 8859-1 encoding, and these garbled Latin characters are what you see in place of the Arabic text.
If you want to display all the text of this website correctly, you can manually set the character encoding that your browser uses to decode this website.
You can do this, for example, with Chrome by installing the Set Character Encoding extension, and then right-click on the website and select:
Set Character Encoding > Arabic (Windows-1256)
In Safari, you can do it simply by selecting:
View > Text Encoding > Arabic (Windows).
The same should be possible with other browsers, such as Firefox or Internet Explorer, too...
For the first case:
Your website (the HTML file that your server sends to the browser) is encoded with UTF-8. However, this HTML file doesn't contain any encoding declaration, so the browser doesn't know with which encoding this file has been encoded.
In this case, the browser is likely to use a default encoding to decode the file, which typically is ISO 8859-1/Windows-1252 (Latin/Western European). The result is the same as in the above case: all the Arabic characters are decoded to garbled Latin characters.
To solve this problem, you have to declare that your HTML file is encoded with UTF-8 by adding the following tag in the header of your file:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

Related

Advice Build Web Sites using ISO-8859-1 or UTF-8

I everybody, i have a decision to make about making a web site in spanish, and the database has a lot of accents and special characters like for example ñ, when i show the data into the view it appears like "InformÃ¡tica, ProducciÃ³n, OrganizaciÃ³n, DiseÃ±ador Web, MÃ©todos" etc. So by the way, i am using JSP & Servlets, MySQL, phpMyAdmin under Fedora 20 and right know i have added this to the html file:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
and in the apache, i change the default charset:
#AddDefaultCharset UTF-8
AddDefaultCharset ISO-8859-1
but in the browser, the data continue appearing like this: "InformÃ¡tica, ProducciÃ³n, Analista de OrganizaciÃ³n y MÃ©todos", so i don't know what to do, and i have searching all day long if doing the websites using UTF-8 but i don't want to convert all accents and special characters all the time, any advice guys?

The encoding errors appearing in your text (e.g, Ã¡ instead of á) indicate that your application is trying to output UTF-8 text, but your pages are incorrectly specifying the text encoding as ISO-8859-1.
Specify the UTF-8 encoding in your Content-Type headers. Do not use ISO-8859-1.

It depends on the editor that has been done anywhere, whether at work by default in UTF-8 or ISO-8859-1. If the original file was written in ISO-8859-1 and edit it in UTF-8, see the special characters encoded wrong. If we keep that which file such, we are corrupting the original encoding (bad is saved with UTF-8).
Depending on the configuration of Apache.
It depends on whether there is a hidden file. Htaccess in the root directory that serves our website (httpdocs, public_html or similar)
Depends if specified in the META tags of the resulting HTML.
Depends if specified in the header of a PHP file.
Charset chosen depends on the database (if you use a database to display content with a CMS such as Joomla, Drupal, phpNuke, or your own application that is dynamic).

Displaying utf8 on flash?

I am using flash to read contents from a UTF8 page, which has unicode in it.
The problem is that when Flash loads the data it displays ???????? instead all unicode.
What could be the problem?

By default Flash treats strings as if they are encoded using UTF-8. The reason that you are seeing characters that possibly substitute non-printable characters or invalid / missing glyphs could be that you set System.useCodepage to true - if that's what happened, then why did you do that?
Otherwise, the font that is used to display the characters may be missing glyphs for the characters you need. You can check that by using Font.hasGlyphs("string with the glyphs"); to make sure the text can be displayed. This would normally only apply to embedded fonts.
Yet another possibility is that the source text you are trying to display is not a UTF-8 encoded string. Some particularly popular file formats such as XML and HTML some times use a declaration of the format in no correspondence to the actual payload (example XML tag: <?xml encoding="utf-8" ?> can be attached to any XML regardless of the actual encoding of the document). In order to make sure that the text is in UTF-8 - read it as ByteArray and verify that the first bit of every byte is set to 0. Single-byte encodings that use national characters use the first bit to encode their characters, while UTF-8 never does that.

Flash internally uses UTF-8 to represent strings, so there should not be a problem if the entire stack uses UTF-8 encoding.
You probably have an implicit decode/encode step somewhere along the way.
This could really be a million things, unfortunately. Start from the ground up, insert traces and/or log messages to see where the conversion fails. Make sure your XML-content uses UTF-8, and especially if you're using PHP, make sure that all the PHP source files are saved in UTF-8 encoding - editing PHP files in simple text editors often results in Windows/Mac format source files, which will then break your character encoding. Also, verify HTML request/response headers to see if there is an encoding mismatch.

Strange characters appearing after copy/pasting in forms/emails

I have users who sometimes paste things into my site's forms after copying something from their Gmail. The characters look normal when they paste it, but in the database they have extra special characters that appear.
Here is an example of the text with the the special characters.
It originally happened on this page:
http://www.hikingsanfrancisco.com/hiker_community/scheduled_hike_event.php?hike_event_id=91
But it looks like the person who made it has cleaned up the strange characters.
Does anyone know how to stop this from happening in the future?
Thanks,
Alex
I use PHP and MySQL

I'd guess that you're getting UTF-8 encoded text but your database is configured for ISO-8859-1 (AKA Latin-1). The page you reference says:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
so it is claiming to be encoded as UTF-8. A form on a UTF-8 page will be sent back to the server in UTF-8. Then you send that UTF-8 data into your database where it is stored as Latin-1 encoded text. If you're not handling the UTF-8 to Latin-1 change yourself then you'll get "funny" characters when you send the data back to a browser. As long as the text only uses standard ASCII characters then everything will be okay as UTF-8 and Latin-1 overlap on the ASCII characters.
The solution is to pick a character encoding and use it everywhere. I'd recommend UTF-8 everywhere. However, if your database is already in Latin-1 then you'll have to go with Latin-1 or change the encoding in the database and re-encode all the data. But, if all the text in your database is simple ASCII then no re-encoding will be needed.
Hard to say what's going without examples but a character encoding mismatch is the usual problem when funny (funny peculiar, not funny ha-ha) characters appear only when text is sent back to the browser.

How to make PHP use the right charset?

I'm making a KSSN (Korean ID Number) checker in PHP using a MySQL database.
I check if it is working by using a file_get_contents call to an external site.
The problem is that the requests (with Hangul/Korean characters in them) are using the wrong charset.
When I echo the string, the Korean characters just get replaced by question marks.
How can I make it to use Korean? Should I change anything in the database too?
What should be the charset?
PHP Source and SQL Dump: http://www.multiupload.com/RJ93RASZ31
NOTE: I'm using Apache (HTML), not CLI.

You need to:
tell the browser what encoding you wish to receive in the form submission, by setting Content-Type by header or <meta> as in aviv's answer.
tell the database what encoding you're sending it bytes in, using mysql_set_charset().
Currently you are using EUC-KR in the database so presumably you want to use that encoding in both the above points. In this century I would suggest instead using UTF-8 throughout for all web apps/databases, as the East Asian multibyte encodings are an anachronistic unpleasantness. (With potential security implications, as if mysql_real_escape_string doesn't know the correct encoding, a multibyte sequence containing ' or \ can sneak through an SQL injection.)
However, if enpang.com are using EUC-KR for the encoding of the Name URL parameter you would need either to stick with EUC-KR, or to transcode the name value from UTF-8 to EUC-KR for that purpose using iconv(). (It's not clear to me what encoding enpang.com are using for URL parameters to their name check service; I always get the same results anyway.)

I don't know the charset, but if you are using HTML to show the results you should set the charset of the html
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
You can also use iconv (php function) to convert the charset to a different charset
http://php.net/manual/en/book.iconv.php
And last but not least, check your database encoding for the tables.
But i guess that in your case you will only have to change the meta tag.

Basically all charset problems stem from the fact that they're being mixed and/or misinterpreted.
A string (text) is a sequence of bytes in a specific order. The string is encoded using some specific charset, that in itself is neither right nor wrong nor anything else. The problem is when you try to read the string, the sequence of bytes, assuming the wrong charset. Bytes encoded using, for example, KS X 1001 just don't make sense when you read them assuming they're UTF-8, that's where the question marks come from.
The site you're getting the text from sends it to you in some specific character set, let's assume KS X 1001. Let's assume your own site uses UTF-8. Embedding a stream of bytes representing KS X 1001 encoded text in the middle of UTF-8 encoded text and telling the browser to interpret the whole site as UTF-8 leads to the KS X 1001 encoded text not making sense to the UTF-8 parser.
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
KSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKS
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
will be rendered as
Hey, this is UTF-8 encoded text, awesome!
???????I?have?no?idea?what?this?is???????
Hey, this is UTF-8 encoded text, awesome!
To solve this problem, convert the fetched text into UTF-8 (or whatever encoding you're using on your site). Look at the Content-Type header of that other site, it should tell you what encoding the site is in. If it doesn't, take a guess.

php5 encoding : I don't detect turkish characters

I have a php script which detects keyword density on given url.
My problem is, it doesn't detect turkish characters or deletes them.
I'm getting contents of url by file_get_contents method. This method works perfect and gets all content with turkish characters.
You can see my code here or try script here.

You seem to be fetching and converting the file_get_contents data as UTF-8 (probably correctly), but your HTML page is not specifying an encoding for itself. So probably, any incoming form data is in iso-8859-1. Try specifying utf-8 as your page's encoding as well:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
And the obligatory reading link on encoding basics: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.