Advice Build Web Sites using ISO-8859-1 or UTF-8

Advice Build Web Sites using ISO-8859-1 or UTF-8 - php

I everybody, i have a decision to make about making a web site in spanish, and the database has a lot of accents and special characters like for example ñ, when i show the data into the view it appears like "InformÃ¡tica, ProducciÃ³n, OrganizaciÃ³n, DiseÃ±ador Web, MÃ©todos" etc. So by the way, i am using JSP & Servlets, MySQL, phpMyAdmin under Fedora 20 and right know i have added this to the html file:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
and in the apache, i change the default charset:
#AddDefaultCharset UTF-8
AddDefaultCharset ISO-8859-1
but in the browser, the data continue appearing like this: "InformÃ¡tica, ProducciÃ³n, Analista de OrganizaciÃ³n y MÃ©todos", so i don't know what to do, and i have searching all day long if doing the websites using UTF-8 but i don't want to convert all accents and special characters all the time, any advice guys?

The encoding errors appearing in your text (e.g, Ã¡ instead of á) indicate that your application is trying to output UTF-8 text, but your pages are incorrectly specifying the text encoding as ISO-8859-1.
Specify the UTF-8 encoding in your Content-Type headers. Do not use ISO-8859-1.

It depends on the editor that has been done anywhere, whether at work by default in UTF-8 or ISO-8859-1. If the original file was written in ISO-8859-1 and edit it in UTF-8, see the special characters encoded wrong. If we keep that which file such, we are corrupting the original encoding (bad is saved with UTF-8).
Depending on the configuration of Apache.
It depends on whether there is a hidden file. Htaccess in the root directory that serves our website (httpdocs, public_html or similar)
Depends if specified in the META tags of the resulting HTML.
Depends if specified in the header of a PHP file.
Charset chosen depends on the database (if you use a database to display content with a CMS such as Joomla, Drupal, phpNuke, or your own application that is dynamic).

Related

What makes a file UTF-8?

I've read that adding the UTF-8 Byte Order Mark (3 characters) at the start of a text file makes it a UTF-8 file, but I've also read that unicode recommends against using the BOM for UTF-8.
I'm generating files in PHP and I have a requirement that the files be UTF-8. I've added the UTF-8 BOM to the start of the file but I've received feedback about garbage characters at the start of the file from the company that is parsing the files and that gave me the requirement to make the files UTF-8.
If I open the file in notepad it doesn't show the BOM, and if I go to save as, it shows UTF-8 as the default choice.
Opening the file in Textpad32 shows the 3 characters at the start of the file.
So what makes a file UTF-8?

Text is UTF-8 because it's valid as UTF-8 and the author decides it is.
How that decision by the author is communicated to the consumer is a different question, which involves convention, guessing, and various schemes for in-band- or out-of-band-signalling, like HTTP or HTML charset, BOM (which enhances guessing), some envelope / embedding Format, additional data-streams, file-naming, and many more.

The file doesn't need any explicit indicator that it is UTF-8, modern text editors should detect UTF-8 encoding from the context as UTF-8 sequences are quite distinct.
Also, as you experienced for yourself, PHP doesn't like the BOM header, it's a silly thing that often messes up with the script output and creates more problems than it solves.
HTML has it's own way of declaring the encoding of a file, you can do it within the HTML itself:
<head>
<meta charset="UTF-8">
</head>
Or declare the encoding in the HTTP headers, here with PHP:
header('Content-Type: text/html; charset=utf-8');
Modern browsers will also assume UTF-8 as default encoding in case none is specified. It is the standard of the web after all.

UTF-8 is a particular encoding. All 7-bit ASCII files are also valid UTF-8, and it can encode every Unicode character as well.
You will often get the advice to save as UTF-8 without a BOM. In practice, it is very unlikely that a file in a legacy encoding (such as code page 1252, Big5 or Shift-JIS) would just happen to look like valid UTF-8 unless it is an intentionally-ambiguous test case. Many programs, such as web browsers, are good in practice at figuring out when a file is UTF-8. Most recent software uses UTF-8 as its preferred text encoding unless it’s forced to default to something else for compatibility with last century. (LaTeX, for example, changed its default source encoding to UTF-8 in April 2018, and both the LuaLaTeX and XeLaTeX engines had been doing the same for years.)
There are some document types with special requirements. For example, the default encoding of web pages is theoretically Windows 1252, although browsers in the real world will take their best guess. The current best practice on the Web is to save as UTF-8 without a BOM. Instead, you write inside the <head> of the document, <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> or <meta charset="utf-8"/> This tells the user agent explicitly what the character encoding is.
On the other hand, some older versions of software either break if they see a BOM, or only recognize UTF-8 if there is a BOM. Microsoft in the ’aughts was especially guilty of this, its software doesn’t want to break any files that used to work back then, and so, to this day, I save my C source files as UTF-8 with a BOM. This is the only format that just works on every compiler I use: even the latest version of MSVC might guess wrong if you don’t give it either a BOM or the right command-line flag, whereas Clang only supports UTF-8 and has no option to read files in any other encoding. Some older versions of MSVC that I was once forced to use cannot understand UTF-8 at all unless the BOM is there, and do not provide any way to override its autodetection.

Incorrect rendering of Language (e.g. Arabic)

I apologize if this question is not directly related to programming. I'm having an issue, of which I have two examples;
I have a website, where I store Arabic words in a DB, and then retrieve it, and display it on a page, using php. (Here's the link to my page, that is displaying Arabic incorrectly.)
I visit any random website, where the majority of the content is supposed to be in Arabic. (An example of a random website that gives me this issue.)
In both these cases, the Arabic text is displayed as 'ÇáÔíÎ: ÇáÓáÝ ãÚäÇå ÇáãÊÞÏãæä Ýßá'... or such weird characters. Do note that, in the first case, I may be able to correct it, since I control the content. So, I can set the encoding.
But what about the second case [this is where I want to apologize, since it isn't directly related to programming (the code) from my end] - what do I do for random websites I visit, where the text (Arabic) is displayed incorrectly? Any help would really be appreciated.

For the second case:
This website is encoded with Windows-1256 (Arabic), however, it wrongly declares to be encoded with ISO 8859-1 (Latin/Western European). If you look at the source, you can see that it declares <meta ... charset=ISO-8859-1" /> in its header.
So, what happens is that the server sends to your browser an HTML file that is encoded with Windows-1256, but your browser decodes this file with ISO 8859-1 (because that's what the file claims to be).
For the ASCII characters, this is no problem as they are encoded identically in both encodings. However, not so for the Arabic characters: each code byte corresponding to an Arabic character (as encoded by Windows-1256) maps to some Latin character of the ISO 8859-1 encoding, and these garbled Latin characters are what you see in place of the Arabic text.
If you want to display all the text of this website correctly, you can manually set the character encoding that your browser uses to decode this website.
You can do this, for example, with Chrome by installing the Set Character Encoding extension, and then right-click on the website and select:
Set Character Encoding > Arabic (Windows-1256)
In Safari, you can do it simply by selecting:
View > Text Encoding > Arabic (Windows).
The same should be possible with other browsers, such as Firefox or Internet Explorer, too...
For the first case:
Your website (the HTML file that your server sends to the browser) is encoded with UTF-8. However, this HTML file doesn't contain any encoding declaration, so the browser doesn't know with which encoding this file has been encoded.
In this case, the browser is likely to use a default encoding to decode the file, which typically is ISO 8859-1/Windows-1252 (Latin/Western European). The result is the same as in the above case: all the Arabic characters are decoded to garbled Latin characters.
To solve this problem, you have to declare that your HTML file is encoded with UTF-8 by adding the following tag in the header of your file:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

utf8 not supporting for other languages why?

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
it is not working like i have my site which can be translated in 20 languages but in some languages like turkish , japanese it shows � symbol instead of space or " and many others

Since I don't know your site I can just guess in the dark.
Setting
<meta charset="utf-8" />
will not be the only thing you have to do. If your document is saved as ASCII your problems won't be solved. Additionally you have to set the document encoding correctly (the meta tag just tells the browser which encoding to use, not which one IS actually used). So open the document with a (good) text editor like SublimeText / Notepad++ or what you prefer and set the encoding to UTF-8.

for php, you need to add a utf-8 header
header ('Content-type: text/html; charset=utf-8');

Letting know browser that text is in unicode and actually providing data in unicode is not the same. Check your files for unicode, database data for unicode and transformation that is done with it while serving. Provide more information to pinpoint your problem

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Just adding a Content-Type header in HTML doesn't make anything utf-8. It merely tells the browser to expect utf-8. If your source files are not in utf-8, or the database columns in which data is stored isn't utf-8, or if the connection to the database itself isn't utf-8, or if you're sending a HTTP header telling it isn't utf-8, it will not work. There's just one way of dealing with utf-8: make sure everything is in utf-8.

The problem is caused by the admin tool you are using. The tool injects data into a UTF-8 encoded data in some other encoding. As the tool has not been described, the specific causes cannot be isolated. The pages mentioned do not exhibit the problem, and they specify the UTF-8 encoding in HTTP headers, so the meta tag is ignored in the online context (though useful for offline use).

Changing character encoding in MySQL, PHP scripts, HTML

So, I have built on this system for quite some time, and it is currently outputting Latin1 (ISO-8859-1) to the web browser, and this is the components:
MySQL - all data is stored with the Latin1 character set
PHP - All PHP text files are stored on disk with Latin1 encoding
HTML - The output has the http-equiv="content-type" content="text/html; charset=iso-8859-1" meta tag
So, I'm trying to understand how the encoding of the different parts come into play in my workflow. If I open a PHP script and change its encoding within the text editor to UTF-8 and save it back to disk and reload the web browser, the text is all messed up - unless the text comes from the DB. If I change the encoding of the DB to UTF-8 and keep the PHP files in latin1 I have to use utf8_decode() for the data to display correctly. And if I change the HTML code the browser will read it incorrectly.
So yeah, I realise that if I want to "upgrade" to UTF8, I have to update all three parts of this setup for it to work correctly, but since it's a huge system with some 180k lines of PHP code and millions of posts in a lot of databases/tables, I don't want to start something like this without understanding everything correctly.
What haven't I thought about? What could mess this up beyond fixing? What are the procedures for changing the encoding of an entire MySQL installation and what's the easiest way to change the encoding of hundreds or thousands of PHP files on disk?
The META tag is luckily added dynamically, so I'll change that in one place only :)
Let me hear about your experiences with this.

It's tricky.
You have to:
change the DB and every table character set/encoding – I don't know much about MySQL, but see here
set the client encoding to UTF-8 in PHP (SET NAMES UTF8) before the first query
change the meta tag and possible the Content-type header (note the Content-type header has precedence)
convert all the PHP files to UTF-8 w/out BOM – you can easily do that with a loop and iconv.
the trickiest of all: you have to change most of your string function calls. Than means mb_strlen instead of strlen, mb_substr instead of substr and $str[index], etc.

Don't convert to UTF8 if you don't have to. Its not worth the trouble.
UTF8 is (becoming) the new standard, so for new projects I can recommend it.
Functions
Certain function calls don't work anymore. For latin1 it's:
echo htmlentities($string);
For UTF8 it's:
echo htmlentities($string, ENT_COMPAT, 'UTF-8');
strlen(), substr(), etc. Aren't aware of the multibyte characters.
MySQL
mysql_set_charset('UTF8') or mysql_query('SET NAMES UTF8') will convert all text to UTF8 coming from the database(SELECTs). It will also convert incoming strings(INSERT, UPDATE) from UTF8 to the encoding of the table.
So for reading from a latin1 table it's not necessary to convert the table encoding.
But certain characters are only available in unicode (like the snowman ☃, iPhone emoticons, etc) and can't be converted to latin1. (The data will be truncated)
Scripts
I try to prevent specials-characters in my php-scripts / templates.
I use the ë notation instead of ë etc. This way it doesn't matter if is saved in latin1 or utf8.

Why is this the extended ascii character (â, é, etc) getting replaced with <?> characters?

Why is this the extended ascii character (â, é, etc) getting replaced with <?> characters?
I attached a pic... but I am using PHP to pull the data from MySQL, and some of these locations have extended characters... I am using the Font Arial.
You can see the screen shot here: http://img269.imageshack.us/i/funnychar.png/
Still happening after the suggestions, here is what I did:
My firefox (view->encoding) is set to UTF-8 after adding the line, however, the text inside the option tags is still showing the funny character instead of the actual accented one. What should I look for now?
UPDATE:
I have the following in the PHP program that is giving my those <?> characters...
ini_set( 'default_charset', 'UTF-8' );
And right after my zend db object creation, I am setting the following query:
$db->query("SET NAMES utf8;");
I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
Also STATUS is reporting:
Connection: Localhost via UNIX socket
Server characterset: latin1
Db characterset: latin1
Client characterset: utf8
Conn. characterset: utf8
UNIX socket: /var/run/mysqld/mysqld.sock
Uptime: 4 days 20 hours 59 min 41 sec
Looking at the source of the page, I see
<option value="Br�l� Lake"> Br�l� Lake
OK- NEW UPDATE-
I Changed everything in my PHP and HTML to:
and
header('Content-Type: text/html; charset=latin1');
Now it works, what gives?? How do I convert it all to UTF-8?

That's what the browser does when it doesn't know the encoding to use for a character. Make sure you specify the encoding type of the text you send to the client either in headers or markup meta.
In HTML:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
In PHP (before any other content is sent to the client):
header('Content-Type: text/html; charset=utf-8');
I'm assuming you'll want UTF-8 encoding. If your site uses another encoding for text, then you should replace UTF-8 with the encoding you're using.
One thing to note about using HTML to specify the encoding is that the browser will restart rendering a page once it sees the Content-Type meta tag, so you should include the <meta /> tag immediately after the <head /> tag in your page so the browser doesn't do any more extra processing than it needs.
Another common charset is "iso-8859-1" (Basic Latin), which you may want to use instead of UTF-8. You can find more detailed info from this awesome article on character encodings and the web. You can also get an exhaustive list of character encodings here if you need a specific type.
If nothing else works, another (rare) possibility is that you may not have a font installed on your computer with the characters needed to display the page. I've tried repeating your results on my own server and had no luck, possibly because I have a lot of fonts installed on my machine so the browser can always substitute unavailable characters from one font with another font.
What I did notice by investigating further is that if text is sent in an encoding different than the encoding the browser reports as, Unicode characters can render unexpectedly. To work around this, I used the HTML character entity representation of special characters, so â becomes â in my HTML and é becomes é. Once I did this, no matter what encoding I reported as, my characters rendered correctly.
Obviously you don't want to modify your database to HTML encode Unicode characters. Your best option if you must do this is to use a PHP function, htmlentities(). You should use this function on any data-driven text you expect to have Unicode characters in. This may be annoying to do, but if specifying the encoding doesn't help, this is a good last resort for forcing Unicode characters to work.

There is no such standard called "extended ASCII", just a bunch of proprietary extensions.
Anyway, there are a variety of possible causes, but it's not your font. You can start by checking the character set in MySQL, and then see what PHP is doing. As Dan said, you need to make sure PHP is specifying the character encoding it's actually using.

As others have mentioned, this is a character-encoding question. You should read Joel Spolsky's article about character encoding.
Setting
header('Content-Type: text/html; charset=utf-8');
will fix your problem if your php page is writing UTF-8 characters to the browser. If the text is still garbled, it's possible your text is not UTF-8; in that case you need to use the correct encoding name in the Content-Type header. If you have a choice, always use UTF-8 or some other Unicode encoding.

Simplest fix
ini_set( 'default_charset', 'UTF-8' );
this way you don't have to worry about manually sending the Content-Type header yourself.
EDIT
Make sure you are actually storing data as UTF-8 - sending non-UTF-8 data to the browser as UTF-8 is just as likely to cause problems as sending UTF-8 data as some other character set.
SELECT table_collation
FROM information_schema.`TABLES` T
WHERE table_name=[Table Name];
SELECT default_character_set_name
, default_collation_name
FROM information_schema.`SCHEMATA` S
WHERE schema_name=[Schema Name];
Check those values

There are two transmission encodings, PHP<->browser and Mysql<->PHP, and they need to be consistent with each other. Setting up the encoding for Mysql<->PHP is dealt with in the answers to the questions below:
Special characters in PHP / MySQL
How to make MySQL handle UTF-8 properly
php mysql character set: storing html of international content
The quick answer is "SET NAMES UTF8".
The slow answer is to read the articles recommended in the other answers - it's a lot better to understand what's going on and make one precise change than to apply trial and error until things seem to work. This isn't just a cosmetic UI issue, bad encoding configurations can mess up your data very badly. Think about the Simpsons episode where Lisa gets chewing gum in her hair, which Marge tries to get out by putting peanut butter on.

You should encode all special chars into HTML entities instead of depending on the charset.
htmlentities() will do the work for you.

I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
If your original data was latin1, then inserting it into a UTF-8 database won't convert it to UTF-8, AFAIK, it will insert the same data but now believe it's UTF-8, thus breaking.
If you've got a SQL dump, I'd suggest running it through a tool to convert to UTF-8. Notepad++ does this pretty well - simply open the file, check that the accented characters are displaying correctly, then find "convert to UTF-8" in the menu.

These special characters generally appear due to the the extensions. If we provide a meta tag with charset=utf-8 we can eliminate them by adding:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
to your meta tags

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.