What makes a file UTF-8? - php

I've read that adding the UTF-8 Byte Order Mark (3 characters) at the start of a text file makes it a UTF-8 file, but I've also read that unicode recommends against using the BOM for UTF-8.
I'm generating files in PHP and I have a requirement that the files be UTF-8. I've added the UTF-8 BOM to the start of the file but I've received feedback about garbage characters at the start of the file from the company that is parsing the files and that gave me the requirement to make the files UTF-8.
If I open the file in notepad it doesn't show the BOM, and if I go to save as, it shows UTF-8 as the default choice.
Opening the file in Textpad32 shows the 3 characters at the start of the file.
So what makes a file UTF-8?

Text is UTF-8 because it's valid as UTF-8 and the author decides it is.
How that decision by the author is communicated to the consumer is a different question, which involves convention, guessing, and various schemes for in-band- or out-of-band-signalling, like HTTP or HTML charset, BOM (which enhances guessing), some envelope / embedding Format, additional data-streams, file-naming, and many more.

The file doesn't need any explicit indicator that it is UTF-8, modern text editors should detect UTF-8 encoding from the context as UTF-8 sequences are quite distinct.
Also, as you experienced for yourself, PHP doesn't like the BOM header, it's a silly thing that often messes up with the script output and creates more problems than it solves.
HTML has it's own way of declaring the encoding of a file, you can do it within the HTML itself:
<head>
<meta charset="UTF-8">
</head>
Or declare the encoding in the HTTP headers, here with PHP:
header('Content-Type: text/html; charset=utf-8');
Modern browsers will also assume UTF-8 as default encoding in case none is specified. It is the standard of the web after all.

UTF-8 is a particular encoding. All 7-bit ASCII files are also valid UTF-8, and it can encode every Unicode character as well.
You will often get the advice to save as UTF-8 without a BOM. In practice, it is very unlikely that a file in a legacy encoding (such as code page 1252, Big5 or Shift-JIS) would just happen to look like valid UTF-8 unless it is an intentionally-ambiguous test case. Many programs, such as web browsers, are good in practice at figuring out when a file is UTF-8. Most recent software uses UTF-8 as its preferred text encoding unless it’s forced to default to something else for compatibility with last century. (LaTeX, for example, changed its default source encoding to UTF-8 in April 2018, and both the LuaLaTeX and XeLaTeX engines had been doing the same for years.)
There are some document types with special requirements. For example, the default encoding of web pages is theoretically Windows 1252, although browsers in the real world will take their best guess. The current best practice on the Web is to save as UTF-8 without a BOM. Instead, you write inside the <head> of the document, <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> or <meta charset="utf-8"/> This tells the user agent explicitly what the character encoding is.
On the other hand, some older versions of software either break if they see a BOM, or only recognize UTF-8 if there is a BOM. Microsoft in the ’aughts was especially guilty of this, its software doesn’t want to break any files that used to work back then, and so, to this day, I save my C source files as UTF-8 with a BOM. This is the only format that just works on every compiler I use: even the latest version of MSVC might guess wrong if you don’t give it either a BOM or the right command-line flag, whereas Clang only supports UTF-8 and has no option to read files in any other encoding. Some older versions of MSVC that I was once forced to use cannot understand UTF-8 at all unless the BOM is there, and do not provide any way to override its autodetection.

Related

 causing padding? [duplicate]

What's different between UTF-8 and UTF-8 with BOM? Which is better?
The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8.
Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.
According to the Unicode standard, the BOM for UTF-8 files is not recommended:
2.6 Encoding Schemes
... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.
The other excellent answers already answered that:
There is no official difference between UTF-8 and BOM-ed UTF-8
A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF
Those bytes, if present, must be ignored when extracting the string from the file/stream.
But, as additional information to this, the BOM for UTF-8 could be a good way to "smell" if a string was encoded in UTF-8... Or it could be a legitimate string in any other encoding...
For example, the data [EF BB BF 41 42 43] could either be:
The legitimate ISO-8859-1 string "ABC"
The legitimate UTF-8 string "ABC"
So while it can be cool to recognize the encoding of a file content by looking at the first bytes, you should not rely on this, as show by the example above
Encodings should be known, not divined.
There are at least three problems with putting a BOM in UTF-8 encoded files.
Files that hold no text are no longer empty because they always contain the BOM.
Files that hold text within the ASCII subset of UTF-8 are no longer themselves ASCII because the BOM is not ASCII, which makes some existing tools break down, and it can be impossible for users to replace such legacy tools.
It is not possible to concatenate several files together because each file now has a BOM at the beginning.
And, as others have mentioned, it is neither sufficient nor necessary to have a BOM to detect that something is UTF-8:
It is not sufficient because an arbitrary byte sequence can happen to start with the exact sequence that constitutes the BOM.
It is not necessary because you can just read the bytes as if they were UTF-8; if that succeeds, it is, by definition, valid UTF-8.
Here are examples of the BOM usage that actually cause real problems and yet many people don't know about it.
BOM breaks scripts
Shell scripts, Perl scripts, Python scripts, Ruby scripts, Node.js scripts or any other executable that needs to be run by an interpreter - all start with a shebang line which looks like one of those:
#!/bin/sh
#!/usr/bin/python
#!/usr/local/bin/perl
#!/usr/bin/env node
It tells the system which interpreter needs to be run when invoking such a script. If the script is encoded in UTF-8, one may be tempted to include a BOM at the beginning. But actually the "#!" characters are not just characters. They are in fact a magic number that happens to be composed out of two ASCII characters. If you put something (like a BOM) before those characters, then the file will look like it had a different magic number and that can lead to problems.
See Wikipedia, article: Shebang, section: Magic number:
The shebang characters are represented by the same two bytes in
extended ASCII encodings, including UTF-8, which is commonly used for
scripts and other text files on current Unix-like systems. However,
UTF-8 files may begin with the optional byte order mark (BOM); if the
"exec" function specifically detects the bytes 0x23 and 0x21, then the
presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent
the script interpreter from being executed. Some authorities recommend
against using the byte order mark in POSIX (Unix-like) scripts,[14]
for this reason and for wider interoperability and philosophical
concerns. Additionally, a byte order mark is not necessary in UTF-8,
as that encoding does not have endianness issues; it serves only to
identify the encoding as UTF-8. [emphasis added]
BOM is illegal in JSON
See RFC 7159, Section 8.1:
Implementations MUST NOT add a byte order mark to the beginning of a JSON text.
BOM is redundant in JSON
Not only it is illegal in JSON, it is also not needed to determine the character encoding because there are more reliable ways to unambiguously determine both the character encoding and endianness used in any JSON stream (see this answer for details).
BOM breaks JSON parsers
Not only it is illegal in JSON and not needed, it actually breaks all software that determine the encoding using the method presented in RFC 4627:
Determining the encoding and endianness of JSON, examining the first four bytes for the NUL byte:
00 00 00 xx - UTF-32BE
00 xx 00 xx - UTF-16BE
xx 00 00 00 - UTF-32LE
xx 00 xx 00 - UTF-16LE
xx xx xx xx - UTF-8
Now, if the file starts with BOM it will look like this:
00 00 FE FF - UTF-32BE
FE FF 00 xx - UTF-16BE
FF FE 00 00 - UTF-32LE
FF FE xx 00 - UTF-16LE
EF BB BF xx - UTF-8
Note that:
UTF-32BE doesn't start with three NULs, so it won't be recognized
UTF-32LE the first byte is not followed by three NULs, so it won't be recognized
UTF-16BE has only one NUL in the first four bytes, so it won't be recognized
UTF-16LE has only one NUL in the first four bytes, so it won't be recognized
Depending on the implementation, all of those may be interpreted incorrectly as UTF-8 and then misinterpreted or rejected as invalid UTF-8, or not recognized at all.
Additionally, if the implementation tests for valid JSON as I recommend, it will reject even the input that is indeed encoded as UTF-8, because it doesn't start with an ASCII character < 128 as it should according to the RFC.
Other data formats
BOM in JSON is not needed, is illegal and breaks software that works correctly according to the RFC. It should be a nobrainer to just not use it then and yet, there are always people who insist on breaking JSON by using BOMs, comments, different quoting rules or different data types. Of course anyone is free to use things like BOMs or anything else if you need it - just don't call it JSON then.
For other data formats than JSON, take a look at how it really looks like. If the only encodings are UTF-* and the first character must be an ASCII character lower than 128 then you already have all the information needed to determine both the encoding and the endianness of your data. Adding BOMs even as an optional feature would only make it more complicated and error prone.
Other uses of BOM
As for the uses outside of JSON or scripts, I think there are already very good answers here. I wanted to add more detailed info specifically about scripting and serialization, because it is an example of BOM characters causing real problems.
What's different between UTF-8 and UTF-8 without BOM?
Short answer: In UTF-8, a BOM is encoded as the bytes EF BB BF at the beginning of the file.
Long answer:
Originally, it was expected that Unicode would be encoded in UTF-16/UCS-2. The BOM was designed for this encoding form. When you have 2-byte code units, it's necessary to indicate which order those two bytes are in, and a common convention for doing this is to include the character U+FEFF as a "Byte Order Mark" at the beginning of the data. The character U+FFFE is permanently unassigned so that its presence can be used to detect the wrong byte order.
UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.
Which is better?
Without. As Martin Cote answered, the Unicode standard does not recommend it. It causes problems with non-BOM-aware software.
A better way to detect whether a file is UTF-8 is to perform a validity check. UTF-8 has strict rules about what byte sequences are valid, so the probability of a false positive is negligible. If a byte sequence looks like UTF-8, it probably is.
UTF-8 with BOM is better identified. I have reached this conclusion the hard way. I am working on a project where one of the results is a CSV file, including Unicode characters.
If the CSV file is saved without a BOM, Excel thinks it's ANSI and shows gibberish. Once you add "EF BB BF" at the front (for example, by re-saving it using Notepad with UTF-8; or Notepad++ with UTF-8 with BOM), Excel opens it fine.
Prepending the BOM character to Unicode text files is recommended by RFC 3629: "UTF-8, a transformation format of ISO 10646", November 2003
at https://www.rfc-editor.org/rfc/rfc3629 (this last info found at: http://www.herongyang.com/Unicode/Notepad-Byte-Order-Mark-BOM-FEFF-EFBBBF.html)
BOM tends to boom (no pun intended (sic)) somewhere, someplace. And when it booms (for example, doesn't get recognized by browsers, editors, etc.), it shows up as the weird characters  at the start of the document (for example, HTML file, JSON response, RSS, etc.) and causes the kind of embarrassments like the recent encoding issue experienced during the talk of Obama on Twitter.
It's very annoying when it shows up at places hard to debug or when testing is neglected. So it's best to avoid it unless you must use it.
Question: What's different between UTF-8 and UTF-8 without a BOM? Which is better?
Here are some excerpts from the Wikipedia article on the byte order mark (BOM) that I believe offer a solid answer to this question.
On the meaning of the BOM and UTF-8:
The Unicode Standard permits the BOM in UTF-8, but does not require
or recommend its use. Byte order has no meaning in UTF-8, so its
only use in UTF-8 is to signal at the start that the text stream is
encoded in UTF-8.
Argument for NOT using a BOM:
The primary motivation for not using a BOM is backwards-compatibility
with software that is not Unicode-aware... Another motivation for not
using a BOM is to encourage UTF-8 as the "default" encoding.
Argument FOR using a BOM:
The argument for using a BOM is that without it, heuristic analysis is
required to determine what character encoding a file is using.
Historically such analysis, to distinguish various 8-bit encodings, is
complicated, error-prone, and sometimes slow. A number of libraries
are available to ease the task, such as Mozilla Universal Charset
Detector and International Components for Unicode.
Programmers mistakenly assume that detection of UTF-8 is equally
difficult (it is not because of the vast majority of byte sequences
are invalid UTF-8, while the encodings these libraries are trying to
distinguish allow all possible byte sequences). Therefore not all
Unicode-aware programs perform such an analysis and instead rely on
the BOM.
In particular, Microsoft compilers and interpreters, and many
pieces of software on Microsoft Windows such as Notepad will not
correctly read UTF-8 text unless it has only ASCII characters or it
starts with the BOM, and will add a BOM to the start when saving text
as UTF-8. Google Docs will add a BOM when a Microsoft Word document is
downloaded as a plain text file.
On which is better, WITH or WITHOUT the BOM:
The IETF recommends that if a protocol either (a) always uses UTF-8,
or (b) has some other way to indicate what encoding is being used,
then it “SHOULD forbid use of U+FEFF as a signature.”
My Conclusion:
Use the BOM only if compatibility with a software application is absolutely essential.
Also note that while the referenced Wikipedia article indicates that many Microsoft applications rely on the BOM to correctly detect UTF-8, this is not the case for all Microsoft applications. For example, as pointed out by #barlop, when using the Windows Command Prompt with UTF-8†, commands such type and more do not expect the BOM to be present. If the BOM is present, it can be problematic as it is for other applications.
† The chcp command offers support for UTF-8 (without the BOM) via code page 65001.
This question already has a million-and-one answers and many of them are quite good, but I wanted to try and clarify when a BOM should or should not be used.
As mentioned, any use of the UTF BOM (Byte Order Mark) in determining whether a string is UTF-8 or not is educated guesswork. If there is proper metadata available (like charset="utf-8"), then you already know what you're supposed to be using, but otherwise you'll need to test and make some assumptions. This involves checking whether the file a string comes from begins with the hexadecimal byte code, EF BB BF.
If a byte code corresponding to the UTF-8 BOM is found, the probability is high enough to assume it's UTF-8 and you can go from there. When forced to make this guess, however, additional error checking while reading would still be a good idea in case something comes up garbled. You should only assume a BOM is not UTF-8 (i.e. latin-1 or ANSI) if the input definitely shouldn't be UTF-8 based on its source. If there is no BOM, however, you can simply determine whether it's supposed to be UTF-8 by validating against the encoding.
Why is a BOM not recommended?
Non-Unicode-aware or poorly compliant software may assume it's latin-1 or ANSI and won't strip the BOM from the string, which can obviously cause issues.
It's not really needed (just check if the contents are compliant and always use UTF-8 as the fallback when no compliant encoding can be found)
When should you encode with a BOM?
If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.
When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data.
UTF-8 without BOM has no BOM, which doesn't make it any better than UTF-8 with BOM, except when the consumer of the file needs to know (or would benefit from knowing) whether the file is UTF-8-encoded or not.
The BOM is usually useful to determine the endianness of the encoding, which is not required for most use cases.
Also, the BOM can be unnecessary noise/pain for those consumers that don't know or care about it, and can result in user confusion.
It should be noted that for some files you must not have the BOM even on Windows. Examples are SQL*plus or VBScript files. In case such files contains a BOM you get an error when you try to execute them.
Quoted at the bottom of the Wikipedia page on BOM: http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2
"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature"
UTF-8 with BOM only helps if the file actually contains some non-ASCII characters. If it is included and there aren't any, then it will possibly break older applications that would have otherwise interpreted the file as plain ASCII. These applications will definitely fail when they come across a non ASCII character, so in my opinion the BOM should only be added when the file can, and should, no longer be interpreted as plain ASCII.
I want to make it clear that I prefer to not have the BOM at all. Add it in if some old rubbish breaks without it, and replacing that legacy application is not feasible.
Don't make anything expect a BOM for UTF-8.
I look at this from a different perspective. I think UTF-8 with BOM is better as it provides more information about the file. I use UTF-8 without BOM only if I face problems.
I am using multiple languages (even Cyrillic) on my pages for a long time and when the files are saved without BOM and I re-open them for editing with an editor (as cherouvim also noted), some characters are corrupted.
Note that Windows' classic Notepad automatically saves files with a BOM when you try to save a newly created file with UTF-8 encoding.
I personally save server side scripting files (.asp, .ini, .aspx) with BOM and .html files without BOM.
When you want to display information encoded in UTF-8 you may not face problems. Declare for example an HTML document as UTF-8 and you will have everything displayed in your browser that is contained in the body of the document.
But this is not the case when we have text, CSV and XML files, either on Windows or Linux.
For example, a text file in Windows or Linux, one of the easiest things imaginable, it is not (usually) UTF-8.
Save it as XML and declare it as UTF-8:
<?xml version="1.0" encoding="UTF-8"?>
It will not display (it will not be be read) correctly, even if it's declared as UTF-8.
I had a string of data containing French letters, that needed to be saved as XML for syndication. Without creating a UTF-8 file from the very beginning (changing options in IDE and "Create New File") or adding the BOM at the beginning of the file
$file="\xEF\xBB\xBF".$string;
I was not able to save the French letters in an XML file.
One practical difference is that if you write a shell script for Mac OS X and save it as plain UTF-8, you will get the response:
#!/bin/bash: No such file or directory
in response to the shebang line specifying which shell you wish to use:
#!/bin/bash
If you save as UTF-8, no BOM (say in BBEdit) all will be well.
The Unicode Byte Order Mark (BOM) FAQ provides a concise answer:
Q: How I should deal with BOMs?
A: Here are some guidelines to follow:
A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as
files. When you need to conform to such a protocol, use a BOM.
Some protocols allow optional BOMs in the case of untagged text. In those cases,
Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM,
the encoding could be anything.
Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there
is no BOM, the text should be interpreted as big-endian.
Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the
BOM as encoding form signature should be avoided.
Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In
particular, whenever a data stream is declared to be UTF-16BE,
UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.
From http://en.wikipedia.org/wiki/Byte-order_mark:
The byte order mark (BOM) is a Unicode
character used to signal the
endianness (byte order) of a text file
or stream. Its code point is U+FEFF.
BOM use is optional, and, if used,
should appear at the start of the text
stream. Beyond its specific use as a
byte-order indicator, the BOM
character may also indicate which of
the several Unicode representations
the text is encoded in.
Always using a BOM in your file will ensure that it always opens correctly in an editor which supports UTF-8 and BOM.
My real problem with the absence of BOM is the following. Suppose we've got a file which contains:
abc
Without BOM this opens as ANSI in most editors. So another user of this file opens it and appends some native characters, for example:
abg-αβγ
Oops... Now the file is still in ANSI and guess what, "αβγ" does not occupy 6 bytes, but 3. This is not UTF-8 and this causes other problems later on in the development chain.
As mentioned above, UTF-8 with BOM may cause problems with non-BOM-aware (or compatible) software. I once edited HTML files encoded as UTF-8 + BOM with the Mozilla-based KompoZer, as a client required that WYSIWYG program.
Invariably the layout would get destroyed when saving. It took my some time to fiddle my way around this. These files then worked well in Firefox, but showed a CSS quirk in Internet Explorer destroying the layout, again. After fiddling with the linked CSS files for hours to no avail I discovered that Internet Explorer didn't like the BOMfed HTML file. Never again.
Also, I just found this in Wikipedia:
The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[15] for this reason and for wider interoperability and philosophical concerns
Here is my experience with Visual Studio, Sourcetree and Bitbucket pull requests, which has been giving me some problems:
So it turns out BOM with a signature will include a red dot character on each file when reviewing a pull request (it can be quite annoying).
If you hover on it, it will show a character like "ufeff", but it turns out Sourcetree does not show these types of bytemarks, so it will most likely end up in your pull requests, which should be ok because that's how Visual Studio 2017 encodes new files now, so maybe Bitbucket should ignore this or make it show in another way, more info here:
Red dot marker BitBucket diff view
I save a autohotkey file with utf-8, the chinese characters become strang.
With utf-8 BOM, works fine.
AutoHotkey will not automatically recognize a UTF-8 file unless it begins with a byte order mark.
https://www.autohotkey.com/docs/FAQ.htm#nonascii
UTF with a BOM is better if you use UTF-8 in HTML files and if you use Serbian Cyrillic, Serbian Latin, German, Hungarian or some exotic language on the same page.
That is my opinion (30 years of computing and IT industry).

Advice Build Web Sites using ISO-8859-1 or UTF-8

I everybody, i have a decision to make about making a web site in spanish, and the database has a lot of accents and special characters like for example ñ, when i show the data into the view it appears like "Informática, Producción, Organización, Diseñador Web, Métodos" etc. So by the way, i am using JSP & Servlets, MySQL, phpMyAdmin under Fedora 20 and right know i have added this to the html file:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
and in the apache, i change the default charset:
#AddDefaultCharset UTF-8
AddDefaultCharset ISO-8859-1
but in the browser, the data continue appearing like this: "Informática, Producción, Analista de Organización y Métodos", so i don't know what to do, and i have searching all day long if doing the websites using UTF-8 but i don't want to convert all accents and special characters all the time, any advice guys?
The encoding errors appearing in your text (e.g, á instead of á) indicate that your application is trying to output UTF-8 text, but your pages are incorrectly specifying the text encoding as ISO-8859-1.
Specify the UTF-8 encoding in your Content-Type headers. Do not use ISO-8859-1.
It depends on the editor that has been done anywhere, whether at work by default in UTF-8 or ISO-8859-1. If the original file was written in ISO-8859-1 and edit it in UTF-8, see the special characters encoded wrong. If we keep that which file such, we are corrupting the original encoding (bad is saved with UTF-8).
Depending on the configuration of Apache.
It depends on whether there is a hidden file. Htaccess in the root directory that serves our website (httpdocs, public_html or similar)
Depends if specified in the META tags of the resulting HTML.
Depends if specified in the header of a PHP file.
Charset chosen depends on the database (if you use a database to display content with a CMS such as Joomla, Drupal, phpNuke, or your own application that is dynamic).

Change Website Character encoding from iso-8859-1 to UTF-8

About 2 years ago I made the mistake of starting a large website using iso-8859-1. I now am having issues with some characters, especially when sending data to the server using ajax. Because of this, I would like to switch to using UTF-8.
What issues do you see coming from this? I know I would have to search the site to look for characters that need to be changed from a ? to their real characters. But, are there any other risks in doing this? Has anyone done this before?
The main difficulty is making sure you've checked that all the data paths are UTF-8 clean:
Is your site DB-backed? If so, you'll need to convert all the tables to UTF-8 or some other Unicode encoding, so sorting and text searching work correctly.
Is your site using some programming language for dynamic content? (PHP, mod_perl, ASP...?) If so, you'll have to make sure the particular language interpreter you're using fully understands some form of Unicode, work out the conversions if it isn't using UTF-8 natively — UTF-16 is next most common — and check that it's configured to use UTF-8 on its output to the web server.
Does your site have some kind of back-end app server? Does it use UTF-8 for its text outputs?
There are at least three different places you can declare the charset for a web document. Be sure you change them all:
the HTTP Content-Type header
the <meta http-equiv="Content-Type"> tag in your documents' <head>
the <?xml> tag at the top of the document, if using XHTML Strict
All this comes from my experiences a years ago when I traced some Unicode data through a moderately complex N-tier app, and found conversion chains like:
Latin-1 → UTF-8 → Latin-1 → UTF-8
So, even though the data ended up in the browser claiming to be "UTF-8", the app could still only handle the subset common with Latin-1.
The biggest reason for those odd conversion chains was due to immature Unicode support in the tooling at the time, but you can still find yourself messing with ugliness like this if you're not careful to make the pipeline UTF-8 clean.
As for your comments about searching out Latin-1 characters and converting files one by one, I wouldn't do that. I'd build a script around the iconv utility found on every modern Linux system, feeding in every text file in your system, explicitly converting it from Latin-1 to UTF-8. Leave no stone unturned.
Such a change touches (nearly) every part of your system. You need to go through everything, from the database to the PHP to the HTML to the web browser.
Start a test site and subject it to some serious testing (various browsers on various platforms doing various things).
IMO it's important to actually get familiar with UTF-8 and what it means for software. A few quick points:
PHP is mostly byte-oriented. Learn the difference between characters and code points and bytes, and between UTF-8 and Unicode.
UTF-8 is well-designed. For instance, given two UTF-8 strings, a byte-oriented strstr() will still function correctly.
The most common problem is treating a UTF-8 string as ISO-8859-1 and vice versa - you may need to add documentation to your functions stating what kind of encoding they expect, to make these sorts of errors less likely. A variable naming convention for your strings (to indicate what encoding they use) may also help.

Why is this the extended ascii character (â, é, etc) getting replaced with <?> characters?

Why is this the extended ascii character (â, é, etc) getting replaced with <?> characters?
I attached a pic... but I am using PHP to pull the data from MySQL, and some of these locations have extended characters... I am using the Font Arial.
You can see the screen shot here: http://img269.imageshack.us/i/funnychar.png/
Still happening after the suggestions, here is what I did:
My firefox (view->encoding) is set to UTF-8 after adding the line, however, the text inside the option tags is still showing the funny character instead of the actual accented one. What should I look for now?
UPDATE:
I have the following in the PHP program that is giving my those <?> characters...
ini_set( 'default_charset', 'UTF-8' );
And right after my zend db object creation, I am setting the following query:
$db->query("SET NAMES utf8;");
I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
Also STATUS is reporting:
Connection: Localhost via UNIX socket
Server characterset: latin1
Db characterset: latin1
Client characterset: utf8
Conn. characterset: utf8
UNIX socket: /var/run/mysqld/mysqld.sock
Uptime: 4 days 20 hours 59 min 41 sec
Looking at the source of the page, I see
<option value="Br�l� Lake"> Br�l� Lake
OK- NEW UPDATE-
I Changed everything in my PHP and HTML to:
and
header('Content-Type: text/html; charset=latin1');
Now it works, what gives?? How do I convert it all to UTF-8?
That's what the browser does when it doesn't know the encoding to use for a character. Make sure you specify the encoding type of the text you send to the client either in headers or markup meta.
In HTML:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
In PHP (before any other content is sent to the client):
header('Content-Type: text/html; charset=utf-8');
I'm assuming you'll want UTF-8 encoding. If your site uses another encoding for text, then you should replace UTF-8 with the encoding you're using.
One thing to note about using HTML to specify the encoding is that the browser will restart rendering a page once it sees the Content-Type meta tag, so you should include the <meta /> tag immediately after the <head /> tag in your page so the browser doesn't do any more extra processing than it needs.
Another common charset is "iso-8859-1" (Basic Latin), which you may want to use instead of UTF-8. You can find more detailed info from this awesome article on character encodings and the web. You can also get an exhaustive list of character encodings here if you need a specific type.
If nothing else works, another (rare) possibility is that you may not have a font installed on your computer with the characters needed to display the page. I've tried repeating your results on my own server and had no luck, possibly because I have a lot of fonts installed on my machine so the browser can always substitute unavailable characters from one font with another font.
What I did notice by investigating further is that if text is sent in an encoding different than the encoding the browser reports as, Unicode characters can render unexpectedly. To work around this, I used the HTML character entity representation of special characters, so â becomes â in my HTML and é becomes é. Once I did this, no matter what encoding I reported as, my characters rendered correctly.
Obviously you don't want to modify your database to HTML encode Unicode characters. Your best option if you must do this is to use a PHP function, htmlentities(). You should use this function on any data-driven text you expect to have Unicode characters in. This may be annoying to do, but if specifying the encoding doesn't help, this is a good last resort for forcing Unicode characters to work.
There is no such standard called "extended ASCII", just a bunch of proprietary extensions.
Anyway, there are a variety of possible causes, but it's not your font. You can start by checking the character set in MySQL, and then see what PHP is doing. As Dan said, you need to make sure PHP is specifying the character encoding it's actually using.
As others have mentioned, this is a character-encoding question. You should read Joel Spolsky's article about character encoding.
Setting
header('Content-Type: text/html; charset=utf-8');
will fix your problem if your php page is writing UTF-8 characters to the browser. If the text is still garbled, it's possible your text is not UTF-8; in that case you need to use the correct encoding name in the Content-Type header. If you have a choice, always use UTF-8 or some other Unicode encoding.
Simplest fix
ini_set( 'default_charset', 'UTF-8' );
this way you don't have to worry about manually sending the Content-Type header yourself.
EDIT
Make sure you are actually storing data as UTF-8 - sending non-UTF-8 data to the browser as UTF-8 is just as likely to cause problems as sending UTF-8 data as some other character set.
SELECT table_collation
FROM information_schema.`TABLES` T
WHERE table_name=[Table Name];
SELECT default_character_set_name
, default_collation_name
FROM information_schema.`SCHEMATA` S
WHERE schema_name=[Schema Name];
Check those values
There are two transmission encodings, PHP<->browser and Mysql<->PHP, and they need to be consistent with each other. Setting up the encoding for Mysql<->PHP is dealt with in the answers to the questions below:
Special characters in PHP / MySQL
How to make MySQL handle UTF-8 properly
php mysql character set: storing html of international content
The quick answer is "SET NAMES UTF8".
The slow answer is to read the articles recommended in the other answers - it's a lot better to understand what's going on and make one precise change than to apply trial and error until things seem to work. This isn't just a cosmetic UI issue, bad encoding configurations can mess up your data very badly. Think about the Simpsons episode where Lisa gets chewing gum in her hair, which Marge tries to get out by putting peanut butter on.
You should encode all special chars into HTML entities instead of depending on the charset.
htmlentities() will do the work for you.
I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
If your original data was latin1, then inserting it into a UTF-8 database won't convert it to UTF-8, AFAIK, it will insert the same data but now believe it's UTF-8, thus breaking.
If you've got a SQL dump, I'd suggest running it through a tool to convert to UTF-8. Notepad++ does this pretty well - simply open the file, check that the accented characters are displaying correctly, then find "convert to UTF-8" in the menu.
These special characters generally appear due to the the extensions. If we provide a meta tag with charset=utf-8 we can eliminate them by adding:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
to your meta tags

Coding in UTF-8 problem

I am using notepad++ for php coding.
I don't have any problem with format set up using Encode in ANSI.
However when I use Encode in UTF-8, either I have a strange character at the top or not showing anything.
Q1. Am I supposed to use ANSI?
Q2. Why do I am not able to display anything when I use UTF-8
My sourse code for the header is following.
<html>
<head>
<title>Hello, PHPlot!</title>
</head>
Is that because I am not using UTF-8 in the header?
It's probably a Byte Order Mark. You can use the 'Encode in UTF-8 without BOM' mode in notepad++.
This question has some helpful information about using UTF-8 with PHP. You will also (as you suggested) need to set the content type in either the header or a meta tag in order for the browser to interpret it correctly.
It sounds like you are using UTF-8 with a BOM (which has issues) and your server is failing to specify the encoding correctly.
IIRC, BOM is unavoidable in Notepad, so I would suggest using a better editor. I'm fond of Komodo Edit myself.
(Also note, that a Doctype is required in HTML documents)
As Tom Haigh says, it's probably the BOM. It's not necessary for UTF-8 encoding, so you can safely leave them out.
However I should point out that PHP has very weak support for UTF-8 - be prepared for a bumpy ride. Take a look at this page for some details on problems you might encounter.

Categories