Coding in UTF-8 problem

Coding in UTF-8 problem - php

I am using notepad++ for php coding.
I don't have any problem with format set up using Encode in ANSI.
However when I use Encode in UTF-8, either I have a strange character at the top or not showing anything.
Q1. Am I supposed to use ANSI?
Q2. Why do I am not able to display anything when I use UTF-8
My sourse code for the header is following.
<html>
<head>
<title>Hello, PHPlot!</title>
</head>
Is that because I am not using UTF-8 in the header?

It's probably a Byte Order Mark. You can use the 'Encode in UTF-8 without BOM' mode in notepad++.
This question has some helpful information about using UTF-8 with PHP. You will also (as you suggested) need to set the content type in either the header or a meta tag in order for the browser to interpret it correctly.

It sounds like you are using UTF-8 with a BOM (which has issues) and your server is failing to specify the encoding correctly.
IIRC, BOM is unavoidable in Notepad, so I would suggest using a better editor. I'm fond of Komodo Edit myself.
(Also note, that a Doctype is required in HTML documents)

As Tom Haigh says, it's probably the BOM. It's not necessary for UTF-8 encoding, so you can safely leave them out.
However I should point out that PHP has very weak support for UTF-8 - be prepared for a bumpy ride. Take a look at this page for some details on problems you might encounter.

Related

What makes a file UTF-8?

I've read that adding the UTF-8 Byte Order Mark (3 characters) at the start of a text file makes it a UTF-8 file, but I've also read that unicode recommends against using the BOM for UTF-8.
I'm generating files in PHP and I have a requirement that the files be UTF-8. I've added the UTF-8 BOM to the start of the file but I've received feedback about garbage characters at the start of the file from the company that is parsing the files and that gave me the requirement to make the files UTF-8.
If I open the file in notepad it doesn't show the BOM, and if I go to save as, it shows UTF-8 as the default choice.
Opening the file in Textpad32 shows the 3 characters at the start of the file.
So what makes a file UTF-8?

Text is UTF-8 because it's valid as UTF-8 and the author decides it is.
How that decision by the author is communicated to the consumer is a different question, which involves convention, guessing, and various schemes for in-band- or out-of-band-signalling, like HTTP or HTML charset, BOM (which enhances guessing), some envelope / embedding Format, additional data-streams, file-naming, and many more.

The file doesn't need any explicit indicator that it is UTF-8, modern text editors should detect UTF-8 encoding from the context as UTF-8 sequences are quite distinct.
Also, as you experienced for yourself, PHP doesn't like the BOM header, it's a silly thing that often messes up with the script output and creates more problems than it solves.
HTML has it's own way of declaring the encoding of a file, you can do it within the HTML itself:
<head>
<meta charset="UTF-8">
</head>
Or declare the encoding in the HTTP headers, here with PHP:
header('Content-Type: text/html; charset=utf-8');
Modern browsers will also assume UTF-8 as default encoding in case none is specified. It is the standard of the web after all.

UTF-8 is a particular encoding. All 7-bit ASCII files are also valid UTF-8, and it can encode every Unicode character as well.
You will often get the advice to save as UTF-8 without a BOM. In practice, it is very unlikely that a file in a legacy encoding (such as code page 1252, Big5 or Shift-JIS) would just happen to look like valid UTF-8 unless it is an intentionally-ambiguous test case. Many programs, such as web browsers, are good in practice at figuring out when a file is UTF-8. Most recent software uses UTF-8 as its preferred text encoding unless it’s forced to default to something else for compatibility with last century. (LaTeX, for example, changed its default source encoding to UTF-8 in April 2018, and both the LuaLaTeX and XeLaTeX engines had been doing the same for years.)
There are some document types with special requirements. For example, the default encoding of web pages is theoretically Windows 1252, although browsers in the real world will take their best guess. The current best practice on the Web is to save as UTF-8 without a BOM. Instead, you write inside the <head> of the document, <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> or <meta charset="utf-8"/> This tells the user agent explicitly what the character encoding is.
On the other hand, some older versions of software either break if they see a BOM, or only recognize UTF-8 if there is a BOM. Microsoft in the ’aughts was especially guilty of this, its software doesn’t want to break any files that used to work back then, and so, to this day, I save my C source files as UTF-8 with a BOM. This is the only format that just works on every compiler I use: even the latest version of MSVC might guess wrong if you don’t give it either a BOM or the right command-line flag, whereas Clang only supports UTF-8 and has no option to read files in any other encoding. Some older versions of MSVC that I was once forced to use cannot understand UTF-8 at all unless the BOM is there, and do not provide any way to override its autodetection.

Encoding problems using PHP Gettext

I am trying to start using Gettext for my php project.
However, I have some encoding problems. If I use UTF-8 encoding in the .mo files and use
"bind_textdomain_codeset('messages', 'UTF-8');"
I don't see the accents properly in the browser. In Firefox, in order to see them OK, I have to change the browser codification to UTF-8 (it is not the default encoding). As I can't expect my visitators to change their browser encoding, what should I do?
I also tried changing everything to ISO-8859-15 and, although accents work OK (even with the browser default encoding), the € sign doesn't work. And I have also read there are problemas when using languages like russian, so it doesn't seem to be the right way.
How should I proceed?
Thank you :)

You should instruct the browser that the page you are sending is encoded in UTF-8. Do this using header before you actually output any content:
header('Content-Type: text/html; charset=utf-8');
Of course this assumes that the page is in UTF-8 in the first place.
In general, the one law that you can never disregard is that all content in your page must be in the same encoding (and that's the encoding you use when declaring the Content-Type).
If all sources for the content (e.g. your hardcoded stuff, what comes from gettext, what comes from a database) are in that encoding, everything is fine. If not then you have to manually convert all content from sources that diverge to the encoding of the page, which is possible through iconv or mb_convert_encoding.

html content into a page

I need to pull the content from the database on the page, but some of this contents have the whole HTML page - with css, head, etc...
What would be the best way prevent having all htlm tags, scripts, css? Would iframe help here?
The most bothering thing is that I'm getting strange characters on the page: �
and as found out it is due to different encoding.
The site has utf-8 encoding and if the content contains different encoding, these signs come out and I cannot replace them.
The only thing it make them remove was to change my encoding, but this is not the real solution.
If someone could tell me how to remove them, would be really great.
Solution: with your help I checked encoding, but couldn't change it. I set names in mysql_query to UTF-8, and stripped unusefull tags. Now it seems ok.
Thanks to all of you.

I think you have no chance apart an ugly iframe. About encoding, you should check db encoding, connection encoding and convert as needed. Use iconv for full control over conversion, for example:
$html=iconv("UTF-8", "ISO-8859-15"."//TRANSLIT//IGNORE",$html]);
In this case, you're going to lose some characters not mapped in ISO-8859-15. Consider moving your whole site to UTF-8 encoding.

The � tags in fact might not be due to encoding, the problem might be the content that is stored in the database.
Check for double quotes like “ which are supposed to be ", more so if the data in the table was copy pasted.

Project conversion from ISO 8859-1 to UTF-8

I coded a php project under ISO 8859-1, and for some technical reasons I want to encode the project under UTF-8. what is a better way to do it? I am afraid of loosing special characters like french accents and so on. thanks for you advice.

You should try using the shell command iconv to encode the php files from latin1 (ISO-8859-1) to UTF-8.
After that you should be sure that PHP uses UTF-8 as the default encoding (default_encoding variable in php.ini if I recall correctly). If not, then you can set it with ini_set() for your project.
After that you should convert your database to UTF-8 or use a quickfix like this (for MySQL):
mysql_query("SET NAMES 'utf8'");
Of course you just substitute mysql_query() for whatever framework you use (if you use any).
Put it into your primary file which includes all the classes and stuff.

transcode all the files with iconv. change any and all http headers or meta tags. profit.

Here's my take on your question - you want the generated HTML (via PHP) to be UTF-8 compliant? Be aware that the HTML 4.x standard is based on iso-8859-1 and it's unclear if XHTML is based on utf-8 or iso-8859-1. Of course, pure XML is utf-8.
(1) So the first piece of the puzzle is to select your DOCTYPE for your rendered HTML.
(2) Make sure you add the the language character set meta tags (charset=utf8), etc.
(3) Take the rendered PHP/HTML string and send it through iconv either via the shell using a system call or through some PHP API method.
The resulting rendered HTML will be utf-8 encoded. The client browser needs to be set to render the HTML by means of utf-8 and not western latin1. Otherwise you get a strange non-breaking space character in the upper left hand corner of the page.

Why is this the extended ascii character (â, é, etc) getting replaced with <?> characters?

Why is this the extended ascii character (â, é, etc) getting replaced with <?> characters?
I attached a pic... but I am using PHP to pull the data from MySQL, and some of these locations have extended characters... I am using the Font Arial.
You can see the screen shot here: http://img269.imageshack.us/i/funnychar.png/
Still happening after the suggestions, here is what I did:
My firefox (view->encoding) is set to UTF-8 after adding the line, however, the text inside the option tags is still showing the funny character instead of the actual accented one. What should I look for now?
UPDATE:
I have the following in the PHP program that is giving my those <?> characters...
ini_set( 'default_charset', 'UTF-8' );
And right after my zend db object creation, I am setting the following query:
$db->query("SET NAMES utf8;");
I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
Also STATUS is reporting:
Connection: Localhost via UNIX socket
Server characterset: latin1
Db characterset: latin1
Client characterset: utf8
Conn. characterset: utf8
UNIX socket: /var/run/mysqld/mysqld.sock
Uptime: 4 days 20 hours 59 min 41 sec
Looking at the source of the page, I see
<option value="Br�l� Lake"> Br�l� Lake
OK- NEW UPDATE-
I Changed everything in my PHP and HTML to:
and
header('Content-Type: text/html; charset=latin1');
Now it works, what gives?? How do I convert it all to UTF-8?

That's what the browser does when it doesn't know the encoding to use for a character. Make sure you specify the encoding type of the text you send to the client either in headers or markup meta.
In HTML:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
In PHP (before any other content is sent to the client):
header('Content-Type: text/html; charset=utf-8');
I'm assuming you'll want UTF-8 encoding. If your site uses another encoding for text, then you should replace UTF-8 with the encoding you're using.
One thing to note about using HTML to specify the encoding is that the browser will restart rendering a page once it sees the Content-Type meta tag, so you should include the <meta /> tag immediately after the <head /> tag in your page so the browser doesn't do any more extra processing than it needs.
Another common charset is "iso-8859-1" (Basic Latin), which you may want to use instead of UTF-8. You can find more detailed info from this awesome article on character encodings and the web. You can also get an exhaustive list of character encodings here if you need a specific type.
If nothing else works, another (rare) possibility is that you may not have a font installed on your computer with the characters needed to display the page. I've tried repeating your results on my own server and had no luck, possibly because I have a lot of fonts installed on my machine so the browser can always substitute unavailable characters from one font with another font.
What I did notice by investigating further is that if text is sent in an encoding different than the encoding the browser reports as, Unicode characters can render unexpectedly. To work around this, I used the HTML character entity representation of special characters, so â becomes â in my HTML and é becomes é. Once I did this, no matter what encoding I reported as, my characters rendered correctly.
Obviously you don't want to modify your database to HTML encode Unicode characters. Your best option if you must do this is to use a PHP function, htmlentities(). You should use this function on any data-driven text you expect to have Unicode characters in. This may be annoying to do, but if specifying the encoding doesn't help, this is a good last resort for forcing Unicode characters to work.

There is no such standard called "extended ASCII", just a bunch of proprietary extensions.
Anyway, there are a variety of possible causes, but it's not your font. You can start by checking the character set in MySQL, and then see what PHP is doing. As Dan said, you need to make sure PHP is specifying the character encoding it's actually using.

As others have mentioned, this is a character-encoding question. You should read Joel Spolsky's article about character encoding.
Setting
header('Content-Type: text/html; charset=utf-8');
will fix your problem if your php page is writing UTF-8 characters to the browser. If the text is still garbled, it's possible your text is not UTF-8; in that case you need to use the correct encoding name in the Content-Type header. If you have a choice, always use UTF-8 or some other Unicode encoding.

Simplest fix
ini_set( 'default_charset', 'UTF-8' );
this way you don't have to worry about manually sending the Content-Type header yourself.
EDIT
Make sure you are actually storing data as UTF-8 - sending non-UTF-8 data to the browser as UTF-8 is just as likely to cause problems as sending UTF-8 data as some other character set.
SELECT table_collation
FROM information_schema.`TABLES` T
WHERE table_name=[Table Name];
SELECT default_character_set_name
, default_collation_name
FROM information_schema.`SCHEMATA` S
WHERE schema_name=[Schema Name];
Check those values

There are two transmission encodings, PHP<->browser and Mysql<->PHP, and they need to be consistent with each other. Setting up the encoding for Mysql<->PHP is dealt with in the answers to the questions below:
Special characters in PHP / MySQL
How to make MySQL handle UTF-8 properly
php mysql character set: storing html of international content
The quick answer is "SET NAMES UTF8".
The slow answer is to read the articles recommended in the other answers - it's a lot better to understand what's going on and make one precise change than to apply trial and error until things seem to work. This isn't just a cosmetic UI issue, bad encoding configurations can mess up your data very badly. Think about the Simpsons episode where Lisa gets chewing gum in her hair, which Marge tries to get out by putting peanut butter on.

You should encode all special chars into HTML entities instead of depending on the charset.
htmlentities() will do the work for you.

I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
If your original data was latin1, then inserting it into a UTF-8 database won't convert it to UTF-8, AFAIK, it will insert the same data but now believe it's UTF-8, thus breaking.
If you've got a SQL dump, I'd suggest running it through a tool to convert to UTF-8. Notepad++ does this pretty well - simply open the file, check that the accented characters are displaying correctly, then find "convert to UTF-8" in the menu.

These special characters generally appear due to the the extensions. If we provide a meta tag with charset=utf-8 we can eliminate them by adding:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
to your meta tags

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.