I don't get it - what encoding should I use? I've been using ANSI, but when I save the document, Spanish letters are converted to illegible markings.
Any suggestions?
In general, for almost any document, you should use UTF-8 without a BOM. (The exception is documents targeting most Asian countries which are usually more efficiently stored in UTF-16).
You also need to make sure that:
it doesn't get transcoded en-route
any database the data is stored in is UTF-8 aware
any <meta> element in an HTML document that states an encoding states the correct encoding. (If the document is intended for viewing from a file system instead of over HTTP then add such an element)
the HTTP headers state that the document is UTF-8
you should use UTF-8 without BOM
Related
I've read that adding the UTF-8 Byte Order Mark (3 characters) at the start of a text file makes it a UTF-8 file, but I've also read that unicode recommends against using the BOM for UTF-8.
I'm generating files in PHP and I have a requirement that the files be UTF-8. I've added the UTF-8 BOM to the start of the file but I've received feedback about garbage characters at the start of the file from the company that is parsing the files and that gave me the requirement to make the files UTF-8.
If I open the file in notepad it doesn't show the BOM, and if I go to save as, it shows UTF-8 as the default choice.
Opening the file in Textpad32 shows the 3 characters at the start of the file.
So what makes a file UTF-8?
Text is UTF-8 because it's valid as UTF-8 and the author decides it is.
How that decision by the author is communicated to the consumer is a different question, which involves convention, guessing, and various schemes for in-band- or out-of-band-signalling, like HTTP or HTML charset, BOM (which enhances guessing), some envelope / embedding Format, additional data-streams, file-naming, and many more.
The file doesn't need any explicit indicator that it is UTF-8, modern text editors should detect UTF-8 encoding from the context as UTF-8 sequences are quite distinct.
Also, as you experienced for yourself, PHP doesn't like the BOM header, it's a silly thing that often messes up with the script output and creates more problems than it solves.
HTML has it's own way of declaring the encoding of a file, you can do it within the HTML itself:
<head>
<meta charset="UTF-8">
</head>
Or declare the encoding in the HTTP headers, here with PHP:
header('Content-Type: text/html; charset=utf-8');
Modern browsers will also assume UTF-8 as default encoding in case none is specified. It is the standard of the web after all.
UTF-8 is a particular encoding. All 7-bit ASCII files are also valid UTF-8, and it can encode every Unicode character as well.
You will often get the advice to save as UTF-8 without a BOM. In practice, it is very unlikely that a file in a legacy encoding (such as code page 1252, Big5 or Shift-JIS) would just happen to look like valid UTF-8 unless it is an intentionally-ambiguous test case. Many programs, such as web browsers, are good in practice at figuring out when a file is UTF-8. Most recent software uses UTF-8 as its preferred text encoding unless it’s forced to default to something else for compatibility with last century. (LaTeX, for example, changed its default source encoding to UTF-8 in April 2018, and both the LuaLaTeX and XeLaTeX engines had been doing the same for years.)
There are some document types with special requirements. For example, the default encoding of web pages is theoretically Windows 1252, although browsers in the real world will take their best guess. The current best practice on the Web is to save as UTF-8 without a BOM. Instead, you write inside the <head> of the document, <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> or <meta charset="utf-8"/> This tells the user agent explicitly what the character encoding is.
On the other hand, some older versions of software either break if they see a BOM, or only recognize UTF-8 if there is a BOM. Microsoft in the ’aughts was especially guilty of this, its software doesn’t want to break any files that used to work back then, and so, to this day, I save my C source files as UTF-8 with a BOM. This is the only format that just works on every compiler I use: even the latest version of MSVC might guess wrong if you don’t give it either a BOM or the right command-line flag, whereas Clang only supports UTF-8 and has no option to read files in any other encoding. Some older versions of MSVC that I was once forced to use cannot understand UTF-8 at all unless the BOM is there, and do not provide any way to override its autodetection.
I am using flash to read contents from a UTF8 page, which has unicode in it.
The problem is that when Flash loads the data it displays ???????? instead all unicode.
What could be the problem?
By default Flash treats strings as if they are encoded using UTF-8. The reason that you are seeing characters that possibly substitute non-printable characters or invalid / missing glyphs could be that you set System.useCodepage to true - if that's what happened, then why did you do that?
Otherwise, the font that is used to display the characters may be missing glyphs for the characters you need. You can check that by using Font.hasGlyphs("string with the glyphs"); to make sure the text can be displayed. This would normally only apply to embedded fonts.
Yet another possibility is that the source text you are trying to display is not a UTF-8 encoded string. Some particularly popular file formats such as XML and HTML some times use a declaration of the format in no correspondence to the actual payload (example XML tag: <?xml encoding="utf-8" ?> can be attached to any XML regardless of the actual encoding of the document). In order to make sure that the text is in UTF-8 - read it as ByteArray and verify that the first bit of every byte is set to 0. Single-byte encodings that use national characters use the first bit to encode their characters, while UTF-8 never does that.
Flash internally uses UTF-8 to represent strings, so there should not be a problem if the entire stack uses UTF-8 encoding.
You probably have an implicit decode/encode step somewhere along the way.
This could really be a million things, unfortunately. Start from the ground up, insert traces and/or log messages to see where the conversion fails. Make sure your XML-content uses UTF-8, and especially if you're using PHP, make sure that all the PHP source files are saved in UTF-8 encoding - editing PHP files in simple text editors often results in Windows/Mac format source files, which will then break your character encoding. Also, verify HTML request/response headers to see if there is an encoding mismatch.
I'm making a KSSN (Korean ID Number) checker in PHP using a MySQL database.
I check if it is working by using a file_get_contents call to an external site.
The problem is that the requests (with Hangul/Korean characters in them) are using the wrong charset.
When I echo the string, the Korean characters just get replaced by question marks.
How can I make it to use Korean? Should I change anything in the database too?
What should be the charset?
PHP Source and SQL Dump: http://www.multiupload.com/RJ93RASZ31
NOTE: I'm using Apache (HTML), not CLI.
You need to:
tell the browser what encoding you wish to receive in the form submission, by setting Content-Type by header or <meta> as in aviv's answer.
tell the database what encoding you're sending it bytes in, using mysql_set_charset().
Currently you are using EUC-KR in the database so presumably you want to use that encoding in both the above points. In this century I would suggest instead using UTF-8 throughout for all web apps/databases, as the East Asian multibyte encodings are an anachronistic unpleasantness. (With potential security implications, as if mysql_real_escape_string doesn't know the correct encoding, a multibyte sequence containing ' or \ can sneak through an SQL injection.)
However, if enpang.com are using EUC-KR for the encoding of the Name URL parameter you would need either to stick with EUC-KR, or to transcode the name value from UTF-8 to EUC-KR for that purpose using iconv(). (It's not clear to me what encoding enpang.com are using for URL parameters to their name check service; I always get the same results anyway.)
I don't know the charset, but if you are using HTML to show the results you should set the charset of the html
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
You can also use iconv (php function) to convert the charset to a different charset
http://php.net/manual/en/book.iconv.php
And last but not least, check your database encoding for the tables.
But i guess that in your case you will only have to change the meta tag.
Basically all charset problems stem from the fact that they're being mixed and/or misinterpreted.
A string (text) is a sequence of bytes in a specific order. The string is encoded using some specific charset, that in itself is neither right nor wrong nor anything else. The problem is when you try to read the string, the sequence of bytes, assuming the wrong charset. Bytes encoded using, for example, KS X 1001 just don't make sense when you read them assuming they're UTF-8, that's where the question marks come from.
The site you're getting the text from sends it to you in some specific character set, let's assume KS X 1001. Let's assume your own site uses UTF-8. Embedding a stream of bytes representing KS X 1001 encoded text in the middle of UTF-8 encoded text and telling the browser to interpret the whole site as UTF-8 leads to the KS X 1001 encoded text not making sense to the UTF-8 parser.
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
KSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKS
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
will be rendered as
Hey, this is UTF-8 encoded text, awesome!
???????I?have?no?idea?what?this?is???????
Hey, this is UTF-8 encoded text, awesome!
To solve this problem, convert the fetched text into UTF-8 (or whatever encoding you're using on your site). Look at the Content-Type header of that other site, it should tell you what encoding the site is in. If it doesn't, take a guess.
I have a HTML form that is sometimes submitted with accented characters: à, è, ì, ò, ù
I have a PHP script that exports these form submissions into CSV format, when I look at the CSV format in a text editor (vim or notepad for example) the characters look fine, but when opened with Open Office or Word, I get some funky results: �����
I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."
What can I do to ensure portability of my CSV file? What's the proper way to handle the encoding?
My HTML file is content-type is set as: Content-Type: text/html; charset=utf-8
Data is being stored in MySQL as latin1_swedish_ci collation.
Total encoding confusion! :-)
The table character set
The MySQL table character set only determines what encoding MySQL should use internally, and thus the range of characters permitted.
If you set it to Latin-1 (aka ISO 8859-1), you will not be able to store international characters in your table.
Importantly, the character set does not affect the encoding MySQL uses when communicating with your PHP script.
The table collation specifies rules for sorting.
The connection character set
The MySQL connection character set determines the encoding you receive table data in (and should send data to MySQL in).
The encoding is set using SET NAMES, e.g. SET NAMES "utf8".
If this does not match the table encoding, MySQL automatically converts data on the fly.
If this does not match your page character set, you'll have to manually perform character set conversion in PHP, using e.g. utf8_encode or mb_convert_encoding.
Page character set
The page character set, specified using the Content-Type header, tells the browser how to interpret the PHP script output.
As an HTTP header, it is not saved when you save the file from within your browser. The information is thus not available to OpenOffice or other programs.
Recommendations
Ideally, you should use the same encoding in all three places, and ideally, that encoding should be UTF-8.
However, CSV will cause problems, since the file format does not include encoding information. It is thus up to the application to guess the encoding, and as you've seen, the guess will be wrong.
I don't know about OpenOffice, but Microsoft Office will assume the Windows "ANSI" encoding, which usually means Latin-1 (or CP1252 to be specific).
Microsoft Office will also cause problems in countries that use "," as a decimal separator, since Office then switches to using ";" as a field separator for CSV-files.
Your best bet is to use Latin-1 for the CSV-file. I'd still use UTF-8 for the table and connection character sets though, and also UTF-8 for HTML pages.
If you use UTF-8 for the connection character set (by executing SET NAMES "utf8" after connecting), you'll need to run the text through utf8_decode to convert to Latin-1.
That entity problem
I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."
This sounds like you're passing HTML code in an XML context, and is unrelated to character sets. Try running the text through html_entity_decode.
Also, what document type have you set, is it?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Try using the htmlentities() function for any text that is not showing correctly.
You may also want to have a look PHP Normalizer.
Make sure you are writing the CSV file as UTF-8. See http://www.php.net/manual/en/function.fwrite.php#55054 if you are unsure how to.
(Also, your sql table should be using utf8, not latin1)
It's up to you to decide which charset encoding you'll use for writing your CSV file (but, note, that must be a concious decision on your part).
Which charset encoding to use ? CSV does not defines a charset encoding - So I'd go for some Unicode charset, presumably UTF8. But some CSV consumers (eg Excel) might not be happy with it. If you are restricted to "western" langs, then latin1 or its variants (iso-8859-1 or iso-8859-15) might be more appropiate. But then (in any case, actually) you must think the conversion from user input to your particular encoding - and what to do if there are invalid characters.
(BTW: same consideration goes for the html-input-to-db conversion - you are using latin1 for your database, have you asked yourself what happens if the user types a non-latin1 character ? eg a japanese char ? ).
I coded a php project under ISO 8859-1, and for some technical reasons I want to encode the project under UTF-8. what is a better way to do it? I am afraid of loosing special characters like french accents and so on. thanks for you advice.
You should try using the shell command iconv to encode the php files from latin1 (ISO-8859-1) to UTF-8.
After that you should be sure that PHP uses UTF-8 as the default encoding (default_encoding variable in php.ini if I recall correctly). If not, then you can set it with ini_set() for your project.
After that you should convert your database to UTF-8 or use a quickfix like this (for MySQL):
mysql_query("SET NAMES 'utf8'");
Of course you just substitute mysql_query() for whatever framework you use (if you use any).
Put it into your primary file which includes all the classes and stuff.
transcode all the files with iconv. change any and all http headers or meta tags. profit.
Here's my take on your question - you want the generated HTML (via PHP) to be UTF-8 compliant? Be aware that the HTML 4.x standard is based on iso-8859-1 and it's unclear if XHTML is based on utf-8 or iso-8859-1. Of course, pure XML is utf-8.
(1) So the first piece of the puzzle is to select your DOCTYPE for your rendered HTML.
(2) Make sure you add the the language character set meta tags (charset=utf8), etc.
(3) Take the rendered PHP/HTML string and send it through iconv either via the shell using a system call or through some PHP API method.
The resulting rendered HTML will be utf-8 encoded. The client browser needs to be set to render the HTML by means of utf-8 and not western latin1. Otherwise you get a strange non-breaking space character in the upper left hand corner of the page.